An Open Standard for lineage metadata collection
APACHE-2.0 License
Bot releases are hidden (Show)
lineage_job_namespace
and lineage_job_name
macros #2582
@dolfinuslineage_job_namespace()
, lineage_job_name(task)
that return an Airflow namespace and Airflow job name, respectively.
SchemaDatasetFacet
#2548
@dolfinusSchemaDatasetFacet
.
#2588
@pawel-big-lebowskipmdTestScala212
of warnings that clutter the logs.
#2591
@blacklightdbt-ol
now propagates the exit code of the underlying dbt process even if no lineage events are emitted.
#2585
@harelsjava.lang.StringIndexOutOfBoundsException
.
HashSet
in column-level lineage instead of iterating through LinkedList
#2584
@mobuchowskiHashSet
for collection.
#2572
@dolfinuspkg_resources
dependency and replaces it with the packaging lib.
airflow.macros.lineage_parent_id
#2578
@blacklightlineage_parent_id
Airflow macro and simplifies the format of the lineage_parent_id
and lineage_run_id
macros.
#2579
@JDarDagranPublished by merobi-hub 7 months ago
SCRIPT
-type jobs in BigQuery #2564
@kacpermudaSCRIPT
-type jobs in BigQuery, no lineage was being extracted because the SCRIPT
job had no lineage information - it only spawned child jobs that had that information. With this change, the integration extracts lineage information from child jobs when dealing with SCRIPT
-type jobs.#2272
@pawel-big-lebowskispark-interfaces-scala
package that allows lineage extraction to be implemented within Spark extensions (Iceberg, Delta, GCS, etc.). The Openlineage integration, when traversing the query plan, verifies if nodes implement defined interfaces. If so, interface methods are used to extract lineage. Refer to the README for more details.
#2496
@mobuchowskiMeterRegistryyFactory
, MicrometerProvider
, StatsDMetricsBuilder
, metrics config in OpenLineage config, and a Java client implementation.
#2528
@mobuchowski#2556
@mobuchowskiJDBCRelation
.
SparkPropertyFacetBuilder
to support recording Spark runtime #2523
@Ruihua98SparkPropertyFacetBuilder
to capture the RuntimeConfig
of the Spark session because the existing SparkPropertyFacet
can only capture the static config of the Spark context. This facet will be added in both RDD-related and SQL-related runs.
fileCount
to dataset stat facets #2562
@dolfinusfileCount
field to DataQualityMetricsInputDatasetFacet
and OutputStatisticsOutputDatasetFacet
specification.
dbt-ol
should transparently exit with the same exit code as the child dbt
process #2560
@blacklightdbt-ol
transparently exit with the same exit code as the child dbt
process.
#2531
@HuangZhenQiuopenlineage-flink.jar
.
#2507
@pawel-big-lebowski.emit()
method logging & annotations #2539
@dolfinus#2547
@dolfinusOpenLineageSql
class could not load a native library, if returned None
for all operations. But because the error message was suppressed, the user could not determine the reason.
#2510
@mobuchowski#2535
@pawel-big-lebowskiIllegalStateException
is always caught when accessing SparkSession
.
#2537
@pawel-big-lebowskiClassNotFoundError
occurring on Databricks runtime and extends the integration test to verify DatabricksEnvironmentFacet
.
#2565
@d-m-hJobMetricsHolder#cleanUp(int)
method now correctly purges unneeded state from both maps.
UnknownEntryFacetListener
#2557
@pawel-big-lebowskiJDBCOptions(table=...)
containing subquery #2546
@dolfinusopenlineage-spark
from producing datasets with names like database.(select * from table)
for JDBC sources.
#2563
@mobuchowskiIllegalStateException
when accessing SparkSession
#2535
@pawel-big-lebowskiIllegalStateException
was not being caught.
Published by merobi-hub 7 months ago
#2518
@JDarDagran#2491
@HuangZhenQiu#2472
@HuangZhenQiuOpenLineageClientUtils#loadOpenLineageJson(InputStream)
and change OpenLineageClientUtils#loadOpenLineageYaml(InputStream)
methods #2490
@d-m-hloadOpenLineageYaml(InputStream)
wanted the InputStream
to contain bytes that represented JSON.
#2486
@davidjgoss#2478
@mattiabertorello#2524
@kacpermudatask_instance
copy fails #2492
@kacpermudatask_instance
copy fails in listener.on_task_instance_running
.
HttpTransport
timeout #2475
@pawel-big-lebowskitimeout
config parameter is ambiguous: implementation treats the value as double in seconds, although the documentation claims it's milliseconds. A new config param timeoutInMillis
has been added. the Existing timeout
has been removed from docs and will be deprecated in 1.13.
#2515
@pawel-big-lebowskiend(jobEnd)
.
#2507
@pawel-big-lebowski#2512
@HuangZhenQiu#2508
@pawel-big-lebowski#2507
@pawel-big-lebowskiCassandraUtils
class which has a static hasClasses
method, but it imports Cassandra-related classes in the header. Also, the Flink subproject contains an unnecessary maven-publish
plugin.
#2504
@HuangZhenQiu#2479
@HuangZhenQiucassandra://host:port
.
Published by merobi-hub 8 months ago
JobTypeJobFacet
properties #2412
@mattiabertorelloJobTypeJobFacet
properties #2411
@mattiabertorello#2372
@pawel-big-lebowskirecordSerializer
needs to implement KafkaTopicsDescriptor
. Please refer to the limitations sections in documentation.
#2436
@HuangZhenQiuServiceLoader
#2435
@pawel-big-lebowskiServiceLoader
as an addition to a list of implemented builders available within the existing package.
#2371
@mobuchowski#1672
.
DataSourceV2Relation
#2394
@pawel-big-lebowskiDataSourceV2Relation
lineage nodes.
JobTypeJobFacet
properties #2410
@mattiabertorellospark.LogicalPlan
facet by default #2433
@pawel-big-lebowskispark.LogicalPlan
has been added to default value of spark.openlineage.facets.disabled
.
#2407
@pawel-big-lebowskiopenlineage-spark
#2446
@d-m-hopenlineage-spark
app
module to be compiled with Scala 2.12 and Scala 2.13 variants of Apache Spark (https://github.com/OpenLineage/OpenLineage/pull/2432) @d-m-hspark.binary.version
and spark.version
properties control which variant to build.
app
module #2432
@d-m-happ
module to be built using both Scala 2.12 and Scala 2.13 variants of various Apache Spark versions, and enables the CI/CD pipeline to build and test them.
UnknownEntryFacet
creation #2431
@mobuchowskiUnknownEntryFacet
was resulting in the event not being sent.
#2405
@mattiabertorellovendor
folder to isolate Snowflake-specific code from the main Spark integration, enhancing organization and flexibility.
#2403
@HuangZhenQiu#2468
@d-m-h#2447
@kacpermudaFileConfig
was creating an additional file when not in append mode. Closes #2439
.
#2441
@kacpermudaFileConfig
was ignoring the append key in YAML config. Closes #2440
#2379
@algorithmy1IcebergSourceWrapper
for Iceberg connector 1.17 #2409
@ensctomcatalogloader
was loading the catalog in the open function, causing the loadTable
method to throw a NullPointerException
error.
spark35
, spark3
, shared
modules to produce Scala 2.12 and Scala 2.13 variants #2390
#2385
#2384
@d-m-hspark2
module to the new build process #2391
@d-m-hNoSuchMethodErrors
were being thrown when running the openlineage-spack connector in an Apache Spark runtime compiled using Scala 2.13.
Published by merobi-hub 9 months ago
#2366
@HuangZhenQiu#2376
@d-m-hLogicalPlan
implementation #2361
@mattiabertorello#2357
@mattiabertorelloScalaConversionUtils
methods that will support cross-compilation.
MERGE INTO
queries on Databricks #2348
@pawel-big-lebowskiMERGE INTO
queries on Databricks runtime.
#2283
@nataliezeller1#2365
@mattiabertorello#2364
@kacpermuda#2358
@kacpermuda#2373
@kacpermuda#2350
@pawel-big-lebowski#2360
@mattiabertorelloremovePathPattern
feature #2350
@pawel-big-lebowskiremovePathPattern
if configured to do so.
#2377
@d-m-h#2383
@d-m-hPublished by merobi-hub 10 months ago
COMPATIBILITY NOTICE
Starting in 1.7.0, the Airflow integration will no longer support Airflow versions >=2.8.0
.
Please use the OpenLineage Airflow Provider instead.
COMPLETE
and FAIL
events in Airflow integration #2320
@kacpermuda#2316
#2318
@kacpermudarun_id
for FAIL event in Airflow 2.6+ #2305
@kacpermudaRun_id
in a FAIL
event was different than in the START
event for Airflow 2.6+.
TableLoader
before loading a table #2314
@pawel-big-lebowskiNullPointerException
in 1.17 when dealing with Iceberg sinks.
#2321
@pawel-big-lebowskikafka://
prefix to Kafka topic datasets' namespaces.
JobTypeJobFacet
#2325
@pawel-big-lebowskicommons-logging
relocate in target jar #2319
@pawel-big-lebowski#2315
@davidjgossAuthority
format for consistency with other references in the same section.
#2330
@kacpermuda>=2.8.0
.
Published by merobi-hub 11 months ago
#2220
@tsungchihdbt-ol send-events
to send metadata of the last run without running the job #2285
@sophiely#2229
@ensctom#2284
@mobuchowskiJobTypeJobFacet
to contain additional job related information#2241
@pawel-big-lebowskiJobTypeJobFacet
contains the processing type such as BATCH|STREAMING
, integration via SPARK|FLINK|...
and job type in QUERY|COMMAND|DAG|...
.
#2259
@JDarDagranCVE-2022-1471
#2185
@pawel-big-lebowski2.15.3
.
#2296
@pawel-big-lebowskicommons-logging
transitive dependency from published jar #2297
@pawel-big-lebowskicommons-logging
is not shipped as this can lead to a version mismatch on the user's side.
Published by merobi-hub 12 months ago
#2175
@HuangZhenQiurdd
and toDF
operations available in Spark Scala API #2188
@pawel-big-lebowskiExternalRddVisitor
and adds support for extracting inputs from MapPartitionsRDD
and ParallelCollectionRDD
plan nodes.
#2185
@pawel-big-lebowski#2107
@JDarDagran#2221
@JDarDagran#2167
@sophiely#2177
@JDarDagranColumnLineageDatasetFacetFieldsAdditionalInputFields
are now skipped.
#2181
@pawel-big-lebowskiSTART
and a single COMPLETE
event are sent #2103
@pawel-big-lebowskiSTART
and COMPLETE
events emitted. Other events should be sent as RUNNING
. Please keep in mind that Spark integration remains stateless to limit the memory footprint, and it is the backend responsibility to merge several Openlineage events into a meaningful snapshot of metadata changes.
Published by merobi-hub about 1 year ago
#2151
@mars-lan#2149
@HuangZhenQiuFlinkIcebergSource
and FlinkIcebergTableSource
for Flink Iceberg lineage.
#2147
@pawel-big-lebowskispark.openlineage.debugFacet=enabled
needs to be set to include the facet. By default, the debug facet is disabled.
#2165
@julwinPublished by merobi-hub about 1 year ago
#1845
@harelsairflow.lineage.Table
(if defined) #2138
@erikalfthanairflow.lineage.Table
inlets/outlets to the OpenLineage Dataset.
#2136
@erikalfthan#2118
@pawel-big-lebowskidelta
and iceberg
are not supported for Spark 3.5
at this time.
#2139
@JDarDagran#2141
@JDarDagranairflow.providers.openlineage
and adds more graceful logging to fix a corner case.
#2142
@d-m-hprepareDatasetIdentifierFromDefaultTablePath
method would override the scheme with the value of "file" when constructing a dataset identifier. It now uses the scheme of the CatalogTable
's URI for this. Thank you @pawel-big-lebowski for the quick triage and suggested fix.
Published by merobi-hub about 1 year ago
ProcessingEngineRunFacet
as part of the normal operation of the OpenLineageSparkEventListener
#2089
@d-m-hProcessEngineRunFacet
alongside the custom SparkVersionFacet
(for now).SparkVersionFacet
is deprecated and will be removed in a future release.
spark.databricks.clusterUsageTags.clusterAllTags
variable from databricks environment #2099
@Anirudh181001spark.databricks.clusterUsageTags.clusterAllTags
to the list of environment variables captured from databricks.
#2106
@tatianaDbtLocalArtifactProcessor
in dbt projects that do not declare target-path.
#2091
@harels#2044
@xli-1026apiKey
if loading it from env variables @2029
@mobuchowskiapi_key
to apiKey
in create_token_provider
.
#2039
@pawel-big-lebowskirunning
events after job completes #2075
@pawel-big-lebowski#2083
@pawel-big-lebowski#2076
@pawel-big-lebowski#2090
@JDarDagranPublished by merobi-hub about 1 year ago
#2033
@pawel-big-lebowskiopenlineage.*
are passed to the Openlineage client.
#2004
@julienledem#2036
@pawel-big-lebowskispark.openlineage.jobName.appendDatasetName
to false
.spark.openlineage.jobName.replaceDotWithUnderscore
.
#2057
@pawel-big-lebowski#2023
@mobuchowskiNone
in TablesHierarchy
to skip filtering on the schema level in the information schema query.
KafkaSink
#2042
@pentium3KafkaSinkVisitor
by changing the KafkaSinkWrapper
to catch schemas of type AvroSerializationSchema
.
CreateView
events #1968
#1987
@pawel-big-lebowskiCreateView
nodes as root.
MERGE INTO
for delta tables identified by physical locations #2026
@pawel-big-lebowski#2035
@mobuchowskiadaptive_spark_plan
in Databricks #2061
@algorithmy1adaptive_spark_plan
from the excludedNodes
in DatabricksEventFilter
.
Published by merobi-hub about 1 year ago
File
definition #2006
@mobuchowskiFile
entity definition to enhance backwards compatibility.
#1997
@JDarDagranDEBUG
when extractor isn't found #2012
@kaxilWARNING
to DEBUG
when an extractor is not available.
#2010
@mobuchowski#2025
@mobuchowskiHttpTransport
by disabling automatic requests
session reuse and not running SnowflakeExtractor
again on job completion.
#2001
@mars-lanHttpTransport
in the case of a null URL.
Published by merobi-hub about 1 year ago
#1960
@pawel-big-lebowskimerge into
on delta tables #1958
@pawel-big-lebowskimerge into
on Delta tables. Also refactors column-level lineage to deal with multiple Spark versions.
merge into
on Iceberg tables #1971
@pawel-big-lebowskimerge into
on Iceberg tables.
#1963
@juancappirest
to the existing options of hive
and hadoop
in IcebergHandler.getDatasetIdentifier()
to add support for Iceberg's RestCatalog
.
#1934
@mobuchowskiOPENLINEAGE_AIRFLOW_ENABLE_DIRECT_EXECUTION
environment variable exists.
openlineage-sql-java
#1981
@davidjgoss#1975
@julienledem{ _deleted: true }
object that can take the place of any job or dataset facet (but not run or input/output facets, which are valid only for a specific run).
#1891
@AlexkuvaFileTransport
and its configuration classes supporting append mode or write-new-file mode, which is especially useful when an object store does not support append mode, e.g. in the case of Databricks DBFS FUSE.
#1999
@JDarDagranOPENLINEAGE_DISABLED
to true
if the provider is installed.
config
to config_class
#1998
@mobuchowskiconfig
class variable to config_class
to avoid potential conflict with the config instance.
#1959
@mobuchowski#1973
@pawel-big-lebowski#1968
@pawel-big-lebowskiProject
node as root.
openlineage.*
logging levels via environment variables #1974
@JDarDagranOPENLINEAGE_{CLIENT/AIRFLOW/DBT}_LOGGING
environment variables that can be set according to module logging levels and cleans up some logging calls in openlineage-airflow
.
Published by merobi-hub over 1 year ago
#1947
@pawel-big-lebowski#1790
@pawel-big-lebowski#1928
@pawel-big-lebowski#1880
@pawel-big-lebowskiDatasetEvent
and JobEvent
types to the spec, along with support for the new types in the Python client.
#1926
@mobuchowski#1950
@mobuchowskiPublished by merobi-hub over 1 year ago
#1829
@Ines70Published by merobi-hub over 1 year ago
client.from_environment
, do not skip loading config #1908
@mobuchowskiOpenLineage.from_environment
method and recommends using the constructor instead.
Published by merobi-hub over 1 year ago
#1878
@mobuchowski#1867
@pawel-big-lebowski[
and ]
in Snowflake URIs #1883
@JDarDagran[
or ]
were causing urllib.parse.urlparse
to fail.
Published by merobi-hub over 1 year ago
#1757
@pawel-big-lebowski#1856
@gaborbernat#1855
@pawel-big-lebowskispark.read.csv('file.csv')
.
logicalPlan
serialization issue on Databricks #1858
@pawel-big-lebowskispark_unknown
facet by default to turn off serialization of logicalPlan
.
Published by merobi-hub over 1 year ago
merge into
support #1823
@pawel-big-lebowskimerge into
queries.
#1808
@nataliezeller1#1830
@pawel-big-lebowskiDeltaEventFilter
class to filter events in cases where rewritten queries in adaptive Spark plans generate extra events.
#1844
@Anirudh181001OpenLineageRunEventBuilder
when it cast the Spark scheduler's ShuffleMapStage
to boolean.
#1840
@pawel-big-lebowskieventTime
, runId
and job
elements no longer produce errors.