OpenLineage

An Open Standard for lineage metadata collection

APACHE-2.0 License

Downloads
5.1M
Stars
1.6K

Bot releases are visible (Hide)

OpenLineage - OpenLineage 1.12.0 Latest Release

Published by merobi-hub 6 months ago

Added

  • Airflow: add lineage_job_namespace and lineage_job_name macros #2582 @dolfinus
    Adds new Airflow macros lineage_job_namespace(), lineage_job_name(task) that return an Airflow namespace and Airflow job name, respectively.
  • Spec: Allow nested struct fields in SchemaDatasetFacet #2548 @dolfinus
    Adds nested fields support to SchemaDatasetFacet.

Fixed

  • Spark: fix PMD for test #2588 @pawel-big-lebowski
    Clears pmdTestScala212 of warnings that clutter the logs.
  • Dbt: propagate the dbt return code also when no OpenLineage events are emitted #2591 @blacklight
    dbt-ol now propagates the exit code of the underlying dbt process even if no lineage events are emitted.
  • Java: make sure string isn't empty to prevent going out of bounds #2585 @harels
    String lookup was not accounting for empty strings and causing a java.lang.StringIndexOutOfBoundsException.
  • Spark: use HashSet in column-level lineage instead of iterating through LinkedList #2584 @mobuchowski
    Takes advantage of performance gains available from using HashSet for collection.
  • Python: fix missing pkg_resources module on Python 3.12 #2572 @dolfinus
    Removes pkg_resources dependency and replaces it with the packaging lib.
  • Airflow: fix format returned by airflow.macros.lineage_parent_id #2578 @blacklight
    Fixes the run format returned by the lineage_parent_id Airflow macro and simplifies the format of the lineage_parent_id and lineage_run_id macros.
  • Dagster: limit Dagster version to 1.6.9 #2579 @JDarDagran
    Adds an upper limit on supported versions of Dagster as the integration is no longer actively maintained and recent releases have introduced breaking changes.
OpenLineage - OpenLineage 1.11.3

Published by merobi-hub 7 months ago

Added

  • Common: add support for SCRIPT-type jobs in BigQuery #2564 @kacpermuda
    In the case of SCRIPT-type jobs in BigQuery, no lineage was being extracted because the SCRIPT job had no lineage information - it only spawned child jobs that had that information. With this change, the integration extracts lineage information from child jobs when dealing with SCRIPT-type jobs.
  • Spark: support for built-in lineage extraction #2272 @pawel-big-lebowski
    This PR adds a spark-interfaces-scala package that allows lineage extraction to be implemented within Spark extensions (Iceberg, Delta, GCS, etc.). The Openlineage integration, when traversing the query plan, verifies if nodes implement defined interfaces. If so, interface methods are used to extract lineage. Refer to the README for more details.
  • Spark/Java: add support for Micrometer metrics #2496 @mobuchowski
    Adds a mechanism for forwarding metrics to any Micrometer-compatible implementation. Included: MeterRegistryyFactory, MicrometerProvider, StatsDMetricsBuilder, metrics config in OpenLineage config, and a Java client implementation.
  • Spark: add support for telemetry mechanism #2528 @mobuchowski
    Adds timers, counters and additional instrumentation in order to implement Micrometer metrics collection.
  • Spark: support query option on table read #2556 @mobuchowski
    Adds support for the Spark-BigQuery connector's query input option, which executes a query directly on BigQuery, storing the result in an intermediate dataset, bypassing Spark's computation layer. Due to this, the lineage is retrieved using the SQL parser, similarly to JDBCRelation.
  • Spark: change SparkPropertyFacetBuilder to support recording Spark runtime #2523 @Ruihua98
    Modifies SparkPropertyFacetBuilder to capture the RuntimeConfig of the Spark session because the existing SparkPropertyFacet can only capture the static config of the Spark context. This facet will be added in both RDD-related and SQL-related runs.
  • Spec: add fileCount to dataset stat facets #2562 @dolfinus
    Adds a fileCount field to DataQualityMetricsInputDatasetFacet and OutputStatisticsOutputDatasetFacet specification.

Fixed

  • dbt: dbt-ol should transparently exit with the same exit code as the child dbt process #2560 @blacklight
    Makes dbt-ol transparently exit with the same exit code as the child dbt process.
  • Flink: disable module metadata generation #2531 @HuangZhenQiu
    Disables the module metadata generation for Flink to fix the problem of having gradle dependencies to submodules within openlineage-flink.jar.
  • Flink: fixes to version 1.19 #2507 @pawel-big-lebowski
    Fixes the class not found issue when checking for Cassandra classes. Also fixes the Maven pom dependency on subprojects.
  • Python: small improvements to .emit() method logging & annotations #2539 @dolfinus
    Updates OpenLineage.emit debug messages and annotations.
  • SQL: show error message when OpenLineageSql cannot find native library #2547 @dolfinus
    When the OpenLineageSql class could not load a native library, if returned None for all operations. But because the error message was suppressed, the user could not determine the reason.
  • SQL: update code to conform to upstream sqlparser-rs changes #2510 @mobuchowski
    Includes tests and cosmetic improvements.
  • Spark: fix access to active Spark session #2535 @pawel-big-lebowski
    Changes behavior so IllegalStateException is always caught when accessing SparkSession.
  • Spark: fix Databricks environment #2537 @pawel-big-lebowski
    Fixes the ClassNotFoundError occurring on Databricks runtime and extends the integration test to verify DatabricksEnvironmentFacet.
  • Spark: fixed memory leak in JobMetricsHolder #2565 @d-m-h
    The JobMetricsHolder#cleanUp(int) method now correctly purges unneeded state from both maps.
  • Spark: fixed memory leak in UnknownEntryFacetListener #2557 @pawel-big-lebowski
    Prevents storing the state when a facet is disabled, purging the state after populating run facets.
  • Spark: fix parsing JDBCOptions(table=...) containing subquery #2546 @dolfinus
    Prevents openlineage-spark from producing datasets with names like database.(select * from table) for JDBC sources.
  • Spark/Snowflake: support query option via SQL parser #2563 @mobuchowski
    When a Snowflake job is bypassing Spark's computation layer, now the SQL parser will be used to get the lineage.
  • Spark: always catch IllegalStateException when accessing SparkSession #2535 @pawel-big-lebowski
    IllegalStateException was not being caught.
OpenLineage - OpenLineage 1.10.2

Published by merobi-hub 7 months ago

Added

  • Dagster: add new provider for version 1.6.10 #2518 @JDarDagran
    Adds the new provider required by the latest version of Dagster.
  • Flink: support lineage for a hybrid source #2491 @HuangZhenQiu
    Adds support for hybrid source lineage for users of Kafka and Iceberg sources in backfill usecases.
  • Flink: bump Flink JDBC connector version #2472 @HuangZhenQiu
    Bumps the Flink JDBC connector version to 3.1.2-1.18 for Flink 1.18.
  • Java: add a OpenLineageClientUtils#loadOpenLineageJson(InputStream) and change OpenLineageClientUtils#loadOpenLineageYaml(InputStream) methods #2490 @d-m-h
    This improves the explicitness of the methods. Previously, loadOpenLineageYaml(InputStream) wanted the InputStream to contain bytes that represented JSON.
  • Java: add info from the HTTP response to the client exception #2486 @davidjgoss
    Adds the status code and body as properties on the thrown exception when a non-success response is encountered in the HTTP transport.
  • Python: add support for MSK IAM authentication with a new transport #2478 @mattiabertorello
    Eases publication of events to MSK with IAM authentication.

Removed

  • Airflow: remove redundant information from facets #2524 @kacpermuda
    Refines the operator's attribute inclusion logic in facets to include only those known to be important or compact, ensuring that custom operator attributes with substantial data do not inflate the event size.

Fixed

  • Airflow: proceed without rendering templates if task_instance copy fails #2492 @kacpermuda
    Airflow will now proceed without rendering templates if task_instance copy fails in listener.on_task_instance_running.
  • Spark: fix the HttpTransport timeout #2475 @pawel-big-lebowski
    The existing timeout config parameter is ambiguous: implementation treats the value as double in seconds, although the documentation claims it's milliseconds. A new config param timeoutInMillis has been added. the Existing timeout has been removed from docs and will be deprecated in 1.13.
  • Spark: prevent NPE if the context is null #2515 @pawel-big-lebowski
    Adds a check for a null context before executing end(jobEnd).
  • Flink: fix class not found issue for Cassandra #2507 @pawel-big-lebowski
    Fixes the class not found issue when checking for Cassandra classes. Also fixes the Maven POM dependency on subprojects.
  • Flink: refine the JDBC table name #2512 @HuangZhenQiu
    Enables the JDBC table name with a schema prefix.
  • Flink: fix JDBC dataset naming #2508 @pawel-big-lebowski
    For JDBC, the Flink integration is not adjusted to the Openlineage naming convention. There is code that extracts the dataset namespace/name from the JDBC connection url, but it's in the Spark integration. As a solution, this code has to be extracted into the Java client and reused by the Spark and Flink integrations.
  • Flink: fix failure due to missing Cassandra classes #2507 @pawel-big-lebowski
    Flink is failing when no Cassandra classes are present on the class path. This is happening because of CassandraUtils class which has a static hasClasses method, but it imports Cassandra-related classes in the header. Also, the Flink subproject contains an unnecessary maven-publish plugin.
  • Flink: fix release runtime dependencies #2504 @HuangZhenQiu
    The shadow jar of Flink is not minimized, so some internal jars are listed as runtime dependences. This removes them from the final pom.xml file in the Flink module.
  • Spec: improve Cassandra lineage metadata #2479 @HuangZhenQiu
    Following the namespace definition, we should use cassandra://host:port.
OpenLineage - OpenLineage 1.9.1

Published by merobi-hub 8 months ago

Added

  • Airflow: add support for JobTypeJobFacet properties #2412 @mattiabertorello
    Adds support for Job type properties within the Airflow Job facet.
  • dbt: add support for JobTypeJobFacet properties #2411 @mattiabertorello
    Adds support for Job type properties within the DBT Job facet.
  • Flink: support Flink Kafka dynamic source and sink (https://github.com/OpenLineage/OpenLineage/pull/2417) @HuangZhenQiu
    Adds support for Flink Kafka Table Connector use cases for topic and schema extraction.
  • Flink: support multi-topic Kafka Sink #2372 @pawel-big-lebowski
    Adds support for multi-topic Kafka sinks. Limitations: recordSerializer needs to implement KafkaTopicsDescriptor. Please refer to the limitations sections in documentation.
  • Flink: support lineage for JDBC connector #2436 @HuangZhenQiu
    Adds support for use cases that employ this connector.
  • Flink: add common config gradle plugin (https://github.com/OpenLineage/OpenLineage/pull/2461) @HuangZhenQiu
    Adds a common config gradle plugin to simplify the gradle files of Flink submodules.
  • Java: extend circuit breaker loaded with ServiceLoader #2435 @pawel-big-lebowski
    Loads the circuit breaker builder with ServiceLoader as an addition to a list of implemented builders available within the existing package.
  • Spark: integration now emits intermediate, application level events wrapping entire job execution #2371 @mobuchowski
    Previously, the Spark event model described only single actions, potentially linked only to some parent run. Closes #1672.
  • Spark: support built-in lineage within DataSourceV2Relation #2394 @pawel-big-lebowski
    Enables built-in lineage extraction within from DataSourceV2Relation lineage nodes.
  • Spark: add support for JobTypeJobFacet properties #2410 @mattiabertorello
    Adds support for Job type properties within the Spark Job facet.
  • Spark: stop sending spark.LogicalPlan facet by default #2433 @pawel-big-lebowski
    spark.LogicalPlan has been added to default value of spark.openlineage.facets.disabled.
  • Spark/Flink/Java: circuit breaker #2407 @pawel-big-lebowski
    Introduces a circuit breaker mechanism to prevent effects of over-instrumentation. Implemented within Java client, it serves both the Flink and Spark integration. Read the Java client README for more details.
  • Spark: add the capability to publish Scala 2.12 and 2.13 variants of openlineage-spark #2446 @d-m-h
    Adds the capability to publish Scala 2.12 and 2.13 variants of openlineage-spark

Changed

  • Spark: enable the app module to be compiled with Scala 2.12 and Scala 2.13 variants of Apache Spark (https://github.com/OpenLineage/OpenLineage/pull/2432) @d-m-h
    The spark.binary.version and spark.version properties control which variant to build.
  • Spark: enable Scala 2.13 support in the app module #2432 @d-m-h
    Enables the app module to be built using both Scala 2.12 and Scala 2.13 variants of various Apache Spark versions, and enables the CI/CD pipeline to build and test them.
  • Spark: don't fail on exception of UnknownEntryFacet creation #2431 @mobuchowski
    Failure to generate UnknownEntryFacet was resulting in the event not being sent.
  • Spark: move Snowflake code into the vendor projects folders #2405 @mattiabertorello
    Creates a vendor folder to isolate Snowflake-specific code from the main Spark integration, enhancing organization and flexibility.

Fixed

  • Flink: resolve PMD rule violation warnings #2403 @HuangZhenQiu
    Resolves the PMD rule violation warnings in the Flink integration module.
  • Flink: Added the 'isReleaseVersion' property back to the build, enabling the Flink integration to be release #2468 @d-m-h
    The 'isReleaseVersion' property was removed from the build, preventing the Flink integration from being released.
  • Python: fix issue with file config creating additional file #2447 @kacpermuda
    FileConfig was creating an additional file when not in append mode. Closes #2439.
  • Python: fix issue with append option in file config #2441 @kacpermuda
    FileConfig was ignoring the append key in YAML config. Closes #2440
  • Spark: fix integration catalog symlink without warehouse #2379 @algorithmy1
    In the case of symlinked Glue Catalog Tables, the parsing method was producing dataset names identical to the namespace.
  • Flink: fix IcebergSourceWrapper for Iceberg connector 1.17 #2409 @ensctom
    In Flink 1.17, the Iceberg catalogloader was loading the catalog in the open function, causing the loadTable method to throw a NullPointerException error.
  • Spark: migrate spark35, spark3, shared modules to produce Scala 2.12 and Scala 2.13 variants #2390 #2385#2384 @d-m-h
    Migrates the three modules to use the refactored Gradle plugins. Also splits some tests into Scala 2.12- and Scala 2.13-specific versions.
  • Spark: conform the spark2 module to the new build process #2391 @d-m-h
    Due to a change in the Scala Collections API in Scala 2.13, NoSuchMethodErrors were being thrown when running the openlineage-spack connector in an Apache Spark runtime compiled using Scala 2.13.
OpenLineage - OpenLineage 1.8.0

Published by merobi-hub 9 months ago

  • Flink: support Flink 1.18 #2366 @HuangZhenQiu
    Adds support for the latest Flink version with 1.17 used for Iceberg Flink runtime and Cassandra Connector as these do not yet support 1.18.
  • Spark: add Gradle plugins to simplify the build process to support Scala 2.13 #2376 @d-m-h
    *Defines a set of Gradle plugins to configure the modules and reduce duplication.
  • Spark: support multiple Scala versions LogicalPlan implementation #2361 @mattiabertorello
    In the LogicalPlanSerializerTest class, the implementation of the LogicalPlan interface is different between Scala 2.12 and Scala 2.13. In detail, the IndexedSeq changes package from the scala.collection to scala.collection.immutable. This implements both of the methods necessary in the two versions.
  • Spark: Use ScalaConversionUtils to convert Scala and Java collections #2357 @mattiabertorello
    This initial step is to start supporting compilation for Scala 2.13 in the 3.2+ Spark versions. Scala 2.13 changed the default collection to immutable, the methods to create an empty collection, and the conversion between Java and Scala. This causes the code to not compile between 2.12 and 2.13. This replaces the usage of direct Scala collection methods (like creating an empty object) and conversions utils with ScalaConversionUtils methods that will support cross-compilation.
  • Spark: support MERGE INTO queries on Databricks #2348 @pawel-big-lebowski
    Supports custom plan nodes used when running MERGE INTO queries on Databricks runtime.
  • Spark: Support Glue catalog in iceberg #2283 @nataliezeller1
    Adds support for the Glue catalog based on the 'catalog-impl' property (in this case we will not have a 'type' property).

Changed

  • Spark: Move Spark 3.1 code from the spark3 project #2365 @mattiabertorello
    Moves the Spark 3.1-related code to a specific project, spark31, so the spark3 project can be compiled with any Spark 3.x version.

Fixed

  • Airflow: add database information to SnowflakeExtractor #2364 @kacpermuda
    Fixes missing database information in SnowflakeExtractor.
  • Airflow: add dag_id to task_run_id to avoid duplicates #2358 @kacpermuda
    The lack of dag_id in task_run_id can cause duplicates in run_id across different dags.
  • Airflow: Add tests for column lineage facet and sql parser #2373 @kacpermuda
    Improves naming (database.schema.table) in SQLExtractor's column lineage facet and adds some unit tests.
  • Spark: fix removePathPattern behaviour #2350 @pawel-big-lebowski
    The removepath pattern feature is not applied all the time. The method is called when constructing DatasetIdentifier through PathUtils which is not the case all the time. This moves removePattern to another place in the codebase that is always run.
  • Spark: fix a type incompatibility in RddExecutionContext between Scala 2.12 and 2.13 #2360 @mattiabertorello
    The function from the ResultStage.func() object change type in Spark between Scala 2.12 and 2.13 makes the compilation fail. This avoids getting the function with an explicit type; instead, it gets it every time it is needed from the ResultStage object. This PR is part of the effort to support Scala 2.13 in the Spark integration.
  • Spark: Fix removePathPattern feature #2350 @pawel-big-lebowski
    Refactors code to make sure that all datasets sent are processed through removePathPattern if configured to do so.
  • Spark: Clean up the individual build.gradle files in preparation for Scala 2.13 support #2377 @d-m-h
    Cleans up the build.gradle files, consolidating the custom plugin and removing unused and unnecessary configuration.
  • Spark: refactor the Gradle plugins to make it easier to define Scala variants per module #2383 @d-m-h
    The third of several PRs to support producing Scala 2.12 and Scala 2.13 variants of the OpenLineage Spark integration. This PR refactors the custom Gradle plugins in order to make supporting multiple variants per module easier. This is necessary because the shared module fails its tests when consuming the Scala 2.13 variants of Apache Spark.
OpenLineage - OpenLineage 1.7.0

Published by merobi-hub 10 months ago

COMPATIBILITY NOTICE
Starting in 1.7.0, the Airflow integration will no longer support Airflow versions >=2.8.0.
Please use the OpenLineage Airflow Provider instead.

Added

  • Airflow: add parent run facet to COMPLETE and FAIL events in Airflow integration #2320 @kacpermuda
    Adds a parent run facet to all events in the Airflow integration.

Fixed

  • Airflow: repair up.sh for MacOS #2316 #2318 @kacpermuda
    Some scripts were not working well on MacOS. This adjusts them.
  • Airflow: repair run_id for FAIL event in Airflow 2.6+ #2305 @kacpermuda
    The Run_id in a FAIL event was different than in the START event for Airflow 2.6+.
  • Flink: open Iceberg TableLoader before loading a table #2314 @pawel-big-lebowski
    Fixes a potential NullPointerException in 1.17 when dealing with Iceberg sinks.
  • Flink: name Kafka datasets according to the naming convention #2321 @pawel-big-lebowski
    Adds a kafka:// prefix to Kafka topic datasets' namespaces.
  • Flink: fix properties within JobTypeJobFacet #2325 @pawel-big-lebowski
    Fixes properties assignment in the Flink visitor.
  • Spark: fix commons-logging relocate in target jar #2319 @pawel-big-lebowski
    Avoids relocating a dependency that was getting excluded from the jar.
  • Spec: fix inconsistency with Redshift authority format #2315 @davidjgoss
    Amends the Authority format for consistency with other references in the same section.

Removed

  • Airflow: remove Airflow 2.8+ support #2330 @kacpermuda
    To encourage use of the Provider, this removes the listener from the plugin if the Airflow version is >=2.8.0.
OpenLineage - OpenLineage 1.6.2

Published by merobi-hub 11 months ago

Added

  • Dagster: support Dagster 1.5.x #2220 @tsungchih
    Gets event records for each target Dagster event type to support Dagster version 0.15.0+.
  • Dbt: add a new command dbt-ol send-events to send metadata of the last run without running the job #2285 @sophiely
    Adds a new command to send events to OpenLineage according to the latest metadata generated without running any dbt command.
  • Flink: add option for Flink job listener to read from Flink conf #2229 @ensctom
    Adds option for the Flink job listener to read jobnames and namespaces from Flink conf.
  • Spark: get column-level lineage from JDBC dbtable option #2284 @mobuchowski
    Adds support for dbtable, enables lineage in the case of single input columns, and improves dataset naming.
  • Spec: introduce JobTypeJobFacet to contain additional job related information#2241 @pawel-big-lebowski
    New JobTypeJobFacet contains the processing type such as BATCH|STREAMING, integration via SPARK|FLINK|... and job type in QUERY|COMMAND|DAG|....
  • SQL: add quote information from sqlparser-rs #2259 @JDarDagran
    Adds quote information from sqlparser-rs.

Fixed

  • Spark: update Jackson dependency to resolve CVE-2022-1471 #2185 @pawel-big-lebowski
    Updates Gradle for Spark and Flink to 8.1.1. Upgrade Jackson 2.15.3.
  • Flink: avoid relying on Guava which can be missing during production runtime #2296 @pawel-big-lebowski
    Removes usage of Guava ImmutableList.
  • Spark: exclude commons-logging transitive dependency from published jar #2297 @pawel-big-lebowski
    Ensures commons-logging is not shipped as this can lead to a version mismatch on the user's side.
OpenLineage - OpenLineage 1.5.0

Published by merobi-hub 12 months ago

Added

  • Flink: add Flink lineage for Cassandra Connectors #2175 @HuangZhenQiu
    Adds Flink Cassandra source and sink visitors and Flink Cassandra Integration test.
  • Spark: support rdd and toDF operations available in Spark Scala API #2188 @pawel-big-lebowski
    Includes the first Scala integration test, fixes ExternalRddVisitor and adds support for extracting inputs from MapPartitionsRDD and ParallelCollectionRDD plan nodes.
  • Spark: support Databricks Runtime 13.3 #2185 @pawel-big-lebowski
    Modifies the Spark integration to support the latest Databricks Runtime version.

Changed

  • Airflow: loosen attrs and requests versions #2107 @JDarDagran
    Lowers the version requirements for attrs and requests and removes an unnecessary dependency.
  • dbt: render yaml configs lazily #2221 @JDarDagran
    Don't render each entry in yaml files at start.

Fixed

  • Airflow/Athena: change dataset name to its location #2167 @sophiely
    Replaces the dataset and namespace with the data's physical location for more complete lineage across integrations.
  • Python client: skip redaction in column lineage facet #2177 @JDarDagran
    Redacted fields in ColumnLineageDatasetFacetFieldsAdditionalInputFields are now skipped.
  • Spark: unify dataset naming for RDD jobs and Spark SQL #2181 @pawel-big-lebowski
    Use the same mechanism for RDD jobs to extract dataset identifier as used for Spark SQL.
  • Spark: ensure a single START and a single COMPLETE event are sent #2103 @pawel-big-lebowski
    For Spark SQL at least four events are sent triggered by different SparkListener methods. Each of them is required and used to collect facets unavailable elsewhere. However, there should be only one START and COMPLETE events emitted. Other events should be sent as RUNNING. Please keep in mind that Spark integration remains stateless to limit the memory footprint, and it is the backend responsibility to merge several Openlineage events into a meaningful snapshot of metadata changes.
OpenLineage - OpenLineage 1.4.1

Published by merobi-hub about 1 year ago

Added

  • Client: allow setting client's endpoint via environment variable #2151 @mars-lan
    Enables setting this endpoint via environment variable because creating the client manually in Airflow is not possible.
  • Flink: expand Iceberg source types #2149 @HuangZhenQiu
    Adds support for FlinkIcebergSource and FlinkIcebergTableSource for Flink Iceberg lineage.
  • Spark: add debug facet #2147 @pawel-big-lebowski
    An extra run facet containing some system details (e.g., OS, Java, Scala version), classpath (e.g., package versions, jars included in the Spark job), SparkConf (like openlineage entries except auth, specified extensions, etc.) and LogicalPlan details (execution tree nodes' names) are added to events emitted. SparkConf setting spark.openlineage.debugFacet=enabled needs to be set to include the facet. By default, the debug facet is disabled.
  • Spark: enable Nessie REST catalog #2165 @julwin
    Adds support for Nessie catalog in Spark.
OpenLineage - OpenLineage 1.3.1

Published by merobi-hub about 1 year ago

Added

  • Airflow: add some basic stats to the Airflow integration #1845 @harels
    Uses the statsd component that already exists in the Airflow codebase and wraps the section that emits to event with a timer, as well as emitting a counter for exceptions in sending the event.
  • Airflow: add columns as schema facet for airflow.lineage.Table (if defined) #2138 @erikalfthan
    Adds columns (if set) from airflow.lineage.Table inlets/outlets to the OpenLineage Dataset.
  • DBT: add SQLSERVER to supported dbt profile types #2136 @erikalfthan
    Adds support for dbt-sqlserver, solving #2129.
  • Spark: support for latest 3.5 #2118 @pawel-big-lebowski
    Integration tests are now run on Spark 3.5. Also upgrades 3.3 branch to 3.3.3. Please note that delta and iceberg are not supported for Spark 3.5 at this time.

Fixed

  • Airflow: fix find-links path in tox #2139 @JDarDagran
    Fixes a broken link.
  • Airflow: add more graceful logging when no OpenLineage provider installed #2141 @JDarDagran
    Recognizes a failed import of airflow.providers.openlineage and adds more graceful logging to fix a corner case.
  • Spark: fix bug in PathUtils' prepareDatasetIdentifierFromDefaultTablePath(CatalogTable) to correctly preserve scheme from CatalogTable's location #2142 @d-m-h
    Previously, the prepareDatasetIdentifierFromDefaultTablePath method would override the scheme with the value of "file" when constructing a dataset identifier. It now uses the scheme of the CatalogTable's URI for this. Thank you @pawel-big-lebowski for the quick triage and suggested fix.
OpenLineage - OpenLineage 1.2.2

Published by merobi-hub about 1 year ago

Added

  • Spark: publish the ProcessingEngineRunFacet as part of the normal operation of the OpenLineageSparkEventListener #2089 @d-m-h
    Publishes the spec-defined ProcessEngineRunFacet alongside the custom SparkVersionFacet (for now).
    The SparkVersionFacet is deprecated and will be removed in a future release.
  • Spark: capture and emit spark.databricks.clusterUsageTags.clusterAllTags variable from databricks environment #2099 @Anirudh181001
    Adds spark.databricks.clusterUsageTags.clusterAllTags to the list of environment variables captured from databricks.

Fixed

  • Common: support parsing dbt_project.yml without target-path #2106 @tatiana
    As of dbt v1.5, usage of target-path in the dbt_project.yml file has been deprecated, now preferring a CLI flag or env var. It will be removed in a future version. This allows users to run DbtLocalArtifactProcessor in dbt projects that do not declare target-path.
  • Proxy: fix Proxy chart #2091 @harels
    Includes the proper image to deploy in the helm chart.
  • Python: fix serde filtering #2044 @xli-1026
    Fixes the bug causing values in list objects to be filtered accidentally.
  • Python: use non-deprecated apiKey if loading it from env variables @2029 @mobuchowski
    Changes api_key to apiKey in create_token_provider.
  • Spark: Improve RDDs on S3 integration. #2039 @pawel-big-lebowski
    Prepares integration test to access S3, fixes input dataset duplicates and includes other minor fixes.
  • Flink: prevent sending running events after job completes #2075 @pawel-big-lebowski
    Flink checkpoint tracking thread was not getting stopped properly on job complete.
  • Spark & Flink: Unify dataset naming from URI objects #2083 @pawel-big-lebowski
    Makes sure Spark and Flink generate same dataset identifiers for the same datasets by having a single implementation to generate dataset namespace and name.
  • Spark: Databricks improvements #2076 @pawel-big-lebowski
    Filters unwanted events on databricks and adds an integration test to verify this. Adds integration tests to verify dataset naming on databricks runtime is correct when table location is specified. Adds integration test for wide transformation on delta tables.

Removed

  • SQL: remove sqlparser dependency from iface-java and iface-py #2090 @JDarDagran
    Removes the dependency due to a breaking change in the latest release of the parser.
OpenLineage - OpenLineage 1.1.0

Published by merobi-hub about 1 year ago

Added

  • Flink: create Openlineage configuration based on Flink configuration #2033 @pawel-big-lebowski
    Flink configuration entries starting with openlineage.* are passed to the Openlineage client.
  • Java: add Javadocs to the Java client #2004 @julienledem
    The client was missing some Javadocs.
  • Spark: append output dataset name to a job name #2036 @pawel-big-lebowski
    Solves problem of multiple jobs, writing to different datasets while having the same job name. The feature is enabled by default and results in different job names and can be disabled by setting spark.openlineage.jobName.appendDatasetName to false.
    Unifies job names generated on the Databricks platform (using a dot job part separator instead of an underscore). The default behaviour can be altered with spark.openlineage.jobName.replaceDotWithUnderscore.
  • Spark: support Spark 3.4.1 #2057 @pawel-big-lebowski
    Bumps the latest Spark version to be covered in integration tests.

Fixed

  • Airflow: do not use database as fallback when no schema parsed #2023 @mobuchowski
    Sets the schema to None in TablesHierarchy to skip filtering on the schema level in the information schema query.
  • Flink: fix a bug when getting schema for KafkaSink #2042 @pentium3
    Fixes the incomplete schema from KafkaSinkVisitor by changing the KafkaSinkWrapper to catch schemas of type AvroSerializationSchema.
  • Spark: filter CreateView events #1968#1987 @pawel-big-lebowski
    Clears events generated by logical plans having CreateView nodes as root.
  • Spark: fix MERGE INTO for delta tables identified by physical locations #2026 @pawel-big-lebowski
    Delta tables identified by physical locations were not properly recognized.
  • Spark: fix incorrect naming of JDBC datasets #2035 @mobuchowski
    Makes the namespace generated by the JDBC/Spark connector conform to the naming schema in the spec.
  • Spark: fix ignored event adaptive_spark_plan in Databricks #2061 @algorithmy1
    Removes adaptive_spark_plan from the excludedNodes in DatabricksEventFilter.
OpenLineage - OpenLineage 1.0.0

Published by merobi-hub about 1 year ago

Added

  • Airflow: convert lineage from legacy File definition #2006 @mobuchowski
    Adds coverage for File entity definition to enhance backwards compatibility.

Removed

  • Spec: remove facet ref from core #1997 @JDarDagran
    Removes references to facets from the core spec that broke compatibility with JSON schema specification.

Changed

  • Airflow: change log level to DEBUG when extractor isn't found #2012 @kaxil
    Changes log level from WARNING to DEBUG when an extractor is not available.
  • Airflow: make sure we cannot fail in thread despite direct execution #2010 @mobuchowski
    Ensures the listener is not failing tasks, even in unlikely scenarios.

Fixed

  • Airflow: stop using reusable session by default, do not send full event on Snowflake complete #2025 @mobuchowski
    Fixes the issue of the Snowflake connector clashing with HttpTransport by disabling automatic requests session reuse and not running SnowflakeExtractor again on job completion.
  • Client: fix error message to avoid confusion #2001 @mars-lan
    Fixes the error message in HttpTransport in the case of a null URL.
OpenLineage - OpenLineage 0.30.1

Published by merobi-hub about 1 year ago

Added

  • Flink: support Iceberg sinks #1960 @pawel-big-lebowski
    Detects output datasets when using an Iceberg table as a sink.
  • Spark: column-level lineage for merge into on delta tables #1958 @pawel-big-lebowski
    Makes column-level lineage support merge into on Delta tables. Also refactors column-level lineage to deal with multiple Spark versions.
  • Spark: column-level lineage for merge into on Iceberg tables #1971 @pawel-big-lebowski
    Makes column-level lineage support merge into on Iceberg tables.
  • Spark: add supprt for Iceberg REST catalog #1963 @juancappi
    Adds rest to the existing options of hive and hadoop in IcebergHandler.getDatasetIdentifier() to add support for Iceberg's RestCatalog.
  • Airflow: add possibility to force direct-execution based on environment variable #1934 @mobuchowski
    Adds the option to use the direct-execution method on the Airflow listener when the existence of a non-SQLAlchemy-based Airflow event mechanism is confirmed. This happens when using Airflow 2.6 or when the OPENLINEAGE_AIRFLOW_ENABLE_DIRECT_EXECUTION environment variable exists.
  • SQL: add support for Apple Silicon to openlineage-sql-java #1981 @davidjgoss
    Expands the OS/architecture checks when compiling to produce a specific file for Apple Silicon. Also expands the corresponding OS/architecture checks when loading the binary at runtime from Java code.
  • Spec: add facet deletion #1975 @julienledem
    In order to add a mechanism for deleting job and dataset facets, adds a { _deleted: true } object that can take the place of any job or dataset facet (but not run or input/output facets, which are valid only for a specific run).
  • Client: add a file transport #1891 @Alexkuva
    Creates a FileTransport and its configuration classes supporting append mode or write-new-file mode, which is especially useful when an object store does not support append mode, e.g. in the case of Databricks DBFS FUSE.

Changed

  • Airflow: do not run plugin if OpenLineage provider is installed #1999 @JDarDagran
    Sets OPENLINEAGE_DISABLED to true if the provider is installed.
  • Python: rename config to config_class #1998 @mobuchowski
    Renames the config class variable to config_class to avoid potential conflict with the config instance.

Fixed

  • Airflow: add workaround for airflow-sqlalchemy event mechanism bug #1959 @mobuchowski
    Due to known issues with the fork and thread model in the Airflow-SQLAlchemy-based event-delivery mechanism, a Kafka producer left alone does not emit a `COMPLETE`` event. This creates a producer for each event when we detect that we're under Airflow 2.3 - 2.5.
  • Spark: fix custom environment variables facet #1973 @pawel-big-lebowski
    Enables sending the Spark environment variables facet in a non-deterministic way.
  • Spark: filter unwanted Delta events #1968 @pawel-big-lebowski
    Clears events generated by logical plans having Project node as root.
  • Python: allow modification of openlineage.* logging levels via environment variables #1974 @JDarDagran
    Adds OPENLINEAGE_{CLIENT/AIRFLOW/DBT}_LOGGING environment variables that can be set according to module logging levels and cleans up some logging calls in openlineage-airflow.
OpenLineage - OpenLineage 0.29.2

Published by merobi-hub over 1 year ago

Added

  • Flink: support Flink version 1.17.1 #1947 @pawel-big-lebowski
    Adds support for Flink versions 1.15.4, 1.16.2 and 1.17.1.
  • Spark: support Spark 3.4 #1790 @pawel-big-lebowski
    Introduces support for latest Spark version 3.4.0, along with 3.2.4 and 3.3.2.
  • Spark: add Databricks platform integration test #1928 @pawel-big-lebowski
    Adds a Spark integration test to verify behaviour on databricks platform to be run manually in CircleCI when needed.
  • Spec: add static lineage event types #1880 @pawel-big-lebowski
    As a first step in implementing static lineage, this adds new DatasetEvent and JobEvent types to the spec, along with support for the new types in the Python client.

Removed

  • Proxy: remove unused Golang client approach #1926 @mobuchowski
    Removes the unused Golang proxy, rendered redundant by the fluentd proxy.
  • Req: bump minimum supported Python version to 3.8 #1950 @mobuchowski
    Python 3.7 is at EOL. This bumps the minimum supported version to 3.8 to keep the project aligned with the Python EOL schedule.

Fixed

  • Flink: fix KafkaSource with GenericRecord #1944 @pawel-big-lebowski
    Extract dataset schema from KafkaSource when GenericRecord deserialized is used.
  • dbt: fix security vulnerabilities #1945 @JDarDagran
    Fixes vulnerabilities in the dbt integration and integration tests.
OpenLineage - OpenLineage 0.28.0

Published by merobi-hub over 1 year ago

Added

  • dbt: add Databricks compatibility #1829 @Ines70
    Enables launching OpenLineage with a Databricks profile.

Fixed

  • Fix type-checked marker and packaging #1913 @gaborbernat
    The client was not marking itself as type-annotated.
  • Python client: add schemaURL to run event #1917 @gaborbernat
    Adds the missing schemaURL to the client's RunState class.
OpenLineage - OpenLineage 0.27.2

Published by merobi-hub over 1 year ago

Fixed

  • Python client: deprecate client.from_environment, do not skip loading config #1908 @mobuchowski
    Deprecates the OpenLineage.from_environment method and recommends using the constructor instead.
OpenLineage - OpenLineage 0.27.1

Published by merobi-hub over 1 year ago

Added

  • Python client: add emission filtering mechanism and exact, regex filters #1878 @mobuchowski
    Adds configurable job-name filtering to the Python client. Filters can be exact-match- or regex-based. Events will not be sent in the case of matches.

Fixed

  • Spark: fix column lineage for aggregate queries on databricks #1867 @pawel-big-lebowski
    Aggregate queries on databricks did not return column lineage.
  • Airflow: fix unquoted [ and ] in Snowflake URIs #1883 @JDarDagran
    Snowflake connections containing one of [ or ] were causing urllib.parse.urlparse to fail.
OpenLineage - OpenLineage 0.26.0

Published by merobi-hub over 1 year ago

Added

  • Proxy: Fluentd proxy support (experimental) #1757 @pawel-big-lebowski
    Adds a Fluentd data collector as a proxy to buffer Openlineage events and send them to multiple backends (among many other purposes). Also implements a Fluentd Openlineage parser to validate incoming HTTP events at the beginning of the pipeline. See the readme file for more details.

Changed

  • Python client: use Hatchling over setuptools to orchestrate Python env setup #1856 @gaborbernat
    Replaces setuptools with Hatchling for building the backend. Also includes a number of fixes, including to type definitions in transport.py and elsewhere.

Fixed

  • Spark: support single file datasets #1855 @pawel-big-lebowski
    Fixes the naming of single file datasets so they are no longer named using the parent directory's path: spark.read.csv('file.csv').
  • Spark: fix logicalPlan serialization issue on Databricks #1858 @pawel-big-lebowski
    Disables the spark_unknown facet by default to turn off serialization of logicalPlan.
OpenLineage - OpenLineage 0.25.0

Published by merobi-hub over 1 year ago

Added

  • Spark: add Spark/Delta merge into support #1823 @pawel-big-lebowski
    Adds support for merge into queries.

Fixed

  • Spark: fix JDBC query handling #1808 @nataliezeller1
    Makes query handling more tolerant of variations in syntax and formatting.
  • Spark: filter Delta adaptive plan events #1830 @pawel-big-lebowski
    Extends the DeltaEventFilter class to filter events in cases where rewritten queries in adaptive Spark plans generate extra events.
  • Spark: fix Java class cast exception #1844 @Anirudh181001
    Fixes the error caused by the OpenLineageRunEventBuilder when it cast the Spark scheduler's ShuffleMapStage to boolean.
  • Flink: include missing fields of Openlineage events #1840 @pawel-big-lebowski
    Enriches Flink events so that missing eventTime, runId and job elements no longer produce errors.