marquez

Collect, aggregate, and visualize a data ecosystem's metadata

APACHE-2.0 License

Downloads
2.8K
Stars
1.6K

Bot releases are hidden (Show)

marquez - Marquez 0.28.0

Published by merobi-hub almost 2 years ago

marquez - Marquez 0.27.0

Published by merobi-hub almost 2 years ago

Added

Changed

Fixed

marquez - Marquez 0.26.0

Published by merobi-hub about 2 years ago

Added

Changed

  • Update lineage query to only look at jobs with inputs or outputs https://github.com/MarquezProject/marquez/pull/2068 @collado-mike
    Changes the lineage query to query the job_versions_io_mapping table and INNER join with the jobs_view so that only jobs that have inputs or outputs are present in the jobs_io CTE. Hence, the table becomes very small and the recursive join in the lineage CTE very fast. (In many environments, a large number of jobs reporting events have no inputs or outputs - e.g., PythonOperators in an Airflow deployment. If a Marquez installation has many of these, the lineage query spends much of its time searching for overlaps with jobs that have no inputs or outputs.)
  • Persist OpenLineage event before updating Marquez model https://github.com/MarquezProject/marquez/pull/2069 @fm100
    Switches the order of the code in order to persist the OpenLineage event first and then update the Marquez model. (When the RunTransitionListener was invoked, the OpenLineage event was not persisted to the database. Because the OpenLineage event is the source of truth for all Marquez run transitions, it should be available from RunTransitionListener.)
  • Drop requirement to provide marquez.yml for seed cmd https://github.com/MarquezProject/marquez/pull/2094 @wslulciuc
    Uses io.dropwizard.cli.Command instead of io.dropwizard.cli.ConfiguredCommand to no longer require passing marquez.yml as an argument to the seed cmd. (The marquez.yml argument is not used in the seed cmd.)

Fixed

  • Fix/rewrite jobs fqn locks https://github.com/MarquezProject/marquez/pull/2067 @collado-mike
    Updates the function to only update the table if the job is a new record or if the symlink_target_uuid is distinct from the previous value. (The rewrite_jobs_fqn_table function was inadvertently updating jobs even when no metadata about the job had changed. Under load, this caused significant locking issues, as the jobs_fqn table must be locked for every job update.)
  • Fix enum string types in the OpenAPI spec https://github.com/MarquezProject/marquez/pull/2086 @studiosciences
    Changes the type to string. (type: enum was not valid in OpenAPI spec.)
  • Fix incorrect PostgresSQL version https://github.com/MarquezProject/marquez/pull/2089 @jabbera
    Corrects the tag for PostgresSQL.
  • Update OpenLineageDao to handle Airflow run UUID conflicts https://github.com/MarquezProject/marquez/pull/2097 @collado-mike
    Alleviates the problem for Airflow installations that will continue to publish events with the older OpenLineage library. This checks the namespace of the parent run and verifies that it matches the namespace in the ParentRunFacet. If not, it generates a new parent run ID that will be written with the correct namespace. (The Airflow integration was generating conflicting UUIDs based on the DAG name and the DagRun ID without accounting for different namespaces. In Marquez installations that have multiple Airflow deployments with duplicated DAG names, we generated jobs whose parents have the wrong namespace.)
marquez - Marquez 0.25.0

Published by merobi-hub about 2 years ago

marquez - Marquez 0.24.0

Published by merobi-hub about 2 years ago

Added

  • Add copyright lines to all source files #1996 @merobi-hub
  • Add copyright and license guidelines in CONTRIBUTING.md @wslulciuc
  • Add @FlywayTarget annotation to migration tests to control flyway upgrades #2035 @collado-mike

Changed

Fixed

marquez - Marquez 0.23.0

Published by merobi-hub over 2 years ago

marquez - Marquez 0.22.0

Published by merobi-hub over 2 years ago

Added

  • Add support for LifecycleStateChangeFacet with an ability to softly delete datasets #1847 @pawel-big-lebowski
  • Enable pod specific annotations in Marquez Helm Chart via marquez.podAnnotations #1945 @wslulciuc
  • Add support for job renaming/redirection via symlink #1947 @collado-mike
  • Add Created by view for dataset versions along with SQL syntax highlighting in web UI #1929 @phixMe
  • Add operationId to openapi spec #1978 @phixMe

Changed

Fixed

marquez - Marquez 0.21.0

Published by merobi-hub over 2 years ago

Added

  • Add MDC to the LoggingMdcFilter to include API method, path, and request ID @fm100
  • Add Postgres sub-chart to Helm deployment for easier installation option @KevinMellott91
  • GitHub Action workflow to validate changes to Helm chart @KevinMellott91

Changed

  • Upgrade from Java11 to Java17 @ucg8j
  • Switch JDK image from alpine to temurin enabling Marquez to run on multiple CPU architectures @ucg8j

Fixed

  • Error when running Marquez on Apple M1 @ucg8j

Removed

  • The /api/v1-beta/lineage endpoint @wslulciuc

  • The marquez-airflow lib. has been removed, Please use the openlineage-airflow library instead. To migrate to using openlineage-airflow, make the following changes @wslulciuc:

    # Update the import in your DAG definitions
    -from marquez_airflow import DAG
    +from openlineage.airflow import DAG
    
    # Update the following environment variables in your Airflow instance
    -MARQUEZ_URL
    +OPENLINEAGE_URL
    -MARQUEZ_NAMESPACE
    +OPENLINEAGE_NAMESPACE
    
  • The marquez-spark lib. has been removed. Please use the openlineage-spark library instead. To migrate to using openlineage-spark, make the following changes @wslulciuc:

    SparkSession.builder()
    - .config("spark.jars.packages", "io.github.marquezproject:marquez-spark:0.20.+")
    + .config("spark.jars.packages", "io.openlineage:openlineage-spark:0.2.+")
    - .config("spark.extraListeners", "marquez.spark.agent.SparkListener")
    + .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener")
      .config("spark.openlineage.host", "https://api.demo.datakin.com")
      .config("spark.openlineage.apiKey", "your datakin api key")
      .config("spark.openlineage.namespace", "<NAMESPACE_NAME>")
    .getOrCreate()
    
marquez - Marquez 0.20.0

Published by wslulciuc almost 3 years ago

Added

Changed

  • Clarify docs on using OpenLineage for metadata collection @fm100
  • Upgrade to gradle 7.x @wslulciuc
  • Use eclipse-temurin for Marquez API base docker image @fm100

Deprecated

  • The following endpoints have been deprecated and are scheduled to be removed in 0.25.0. Please use the /lineage endpoint when collecting source, dataset, and job metadata @wslulciuc:
    • /sources endpoint to collect source metadata
    • /datasets endpoint to collect dataset metadata
    • /jobs endpoint to collect job metadata

Fixed

  • Validation of OpenLineage events on write @collado-mike
  • Increase name column size for tables namespaces and sources @mmeasic

Security

marquez - Marquez 0.19.1

Published by collado-mike almost 3 years ago

Fixed

  • URI and URL DB mappper should handle empty string as null @OleksandrDvornik
  • Fix NodeId parsing when dataset name contains struct<> @fm100
  • Add encoding for dataset names in URL construction @collado-mike
marquez - Marquez 0.19.0

Published by wslulciuc almost 3 years ago

Added

  • Add simple python client example @wslulciuc
  • Display dataset versions in web UI 🎉 @phixMe
  • Display runs and run facets in web UI 🎉 @phixMe
  • Facet formatting and highlighting as Json in web UI @phixMe
  • Add option for docker/up.sh to run in the background @rossturk
  • Return totalCount in lists of jobs and datatsets @phixMe

Changed

  • Change type column in dataset_fields table to TEXT @wslulciuc
  • Set ZonedDateTime parsing to support optional offsets and default to server timezone @collado-mike

Fixed

  • Job.location and Source.connectionUrl should be in URI format on write @OleksandrDvornik
  • Z-Index fix for nodes and edges in lineage graph @phixMe
  • Format of the index files for web UI @phixMe
  • Fix OpenLineage API to return correct response codes for exceptions propagated from async calls @collado-mike
  • Stopped overwriting nominal time information with nulls @mobuchowski

Removed

  • WriteOnly clients for java and python. Before OpenLineage, we added a WriteOnly implementation to our clients to emit calls to a backend. A backend enabled collecting raw HTTP requests to an HTTP endpoint, console, or file. This was our way of capturing lineage events that could then be used to automatically create resources on the Marquez backend. We soon worked on a standard that eventually became OpenLineage. That is, OpenLineage removed the need to make individual calls to create a namespace, a source, a datasets, etc, but rather accept an event with metadata that the backend could process. @wslulciuc
marquez - Marquez 0.18.0

Published by wslulciuc about 3 years ago

Added

  • New Add Search API 🎉 @wslulciuc
  • Add .env.example to override variables defined in docker-compose files @wslulciuc

Changed

Fixed

Removed

  • Drop job_versions_io_mapping_inputs and job_versions_io_mapping_outputs tables @OleksandrDvornik
marquez - Marquez 0.17.0

Published by wslulciuc about 3 years ago

Changed

  • Updated Lineage runs query to improve performance, added tests @collado-mike
  • Add POST /api/v1/lineage endpoint to docs and deprecate run endpoints @wslulciuc
  • Drop FieldType enum @wslulciuc

Deprecated

Removed

marquez - Marquez 0.16.1

Published by wslulciuc over 3 years ago

Fixed

  • dbt packages should look for namespace packages @mobuchowski
  • Add common integration dependency to dbt plugins @mobuchowski
  • DatasetVersionDao queries missing input and output facets @dominiquetipton
  • (De)serialization issue for Run and JobData models @collado-mike
  • Prefix spark openlineage.* configuration parameters with spark.* @collado-mike
  • Parse multi-statement sql in class SqlParser used in Airflow integration @wslulciuc
  • URL-encode namespace on calls to API backend @phixMe
marquez - Marquez 0.16.0

Published by wslulciuc over 3 years ago

Added

Changed

Fixed

marquez - Marquez 0.15.2

Published by collado-mike over 3 years ago

Added

  • Add endpoint to create tags @hanbei

Fixed

  • Fixed build & release process for python marquez-integration-common package @collado-mike
  • Fixed snowflake and bigquery errors when connector libraries not loaded @collado-mike
  • Fixed Openlineage API does not set Dataset current_version_uuid #1361 @collado-mike
marquez - Marquez 0.15.1

Published by collado-mike over 3 years ago

Added

  • Factored out common functionality in Python airflow integration @mobuchowski
  • Added Airflow task run macro to expose task run id @collado-mike

Changed

  • Refactored ValuesAverageExpectationParser to ValuesSumExpectationParser and ValuesCountExpectationParser @collado-mike
  • Updated SparkListener to extend Spark's SparkListener abstract class @collado-mike

Fixed

  • Use current project version in spark openlineage client @mobuchowski
  • Rewrote LineageDao queries and LineageService for performance @collado-mike
  • Updated lineage query to include new jobs that have no job version yet @collado-mike
marquez - Marquez 0.15.0

Published by wslulciuc over 3 years ago

Added

Changed

  • Augment tutorial instructions & screenshots for Airflow example @rossturk
  • Rewrite correlated subqueries when querying the lineage_events table @collado-mike

Fixed

marquez - Marquez 0.14.2

Published by wslulciuc over 3 years ago

Changed

  • Unpin requests dep in marquez-airflow integration @wslulciuc
  • Unpin attrs dep in marquez-airflow integration @wslulciuc
marquez - Marquez 0.14.1

Published by wslulciuc over 3 years ago

Changed

  • Updated dataset lineage query to find most recent job that wrote to it @collado-mike
  • Pin http-proxy-middleware to 0.20.0 @wslulciuc