marquez

marquez - Marquez 0.28.0

Published by merobi-hub almost 2 years ago

Added

Optimize current runs query for lineage API https://github.com/MarquezProject/marquez/pull/2211 @prachim-collab
Add Code Quality, DCO and Governance docs to project https://github.com/MarquezProject/marquez/pull/2237 https://github.com/MarquezProject/marquez/pull/2241 @merobi-hub
Add possibility to soft-delete namespaces https://github.com/MarquezProject/marquez/pull/2244 @mobuchowski
Add search service proposal https://github.com/MarquezProject/marquez/pull/2203 @pawel-big-lebowski

Fixed

Show facets even when dataset has no fields https://github.com/MarquezProject/marquez/pull/2214 @JDarDagran
Appreciate column prefix when given for ended_at https://github.com/MarquezProject/marquez/pull/2231 @fm100
Fix bug keeping jobs from being properly deleted https://github.com/MarquezProject/marquez/pull/2244 @mobuchowski
Fix symlink table column length https://github.com/MarquezProject/marquez/pull/2217 @pawel-big-lebowski

marquez - Marquez 0.27.0

Published by merobi-hub almost 2 years ago

Added

Implement dataset symlink feature https://github.com/MarquezProject/marquez/pull/2066 @pawel-big-lebowski
Store column lineage facets in separate table https://github.com/MarquezProject/marquez/pull/2096 @mzareba382 @pawel-big-lebowski
Add a lineage graph endpoint for column lineage https://github.com/MarquezProject/marquez/pull/2124 @pawel-big-lebowski
Enrich returned dataset resource with column lineage information https://github.com/MarquezProject/marquez/pull/2113 @pawel-big-lebowski
Add downstream column lineage https://github.com/MarquezProject/marquez/pull/2159 @pawel-big-lebowski
Implement column lineage within Marquez Java client https://github.com/MarquezProject/marquez/pull/2163 @pawel-big-lebowski
Provide dataset_symlinks table for SymlinkDatasetFacet https://github.com/MarquezProject/marquez/pull/2087 @pawel-big-lebowski
Display current run state for job node in lineage graph https://github.com/MarquezProject/marquez/pull/2146 @wslulciuc
Include column lineage in dataset resource https://github.com/MarquezProject/marquez/pull/2148 @pawel-big-lebowski
Add indices on the job table https://github.com/MarquezProject/marquez/pull/2161 @phixMe
Add endpoint to get column lineage by a job https://github.com/MarquezProject/marquez/pull/2204 @pawel-big-lebowski
Add column lineage methods to Python client https://github.com/MarquezProject/marquez/pull/2209 @pawel-big-lebowski

Changed

Update insert job function to avoid joining on symlinks for jobs with no symlinks https://github.com/MarquezProject/marquez/pull/2144 @collado-mike
Increase size of column-lineage.description column https://github.com/MarquezProject/marquez/pull/2205 @pawel-big-lebowski

Fixed

Add support for parentRun facet as reported by older Airflow OpenLineage versions https://github.com/MarquezProject/marquez/pull/2130 @collado-mike
Add fix and tests for handling Airflow DAGs with dots and task groups https://github.com/MarquezProject/marquez/pull/2126 @collado-mike @wslulciuc
Fix version bump in docker/up.sh https://github.com/MarquezProject/marquez/pull/2129 @wslulciuc
Use clean when running shadowJar in Dockerfile https://github.com/MarquezProject/marquez/pull/2145 @wslulciuc
Fix bug that caused a single run event to create multiple jobs https://github.com/MarquezProject/marquez/pull/2162 @collado-mike
Fix column lineage returning multiple entries for job run multiple times https://github.com/MarquezProject/marquez/pull/2176 @pawel-big-lebowski
Fix API spec issues https://github.com/MarquezProject/marquez/pull/2178 @phixMe
Fix downstream recursion https://github.com/MarquezProject/marquez/pull/2181 @pawel-big-lebowski
Update jobs_current_version_uuid_index and jobs_symlink_target_uuid_index to ignore NULL values https://github.com/MarquezProject/marquez/pull/2186 @collado-mike

marquez - Marquez 0.26.0

Published by merobi-hub about 2 years ago

Added

Update FlywayFactory to support an argument to customize the schema programatically https://github.com/MarquezProject/marquez/pull/2055 @collado-mike
Note: this change does not aim to support custom schemas from configuration.
Add steps on proposing changes to Marquez https://github.com/MarquezProject/marquez/pull/2065 @wslulciuc
Adds steps on how to submit a proposal for review along with a design doc template.
Add --metadata option to seed backend with OpenLineage events https://github.com/MarquezProject/marquez/pull/2082 @wslulciuc
Updates the seed command to load metadata from a file containing an array of OpenLineage events via the --metadata option. (Metadata used in the command was not being defined using the OpenLineage standard.)
Improve documentation on nodeId in the spec https://github.com/MarquezProject/marquez/pull/2084 @howardyoo
Adds complete examples of nodeId to the spec.
Add metadata cmd https://github.com/MarquezProject/marquez/pull/2091 @wslulciuc
Adds cmd metadata to generate OpenLineage events; generated events will be saved to a file called metadata.json that can be used to seed Marquez via the seed cmd. (We lacked a way to performance test the data model of Marquez with significantly large OL events.)
Add possibility to soft-delete datasets and jobs https://github.com/MarquezProject/marquez/pull/2032 https://github.com/MarquezProject/marquez/pull/2099 https://github.com/MarquezProject/marquez/pull/2101 @mobuchowski
Adds the ability to "hide" inactive datasets and jobs through the UI. (This PR does not include the UI part.) The feature works by adding an is_hidden flag to both datasets and jobs tables. Then, it changes jobs_view and adds datasets_view, which hides rows where the is_hidden flag is set to True. This makes writing proper queries easier since there is no need to do this filtering manually. The soft-delete is reversed if the job or dataset is updated again because the new version reverts the flag.
Add raw OpenLineage events API https://github.com/MarquezProject/marquez/pull/2070 @mobuchowski
Adds an API that returns raw OpenLineage events sorted by time and optionally filtered by namespace. Filtering by namespace takes into account both job and dataset namespaces.
Create column lineage endpoint proposal https://github.com/MarquezProject/marquez/pull/2077 @julienledem @pawel-big-lebowski
Adds a proposal to implement a column-level lineage endpoint in Marquez to leverage the column-level lineage facet in OpenLineage.

Changed

Update lineage query to only look at jobs with inputs or outputs https://github.com/MarquezProject/marquez/pull/2068 @collado-mike
Changes the lineage query to query the job_versions_io_mapping table and INNER join with the jobs_view so that only jobs that have inputs or outputs are present in the jobs_io CTE. Hence, the table becomes very small and the recursive join in the lineage CTE very fast. (In many environments, a large number of jobs reporting events have no inputs or outputs - e.g., PythonOperators in an Airflow deployment. If a Marquez installation has many of these, the lineage query spends much of its time searching for overlaps with jobs that have no inputs or outputs.)
Persist OpenLineage event before updating Marquez model https://github.com/MarquezProject/marquez/pull/2069 @fm100
Switches the order of the code in order to persist the OpenLineage event first and then update the Marquez model. (When the RunTransitionListener was invoked, the OpenLineage event was not persisted to the database. Because the OpenLineage event is the source of truth for all Marquez run transitions, it should be available from RunTransitionListener.)
Drop requirement to provide marquez.yml for seed cmd https://github.com/MarquezProject/marquez/pull/2094 @wslulciuc
Uses io.dropwizard.cli.Command instead of io.dropwizard.cli.ConfiguredCommand to no longer require passing marquez.yml as an argument to the seed cmd. (The marquez.yml argument is not used in the seed cmd.)

Fixed

Fix/rewrite jobs fqn locks https://github.com/MarquezProject/marquez/pull/2067 @collado-mike
Updates the function to only update the table if the job is a new record or if the symlink_target_uuid is distinct from the previous value. (The rewrite_jobs_fqn_table function was inadvertently updating jobs even when no metadata about the job had changed. Under load, this caused significant locking issues, as the jobs_fqn table must be locked for every job update.)
Fix enum string types in the OpenAPI spec https://github.com/MarquezProject/marquez/pull/2086 @studiosciences
Changes the type to string. (type: enum was not valid in OpenAPI spec.)
Fix incorrect PostgresSQL version https://github.com/MarquezProject/marquez/pull/2089 @jabbera
Corrects the tag for PostgresSQL.
Update OpenLineageDao to handle Airflow run UUID conflicts https://github.com/MarquezProject/marquez/pull/2097 @collado-mike
Alleviates the problem for Airflow installations that will continue to publish events with the older OpenLineage library. This checks the namespace of the parent run and verifies that it matches the namespace in the ParentRunFacet. If not, it generates a new parent run ID that will be written with the correct namespace. (The Airflow integration was generating conflicting UUIDs based on the DAG name and the DagRun ID without accounting for different namespaces. In Marquez installations that have multiple Airflow deployments with duplicated DAG names, we generated jobs whose parents have the wrong namespace.)

marquez - Marquez 0.25.0

Published by merobi-hub about 2 years ago

Fixed

Fix py module release https://github.com/MarquezProject/marquez/pull/2057 @wslulciuc
Use /bin/sh in web/docker/entrypoint.sh https://github.com/MarquezProject/marquez/pull/2059 @wslulciuc

marquez - Marquez 0.24.0

Published by merobi-hub about 2 years ago

Added

Add copyright lines to all source files #1996 @merobi-hub
Add copyright and license guidelines in CONTRIBUTING.md @wslulciuc
Add @FlywayTarget annotation to migration tests to control flyway upgrades #2035 @collado-mike

Changed

Updated jobs_view to stop computing FQN on reads and to compute on writes instead #2036 @collado-mike
Runs row reduction #2041 @collado-mike

Fixed

Update Run in the openapi spec to include a context field #2020 @esaych
Fix dataset openapi model #2038 @esaych
Fix casing on lastLifecycleState #2039 @esaych
Fix V45 migration to include initial population of jobs_fqn table #2051 @collado-mike
Fix symlinked jobs in queries #2053 @collado-mike

marquez - Marquez 0.23.0

Published by merobi-hub over 2 years ago

Added

Update docker-compose.yml: Randomly map postgres db port https://github.com/MarquezProject/marquez/pull/2000 @RNHTTR
Job parent hierarchy https://github.com/MarquezProject/marquez/pull/1935 https://github.com/MarquezProject/marquez/pull/1980 https://github.com/MarquezProject/marquez/pull/1992 @collado-mike

Changed

Set default limit for listing datasets and jobs in UI from 2000 to 25 https://github.com/MarquezProject/marquez/pull/2018 @wslulciuc

Fixed

Return the tag for postgresql to 12.1.0 https://github.com/MarquezProject/marquez/pull/2015 @rossturk

marquez - Marquez 0.22.0

Published by merobi-hub over 2 years ago

Added

Add support for LifecycleStateChangeFacet with an ability to softly delete datasets #1847 @pawel-big-lebowski
Enable pod specific annotations in Marquez Helm Chart via marquez.podAnnotations #1945 @wslulciuc
Add support for job renaming/redirection via symlink #1947 @collado-mike
Add Created by view for dataset versions along with SQL syntax highlighting in web UI #1929 @phixMe
Add operationId to openapi spec #1978 @phixMe

Changed

Upgrade Flyway to v7.6.0 #1974 @dakshin-k

Fixed

Remove size limits on namespaces, dataset names, and and source connection urls #1925 @collado-mike
Update namespace names to allow =, @, and ; #1936 @mobuchowski
Time duration display in web UI #1950 @phixMe
Enable web UI to access API via Helm Chart @GZack2000

marquez - Marquez 0.21.0

Published by merobi-hub over 2 years ago

Added

Add MDC to the LoggingMdcFilter to include API method, path, and request ID @fm100
Add Postgres sub-chart to Helm deployment for easier installation option @KevinMellott91
GitHub Action workflow to validate changes to Helm chart @KevinMellott91

Changed

Upgrade from Java11 to Java17 @ucg8j
Switch JDK image from alpine to temurin enabling Marquez to run on multiple CPU architectures @ucg8j

Fixed

Error when running Marquez on Apple M1 @ucg8j

Removed

The /api/v1-beta/lineage endpoint @wslulciuc

The marquez-airflow lib. has been removed, Please use the openlineage-airflow library instead. To migrate to using openlineage-airflow, make the following changes @wslulciuc:

# Update the import in your DAG definitions
-from marquez_airflow import DAG
+from openlineage.airflow import DAG

# Update the following environment variables in your Airflow instance
-MARQUEZ_URL
+OPENLINEAGE_URL
-MARQUEZ_NAMESPACE
+OPENLINEAGE_NAMESPACE

The marquez-spark lib. has been removed. Please use the openlineage-spark library instead. To migrate to using openlineage-spark, make the following changes @wslulciuc:

SparkSession.builder()
- .config("spark.jars.packages", "io.github.marquezproject:marquez-spark:0.20.+")
+ .config("spark.jars.packages", "io.openlineage:openlineage-spark:0.2.+")
- .config("spark.extraListeners", "marquez.spark.agent.SparkListener")
+ .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener")
  .config("spark.openlineage.host", "https://api.demo.datakin.com")
  .config("spark.openlineage.apiKey", "your datakin api key")
  .config("spark.openlineage.namespace", "<NAMESPACE_NAME>")
.getOrCreate()

marquez - Marquez 0.20.0

Published by wslulciuc almost 3 years ago

Added

Add deploy docs for running Marquez on AWS @wslulciuc @merobi-hub

Changed

Clarify docs on using OpenLineage for metadata collection @fm100
Upgrade to gradle 7.x @wslulciuc
Use eclipse-temurin for Marquez API base docker image @fm100

Deprecated

The following endpoints have been deprecated and are scheduled to be removed in 0.25.0. Please use the /lineage endpoint when collecting source, dataset, and job metadata @wslulciuc:
- /sources endpoint to collect source metadata
- /datasets endpoint to collect dataset metadata
- /jobs endpoint to collect job metadata

Fixed

Validation of OpenLineage events on write @collado-mike
Increase name column size for tables namespaces and sources @mmeasic

Security

Fix log4j exploit @fm100

marquez - Marquez 0.19.1

Published by collado-mike almost 3 years ago

Fixed

URI and URL DB mappper should handle empty string as null @OleksandrDvornik
Fix NodeId parsing when dataset name contains struct<> @fm100
Add encoding for dataset names in URL construction @collado-mike

marquez - Marquez 0.19.0

Published by wslulciuc almost 3 years ago

Added

Add simple python client example @wslulciuc
Display dataset versions in web UI 🎉 @phixMe
Display runs and run facets in web UI 🎉 @phixMe
Facet formatting and highlighting as Json in web UI @phixMe
Add option for docker/up.sh to run in the background @rossturk
Return totalCount in lists of jobs and datatsets @phixMe

Changed

Change type column in dataset_fields table to TEXT @wslulciuc
Set ZonedDateTime parsing to support optional offsets and default to server timezone @collado-mike

Fixed

Job.location and Source.connectionUrl should be in URI format on write @OleksandrDvornik
Z-Index fix for nodes and edges in lineage graph @phixMe
Format of the index files for web UI @phixMe
Fix OpenLineage API to return correct response codes for exceptions propagated from async calls @collado-mike
Stopped overwriting nominal time information with nulls @mobuchowski

Removed

WriteOnly clients for java and python. Before OpenLineage, we added a WriteOnly implementation to our clients to emit calls to a backend. A backend enabled collecting raw HTTP requests to an HTTP endpoint, console, or file. This was our way of capturing lineage events that could then be used to automatically create resources on the Marquez backend. We soon worked on a standard that eventually became OpenLineage. That is, OpenLineage removed the need to make individual calls to create a namespace, a source, a datasets, etc, but rather accept an event with metadata that the backend could process. @wslulciuc

marquez - Marquez 0.18.0

Published by wslulciuc about 3 years ago

Added

New Add Search API 🎉 @wslulciuc
Add .env.example to override variables defined in docker-compose files @wslulciuc

Changed

Add openlineage-java as dependency @OleksandrDvornik
Move class SentryConfig from marquez to marquez.tracing pkg
Major UI improvements; the UI now uses the Search and Lineage APIs 🎉 @phixMe
Set default API port to 8080 when running the Marquez shadow jar @wslulciuc

Fixed

Update examples/airflow to use openlineage-airflow and fix the SQL in DAG troubleshooting step @wslulciuc

Removed

Drop job_versions_io_mapping_inputs and job_versions_io_mapping_outputs tables @OleksandrDvornik

marquez - Marquez 0.17.0

Published by wslulciuc about 3 years ago

Changed

Updated Lineage runs query to improve performance, added tests @collado-mike
Add POST /api/v1/lineage endpoint to docs and deprecate run endpoints @wslulciuc
Drop FieldType enum @wslulciuc

Deprecated

Run API endpoints that create or modify a job run (scheduled to be removed in 0.19.0). Please use the POST /api/v1/lineage endpoint when collecting job run metadata. @wslulciuc
Airflow integration, please use the openlineage-airflow library instead. @wslulciuc
Spark integration, please use the openlineage-spark library instead. @wslulciuc
Write only clients for java and python (scheduled to be removed in 0.19.0) @wslulciuc

Removed

Dbt integration lib. @wslulciuc
Common integration lib. @wslulciuc

marquez - Marquez 0.16.1

Published by wslulciuc over 3 years ago

Fixed

dbt packages should look for namespace packages @mobuchowski
Add common integration dependency to dbt plugins @mobuchowski
DatasetVersionDao queries missing input and output facets @dominiquetipton
(De)serialization issue for Run and JobData models @collado-mike
Prefix spark openlineage.* configuration parameters with spark.* @collado-mike
Parse multi-statement sql in class SqlParser used in Airflow integration @wslulciuc
URL-encode namespace on calls to API backend @phixMe

marquez - Marquez 0.16.0

Published by wslulciuc over 3 years ago

Added

New Add JobVersion API 🎉 @collado-mike
New Add DBT integrations for BigQuery and Snowflake 🎉 @mobuchowski

Changed

Reverted delete of BigQueryNodeVisitor to work with vanilla SparkListener @collado-mike
Promote Lineage API out of beta @OleksandrDvornik

Fixed

Display job SQL in UI @phixMe
Allow upsert of tags @hanbei
Allow potentially ambiguous URIs with encoded path segments @mobuchowski
Use source naming convetion defined by OpenLineage @mobuchowski
Return dataset facets @collado-mike
BigQuery source naming in integrations @mobuchowski

marquez - Marquez 0.15.2

Published by collado-mike over 3 years ago

Added

Add endpoint to create tags @hanbei

Fixed

Fixed build & release process for python marquez-integration-common package @collado-mike
Fixed snowflake and bigquery errors when connector libraries not loaded @collado-mike
Fixed Openlineage API does not set Dataset current_version_uuid #1361 @collado-mike

marquez - Marquez 0.15.1

Published by collado-mike over 3 years ago

Added

Factored out common functionality in Python airflow integration @mobuchowski
Added Airflow task run macro to expose task run id @collado-mike

Changed

Refactored ValuesAverageExpectationParser to ValuesSumExpectationParser and ValuesCountExpectationParser @collado-mike
Updated SparkListener to extend Spark's SparkListener abstract class @collado-mike

Fixed

Use current project version in spark openlineage client @mobuchowski
Rewrote LineageDao queries and LineageService for performance @collado-mike
Updated lineage query to include new jobs that have no job version yet @collado-mike

marquez - Marquez 0.15.0

Published by wslulciuc over 3 years ago

Added

Add tracing visibility @julienledem
New Add snowflake extractor 🎉 @mobuchowski
Add SSLContext to MarquezClient @lewiesnyder
Add support for LogicalRDDs in spark plan visitors @collado-mike
New Add Great Expectations based data quality facet support 🎉 @mobuchowski

Changed

Augment tutorial instructions & screenshots for Airflow example @rossturk
Rewrite correlated subqueries when querying the lineage_events table @collado-mike

Fixed

Web time formatting display fix @kachontep

marquez - Marquez 0.14.2

Published by wslulciuc over 3 years ago

Changed

Unpin requests dep in marquez-airflow integration @wslulciuc
Unpin attrs dep in marquez-airflow integration @wslulciuc

marquez - Marquez 0.14.1

Published by wslulciuc over 3 years ago

Changed

Updated dataset lineage query to find most recent job that wrote to it @collado-mike
Pin http-proxy-middleware to 0.20.0 @wslulciuc

Added

Fixed

Added

Changed

Fixed

Added

Changed

Fixed

Fixed

Added

Changed

Fixed

Added

Changed

Fixed

Added

Changed

Fixed

Added

Changed

Fixed

Removed

Added

Changed

Deprecated

Fixed

Security

Fixed

Added

Changed

Fixed

Removed

Added

Changed

Fixed

Removed

Changed

Deprecated

Removed

Fixed

Added

Changed

Fixed

Added

Fixed

Added

Changed

Fixed

Added

Changed

Fixed

Changed

Changed

Related Projects

oshi

data-api

video-series-app

nullptr-tools

OpenLineage

nullptr-tools