An orchestration platform for the development, production, and observation of data assets.
APACHE-2.0 License
Bot releases are visible (Hide)
Published by ajnadel about 4 years ago
New
ResourceDefinition.mock_resource
helper for magic mocking resources. Example usage can be found here
row_count
metadata entry from the Dask DataFrame type check (thanks @kinghuang!)orient
to the config options when materializing a Dask DataFrame to json
(thanks @kinghuang!)Bugfixes
configured
to a solid definition would overwrite inputs from run config.Published by johannkm about 4 years ago
Bugfixes
dagster-k8s-celery
executor when executing solid subsetsPublished by helloworld about 4 years ago
Breaking Changes
dagit
key is no longer part of the instance configuration schema and must be removed from dagster.yaml
files before they can be used.-d
can no longer be used as a command-line argument to specify a mode. Use --mode
instead.--preset
instead of --preset-name
to specify a preset to the pipeline launch
command.config
argument to the ConfigMapping
, @composite_solid
, @solid
, SolidDefinition
, @executor
, ExecutorDefinition
, @logger
, LoggerDefinition
, @resource
, and ResourceDefinition
APIs, which we deprecated in 0.8.0. Use config_schema
instead.New
-d
or --working-directory
can be used to specify a working directory in any command that-f
or --python_file
argument.create_dagster_pandas_dataframe_type
. This is the currentlyconfigured
API for predefining configuration for various definitions: https://docs.dagster.io/overview/configuration/#configured
Published by helloworld about 4 years ago
New
CeleryK8sRunLauncher
supports termination of pipeline runs. This can be accessed via theK8sRunLauncher
supports termination of pipeline runs.AssetMaterialization
events display the asset key in the Runs view.Bugfixes
DagsterInstance
was leaving database connections open due to not beingEnum
in resource config schemas resulted in an error.Published by helloworld about 4 years ago
New
AssetMaterializations
can now have type information attached as metadata. See the materializations tutorial for moreBugfixes
context['ts']
was not passed properlytask_acks_late: true
that resulted in a 409 Conflict error
from Kubernetes. The creation of a Kubernetes Job will now be aborted if another Job with the same name existsDocs
New
configured
API makes it easy to create configured versions of resources.Materialization
event type in favor of the new AssetMaterialization
event type,asset_key
parameter. Solids yielding Materialization
events will continueMaterialization
event will be removed in a future release.intermediate_store_defs
argument to ModeDefinition
, which will eventuallyBugfixes
default_value
config on a field now works as expected. #2725Breaking Changes
dagster
and dagit
CLI commands no longer add the working directory to thepython_package
workspace yaml config option.python_module
config option is deprecated and will be removed in a future release.New
--path-prefix
to the dagit CLI. #2073date_partition_range
util function now accepts an optional inclusive
boolean argument. By default, the function does not return include the partition for which the end time of the date range is greater than the current time. If inclusive=True
, then the list of partitions returned will include the extra partition.MultiDependency
or fan-in inputs will now only cause the solid step to skip if all of theBugfixes
input_hydration_config
argumentsalias
on a solid output will produce a useful error message (thanks @iKintosh!)daily_schedule
) for certain workspace.yaml formatsBreaking Changes
dagster-celery
module has been broken apart to manage dependencies more coherently. There are now three modules: dagster celery
, dagster-celery-k8s
, and dagster-celery-docker
.dagster-celery worker start
command now takes a required -A
parameter which must point to the app.py
file within the appropriate module. E.g if you are using the celery_k8s_job_executor
then you must use the -A dagster_celery_k8s.app
option when using the celery
or dagster-celery
cli tools. Similar for the celery_docker_executor
: -A dagster_celery_docker.app
must be used.input_hydration_config
and output_materialization_config
decorators to dagster_type_
and dagster_type_materializer
respectively. Renamed DagsterType's input_hydration_config
and output_materialization_config
arguments to loader
and materializer
respectively.New
New pipeline scoped runs tab in Dagit
Add the following Dask Job Queue clusters: moab, sge, lsf, slurm, oar (thanks @DavidKatz-il!)
K8s resource-requirements for run coordinator pods can be specified using the dagster-k8s/resource_requirements
tag on pipeline definitions:
@pipeline(
tags={
'dagster-k8s/resource_requirements': {
'requests': {'cpu': '250m', 'memory': '64Mi'},
'limits': {'cpu': '500m', 'memory': '2560Mi'},
}
},
)
def foo_bar_pipeline():
Added better error messaging in dagit for partition set and schedule configuration errors
An initial version of the CeleryDockerExecutor was added (thanks @mrdrprofuroboros!). The celery workers will launch tasks in docker containers.
Experimental: Great Expectations integration is currently under development in the new library dagster-ge. Example usage can be found here
Published by helloworld over 4 years ago
Breaking Changes
Engine
and ExecutorConfig
have been deleted in favor of Executor
. Instead of the @executor
decorator decorating a function that returns an ExecutorConfig
it should now decorate a function that returns an Executor
.New
dict
can be used as an alias for Permissive()
within a config schema declaration.StringSource
in the S3ComputeLogManager
configuration schema to support using environment variables in the configuration (Thanks @mrdrprofuroboros!)Bugfixes
$DAGSTER_HOME
environment variable is not an absolute path (Thanks @AndersonReyes!)staging_prefix
in the DatabricksPySparkStepLauncher
configuration to be an absolute path (Thanks @sd2k!)input_hydration_config
(Thanks @joeyfreund!)Published by ajnadel over 4 years ago
Bugfix
New
dagster asset wipe <asset_key>
Published by natekupp over 4 years ago
Breaking Changes
Previously, the gcs_resource
returned a GCSResource
wrapper which had a single client
property that returned a google.cloud.storage.client.Client
. Now, the gcs_resource
returns the client directly.
To update solids that use the gcp_resource
, change:
context.resources.gcs.client
To:
context.resources.gcs
New
reexecute_pipeline
to reexecute an existing pipeline run.project
field to the gcs_resource
in dagster_gcp
.dagster asset wipe
to remove all existing asset keys.Bugfix
executeRunInProcess
.dagster schedule up
output to be repository location scopedPublished by helloworld over 4 years ago
Bugfix
dagster instance migrate
.launch_scheduled_execution
that would mask configuration errors.dagster-k8s
when specifying per-step resources.New
label
optional parameter for materializations with asset_key
specified.Assets
page to have a typeahead selector and hierarchical views based on asset_key path.SSHResource
, replacing sftp_solid.Docs
Published by helloworld over 4 years ago
Bugfix
OSError: [Errno 24] Too many open files
when enoughNew
Published by mgasner over 4 years ago
Major Changes
Please see the 080_MIGRATION.md
migration guide for details on updating existing code to be
compatible with 0.8.0
Workspace, host and user process separation, and repository definition Dagit and other tools no
longer load a single repository containing user definitions such as pipelines into the same
process as the framework code. Instead, they load a "workspace" that can contain multiple
repositories sourced from a variety of different external locations (e.g., Python modules and
Python virtualenvs, with containers and source control repositories soon to come).
The repositories in a workspace are loaded into their own "user" processes distinct from the
"host" framework process. Dagit and other tools now communicate with user code over an IPC
mechanism. This architectural change has a couple of advantages:
We have introduced a new file format, workspace.yaml
, in order to support this new architecture.
The workspace yaml encodes what repositories to load and their location, and supersedes the
repository.yaml
file and associated machinery.
As a consequence, Dagster internals are now stricter about how pipelines are loaded. If you have
written scripts or tests in which a pipeline is defined and then passed across a process boundary
(e.g., using the multiprocess_executor
or dagstermill), you may now need to wrap the pipeline
in the reconstructable
utility function for it to be reconstructed across the process boundary.
In addition, rather than instantiate the RepositoryDefinition
class directly, users should now
prefer the @repository
decorator. As part of this change, the @scheduler
and
@repository_partitions
decorators have been removed, and their functionality subsumed under
@repository
.
Dagit organization The Dagit interface has changed substantially and is now oriented around
pipelines. Within the context of each pipeline in an environment, the previous "Pipelines" and
"Solids" tabs have been collapsed into the "Definition" tab; a new "Overview" tab provides
summary information about the pipeline, its schedules, its assets, and recent runs; the previous
"Playground" tab has been moved within the context of an individual pipeline. Related runs (e.g.,
runs created by re-executing subsets of previous runs) are now grouped together in the Playground
for easy reference. Dagit also now includes more advanced support for display of scheduled runs
that may not have executed ("schedule ticks"), as well as longitudinal views over scheduled runs,
and asset-oriented views of historical pipeline runs.
Assets Assets are named materializations that can be generated by your pipeline solids, which
support specialized views in Dagit. For example, if we represent a database table with an asset
key, we can now index all of the pipelines and pipeline runs that materialize that table, and
view them in a single place. To use the asset system, you must enable an asset-aware storage such
as Postgres.
Run launchers The distinction between "starting" and "launching" a run has been effaced. All
pipeline runs instigated through Dagit now make use of the RunLauncher
configured on the
Dagster instance, if one is configured. Additionally, run launchers can now support termination of
previously launched runs. If you have written your own run launcher, you may want to update it to
support termination. Note also that as of 0.7.9, the semantics of RunLauncher.launch_run
have
changed; this method now takes the run_id
of an existing run and should no longer attempt to
create the run in the instance.
Flexible reexecution Pipeline re-execution from Dagit is now fully flexible. You may
re-execute arbitrary subsets of a pipeline's execution steps, and the re-execution now appears
in the interface as a child run of the original execution.
Support for historical runs Snapshots of pipelines and other Dagster objects are now persisted
along with pipeline runs, so that historial runs can be loaded for review with the correct
execution plans even when pipeline code has changed. This prepares the system to be able to diff
pipeline runs and other objects against each other.
Step launchers and expanded support for PySpark on EMR and Databricks We've introduced a new
StepLauncher
abstraction that uses the resource system to allow individual execution steps to
be run in separate processes (and thus on separate execution substrates). This has made extensive
improvements to our PySpark support possible, including the option to execute individual PySpark
steps on EMR using the EmrPySparkStepLauncher
and on Databricks using the
DatabricksPySparkStepLauncher
The emr_pyspark
example demonstrates how to use a step launcher.
Clearer names What was previously known as the environment dictionary is now called the
run_config
, and the previous environment_dict
argument to APIs such as execute_pipeline
is
now deprecated. We renamed this argument to focus attention on the configuration of the run
being launched or executed, rather than on an ambiguous "environment". We've also renamed the
config
argument to all use definitions to be config_schema
, which should reduce ambiguity
between the configuration schema and the value being passed in some particular case. We've also
consolidated and improved documentation of the valid types for a config schema.
Lakehouse We're pleased to introduce Lakehouse, an experimental, alternative programming model
for data applications, built on top of Dagster core. Lakehouse allows developers to define data
applications in terms of data assets, such as database tables or ML models, rather than in terms
of the computations that produce those assets. The simple_lakehouse
example gives a taste of
what it's like to program in Lakehouse. We'd love feedback on whether this model is helpful!
Airflow ingest We've expanded the tooling available to teams with existing Airflow installations
that are interested in incrementally adopting Dagster. Previously, we provided only injection
tools that allowed developers to write Dagster pipelines and then compile them into Airflow DAGs
for execution. We've now added ingestion tools that allow teams to move to Dagster for execution
without having to rewrite all of their legacy pipelines in Dagster. In this approach, Airflow
DAGs are kept in their own container/environment, compiled into Dagster pipelines, and run via
the Dagster orchestrator. See the airflow_ingest
example for details!
Breaking Changes
dagster
The @scheduler
and @repository_partitions
decorators have been removed. Instances of
ScheduleDefinition
and PartitionSetDefinition
belonging to a repository should be specified
using the @repository
decorator instead.
Support for the Dagster solid selection DSL, previously introduced in Dagit, is now uniform
throughout the Python codebase, with the previous solid_subset
arguments (--solid-subset
in
the CLI) being replaced by solid_selection
(--solid-selection
). In addition to the names of
individual solids, this argument now supports selection queries like *solid_name++
(i.e.,
solid_name
, all of its ancestors, its immediate descendants, and their immediate descendants).
The built-in Dagster type Path
has been removed.
PartitionSetDefinition
names, including those defined by a PartitionScheduleDefinition
,
must now be unique within a single repository.
Asset keys are now sanitized for non-alphanumeric characters. All characters besides
alphanumerics and _
are treated as path delimiters. Asset keys can also be specified using
AssetKey
, which accepts a list of strings as an explicit path. If you are running 0.7.10 or
later and using assets, you may need to migrate your historical event log data for asset keys
from previous runs to be attributed correctly. This event_log
data migration can be invoked
as follows:
from dagster.core.storage.event_log.migration import migrate_event_log_data
from dagster import DagsterInstance
migrate_event_log_data(instance=DagsterInstance.get())
The interface of the Scheduler
base class has changed substantially. If you've written a
custom scheduler, please get in touch!
The partitioned schedule decorators now generate PartitionSetDefinition
names using
the schedule name, suffixed with _partitions
.
The repository
property on ScheduleExecutionContext
is no longer available. If you were
using this property to pass to Scheduler
instance methods, this interface has changed
significantly. Please see the Scheduler
class documentation for details.
The CLI option --celery-base-priority
is no longer available for the command:
dagster pipeline backfill
. Use the tags option to specify the celery priority, (e.g.
dagster pipeline backfill my_pipeline --tags '{ "dagster-celery/run_priority": 3 }'
The execute_partition_set
API has been removed.
The deprecated is_optional
parameter to Field
and OutputDefinition
has been removed.
Use is_required
instead.
The deprecated runtime_type
property on InputDefinition
and OutputDefinition
has been
removed. Use dagster_type
instead.
The deprecated has_runtime_type
, runtime_type_named
, and all_runtime_types
methods on
PipelineDefinition
have been removed. Use has_dagster_type
, dagster_type_named
, and
all_dagster_types
instead.
The deprecated all_runtime_types
method on SolidDefinition
and CompositeSolidDefinition
has been removed. Use all_dagster_types
instead.
The deprecated metadata
argument to SolidDefinition
and @solid
has been removed. Use
tags
instead.
The graphviz-based DAG visualization in Dagster core has been removed. Please use Dagit!
dagit
dagit-cli
has been removed, and dagit
is now the only console entrypoint.dagster-aws
dagster_aws.EmrRunJobFlowSolidDefinition
has been removed.dagster-bash
bash_command_solid
and bash_script_solid
create_shell_command_solid
andcreate_shell_script_solid
.dagster-celery
--celery-base-priority
is no longer available for the command:dagster pipeline backfill
. Use the tags option to specify the celery priority, (e.g.dagster pipeline backfill my_pipeline --tags '{ "dagster-celery/run_priority": 3 }'
dagster-dask
dagster_dask.dask_executor
has changed. The previous config shouldlocal
.dagster-gcp
BigQueryClient
has been removed. Use bigquery_resource
instead.dagster-dbt
dagster-spark
dagster_spark.SparkSolidDefinition
has been removed - use create_spark_solid
instead.SparkRDD
Dagster type, which only worked with an in-memory engine, has been removed.dagster-twilio
TwilioClient
has been removed. Use twilio_resource
instead.New
dagster
asset_key
on any Materialization
to use the new asset system. You will alsolongitudinal_pipeline
exampleend_time
.dagit
/graphiql
as well as at /graphql
.dagster-aws
dagster_aws.S3ComputeLogManager
may now be configured to override the S3 endpoint anddagster-azure
adls2_system_storage
or, for direct access, the adls2_resource
resource. (Thanksdagster-dask
dagster_dask.dask_executor
. For full support, you will needpip install dagster-dask[yarn, pbs, kube]
. (Thanks @DavidKatz-il!)dagster-databricks
databricks_pyspark_step_launcher
. (Thanks @sd2k!)dagster-gcp
dagster-k8s
CeleryK8sRunLauncher
to submit execution plan steps to Celery task queues fordagster-pandas
dagster-papertrail
papertrail_logger
may now be set using eitherdagster-pyspark
emr_pyspark_step_launcher
, or on Databricks usingemr_pyspark
example demonstrates how to use a stepdagster-snowflake
snowflake_resource
may now be set using eitherdagster-spark
dagster_spark.create_spark_solid
now accepts a required_resource_keys
argument, whichemr_pyspark_step_launcher
.Bugfix
dagster pipeline execute
now sets a non-zero exit code when pipeline execution fails.Published by prha over 4 years ago
Bugfix
NoOpComputeLogManager
to be configured as the compute_logs
implementation in dagster.yaml
Published by natekupp over 4 years ago
New
New
dagster schedule logs {schedule_name}
command will show the log file for a given schedule. This helps uncover errors like missing environment variables and import errors.dagster schedule debug
command. As before, these errors can be resolve using dagster schedule up
Bugfix
dagster.yaml
Breaking Changes
dagster pipeline backfill
command no longer takes a mode
flag. Instead, it uses the mode specified on the PartitionSetDefinition
. Similarly, the runs created from the backfill also use the solid_subset
specified on the PartitionSetDefinition
BugFix
dagster schedule
debug command will display issues related to missing crob jobs, extraneous cron jobs, and duplicate cron jobs. Running dagster schedule up
will fix any issues.New
dagster
package.Published by helloworld over 4 years ago
Bugfix
Bugfix
dagster_celery
had introduced a spurious dependency on dagster_k8s