An orchestration platform for the development, production, and observation of data assets.
APACHE-2.0 License
Bot releases are visible (Hide)
Published by gibsondan about 3 years ago
emr_pyspark_step_launcher
to fail when stderr included non-Log4J-formatted lines.applyPerUniqueValue
config on the QueuedRunCoordinator
to fail Helm schema validation.@asset
decorator and build_assets_job
APIs to construct asset-based jobs, along with Dagit support.load_assets_from_dbt_project
and load_assets_from_dbt_manifest
, which enable constructing asset-based jobs from DBT models.Published by gibsondan about 3 years ago
[dagstermill] You can now have more precise IO control over the output notebooks by specifying output_notebook_name
in define_dagstermill_solid
and providing your own IO manager via "output_notebook_io_manager" resource key.
output_notebook
argument in define_dagstermill_solid
in favor of output_notebook_name
.local_output_notebook_io_manager
is provided for handling local output notebook materialization.Dagit fonts have been updated.
context.log.info("foo %s", "bar")
would not get formatted as expected.QueuedRunCoordinator
’s tag_concurrency_limits
to not be respected in some casestags
argument of the @graph
decorator or GraphDefinition
constructor. These tags will be set on any runs of jobs are built from invoking to_job
on the graph.k8s_job_executor
or celery_k8s_job_executor
. Use the key image
inside the container_config
block of the k8s solid tag.jobs
argument. Each RunRequest
emitted from a multi-job sensor’s evaluation function must specify a job_name
.Published by gibsondan about 3 years ago
KubernetesRunLauncher
image pull policy is now configurable in a separate field (thanks @yamrzou!).dagster-github
package is now usable for GitHub Enterprise users (thanks @metinsenturk!) A hostname can now be provided via config to the dagster-github resource with the key github_hostname
:execute_pipeline(
github_pipeline, {'resources': {'github': {'config': {
"github_app_id": os.getenv('GITHUB_APP_ID'),
"github_app_private_rsa_key": os.getenv('GITHUB_PRIVATE_KEY'),
"github_installation_id": os.getenv('GITHUB_INSTALLATION_ID'),
"github_hostname": os.getenv('GITHUB_HOSTNAME'),
}}}}
)
pipeline_failure_sensor
and run_status_sensor
queries. To take advantage of these performance gains, run a schema migration with the CLI command: dagster instance migrate
.DockerRunLauncher
would raise an exception when no networks were specified in its configuration.dagster-slack
has migrated off of deprecated slackclient
(deprecated) and now uses [slack_sdk](https://slack.dev/python-slack-sdk/v3-migration/)
.OpDefinition
, the replacement for SolidDefinition
which is the type produced by the @op
decorator, is now part of the public API.daily_partitioned_config
, hourly_partitioned_config
, weekly_partitioned_config
, and monthly_partitioned_config
now accept an end_offset
parameter, which allows extending the set of partitions so that the last partition ends after the current time.Published by dpeng817 about 3 years ago
Previously in Dagit, when a repository location had an error when reloaded, the user could end up on an empty page with no context about the error. Now, we immediately show a dialog with the error and stack trace, with a button to try reloading the location again when the error is fixed.
Dagster is now compatible with Python’s logging module. In your config YAML file, you can configure log handlers and formatters that apply to the entire Dagster instance. Configuration instructions and examples detailed in the docs: https://docs.dagster.io/concepts/logging/python-logging
[helm] The timeout of database statements sent to the Dagster instance can now be configured using .dagit.dbStatementTimeout
.
The QueuedRunCoordinator
now supports setting separate limits for each unique value with a certain key. In the below example, 5 runs with the tag (backfill: first)
could run concurrently with 5 other runs with the tag (backfill: second)
.
run_coordinator:
module: dagster.core.run_coordinator
class: QueuedRunCoordinator
config:
tag_concurrency_limits:
- key: backfill
value:
applyLimitPerUniqueValue: True
limit: 5
.migrate.enabled=True
.Published by mgasner about 3 years ago
Added instance
on RunStatusSensorContext
for accessing the Dagster Instance from within the
run status sensors.
The inputs of a Dagstermill solid now are loaded the same way all other inputs are loaded in the
framework. This allows rerunning output notebooks with properly loaded inputs outside Dagster
context. Previously, the IO handling depended on temporary marshal directory.
Previously, the Dagit CLI could not target a bare graph in a file, like so:
from dagster import op, graph
@op
def my_op():
pass
@graph
def my_graph():
my_op()
This has been remedied. Now, a file foo.py
containing just a graph can be targeted by the dagit
CLI: dagit -f foo.py
.
When a solid, pipeline, schedule, etc. description or event metadata entry contains a
markdown-formatted table, that table is now rendered in Dagit with better spacing between elements.
The hacker-news example now includes
instructions
on how to deploy the repository in a Kubernetes cluster using the Dagster Helm chart.
[dagster-dbt] The dbt_cli_resource
now supports the dbt source snapshot-freshness
command
(thanks @emilyhawkins-drizly!)
[helm] Labels are now configurable on user code deployments.
EventMetadata.asset
and EventMetadata.pipeline_run
inAssetMaterialization
metadata. (Thanks @ymrzkrrs and @drewsonne!)fs_io_manager
, which allowsobjects.inv
is available at http://docs.dagster.io/objects.inv for other projects to link.execute_solid
has been removed from the testing (https://docs.dagster.io/concepts/testing)gcsfs
as a dependency.create_databricks_job_solid
now includes an example of how to use it.context.log.info()
and other similar functions now fully respect the python logging API. Concretely, log statements of the form context.log.error(“something %s happened!”, “bad”)
will now work as expected, and you are allowed to add things to the “extra” field to be consumed by downstream loggers: context.log.info("foo", extra={"some":"metadata"})
.config_from_files
, config_from_pkg_resources
, and config_from_yaml_strings
have been added for constructing run config from yaml files and strings.DockerRunLauncher
can now be configured to launch runs that are connected to more than one network, by configuring the networks
key.env_from
and volume_mounts
are now properly applied to the corresponding Kubernetes run worker and job pods.end_mlflow_run_on_pipeline_finished
hook now no longer errors whenever invoked.context.log
calls are now not allowed. context.log.info("msg", foo="hi")
should be rewritten as context.log.info("msg", extra={"foo":"hi"})
.AssetMaterialization
. Previously, it would still yield an AssetMaterialization
where the path is a temp file path that won't exist after the notebook execution.InputContext
and OutputContext
now each has an asset_key
that returns the asset key that was provided to the corresponding InputDefinition
or OutputDefinition
.Published by clairelin135 about 3 years ago
dbt_rpc_sync_resource
), which allows you to programmatically send dbt
commands to an RPC server, returning only when the command completes (as opposed to returning as soon as the command has been sent).k8s_job_executor
now adds to the secrets specified in K8sRunLa``uncher
, instead of overwriting them.local_file_manager
no longer uses the current directory as the default base_dir
, instead defaulting to LOCAL_ARTIFACT_STORAGE/storage/file_manager
. If you wish, you can configure LOCAL_ARTIFACT_STORAGE
in your dagster.yaml file.Following the recent change to add strict Content-Security-Policy directives to Dagit, the CSP began to block the iframe used to render ipynb notebook files. This has been fixed, and these iframes should now render correctly.
Fixed an error where large files would fail to upload when using the s3_pickle_io_manager
for intermediate storage.
Fixed an issue where Kubernetes environment variables defined in pipeline tags were not being applied properly to Kubernetes jobs.
Fixed tick preview in the Recent
live tick timeline view for Sensors.
Added more descriptive error messages for invalid sensor evaluation functions.
dagit
will now write to a temp directory in the current working directory when launched with the env var DAGSTER_HOME
not set. This should resolve issues where the event log was not keeping up to date when observing runs progress live in dagit
with no DAGSTER_HOME
Fixed an issue where retrying from a failed run sometimes failed if the pipeline was changed after the failure.
Fixed an issue with default config on to_job
that would result in an error when using an enum config schema within a job.
A-Za-z0-9_
.EventMetadata.python_artifact
.dagster sensor list
and dagster schedule list
CLI commands to include schedules and sensors that have never been turned on.execute_in_process
where providing default executor config to a job would cause config errors.ops
config entry in place of solids
would cause a config error.adls2_io_manager
ModeDefinition
now validates the keys of resource_defs
at definition time.Failure
exceptions no longer bypass the RetryPolicy
if one is set.serviceAccount.name
to the user deployment Helm subchart and schema, thanks @jrouly!EcsRunLauncher
will now exponentially backoff certain requests for up to a minute while waiting for ECS to reach a consistent state.launch
CLI, and other modes of external execution, whereas before, memoization was only available via execute_pipeline
and the execute
CLI.version
argument on the decorator:from dagster import root_input_manager
@root_input_manager(version="foo")
def my_root_manager(_):
pass
versioned_fs_io_manager
now defaults to using the storage directory of the instance as a base directory.GraphDefinition.to_job
now accepts a tags dictionary with non-string values - which will be serialized to JSON. This makes job tags work similarly to pipeline tags and solid tags.NoOpComputeLogManager
. It did not make sense to default to the LocalComputeLogManager
as pipeline runs are executed in ephemeral jobs, so logs could not be retrieved once these jobs were cleaned up. To have compute logs in a Kubernetes environment, users should configure a compute log manager that uses a cloud provider.dbt_pipeline
to the hacker news example repo, which demonstrates how to run a dbt project within a Dagster pipeline.k8s_job_executor
to match the configuration set in the K8sRunLauncher
.DAGSTER_GRPC_MAX_RX_BYTES
environment variable.dagster instance migrate
when the asset catalog contains wiped assets.--models
, --select
, or --exclude
flags while configuring the dbt_cli_resource
, it will no longer attempt to supply these flags to commands that don’t accept them.yield_result
wrote output value to the same file path if output names are the same for different solids.ops
can now be used as a config entry in place of solids
.EcsRunLauncher
more resilient to ECS’ eventual consistency model.PipelineRunStatus
.solid
on build_hook_context
. This allows you to access the hook_context.solid
parameter.dagster
’s dependency on docstring-parser
has been loosened.@pipeline
now pulls its description
from the doc string on the decorated function if it is provided.dagster new-project
now no longer targets a non-existent mode.@repository
functions.GraphDefinition.to_job
now supports the description
argument.AmazonECS_FullAccess
policy. Now, the attached roles has been more narrowly scoped to only allow the daemon and dagit tasks to interact with the ECS actions required by the EcsRunLauncher.Error: Got unexpected extra arguments
. Now, it ignores the entrypoint and launches succeed.dagster instance migrate
.ScheduleDefinition
constructor to instantiate a schedule definition, if a schedule name is not provided, the name of the schedule will now default to the pipeline name, plus “_schedule”, instead of raising an error.description
and solid_retry_policy
were getting dropped when using a solid_hook
decorator on a pipeline definition (#4355).With the new first-class Pipeline Failure sensors, you can now write sensors to perform arbitrary actions when pipelines in your repo fail using @pipeline_failure_sensor
. Out-of-the-box sensors are provided to send emails using make_email_on_pipeline_failure_sensor
and slack messages using make_slack_on_pipeline_failure_sensor
.
See the Pipeline Failure Sensor docs to learn more.
New first-class Asset sensors help you define sensors that launch pipeline runs or notify appropriate stakeholders when specific asset keys are materialized. This pattern also enables Dagster to infer cross-pipeline dependency links. Check out the docs here!
Solid-level retries: A new retry_policy
argument to the @solid
decorator allows you to easily and flexibly control how specific solids in your pipelines will be retried if they fail by setting a RetryPolicy.
Writing tests in Dagster is now even easier, using the new suite of direct invocation apis. Solids, resources, hooks, loggers, sensors, and schedules can all be invoked directly to test their behavior. For example, if you have some solid my_solid
that you'd like to test on an input, you can now write assert my_solid(1, "foo") == "bar"
(rather than explicitly calling execute_solid()
).
[Experimental] A new set of experimental core APIs. Among many benefits, these changes unify concepts such as Presets and Partition sets, make it easier to reuse common resources within an environment, make it possible to construct test-specific resources outside of your pipeline definition, and more. These changes are significant and impactful, so we encourage you to try them out and let us know how they feel! You can learn more about the specifics here
[Experimental] There’s a new reference deployment for running Dagster on AWS ECS and a new EcsRunLauncher that launches each pipeline run in its own ECS Task.
[Experimental] There’s a new k8s_job_executor
(https://docs.dagster.io/_apidocs/libraries/dagster-k8s#dagster_k8s.k8s_job_executor)which executes each solid of your pipeline in a separate Kubernetes job. This addition means that you can now choose at runtime (https://docs.dagster.io/deployment/guides/kubernetes/deploying-with-helm#executor) between single pod and multi-pod isolation for solids in your run. Previously this was only configurable for the entire deployment- you could either use the K8sRunLauncher
with the default executors (in process and multiprocess) for low isolation, or you could use the CeleryK8sRunLauncher
with the celery_k8s_job_executor
for pod-level isolation. Now, your instance can be configured with the K8sRunLauncher
and you can choose between the default executors or the k8s_job_executor.
Using the @schedule
, @resource
, or @sensor
decorator no longer requires a context parameter. If you are not using the context parameter in these, you can now do this:
@schedule(cron_schedule="* * * * *", pipeline_name="my_pipeline")
def my_schedule():
return {}
@resource
def my_resource():
return "foo"
@sensor(pipeline_name="my_pipeline")
def my_sensor():
return RunRequest(run_config={})
Dynamic mapping and collect features are no longer marked “experimental”. DynamicOutputDefinition
and DynamicOutput
can now be imported directly from dagster
.
Added repository_name property on SensorEvaluationContext
, which is name of the repository that the sensor belongs to.
get_mapping_key is now available on SolidExecutionContext
, allowing for discerning which downstream branch of a DynamicOutput
you are in.
When viewing a run in Dagit, you can now download its debug file directly from the run view. This can be loaded into dagit-debug.
[dagster-dbt] A new dbt_cli_resource
simplifies the process of working with dbt projects in your pipelines, and allows for a wide range of potential uses. Check out the integration guide for examples!
k8s_job_executor
that caused solid tag user defined Kubernetes config to not be applied to the Kubernetes jobs.The deprecated SystemCronScheduler
and K8sScheduler
schedulers have been removed. All schedules are now executed using the dagster-daemon proess. See the deployment docs for more information about how to use the dagster-daemon
process to run your schedules.
If you have written a custom run launcher, the arguments to the launch_run
function have changed in order to enable faster run launches. launch_run
now takes in a LaunchRunContext
object. Additionally, run launchers should now obtain the PipelinePythonOrigin
to pass as an argument to dagster api execute_run
. See the implementation of DockerRunLauncher for an example of the new way to write run launchers.
[helm] .Values.dagsterDaemon.queuedRunCoordinator
has had its schema altered. It is now referenced at .Values.dagsterDaemon.runCoordinator.
Previously, if you set up your run coordinator configuration in the following manner:
dagsterDaemon:
queuedRunCoordinator:
enabled: true
module: dagster.core.run_coordinator
class: QueuedRunCoordinator
config:
max_concurrent_runs: 25
tag_concurrency_limits: []
dequeue_interval_seconds: 30
It is now configured like:
dagsterDaemon:
runCoordinator:
enabled: true
type: QueuedRunCoordinator
config:
queuedRunCoordinator:
maxConcurrentRuns: 25
tagConcurrencyLimits: []
dequeueIntervalSeconds: 30
The method events_for_asset_key
on DagsterInstance
has been deprecated and will now issue a warning. This method was previously used in our asset sensor example code. This can be replaced by calls using the new DagsterInstance
API get_event_records
. The example code in our sensor documentation has been updated to use our new APIs as well.