dagster

An orchestration platform for the development, production, and observation of data assets.

APACHE-2.0 License

Downloads
12.2M
Stars
11.1K
Committers
367

Bot releases are hidden (Show)

dagster - 0.10.7

Published by sryza over 3 years ago

New

  • When user code raises an error inside handle_output, load_input, or a type check function, the log output now includes context about which input or output the error occurred during.
  • Added a secondary index to improve performance when querying run status. Run dagster instance migrate to upgrade.
  • [Helm] Celery queues can now be configured with different node selectors. Previously, configuring a node selector applied it to all Celery queues.
  • In Dagit, a repository location reload button is now available in the header of every pipeline, schedule, and sensor page.
  • When viewing a run in Dagit, log filtering behavior has been improved. step and type filtering now offer fuzzy search, all log event types are now searchable, and visual bugs within the input have been repaired. Additionally, the default setting for “Hide non-matches” has been flipped to true.
  • After launching a backfill in Dagit, the success message now includes a link to view the runs for the backfill.
  • The dagster-daemon process now runs faster when running multiple schedulers or sensors from the same repository.
  • When launching a backfill from Dagit, the “Re-execute From Last Run” option has been removed, because it had confusing semantics. “Re-execute From Failure” now includes a tooltip.
  • fs_io_manager now defaults the base directory to base_dir via the Dagster instance’s local_artifact_storage configuration. Previously, it defaults to the directory where the pipeline is executed.
  • Experimental IO managers versioned_filesystem_io_manager and custom_path_fs_io_manager now require base_dir as part of the resource configs. Previously, the base_dir defaulted to the directory where the pipeline was executed.
  • Added a backfill daemon that submits backfill runs in a daemon process. This should relieve memory / CPU requirements for scheduling large backfill jobs. Enabling this feature requires a schema migration to the runs storage via the CLI command dagster instance migrate and configuring your instance with the following settings in dagster.yaml:
backfill:
      daemon_enabled: true

There is a corresponding flag in the Dagster helm chart to enable this instance configuration. See the Helm chart’s values.yaml file for more information.

  • Both sensors and schedule definitions support a description parameter that takes in a human-readable string description and displays it on the corresponding landing page in Dagit.

Integrations

  • [dagster-gcp] The gcs_pickle_io_manager now also retries on 403 Forbidden errors, which previously would only retry on 429 TooManyRequests.

Bug Fixes

  • The use of Tuple with nested inner types in solid definitions no longer causes GraphQL errors
  • When searching assets in Dagit, keyboard navigation to the highlighted suggestion now navigates to the correct asset.
  • In some cases, run status strings in Dagit (e.g. “Queued”, “Running”, “Failed”) did not accurately match the status of the run. This has been repaired.
  • The experimental CLI command dagster new-repo should now properly generate subdirectories and files, without needing to install dagster from source (e.g. with pip install --editable).
  • Sensor minimum intervals now interact in a more compatible way with sensor daemon intervals to minimize evaluation ticks getting skipped. This should result in the cadence of sensor evaluations being less choppy.

Dependencies

  • Removed Dagster’s pin of the pendulum datetime/timezone library.

Documentation

  • Added an example of how to write a user-in-the-loop pipeline
dagster -

Published by gibsondan over 3 years ago

0.10.6

New

  • Added a dagster run delete CLI command to delete a run and its associated event log entries.
  • Added a partition_days_offset argument to the @daily_schedule decorator that allows you to customize which partition is used for each execution of your schedule. The default value of this parameter is 1, which means that a schedule that runs on day N will fill in the partition for day N-1. To create a schedule that uses the partition for the current day, set this parameter to 0, or increase it to make the schedule use an earlier day’s partition. Similar arguments have also been added for the other partitioned schedule decorators (@monthly_schedule, @weekly_schedule, and @hourly_schedule).
  • The experimental dagster new-repo command now includes a workspace.yaml file for your new repository.
  • When specifying the location of a gRPC server in your workspace.yaml file to load your pipelines, you can now specify an environment variable for the server’s hostname and port. For example, this is now a valid workspace:
load_from:
  - grpc_server:
      host:
        env: FOO_HOST
      port:
        env: FOO_PORT

Integrations

  • [Kubernetes] K8sRunLauncher and CeleryK8sRunLauncher no longer reload the pipeline being executed just before launching it. The previous behavior ensured that the latest version of the pipeline was always being used, but was inconsistent with other run launchers. Instead, to ensure that you’re running the latest version of your pipeline, you can refresh your repository in Dagit by pressing the button next to the repository name.
  • [Kubernetes] Added a flag to the Dagster helm chart that lets you specify that the cluster already has a redis server available, so the Helm chart does not need to create one in order to use redis as a messaging queue. For more information, see the Helm chart’s values.yaml file.

Bug Fixes

  • Schedules with invalid cron strings will now throw an error when the schedule definition is loaded, instead of when the cron string is evaluated.
  • Starting in the 0.10.1 release, the Dagit playground did not load when launched with the --path-prefix option. This has been fixed.
  • In the Dagit playground, when loading the run preview results in a Python error, the link to view the error is now clickable.
  • When using the “Refresh config” button in the Dagit playground after reloading a pipeline’s repository, the user’s solid selection is now preserved.
  • When executing a pipeline with a ModeDefinition that contains a single executor, that executor is now selected by default.
  • Calling reconstructable on pipelines with that were also decorated with hooks no longer raises an error.
  • The dagster-daemon liveness-check command previously returned false when daemons surfaced non-fatal errors to be displayed in Dagit, leading to crash loops in Kubernetes. The command has been fixed to return false only when the daemon has stopped running.
  • When a pipeline definition includes OutputDefinitions with io_manager_keys, or InputDefinitions with root_manager_keys, but any of the modes provided for the pipeline definition do not include a resource definition for the required key, Dagster now raises an error immediately instead of when the pipeline is executed.
  • dbt 0.19.0 introduced breaking changes to the JSON schema of dbt Artifacts. dagster-dbt has been updated to handle the new run_results.json schema for dbt 0.19.0.

Dependencies

  • The astroid library has been pinned to version 2.4 in dagster, due to version 2.5 causing problems with our pylint test suite.

Documentation

dagster -

Published by gibsondan over 3 years ago

New

  • Added a dagster run delete CLI command to delete a run and its associated event log entries.
  • Added a partition_days_offset argument to the @daily_schedule decorator that allows you to customize which partition is used for each execution of your schedule. The default value of this parameter is 1, which means that a schedule that runs on day N will fill in the partition for day N-1. To create a schedule that uses the partition for the current day, set this parameter to 0, or increase it to make the schedule use an earlier day’s partition. Similar arguments have also been added for the other partitioned schedule decorators (@monthly_schedule, @weekly_schedule, and @hourly_schedule).
  • The experimental dagster new-repo command now includes a workspace.yaml file for your new repository.
  • When specifying the location of a gRPC server in your workspace.yaml file to load your pipelines, you can now specify an environment variable for the server’s hostname and port. For example, this is now a valid workspace:
load_from:
  - grpc_server:
      host:
        env: FOO_HOST
      port:
        env: FOO_PORT

Integrations

  • [Kubernetes] K8sRunLauncher and CeleryK8sRunLauncher no longer reload the pipeline being executed just before launching it. The previous behavior ensured that the latest version of the pipeline was always being used, but was inconsistent with other run launchers. Instead, to ensure that you’re running the latest version of your pipeline, you can refresh your repository in Dagit by pressing the button next to the repository name.
  • [Kubernetes] Added a flag to the Dagster helm chart that lets you specify that the cluster already has a redis server available, so the Helm chart does not need to create one in order to use redis as a messaging queue. For more information, see the Helm chart’s values.yaml file.

Bug Fixes

  • Schedules with invalid cron strings will now throw an error when the schedule definition is loaded, instead of when the cron string is evaluated.
  • Starting in the 0.10.1 release, the Dagit playground did not load when launched with the --path-prefix option. This has been fixed.
  • In the Dagit playground, when loading the run preview results in a Python error, the link to view the error is now clickable.
  • When using the “Refresh config” button in the Dagit playground after reloading a pipeline’s repository, the user’s solid selection is now preserved.
  • When executing a pipeline with a ModeDefinition that contains a single executor, that executor is now selected by default.
  • Calling reconstructable on pipelines with that were also decorated with hooks no longer raises an error.
  • The dagster-daemon liveness-check command previously returned false when daemons surfaced non-fatal errors to be displayed in Dagit, leading to crash loops in Kubernetes. The command has been fixed to return false only when the daemon has stopped running.
  • When a pipeline definition includes OutputDefinitions with io_manager_keys, or InputDefinitions with root_manager_keys, but any of the modes provided for the pipeline definition do not include a resource definition for the required key, Dagster now raises an error immediately instead of when the pipeline is executed.
  • dbt 0.19.0 introduced breaking changes to the JSON schema of dbt Artifacts. dagster-dbt has been updated to handle the new run_results.json schema for dbt 0.19.0.

Dependencies

  • The astroid library has been pinned to version 2.4 in dagster, due to version 2.5 causing problems with our pylint test suite.

Documentation

dagster -

Published by dpeng817 over 3 years ago

Community Contributions

  • Add /License for packages that claim distribution under Apache-2.0 (thanks @bollwyvl!)

New

  • [k8s] Changed our weekly docker image releases (the default images in the helm chart). dagster/dagster-k8s and dagster/dagster-celery-k8s can be used for all processes which don't require user code (Dagit, Daemon, and Celery workers when using the CeleryK8sExecutor). user-code-example can
    be used for a sample user repository. The prior images (k8s-dagit, k8s-celery-worker, k8s-example)
    are deprecated.
  • configured api on solids now enforces name argument as positional. The name argument remains a keyword argument on executors. name argument has been removed from resources, and loggers to reflect that they are anonymous. Previously, you would receive an error message if the name argument was provided to configured on resources or loggers.
  • [sensors] In addition to the per-sensor minimum_interval_seconds field, the overall sensor daemon interval can now be configured in the dagster.yaml instance settings with:
sensor_settings:
    interval_seconds: 30 # (default)

This changes the interval at which the daemon checks for sensors which haven't run within their minimum_interval_seconds.

  • The message logged for type check failures now includes the description included in the TypeCheck
  • The dagster-daemon process now runs each of its daemons in its own thread. This allows the scheduler, sensor loop, and daemon for launching queued runs to run in parallel, without slowing each other down. The dagster-daemon process will shut down if any of the daemon threads crash or hang, so that the execution environment knows that it needs to be restarted.
  • dagster new-repo is a new CLI command that generates a Dagster repository with skeleton code in your filesystem. This CLI command is experimental and it may generate different files in future versions, even between dot releases. As of 0.10.5, dagster new-repo does not support Windows. See here for official API docs.
  • When using a grpc_server repository location, Dagit will automatically detect changes and prompt you to reload when the remote server updates.
  • Improved consistency of headers across pages in Dagit.
  • Added support for assets to the default SQLite event log storage.

Integrations

  • [dagster-pandas] - Improved the error messages on failed pandas type checks.
  • [dagster-postgres] - postgres_url is now a StringSource and can be loaded by environment variable
  • [helm] - Users can set Kubernetes labels on Celery worker deployments
  • [helm] - Users can set environment variables for Flower deployment
  • [helm] - The redis helm chart is now included as an optional dagster helm chart dependency

Bugfixes

  • Resolved an error preventing dynamic outputs from being passed to composite_solid inputs
  • Fixed the tick history graph for schedules defined in a lazy-loaded repository (#3626)
  • Fixed performance regression of the Runs page on dagit.
  • Fixed Gantt chart on Dagit run view to use the correct start time, repairing how steps are rendered within the chart.
  • On Instance status page in Dagit, correctly handle states where daemons have multiple errors.
  • Various Dagit bugfixes and improvements.
dagster -

Published by dpeng817 over 3 years ago

dagster -

Published by johannkm over 3 years ago

Bugfixes

  • Fixed an issue with daemon heartbeat backwards compatibility. Resolves an error on Dagit's Daemon Status page
dagster -

Published by johannkm over 3 years ago

dagster -

Published by OwenKephart over 3 years ago

New

  • [dagster] Sensors can now specify a minimum_interval_seconds argument, which determines the minimum amount of time between sensor evaluations.
  • [dagit] After manually reloading the current repository, users will now be prompted to regenerate preset-based or partition-set based run configs in the Playground view. This helps ensure that the generated run config is up to date when launching new runs. The prompt does not occur when the repository is automatically reloaded.

Bugfixes

  • Updated the -n/--max_workers default value for the dagster api grpc command to be None. When set to None, the gRPC server will use the default number of workers based on the CPU count. If you were previously setting this value to 1, we recommend removing the argument or increasing the number.
  • Fixed issue loading the schedule tick history graph for new schedules that have not been turned on.
  • In Dagit, newly launched runs will open in the current tab instead of a new tab.
  • Dagit bugfixes and improvements, including changes to loading state spinners.
  • When a user specifies both an intermediate storage and an IO manager for a particular output, we no longer silently ignore the IO manager.
dagster -

Published by OwenKephart over 3 years ago

dagster -

Published by sidkmenon-zz over 3 years ago

Community Contributions

New

  • [dagstermill] Users can now specify custom tags & descriptions for notebook solids.
  • [dagster-pagerduty / dagster-slack] Added built-in hook integrations to create pagerduty/slack alerts when solids fail.
  • [dagit] Added ability to preview runs for upcoming schedule ticks.

Bugfixes

  • Fixed an issue where run start times and end times were displayed in the wrong timezone in Dagit when using Postgres storage.

  • Schedules with partitions that weren’t able to execute due to not being able to find a partition will now display the name of the partition they were unable to find on the “Last tick” entry for that schedule.

  • Improved timing information display for queued and canceled runs within the Runs table view and on individual Run pages in Dagit.

  • Improvements to the tick history view for schedules and sensors.

  • Fixed formatting issues on the Dagit instance configuration page.

  • Miscellaneous Dagit bugfixes and improvements.

  • The dagster pipeline launch command will now respect run concurrency limits if they are applied on your instance.

  • Fixed an issue where re-executing a run created by a sensor would cause the daemon to stop executing any additional runs from that sensor.

  • Sensor runs with invalid run configuration will no longer create a failed run - instead, an error will appear on the page for the sensor, allowing you to fix the configuration issue.

  • General dagstermill housekeeping: test refactoring & type annotations, as well as repinning ipykernel to solve #3401

Documentation

  • Improved dagster-dbt example.
  • Added examples to demonstrate experimental features, including Memoized Development and Dynamic Graph.
  • Added a PR template and how to pick an issue for the first time contributors
dagster -

Published by sidkmenon-zz over 3 years ago

dagster -

Published by rexledesma over 3 years ago

dagster -

Published by rexledesma over 3 years ago

Community Contributions

  • Reduced image size of k8s-example by 25% (104 MB) (thanks @alex-treebeard and @mrdavidlaing!)
  • [dagster-snowflake] snowflake_resource can now be configured to use the SQLAlchemy connector (thanks @basilvetas!)

New

  • When setting userDeployments.deployments in the Helm chart, replicaCount now defaults to 1 if not specified.

Bugfixes

  • Fixed an issue where the Dagster daemon process couldn’t launch runs in repository locations containing more than one repository.
  • Fixed an issue where Helm chart was not correctly templating env, envConfigMaps, and envSecrets.

Documentation

  • Added new troubleshooting guide for problems encountered while using the QueuedRunCoordinator to limit run concurrency.
  • Added documentation for the sensor command-line interface.
dagster - 0.10.0 The Edge of Glory

Published by prha almost 4 years ago

0.10.0 The Edge of Glory

Major Changes

  • A native scheduler with support for exactly-once, fault tolerant, timezone-aware scheduling. A new Dagster daemon process has been added to manage your schedules and sensors with a reconciliation loop, ensuring that all runs are executed exactly once, even if the Dagster daemon experiences occasional failure. See the Migration Guide for instructions on moving from SystemCronScheduler or K8sScheduler to the new scheduler.
  • First-class sensors, built on the new Dagster daemon, allow you to instigate runs based on changes in external state - for example, files on S3 or assets materialized by other Dagster pipelines. See the Sensors Overview for more information.
  • Dagster now supports pipeline run queueing. You can apply instance-level run concurrency limits and prioritization rules by adding the QueuedRunCoordinator to your Dagster instance. See the Run Concurrency Overview for more information.
  • The IOManager abstraction provides a new, streamlined primitive for granular control over where and how solid outputs are stored and loaded. This is intended to replace the (deprecated) intermediate/system storage abstractions, See the IO Manager Overview for more information.
  • A new Partitions page in Dagit lets you view your your pipeline runs organized by partition. You can also launch backfills from Dagit and monitor them from this page.
  • A new Instance Status page in Dagit lets you monitor the health of your Dagster instance, with repository location information, daemon statuses, instance-level schedule and sensor information, and linkable instance configuration.
  • Resources can now declare their dependencies on other resources via the required_resource_keys parameter on @resource.
  • Our support for deploying on Kubernetes is now mature and battle-tested Our Helm chart is now easier to configure and deploy, and we’ve made big investments in observability and reliability. You can view Kubernetes interactions in the structured event log and use Dagit to help you understand what’s happening in your deployment. The defaults in the Helm chart will give you graceful degradation and failure recovery right out of the box.
  • Experimental support for dynamic orchestration with the new DynamicOutputDefinition API. Dagster can now map the downstream dependencies over a dynamic output at runtime.

Breaking Changes

Dropping Python 2 support

  • We’ve dropped support for Python 2.7, based on community usage and enthusiasm for Python 3-native public APIs.

Removal of deprecated APIs

These APIs were marked for deprecation with warnings in the 0.9.0 release, and have been removed in the 0.10.0 release.

  • The decorator input_hydration_config has been removed. Use the dagster_type_loader decorator instead.
  • The decorator output_materialization_config has been removed. Use dagster_type_materializer instead.
  • The system storage subsystem has been removed. This includes SystemStorageDefinition, @system_storage, and default_system_storage_defs . Use the new IOManagers API instead. See the IO Manager Overview for more information.
  • The config_field argument on decorators and definitions classes has been removed and replaced with config_schema. This is a drop-in rename.
  • The argument step_keys_to_execute to the functions reexecute_pipeline and reexecute_pipeline_iterator has been removed. Use the step_selection argument to select subsets for execution instead.
  • Repositories can no longer be loaded using the legacy repository key in your workspace.yaml; use load_from instead. See the
    Workspaces Overview for documentation about how to define a workspace.

Breaking API Changes

  • SolidExecutionResult.compute_output_event_dict has been renamed to SolidExecutionResult.compute_output_events_dict. A solid execution result is returned from methods such as result_for_solid. Any call sites will need to be updated.
  • The .compute suffix is no longer applied to step keys. Step keys that were previously named my_solid.compute will now be named my_solid. If you are using any API method that takes a step_selection argument, you will need to update the step keys accordingly.
  • The pipeline_def property has been removed from the InitResourceContext passed to functions decorated with @resource.

Helm Chart

  • The schema for the scheduler values in the helm chart has changed. Instead of a simple toggle on/off, we now require an explicit scheduler.type to specify usage of the DagsterDaemonScheduler, K8sScheduler, or otherwise. If your specified scheduler.type has required config, these fields must be specified under scheduler.config.
  • snake_case fields have been changed to camelCase. Please update your values.yaml as follows:
    • pipeline_runpipelineRun
    • dagster_homedagsterHome
    • env_secretsenvSecrets
    • env_config_mapsenvConfigMaps
  • The Helm values celery and k8sRunLauncher have now been consolidated under the Helm value runLauncher for simplicity. Use the field runLauncher.type to specify usage of the K8sRunLauncher, CeleryK8sRunLauncher, or otherwise. By default, the K8sRunLauncher is enabled.
  • All Celery message brokers (i.e. RabbitMQ and Redis) are disabled by default. If you are using the CeleryK8sRunLauncher, you should explicitly enable your message broker of choice.
  • userDeployments are now enabled by default.

Core

  • Event log messages streamed to stdout and stderr have been streamlined to be a single line per event.

  • Experimental support for memoization and versioning lets you execute pipelines incrementally, selecting which solids need to be rerun based on runtime criteria and versioning their outputs with configurable identifiers that capture their upstream dependencies.

    To set up memoized step selection, users can provide a MemoizableIOManager, whose has_output function decides whether a given solid output needs to be computed or already exists. To execute a pipeline with memoized step selection, users can supply the dagster/is_memoized_run run tag to execute_pipeline.

    To set the version on a solid or resource, users can supply the version field on the definition. To access the derived version for a step output, users can access the version field on the OutputContext passed to the handle_output and load_input methods of IOManager and the has_output method of MemoizableIOManager.

  • Schedules that are executed using the new DagsterDaemonScheduler can now execute in any timezone by adding an execution_timezone parameter to the schedule. Daylight Savings Time transitions are also supported. See the Schedules Overview for more information and examples.

Dagit

  • Countdown and refresh buttons have been added for pages with regular polling queries (e.g. Runs, Schedules).
  • Confirmation and progress dialogs are now presented when performing run terminations and deletions. Additionally, hanging/orphaned runs can now be forced to terminate, by selecting "Force termination immediately" in the run termination dialog.
  • The Runs page now shows counts for "Queued" and "In progress" tabs, and individual run pages show timing, tags, and configuration metadata.
  • The backfill experience has been improved with means to view progress and terminate the entire backfill via the partition set page. Additionally, errors related to backfills are now surfaced more clearly.
  • Shortcut hints are no longer displayed when attempting to use the screen capture command.
  • The asset page has been revamped to include a table of events and enable organizing events by partition. Asset key escaping issues in other views have been fixed as well.
  • Miscellaneous bug fixes, frontend performance tweaks, and other improvements are also included.

Kubernetes/Helm

Helm

  • We've added schema validation to our Helm chart. You can now check that your values YAML file is
    correct by running:

    helm lint helm/dagster -f helm/dagster/values.yaml
    
  • Added support for resource annotations throughout our Helm chart.

  • Added Helm deployment of the dagster daemon & daemon scheduler.

  • Added Helm support for configuring a compute log manager in your dagster instance.

  • User code deployments now include a user ConfigMap by default.

  • Changed the default liveness probe for Dagit to use httpGet "/dagit_info" instead of tcpSocket:80

Dagster-K8s [Kubernetes]

  • Added support for user code deployments on Kubernetes.
  • Added support for tagging pipeline executions.
  • Fixes to support version 12.0.0 of the Python Kubernetes client.
  • Improved implementation of Kubernetes+Dagster retries.
  • Many logging improvements to surface debugging information and failures in the structured event log.

Dagster-Celery-K8s

  • Improved interrupt/termination handling in Celery workers.

Integrations & Libraries

  • Added a new dagster-docker library with a DockerRunLauncher that launches each run in its own Docker container. (See Deploying with Docker docs for an example.)
  • Added support for AWS Athena. (Thanks @jmsanders!)
  • Added mocks for AWS S3, Athena, and Cloudwatch in tests. (Thanks @jmsanders!)
  • Allow setting of S3 endpoint through env variables. (Thanks @marksteve!)
  • Various bug fixes and new features for the Azure, Databricks, and Dask integrations.
  • Added a create_databricks_job_solid for creating solids that launch Databricks jobs.

Migrating to 0.10.0

Action Required: Run and event storage schema changes

# Run after migrating to 0.10.0

$ dagster instance migrate

This release includes several schema changes to the Dagster storages that improve performance and enable new features like sensors and run queueing. After upgrading to 0.10.0, run the dagster instance migrate command to migrate your instance storage to the latest schema. This will turn off any running schedules, so you will need to restart any previously running schedules after migrating the schema. Before turning them back on, you should follow the steps below to migrate to DagsterDaemonScheduler.

New scheduler: DagsterDaemonScheduler

This release includes a new DagsterDaemonScheduler with improved fault tolerance and full support for timezones. We highly recommend upgrading to the new scheduler during this release. The existing schedulers, SystemCronScheduler and K8sScheduler, are deprecated and will be removed in a future release.

Steps to migrate

Instead of relying on system cron or k8s cron jobs, the DaemonScheduler uses the new dagster-daemon service to run schedules. This requires running the dagster-daemon service as a part of your deployment.

Refer to our deployment documentation for a guides on how to set up and run the daemon process for local development, Docker, or Kubernetes deployments.

If you are currently using the SystemCronScheduler or K8sScheduler:

  1. Stop any currently running schedules, to prevent any dangling cron jobs from being left behind. You can do this through the Dagit UI, or using the following command:

    dagster schedule stop --location {repository_location_name} {schedule_name}
    

    If you do not stop running schedules before changing schedulers, Dagster will throw an exception on startup due to the misconfigured running schedules.

  2. In your dagster.yaml file, remove the scheduler: entry. If there is no scheduler: entry, the DagsterDaemonScheduler is automatically used as the default scheduler.

  3. Start the dagster-daemon process. Guides can be found in our deployment documentations.

See our schedules troubleshooting guide for help if you experience any problems with the new scheduler.

If you are not using a legacy scheduler:

No migration steps are needed, but make sure you run dagster instance migrate as a part of upgrading to 0.10.0.

Deprecation: Intermediate Storage

We have deprecated the intermediate storage machinery in favor of the new IO manager abstraction, which offers finer-grained control over how inputs and outputs are serialized and persisted. Check out the IO Managers Overview for more information.

Steps to Migrate

  • We have deprecated the top level "storage" and "intermediate_storage" fields on run_config. If you are currently executing pipelines as follows:

    @pipeline
    def my_pipeline():
        ...
    
    execute_pipeline(
        my_pipeline,
        run_config={
            "intermediate_storage": {
                "filesystem": {"base_dir": ...}
            }
        },
    )
    
    execute_pipeline(
        my_pipeline,
        run_config={
            "storage": {
                "filesystem": {"base_dir": ...}
            }
        },
    )
    

    You should instead use the built-in IO manager fs_io_manager, which can be attached to your pipeline as a resource:

    @pipeline(
        mode_defs=[
            ModeDefinition(
                resource_defs={"io_manager": fs_io_manager}
            )
        ],
    )
    def my_pipeline():
        ...
    
    execute_pipeline(
        my_pipeline,
        run_config={
            "resources": {
                "io_manager": {"config": {"base_dir": ...}}
            }
        },
    )
    

    There are corresponding IO managers for other intermediate storages, such as the S3- and ADLS2-based storages

  • We have deprecated IntermediateStorageDefinition and @intermediate_storage.

    If you have written custom intermediate storage, you should migrate to custom IO managers defined using the @io_manager API. We have provided a helper method, io_manager_from_intermediate_storage, to help migrate your existing custom intermediate storages to IO managers.

    my_io_manager_def = io_manager_from_intermediate_storage(
        my_intermediate_storage_def
    )
    
    @pipeline(
        mode_defs=[
            ModeDefinition(
                resource_defs={
                    "io_manager": my_io_manager_def
                }
            ),
        ],
    )
    def my_pipeline():
        ...
    
  • We have deprecated the intermediate_storage_defs argument to ModeDefinition, in favor of the new IO managers, which should be attached using the resource_defs argument.

Removal: input_hydration_config and output_materialization_config

Use dagster_type_loader instead of input_hydration_config and dagster_type_materializer instead of output_materialization_config.

On DagsterType and type constructors in dagster_pandas use the loader argument instead of input_hydration_config and the materializer argument instead of dagster_type_materializer argument.

Removal: repository key in workspace YAML

We have removed the ability to specify a repository in your workspace using the repository: key. Use load_from: instead when specifying how to load the repositories in your workspace.

Deprecated: python_environment key in workspace YAML

The python_environment: key is now deprecated and will be removed in a future release.

Previously, when you wanted to load a repository location in your workspace using a different Python environment from Dagit’s Python environment, you needed to use a python_environment: key under load_from: instead of the python_file: or python_package: keys. Now, you can simply customize the executable_path in your workspace entries without needing to change to the
python_environment: key.

For example, the following workspace entry:

  - python_environment:
      executable_path: "/path/to/venvs/dagster-dev-3.7.6/bin/python"
      target:
        python_package:
          package_name: dagster_examples
          location_name: dagster_examples

should now be expressed as:

  - python_package:
      executable_path: "/path/to/venvs/dagster-dev-3.7.6/bin/python"
      package_name: dagster_examples
      location_name: dagster_examples

See our Workspaces Overview for more information and examples.

Removal: config_field property on definition classes

We have removed the property config_field on definition classes. Use config_schema instead.

Removal: System Storage

We have removed the system storage abstractions, i.e. SystemStorageDefinition and @system_storage (deprecated in 0.9.0).

Please note that the intermediate storage abstraction is also deprecated and will be removed in 0.11.0. Use IO managers instead.

  • We have removed the system_storage_defs argument (deprecated in 0.9.0) to ModeDefinition, in favor of intermediate_storage_defs.
  • We have removed the built-in system storages, e.g. default_system_storage_defs (deprecated in 0.9.0).

Removal: step_keys_to_execute

We have removed the step_keys_to_execute argument to reexecute_pipeline and reexecute_pipeline_iterator, in favor of step_selection. This argument accepts the Dagster selection syntax, so, for example, *solid_a+ represents solid_a, all of its upstream steps, and its immediate downstream steps.

Breaking Change: date_partition_range

Starting in 0.10.0, Dagster uses the pendulum library to ensure that schedules and partitions behave correctly with respect to timezones. As part of this change, the delta parameter to date_partition_range (which determined the time different between partitions and was a datetime.timedelta) has been replaced by a delta_range parameter (which must be a string that's a valid argument to the pendulum.period function, such as "days", "hours", or "months").

For example, the following partition range for a monthly partition set:

date_partition_range(
    start=datetime.datetime(2018, 1, 1),
    end=datetime.datetime(2019, 1, 1),
    delta=datetime.timedelta(months=1)
)

should now be expressed as:

date_partition_range(
    start=datetime.datetime(2018, 1, 1),
    end=datetime.datetime(2019, 1, 1),
    delta_range="months"
)

Breaking Change: PartitionSetDefinition.create_schedule_definition

When you create a schedule from a partition set using PartitionSetDefinition.create_schedule_definition, you now must supply a partition_selector argument that tells the scheduler which partition to use for a given schedule time.

We have added two helper functions, create_offset_partition_selector and identity_partition_selector, that capture two common partition selectors (schedules that execute at a fixed offset from the partition times, e.g. a schedule that creates the previous day's partition each morning, and schedules that execute at the same time as the partition times).

The previous default partition selector was last_partition, which didn't always work as expected when using the default scheduler and has been removed in favor of the two helper partition selectors above.

For example, a schedule created from a daily partition set that fills in each partition the next day at 10AM would be created as follows:

partition_set = PartitionSetDefinition(
    name='hello_world_partition_set',
    pipeline_name='hello_world_pipeline',
    partition_fn= date_partition_range(
        start=datetime.datetime(2021, 1, 1),
        delta_range="days",
        timezone="US/Central",
    )
    run_config_fn_for_partition=my_run_config_fn,
)

schedule_definition = partition_set.create_schedule_definition(
    "daily_10am_schedule",
    "0 10 * * *",
    partition_selector=create_offset_partition_selector(lambda d: d.subtract(hours=10, days=1))
    execution_timezone="US/Central",
)

Renamed: Helm values

Following convention in the Helm docs, we now camel case all of our Helm values. To migrate to 0.10.0, you'll need to update your values.yaml with the following renames:

  • pipeline_runpipelineRun
  • dagster_homedagsterHome
  • env_secretsenvSecrets
  • env_config_mapsenvConfigMaps

Restructured: scheduler in Helm values

When specifying the Dagster instance scheduler, rather than using a boolean field to switch between the current options of K8sScheduler and DagsterDaemonScheduler, we now require the scheduler type to be explicitly defined under scheduler.type. If the user specified scheduler.type has required config, additional fields will need to be specified under scheduler.config.

scheduler.type and corresponding scheduler.config values are enforced via JSON Schema.

For example, if your Helm values previously were set like this to enable the DagsterDaemonScheduler:

scheduler:
  k8sEnabled: false


You should instead have:

scheduler:
  type: DagsterDaemonScheduler

Restructured: celery and k8sRunLauncher in Helm values

celery and k8sRunLauncher now live under runLauncher.config.celeryK8sRunLauncher and runLauncher.config.k8sRunLauncher respectively. Now, to enable celery, runLauncher.type must equal CeleryK8sRunLauncher. To enable the vanilla K8s run launcher, runLauncher.type must equal K8sRunLauncher.

runLauncher.type and corresponding runLauncher.config values are enforced via JSON Schema.

For example, if your Helm values previously were set like this to enable the K8sRunLauncher:

celery:
  enabled: false
k8sRunLauncher:
  enabled: true
  jobNamespace: ~
  loadInclusterConfig: true
  kubeconfigFile: ~
  envConfigMaps: []
  envSecrets: []


You should instead have:

runLauncher:
  type: K8sRunLauncher
  config:
    k8sRunLauncher:
      jobNamespace: ~
      loadInclusterConfig: true
      kubeconfigFile: ~
      envConfigMaps: []
      envSecrets: []

New Helm defaults

By default, userDeployments is enabled and the runLauncher is set to the K8sRunLauncher. Along with the latter change, all message brokers (e.g. rabbitmq and redis) are now disabled by default.

If you were using the CeleryK8sRunLauncher, one of rabbitmq or redis must now be explicitly enabled in your Helm values.

dagster -

Published by prha almost 4 years ago

dagster - 0.9.22.post0

Published by catherinewu almost 4 years ago

Bugfixes

  • [Dask] Pin dask[dataframe] to <=2.30.0 and distributed to <=2.30.1
dagster - 0.9.22

Published by catherinewu almost 4 years ago

New

  • When using a solid selection in the Dagit Playground, non-matching solids are hidden in the RunPreview panel.

Bugfixes

  • [Helm/K8s] Fixed whitespacing bug in ingress.yaml Helm template.
dagster -

Published by catherinewu almost 4 years ago

Changelog

0.9.22

New

  • When using a solid selection in the Dagit Playground, non-matching solids are hidden in the RunPreview panel.

Bugfixes

  • [Helm/K8s] Fixed whitespacing bug in ingress.yaml Helm template.
dagster - 0.9.22.pre0

Published by catherinewu almost 4 years ago

0.9.22

New

  • When using a solid selection in the Dagit Playground, non-matching solids are hidden in the RunPreview panel.

Bugfixes

  • [Helm/K8s] Fixed whitespacing bug in ingress.yaml Helm template.
dagster -

Published by chenbobby almost 4 years ago

0.9.21

Community Contributions

  • Fixed helm chart to only add flower to the K8s ingress when enabled (thanks @PenguinToast!)
  • Updated helm chart to use more lenient timeouts for liveness probes on user code deployments (thanks @PenguinToast!)

Bugfixes

  • [Helm/K8s] Due to Flower being incompatible with Celery 5.0, the Helm chart for Dagster now uses a specific image mher/flower:0.9.5 for the Flower pod.