Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
APACHE-2.0 License
Bot releases are hidden (Show)
Published by idanov over 2 years ago
Kedro 0.18.0 strives to reduce the complexity of the project template and get us closer to a stable release of the framework. We've introduced the full micro-packaging workflow 📦, which allows you to import packages, utility functions and existing pipelines into your Kedro project. Integration with IPython and Jupyter has been streamlined in preparation for enhancements to Kedro's interactive workflow. Additionally, the release comes with long-awaited Python 3.9 and 3.10 support 🐍.
kedro.config.abstract_config.AbstractConfigLoader
as an abstract base class for all ConfigLoader
implementations. ConfigLoader
and TemplatedConfigLoader
now inherit directly from this base class.ConfigLoader.get
and TemplatedConfigLoader.get
API and delegated the actual get
method functional implementation to the kedro.config.common
module.hook_manager
is no longer a global singleton. The hook_manager
lifecycle is now managed by the KedroSession
, and a new hook_manager
will be created every time a session
is instantiated.pipeline()
without the params:
prefix.Pipeline.filter()
(previously in KedroContext._filter_pipeline()
) to filter parts of a pipeline.username
to Session store for logging during Experiment Tracking.from my_package.__main__ import main
main(
["--pipleine", "my_pipeline"]
) # or just main() if no parameters are needed for the run
cli.py
from the Kedro project template. By default, all CLI commands, including kedro run
, are now defined on the Kedro framework side. You can still define custom CLI commands by creating your own cli.py
.hooks.py
from the Kedro project template. Registration hooks have been removed in favour of settings.py
configuration, but you can still define execution timeline hooks by creating your own hooks.py
..ipython
directory from the Kedro project template. The IPython/Jupyter workflow no longer uses IPython profiles; it now uses an IPython extension.kedro
run configuration environment names can now be set in settings.py
using the CONFIG_LOADER_ARGS
variable. The relevant keyword arguments to supply are base_env
and default_run_env
, which are set to base
and local
respectively by default.Type | Description | Location |
---|---|---|
pandas.XMLDataSet |
Read XML into Pandas DataFrame. Write Pandas DataFrame to XML | kedro.extras.datasets.pandas |
networkx.GraphMLDataSet |
Work with NetworkX using GraphML files | kedro.extras.datasets.networkx |
networkx.GMLDataSet |
Work with NetworkX using Graph Modelling Language files | kedro.extras.datasets.networkx |
redis.PickleDataSet |
loads/saves data from/to a Redis database | kedro.extras.datasets.redis |
partitionBy
support and exposed save_args
for SparkHiveDataSet
.open_args_save
in fs_args
for pandas.ParquetDataSet
.load
and save
operations for pandas
datasets in order to leverage pandas
own API and delegate fsspec
operations to them. This reduces the need to have our own fsspec
wrappers.pandas.AppendableExcelDataSet
into pandas.ExcelDataSet
.save_args
to feather.FeatherDataSet
.%load_ext kedro.extras.extensions.ipython
and use the line magic %reload_kedro
.kedro ipython
launches an IPython session that preloads the Kedro IPython extension.kedro jupyter notebook/lab
creates a custom Jupyter kernel that preloads the Kedro IPython extension and launches a notebook with that kernel selected. There is no longer a need to specify --all-kernels
to show all available kernels.pandas
to 1.3. Any storage_options
should continue to be specified under fs_args
and/or credentials
.black
dependency in the project template to a non pre-release version.RegistrationSpecs
and its associated register_config_loader
and register_catalog
hook specifications in favour of CONFIG_LOADER_CLASS
/CONFIG_LOADER_ARGS
and DATA_CATALOG_CLASS
in settings.py
.load_context
and get_project_context
.CONF_SOURCE
, package_name
, pipeline
, pipelines
, config_loader
and io
attributes from KedroContext
as well as the deprecated KedroContext.run
method.PluginManager
hook_manager
argument to KedroContext
and the Runner.run()
method, which will be provided by the KedroSession
.get_hook_manager()
and replaced its functionality by _create_hook_manager()
.KedroSession
. run_id
has been renamed to session_id
as a result.settings.py
setting CONF_ROOT
has been renamed to CONF_SOURCE
. Default value of conf
remains unchanged.ConfigLoader
and TemplatedConfigLoader
argument conf_root
has been renamed to conf_source
.extra_params
has been renamed to runtime_params
in kedro.config.config.ConfigLoader
and kedro.config.templated_config.TemplatedConfigLoader
.KedroContext
and is now implemented in a ConfigLoader
class (or equivalent) with the base_env
and default_run_env
attributes.pandas.ExcelDataSet
now uses openpyxl
engine instead of xlrd
.pandas.ParquetDataSet
now calls pd.to_parquet()
upon saving. Note that the argument partition_cols
is not supported.spark.SparkHiveDataSet
API has been updated to reflect spark.SparkDataSet
. The write_mode=insert
option has also been replaced with write_mode=append
as per Spark styleguide. This change addresses Issue 725 and Issue 745. Additionally, upsert
mode now leverages checkpoint
functionality and requires a valid checkpointDir
be set for current SparkContext
.yaml.YAMLDataSet
can no longer save a pandas.DataFrame
directly, but it can save a dictionary. Use pandas.DataFrame.to_dict()
to convert your pandas.DataFrame
to a dictionary before you attempt to save it to YAML.open_args_load
and open_args_save
from the following datasets:
pandas.CSVDataSet
pandas.ExcelDataSet
pandas.FeatherDataSet
pandas.JSONDataSet
pandas.ParquetDataSet
storage_options
are now dropped if they are specified under load_args
or save_args
for the following datasets:
pandas.CSVDataSet
pandas.ExcelDataSet
pandas.FeatherDataSet
pandas.JSONDataSet
pandas.ParquetDataSet
lambda_data_set
, memory_data_set
, and partitioned_data_set
to lambda_dataset
, memory_dataset
, and partitioned_dataset
, respectively, in kedro.io
.networkx.NetworkXDataSet
has been renamed to networkx.JSONDataSet
.kedro install
in favour of pip install -r src/requirements.txt
to install project dependencies.--parallel
flag from kedro run
in favour of --runner=ParallelRunner
. The -p
flag is now an alias for --pipeline
.kedro pipeline package
has been replaced by kedro micropkg package
and, in addition to the --alias
flag used to rename the package, now accepts a module name and path to the pipeline or utility module to package, relative to src/<package_name>/
. The --version
CLI option has been removed in favour of setting a __version__
variable in the micro-package's __init__.py
file.kedro pipeline pull
has been replaced by kedro micropkg pull
and now also supports --destination
to provide a location for pulling the package.kedro pipeline list
and kedro pipeline describe
in favour of kedro registry list
and kedro registry describe
.kedro package
and kedro micropkg package
now save egg
and whl
or tar
files in the <project_root>/dist
folder (previously <project_root>/src/dist
).kedro build-reqs
to compile requirements from requirements.txt
instead of requirements.in
and save them to requirements.lock
instead of requirements.txt
.kedro jupyter notebook/lab
no longer accept --all-kernels
or --idle-timeout
flags. --all-kernels
is now the default behaviour.KedroSession.run
now raises ValueError
rather than KedroContextError
when the pipeline contains no nodes. The same ValueError
is raised when there are no matching tags.KedroSession.run
now raises ValueError
rather than KedroContextError
when the pipeline name doesn't exist in the pipeline registry..tar.gz
).Node
and Pipeline
, as well as the modules kedro.extras.decorators
and kedro.pipeline.decorators
.DataCatalog
, as well as the modules kedro.extras.transformers
and kedro.io.transformers
.Journal
and DataCatalogWithDefault
.%init_kedro
IPython line magic, with its functionality incorporated into %reload_kedro
. This means that if %reload_kedro
is called with a filepath, that will be set as default for subsequent calls.hook_impl
of the register_config_loader
and register_catalog
methods from ProjectHooks
in hooks.py
(or custom alternatives).run_id
in the after_catalog_created
hook, replace it with save_version
instead.run_id
in any of the before_node_run
, after_node_run
, on_node_error
, before_pipeline_run
, after_pipeline_run
or on_pipeline_error
hooks, replace it with session_id
instead.settings.py
filekedro.config.TemplatedConfigLoader
, alter CONFIG_LOADER_CLASS
to specify the class and CONFIG_LOADER_ARGS
to specify keyword arguments. If not set, these default to kedro.config.ConfigLoader
and an empty dictionary respectively.DATA_CATALOG_CLASS
to specify the class. If not set, this defaults to kedro.io.DataCatalog
.conf
), update CONF_ROOT
to CONF_SOURCE
and set it to a string with the expected configuration location. If not set, this defaults to "conf"
.For a given pipeline:
active_pipeline = pipeline(
pipe=[
node(
func=some_func,
inputs=["model_input_table", "params:model_options"],
outputs=["**my_output"],
),
...,
],
inputs="model_input_table",
namespace="candidate_modelling_pipeline",
)
The parameters should look like this:
-model_options:
- test_size: 0.2
- random_state: 8
- features:
- - engines
- - passenger_capacity
- - crew
+candidate_modelling_pipeline:
+ model_options:
+ test_size: 0.2
+ random_state: 8
+ features:
+ - engines
+ - passenger_capacity
+ - crew
params:
prefix when supplying values to parameters
argument in a pipeline()
call.kedro pipeline pull my_pipeline --alias other_pipeline
, now use kedro micropkg pull my_pipeline --alias pipelines.other_pipeline
instead.kedro pipeline package my_pipeline
, now use kedro micropkg package pipelines.my_pipeline
instead.pyproject.toml
, you should modify the keys to include the full module path, and wrapped in double-quotes, e.g:[tool.kedro.micropkg.package]
-data_engineering = {destination = "path/to/here"}
-data_science = {alias = "ds", env = "local"}
+"pipelines.data_engineering" = {destination = "path/to/here"}
+"pipelines.data_science" = {alias = "ds", env = "local"}
[tool.kedro.micropkg.pull]
-"s3://my_bucket/my_pipeline" = {alias = "aliased_pipeline"}
+"s3://my_bucket/my_pipeline" = {alias = "pipelines.aliased_pipeline"}
pandas.ExcelDataSet
, make sure you have openpyxl
installed in your environment. This is automatically installed if you specify kedro[pandas.ExcelDataSet]==0.18.0
in your requirements.txt
. You can uninstall xlrd
if you were only using it for this dataset.pandas.ParquetDataSet
, pass pandas saving arguments directly to save_args
instead of nested in from_pandas
(e.g. save_args = {"preserve_index": False}
instead of save_args = {"from_pandas": {"preserve_index": False}}
).spark.SparkHiveDataSet
with write_mode
option set to insert
, change this to append
in line with the Spark styleguide. If you use spark.SparkHiveDataSet
with write_mode
option set to upsert
, make sure that your SparkContext
has a valid checkpointDir
set either by SparkContext.setCheckpointDir
method or directly in the conf
folder.pandas~=1.2.0
and pass storage_options
through load_args
or savs_args
, specify them under fs_args
or via credentials
instead.kedro.io.lambda_data_set
, kedro.io.memory_data_set
, or kedro.io.partitioned_data_set
, change the import to kedro.io.lambda_dataset
, kedro.io.memory_dataset
, or kedro.io.partitioned_dataset
, respectively (or import the dataset directly from kedro.io
).pandas.AppendableExcelDataSet
entries in your catalog, replace them with pandas.ExcelDataSet
.networkx.NetworkXDataSet
entries in your catalog, replace them with networkx.JSONDataSet
.kedro pipeline package --version
to use kedro micropkg package
instead. If you wish to set a specific pipeline package version, set the __version__
variable in the pipeline package's __init__.py
file.kedro run --runner=ParallelRunner
rather than --parallel
or -p
.ConfigLoader
or TemplatedConfigLoader
directly, update the keyword arguments conf_root
to conf_source
and extra_params
to runtime_params
.KedroContext
to access ConfigLoader
, use settings.CONFIG_LOADER_CLASS
to access the currently used ConfigLoader
instead.Published by idanov over 2 years ago
pipeline
now accepts tags
and a collection of Node
s and/or Pipeline
s rather than just a single Pipeline
object. pipeline
should be used in preference to Pipeline
when creating a Kedro pipeline.pandas.SQLTableDataSet
and pandas.SQLQueryDataSet
now only open one connection per database, at instantiation time (therefore at catalog creation time), rather than one per load/save operation.micropkg
, to replace kedro pipeline pull
and kedro pipeline package
with kedro micropkg pull
and kedro micropkg package
for Kedro 0.18.0. kedro micropkg package
saves packages to project/dist
while kedro pipeline package
saves packages to project/src/dist
.pandas<1.4
to maintain compatibility with xlrd~=1.0
.Pillow
minimum version requirement to 9.0 (Python 3.7+ only) following CVE-2022-22817.PickleDataSet
to be copyable and hence work with the parallel runner.pip-tools
, which is used by kedro build-reqs
, to 6.5 (Python 3.7+ only). This pip-tools
version is compatible with pip>=21.2
, including the most recent releases of pip
. Python 3.6 users should continue to use pip-tools
6.4 and pip<22
.astro-iris
as alias for astro-airlow-iris
, so that old tutorials can still be followed.kedro pipeline pull
and kedro pipeline package
will be deprecated. Please use kedro micropkg
instead.Published by idanov almost 3 years ago
pipelines
global variable to IPython extension, allowing you to access the project's pipelines in kedro ipython
or kedro jupyter notebook
.params
in CLI, i.e. kedro run --params="model.model_tuning.booster:gbtree"
updates parameters to {"model": {"model_tuning": {"booster": "gbtree"}}}
.pandas.SQLQueryDataSet
to specify a filepath
with a SQL query, in addition to the current method of supplying the query itself in the sql
argument.ExcelDataSet
to support saving Excel files with multiple sheets.Type | Description | Location |
---|---|---|
plotly.JSONDataSet |
Works with plotly graph object Figures (saves as json file) | kedro.extras.datasets.plotly |
pandas.GenericDataSet |
Provides a 'best effort' facility to read / write any format provided by the pandas library |
kedro.extras.datasets.pandas |
pandas.GBQQueryDataSet |
Loads data from a Google Bigquery table using provided SQL query | kedro.extras.datasets.pandas |
spark.DeltaTableDataSet |
Dataset designed to handle Delta Lake Tables and their CRUD-style operations, including update , merge and delete
|
kedro.extras.datasets.spark |
kedro new --config config.yml
was ignoring the config file when prompts.yml
didn't exist.kedro viz --autoreload
.pickle
interface to PickleDataSet
.sum
syntax for connecting pipeline objects.pip-tools
, which is used by kedro build-reqs
, to 6.4. This pip-tools
version requires pip>=21.2
while adding support for pip>=21.3
. To upgrade pip
, please refer to their documentation.plotly
requirement for plotly.PlotlyDataSet
and the pyarrow
requirement for pandas.ParquetDataSet
.kedro pipeline package <pipeline>
now raises an error if the <pipeline>
argument doesn't look like a valid Python module path (e.g. has /
instead of .
).overwrite
argument to PartitionedDataSet
and MatplotlibWriter
to enable deletion of existing partitions and plots on dataset save
.kedro pipeline pull
now works when the project requirements contains entries such as -r
, --extra-index-url
and local wheel files (Issue #913)._FrozenDatasets
creations..coveragerc
from the Kedro project template. coverage
settings are now given in pyproject.toml
.git
.load_versions
that are not found in the data catalog would silently pass.kedro.extras.decorators
and kedro.pipeline.decorators
are being deprecated in favour of Hooks.kedro.extras.transformers
and kedro.io.transformers
are being deprecated in favour of Hooks.--parallel
flag on kedro run
is being removed in favour of --runner=ParallelRunner
. The -p
flag will change to be an alias for --pipeline
.kedro.io.DataCatalogWithDefault
is being deprecated, to be removed entirely in 0.18.0.Deepyaman Datta,
Brites,
Manish Swami,
Avaneesh Yembadi,
Zain Patel,
Simon Brugman,
Kiyo Kunii,
Benjamin Levy,
Louis de Charsonville,
Simon Picard
Published by idanov about 3 years ago
registry
, with the associated commands kedro registry list
and kedro registry describe
, to replace kedro pipeline list
and kedro pipeline describe
.requirements.txt
is packaged, its dependencies are embedded in the modular pipeline wheel file. Upon pulling the pipeline, Kedro will append dependencies to the project's requirements.in
. More information is available in our documentation.kedro pipeline package/pull --all
and pyproject.toml
.cli.py
from the Kedro project template. By default all CLI commands, including kedro run
, are now defined on the Kedro framework side. These can be overridden in turn by a plugin or a cli.py
file in your project. A packaged Kedro project will respect the same hierarchy when executed with python -m my_package
..ipython/profile_default/startup/
from the Kedro project template in favour of .ipython/profile_default/ipython_config.py
and the kedro.extras.extensions.ipython
.dill
backend to PickleDataSet
.kedro pipeline package
and kedro pipeline pull
time, so that aliasing a modular pipeline doesn't break it.Type | Description | Location |
---|---|---|
tracking.MetricsDataSet |
Dataset to track numeric metrics for experiment tracking | kedro.extras.datasets.tracking |
tracking.JSONDataSet |
Dataset to track data for experiment tracking | kedro.extras.datasets.tracking |
fsspec
version to 2021.04.kedro install
and kedro build-reqs
flows when uninstalled dependencies are present in a project's settings.py
, context.py
or hooks.py
(Issue #829).kedro pipeline package
and kedro pipeline pull
time, so that aliasing a modular pipeline doesn't break it.dynaconf
to <3.1.6
because the method signature for _validate_items
changed which is used in Kedro.kedro pipeline list
and kedro pipeline describe
are being deprecated in favour of new commands kedro registry list
and kedro registry describe
.kedro install
is being deprecated in favour of using pip install -r src/requirements.txt
to install project dependencies.Published by idanov over 3 years ago
Type | Description | Location |
---|---|---|
plotly.PlotlyDataSet |
Works with plotly graph object Figures (saves as json file) | kedro.extras.datasets.plotly |
ConfigLoader.get()
now raises a BadConfigException
, with a more helpful error message, if a configuration file cannot be loaded (for instance due to wrong syntax or poor formatting).run_id
now defaults to save_version
when after_catalog_created
is called, similarly to what happens during a kedro run
.kedro ipython
and kedro jupyter notebook
didn't work if the PYTHONPATH
was already set.env
and extra_params
to reload_kedro
similar to how the IPython script works.kedro info
now outputs if a plugin has any hooks
or cli_hooks
implemented.PartitionedDataSet
now supports lazily materializing data on save.kedro pipeline describe
now defaults to the __default__
pipeline when no pipeline name is provided and also shows the namespace the nodes belong to.EmailMessageDataSet
added to doctree.kedro pipeline package
now only packages the parameter file that exactly matches the pipeline name specified and the parameter files in a directory with the pipeline name.model input
tables in accordance with our Data Engineering convention.kedro pipeline package
takes the pipeline package version, rather than the kedro package version. If the pipeline package version is not present, then the package version is used.Published by idanov over 3 years ago
before_command_run
hook for plugins to add extra behaviour before Kedro CLI commands run.pipelines
from pipeline_registry.py
and register_pipeline
hooks are now loaded lazily when they are first accessed, not on startup:from kedro.framework.project import pipelines
print(pipelines["__default__"]) # pipeline loading is only triggered here
TemplatedConfigLoader
now correctly inserts default values when no globals are supplied.KEDRO_ENV
environment variable had no effect on instantiating the context
variable in an iPython session or a Jupyter notebook.bootstrap_project
method.configure_project
is invoked if a package_name
is supplied to KedroSession.create
. This is added for backward-compatibility purpose to support a workflow that creates Session
manually. It will be removed in 0.18.0
.ModuleNotFoundError
if register_pipelines
not found, so that a more helpful error message will appear when a dependency is missing, e.g. Issue #722.kedro new
is invoked using a configuration yaml file, output_dir
is no longer a required key; by default the current working directory will be used.kedro new
is invoked using a configuration yaml file, the appropriate prompts.yml
file is now used for validating the provided configuration. Previously, validation was always performed against the kedro project template prompts.yml
file.kedro new
now generates user prompts to obtain configuration rather than supplying empty configuration.after_dataset_loaded
run would finish before a dataset is actually loaded when using --async
flag.kedro.versioning.journal.Journal
will be removed.kedro.framework.context.KedroContext
will be removed:
io
in favour of KedroContext.catalog
pipeline
(equivalent to pipelines["__default__"]
)pipelines
in favour of kedro.framework.project.pipelines
Published by idanov over 3 years ago
compress_pickle
backend to PickleDataSet
.KedroContext
instance:from kedro.framework.project import pipelines
print(pipelines)
pipeline_registry.py
rather than hooks.py
.kedro run
settings.py
is not importable, the errors will be surfaced earlier in the process, rather than at runtime.kedro pipeline list
and kedro pipeline describe
no longer accept redundant --env
parameter.from kedro.framework.cli.cli import cli
no longer includes the new
and starter
commands.kedro.framework.context.KedroContext.run
will be removed in release 0.18.0.Published by idanov over 3 years ago
env
and extra_params
to reload_kedro()
line magic.pipeline()
API to allow strings and sets of strings as inputs
and outputs
, to specify when a dataset name remains the same (not namespaced).default_config.yml
as prompts.yml
.env
and extra_params
arguments to register_config_loader
hook.settings
are loaded. You will now be able to run:from kedro.framework.project import settings
print(settings.CONF_ROOT)
SparkDataSet
in the interactive workflow.pyproject.toml
for a tool.kedro
section before treating the project as a Kedro project.DataCatalog::shallow_copy
now it should copy layers.kedro pipeline pull
now uses pip download
for protocols that are not supported by fsspec
.jsonschema
schema definition for the Kedro 0.17 catalog.kedro install
now waits on Windows until all the requirements are installed.--to-outputs
option in the CLI, throughout the codebase, and as part of hooks specifications.ParquetDataSet
wasn't creating parent directories on the fly.kedro ipython
and kedro jupyter
workflows. To fix this, follow the instructions in the migration guide below.Note: If you're using the
ipython
extension instead, you will not encounter this problem.
You will have to update the file <your_project>/.ipython/profile_default/startup/00-kedro-init.py
in order to make kedro ipython
and/or kedro jupyter
work. Add the following line before the KedroSession
is created:
configure_project(metadata.package_name) # to add
session = KedroSession.create(metadata.package_name, path)
Make sure that the associated import is provided in the same place as others in the file:
from kedro.framework.project import configure_project # to add
from kedro.framework.session import KedroSession
Mariana Silva,
Kiyohito Kunii,
noklam,
Ivan Doroshenko,
Zain Patel,
Deepyaman Datta,
Sam Hiscox,
Pascal Brokmeier
Published by idanov almost 4 years ago
KedroSession
which is responsible for managing the lifecycle of a Kedro run.kedro new --starter=mini-kedro
. It is possible to use the DataCatalog as a standalone component in a Jupyter notebook and transition into the rest of the Kedro framework.DatasetSpecs
with Hooks to run before and after datasets are loaded from/saved to the catalog.kedro catalog create
. For a registered pipeline, it creates a <conf_root>/<env>/catalog/<pipeline_name>.yml
configuration file with MemoryDataSet
datasets for each dataset that is missing from DataCatalog
.settings.py
and pyproject.toml
(to replace .kedro.yml
) for project configuration, in line with Python best practice.ProjectContext
is no longer needed, unless for very complex customisations. KedroContext
, ProjectHooks
and settings.py
together implement sensible default behaviour. As a result context_path
is also now an optional key in pyproject.toml
.ProjectContext
from src/<package_name>/run.py
.TemplatedConfigLoader
now supports Jinja2 template syntax alongside its original syntax.ConfigLoader
or the DataCatalog
used in a project. If no such Hook is provided in src/<package_name>/hooks.py
, a KedroContextError
is raised. There are sensible defaults defined in any project generated with Kedro >= 0.16.5.ParallelRunner
no longer results in a run failure, when triggered from a notebook, if the run is started using KedroSession
(session.run()
).before_node_run
can now overwrite node inputs by returning a dictionary with the corresponding updates.isort
and pytest
configuration from <project_root>/setup.cfg
to <project_root>/pyproject.toml
.KedroSession
to KedroContext
.pyspark
requirements to allow for installation of pyspark
3.0.--fs-args
option to the kedro pipeline pull
command to specify configuration options for the fsspec
filesystem arguments used when pulling modular pipelines from non-PyPI locations.fsspec
version to 0.9.s3fs
version to 0.5 (S3FileSystem
interface has changed since 0.4.1 version).kedro.cli
and kedro.context
modules in favour of kedro.framework.cli
and kedro.framework.context
respectively.kedro.io.DataCatalog.exists()
returns False
when the dataset does not exist, as opposed to raising an exception.catalog.yml
file is no longer automatically created for modular pipelines when running kedro pipeline create
. Use kedro catalog create
to replace this functionality.include_examples
prompt from kedro new
. To generate boilerplate example code, you should use a Kedro starter.--verbose
flag from a global command to a project-specific command flag (e.g kedro --verbose new
becomes kedro new --verbose
).dataset_credentials
key in credentials in PartitionedDataSet
.get_source_dir()
was removed from kedro/framework/cli/utils.py
.get_config
, create_catalog
, create_pipeline
, template_version
, project_name
and project_path
keys by get_project_context()
function (kedro/framework/cli/cli.py
).kedro new --starter
now defaults to fetching the starter template matching the installed Kedro version.kedro_cli.py
to cli.py
and moved it inside the Python package (src/<package_name>/
), for a better packaging and deployment experience..kedro.yml
from the project template and replaced it with pyproject.toml
.KEDRO_CONFIGS
constant (previously residing in kedro.framework.context.context
).kedro pipeline create
CLI command to add a boilerplate parameter config file in conf/<env>/parameters/<pipeline_name>.yml
instead of conf/<env>/pipelines/<pipeline_name>/parameters.yml
. CLI commands kedro pipeline delete
/ package
/ pull
were updated accordingly.get_static_project_data
from kedro.framework.context
.KedroContext.static_data
.KedroContext
constructor now takes package_name
as first argument.context
property on KedroSession
with load_context()
method._push_session
and _pop_session
in kedro.framework.session.session
to _activate_session
and _deactivate_session
respectively.CONTEXT_CLASS
variable in src/<your_project>/settings.py
.KedroContext.hooks
attribute. Instead, hooks should be registered in src/<your_project>/settings.py
under the HOOKS
key.[\w\.-]+$
.KedroContext._create_config_loader()
and KedroContext._create_data_catalog()
. They have been replaced by registration hooks, namely register_config_loader()
and register_catalog()
(see also upcoming deprecations).kedro.framework.context.load_context
will be removed in release 0.18.0.kedro.framework.cli.get_project_context
will be removed in release 0.18.0.DeprecationWarning
to the decorator API for both node
and pipeline
. These will be removed in release 0.18.0. Use Hooks to extend a node's behaviour instead.DeprecationWarning
to the Transformers API when adding a transformer to the catalog. These will be removed in release 0.18.0. Use Hooks to customise the load
and save
methods.Deepyaman Datta, Zach Schuster
Reminder: Our documentation on how to upgrade Kedro covers a few key things to remember when updating any Kedro version.
The Kedro 0.17.0 release contains some breaking changes. If you update Kedro to 0.17.0 and then try to work with projects created against earlier versions of Kedro, you may encounter some issues when trying to run kedro
commands in the terminal for that project. Here's a short guide to getting your projects running against the new version of Kedro.
Note: As always, if you hit any problems, please check out our documentation:
To get an existing Kedro project to work after you upgrade to Kedro 0.17.0, we recommend that you create a new project against Kedro 0.17.0 and move the code from your existing project into it. Let's go through the changes, but first, note that if you create a new Kedro project with Kedro 0.17.0 you will not be asked whether you want to include the boilerplate code for the Iris dataset example. We've removed this option (you should now use a Kedro starter if you want to create a project that is pre-populated with code).
To create a new, blank Kedro 0.17.0 project to drop your existing code into, you can create one, as always, with kedro new
. We also recommend creating a new virtual environment for your new project, or you might run into conflicts with existing dependencies.
pyproject.toml
: Copy the following three keys from the .kedro.yml
of your existing Kedro project into the pyproject.toml
file of your new Kedro 0.17.0 project:[tools.kedro]
package_name = "<package_name>"
project_name = "<project_name>"
project_version = "0.17.0"
Check your source directory. If you defined a different source directory (source_dir
), make sure you also move that to pyproject.toml
.
Copy files from your existing project:
project/src/project_name/pipelines
from existing to new projectproject/src/test/pipelines
from existing to new projectrequirements.txt
and/or requirements.in
.conf
folder. Take note of the new locations needed for modular pipeline configuration (move it from conf/<env>/pipeline_name/catalog.yml
to conf/<env>/catalog/pipeline_name.yml
and likewise for parameters.yml
).data/
folder of your existing project, if needed, into the same location in your new project.src/<package_name>/hooks.py
.Update your new project's README and docs as necessary.
Update settings.py
: For example, if you specified additional Hook implementations in hooks
, or listed plugins under disable_hooks_by_plugin
in your .kedro.yml
, you will need to move them to settings.py
accordingly:
from <package_name>.hooks import MyCustomHooks, ProjectHooks
HOOKS = (ProjectHooks(), MyCustomHooks())
DISABLE_HOOKS_FOR_PLUGINS = ("my_plugin1",)
Migration for node
names. From 0.17.0 the only allowed characters for node names are letters, digits, hyphens, underscores and/or fullstops. If you have previously defined node names that have special characters, spaces or other characters that are no longer permitted, you will need to rename those nodes.
Copy changes to kedro_cli.py
. If you previously customised the kedro run
command or added more CLI commands to your kedro_cli.py
, you should move them into <project_root>/src/<package_name>/cli.py
. Note, however, that the new way to run a Kedro pipeline is via a KedroSession
, rather than using the KedroContext
:
with KedroSession.create(package_name=...) as session:
session.run()
Copy changes made to ConfigLoader
. If you have defined a custom class, such as TemplatedConfigLoader
, by overriding ProjectContext._create_config_loader
, you should move the contents of the function in src/<package_name>/hooks.py
, under register_config_loader
.
Copy changes made to DataCatalog
. Likewise, if you have DataCatalog
defined with ProjectContext._create_catalog
, you should copy-paste the contents into register_catalog
.
Optional: If you have plugins such as Kedro-Viz installed, it's likely that Kedro 0.17.0 won't work with their older versions, so please either upgrade to the plugin's newest version or follow their migration guides.
Published by idanov almost 4 years ago
kedro new --starter spaceflights
.TypeError
when converting dict inputs to a node made from a wrapped partial
function.PartitionedDataSet
improvements:
jalapeño
will be accessible as DataCatalog.datasets.jalapeño
rather than DataCatalog.datasets.jalape__o
.kedro install
for an Anaconda environment defined in environment.yml
..kedro.yml
to use kedro lint
and kedro jupyter notebook convert
.TensorFlowModelDataset
in the HDF5 format with versioning enabled.run_result
argument in after_pipeline_run
Hooks spec.00-kedro-init.py
file.Deepyaman Datta, Bhavya Merchant, Lovkush Agarwal, Varun Krishna S, Sebastian Bertoli, noklam, Daniel Petti, Waylon Walker
Published by idanov about 4 years ago
Type | Description | Location |
---|---|---|
email.EmailMessageDataSet |
Manage email messages using the Python standard library | kedro.extras.datasets.email |
pyproject.toml
to configure Kedro. pyproject.toml
is used if .kedro.yml
doesn't exist (Kedro configuration should be under [tool.kedro]
section).pipeline.py
, having been replaced by hooks.py
.register_pipelines()
, to replace _get_pipelines()
register_config_loader()
, to replace _create_config_loader()
register_catalog()
, to replace _create_catalog()
src/<package-name>/hooks.py
and added to .kedro.yml
(or pyproject.toml
). The order of execution is: plugin hooks, .kedro.yml
hooks, hooks in ProjectContext.hooks
..kedro.yml
(or pyproject.toml
) configuration file..isort.cfg
settings into setup.cfg
.project_name
, project_version
and package_name
now have to be defined in .kedro.yml
for projects generated using Kedro 0.16.5+.Published by idanov about 4 years ago
ParallelRunner
on Windows.GBQTableDataSet
to load customised results using customised queries from Google Big Query tables.Ajay Bisht, Vijay Sajjanar, Deepyaman Datta, Sebastian Bertoli, Shahil Mawjee, Louis Guitton, Emanuel Ferm
Published by idanov over 4 years ago
Release 0.16.3
Published by idanov over 4 years ago
Type | Description | Location |
---|---|---|
pandas.AppendableExcelDataSet |
Works with Excel file opened in append mode |
kedro.extras.datasets.pandas |
tensorflow.TensorFlowModelDataset |
Works with TensorFlow models using TensorFlow 2.X
|
kedro.extras.datasets.tensorflow |
holoviews.HoloviewsWriter |
Works with Holoviews objects (saves as image file) |
kedro.extras.datasets.holoviews |
kedro install
will now compile project dependencies (by running kedro build-reqs
behind the scenes) before the installation if the src/requirements.in
file doesn't exist.only_nodes_with_namespace
in Pipeline
class to filter only nodes with a specified namespace.kedro pipeline delete
command to help delete unwanted or unused pipelines (it won't remove references to the pipeline in your create_pipelines()
code).kedro pipeline package
command to help package up a modular pipeline. It will bundle up the pipeline source code, tests, and parameters configuration into a .whl file.DataCatalog
:
DataCatalog.list()
method.__
in DataCatalog.datasets
, for ease of access to transcoded datasets.spark.SparkHiveDataSet
.spark.SparkDataSet
.pyarrow
table in pandas.ParquetDataSet
.kedro build-reqs
CLI command:
kedro build-reqs
is now called with -q
option and will no longer print out compiled requirements to the console for security reasons.kedro build-reqs
command are now passed to pip-compile call (e.g. kedro build-reqs --generate-hashes
).kedro jupyter
CLI command:
kedro jupyter notebook
, kedro jupyter lab
or kedro ipython
with Jupyter/IPython dependencies not being installed.%run_viz
line magic for showing kedro viz inside a Jupyter notebook. For the fix to be applied on existing Kedro project, please see the migration guide.pillow.ImageDataSet
entry to the documentation.%run_viz
line magic in existing projectEven though this release ships a fix for project generated with kedro==0.16.2
, after upgrading, you will still need to make a change in your existing project if it was generated with kedro>=0.16.0,<=0.16.1
for the fix to take effect. Specifically, please change the content of your project's IPython init script located at .ipython/profile_default/startup/00-kedro-init.py
with the content of this file. You will also need kedro-viz>=3.3.1
.
Miguel Rodriguez Gutierrez, Joel Schwarzmann, w0rdsm1th, Deepyaman Datta, Tam-Sanh Nguyen, Marcus Gawronsky
Published by idanov over 4 years ago
kedro.cli
and kedro.context
when running kedro jupyter notebook
.catalog
and context
were not available in Jupyter Lab and Notebook.kedro build-reqs
would fail if you didn't have your project dependencies installed.Published by idanov over 4 years ago
kedro catalog list
to list datasets in your catalogkedro pipeline list
to list pipelineskedro pipeline describe
to describe a specific pipelinekedro pipeline create
to create a modular pipelinegit
-style.kedro.cli
and kedro.context
have been moved into kedro.framework.cli
and kedro.framework.context
respectively. kedro.cli
and kedro.context
will be removed in future releases.Hooks
, which is a new mechanism for extending Kedro.load_context
changing user's current working directory..kedro.yml
.node(func, "params:a.b", None)
Type | Description | Location |
---|---|---|
pillow.ImageDataSet |
Work with image files using Pillow
|
kedro.extras.datasets.pillow |
geopandas.GeoJSONDataSet |
Work with geospatial data using GeoPandas
|
kedro.extras.datasets.geopandas.GeoJSONDataSet |
api.APIDataSet |
Work with data from HTTP(S) API requests | kedro.extras.datasets.api.APIDataSet |
joblib
backend support to pickle.PickleDataSet
.MatplotlibWriter
dataset.pip install "kedro[pandas.ParquetDataSet]"
.encoding
or compression
, for fsspec.spec.AbstractFileSystem.open()
calls when loading/saving a dataset. See Example 3 under docs.namespace
property on Node
, related to the modular pipeline where the node belongs.SequentialRunner(is_async=True)
and ParallelRunner(is_async=True)
class.MemoryProfiler
transformer.pandas>=1.0
.pyspark
is not fully-compatible with 3.8 yet.
CONTRIBUTING.md
- added Developer Workflow._exists
method to MyOwnDataSet
example in 04_user_guide/08_advanced_io.PartitionedDataSet
and IncrementalDataSet
were not working with s3a
or s3n
protocol.pandas.ParquetDataSet
.functools.lru_cache
with cachetools.cachedmethod
in PartitionedDataSet
and IncrementalDataSet
for per-instance cache invalidation.SparkDataSet
when running on Databricks.SparkDataSet
not allowing for loading data from DBFS in a Windows machine using Databricks-connect.DataSetNotFoundError
to suggest possible dataset names user meant to type.make test-no-spark
.kedro lint --check-only
).kedro.io
.kedro.contrib
and extras
folders.CSVBlobDataSet
and JSONBlobDataSet
dataset types.invalidate_cache
method on datasets private.get_last_load_version
and get_last_save_version
methods are no longer available on AbstractDataSet
.get_last_load_version
and get_last_save_version
have been renamed to resolve_load_version
and resolve_save_version
on AbstractVersionedDataSet
, the results of which are cached.release()
method on datasets extending AbstractVersionedDataSet
clears the cached load and save version. All custom datasets must call super()._release()
inside _release()
.TextDataSet
no longer has load_args
and save_args
. These can instead be specified under open_args_load
or open_args_save
in fs_args
.PartitionedDataSet
and IncrementalDataSet
method invalidate_cache
was made private: _invalidate_caches
.KEDRO_ENV_VAR
from kedro.context
to speed up the CLI run time.Pipeline.name
has been removed in favour of Pipeline.tag()
.Pipeline.transform()
in favour of kedro.pipeline.modular_pipeline.pipeline()
helper function.PARAMETER_KEYWORDS
private, and moved it from kedro.pipeline.pipeline
to kedro.pipeline.modular_pipeline
.DataCatalog
.Since all the datasets (from kedro.io
and kedro.contrib.io
) were moved to kedro/extras/datasets
you must update the type of all datasets in <project>/conf/base/catalog.yml
file.
Here how it should be changed: type: <SomeDataSet>
-> type: <subfolder of kedro/extras/datasets>.<SomeDataSet>
(e.g. type: CSVDataSet
-> type: pandas.CSVDataSet
).
In addition, all the specific datasets like CSVLocalDataSet
, CSVS3DataSet
etc. were deprecated. Instead, you must use generalized datasets like CSVDataSet
.
E.g. type: CSVS3DataSet
-> type: pandas.CSVDataSet
.
Note: No changes required if you are using your custom dataset.
Pipeline.transform()
has been dropped in favour of the pipeline()
constructor. The following changes apply:
from kedro.pipeline import pipeline
prefix
argument has been renamed to namespace
datasets
has been broken down into more granular arguments:
inputs
: Independent inputs to the pipelineoutputs
: Any output created in the pipeline, whether an intermediary dataset or a leaf outputparameters
: params:...
or parameters
As an example, code that used to look like this with the Pipeline.transform()
constructor:
result = my_pipeline.transform(
datasets={"input": "new_input", "output": "new_output", "params:x": "params:y"},
prefix="pre"
)
When used with the new pipeline()
constructor, becomes:
from kedro.pipeline import pipeline
result = pipeline(
my_pipeline,
inputs={"input": "new_input"},
outputs={"output": "new_output"},
parameters={"params:x": "params:y"},
namespace="pre"
)
Since some modules were moved to other locations you need to update import paths appropriately.
You can find the list of moved files in the 0.15.6
release notes under the section titled Files with a new location
.
Note: If you haven't made significant changes to your
kedro_cli.py
, it may be easier to simply copy the updatedkedro_cli.py
.ipython/profile_default/startup/00-kedro-init.py
and from GitHub or a newly generated project into your old project.
KEDRO_ENV_VAR
from kedro.context
. To get your existing project template working, you'll need to remove all instances of KEDRO_ENV_VAR
from your project template:
kedro_cli.py
and .ipython/profile_default/startup/00-kedro-init.py
: from kedro.context import KEDRO_ENV_VAR, load_context
-> from kedro.framework.context import load_context
envvar=KEDRO_ENV_VAR
line from the click options in run
, jupyter_notebook
and jupyter_lab
in kedro_cli.py
KEDRO_ENV_VAR
with "KEDRO_ENV"
in _build_jupyter_env
context = load_context(path, env=os.getenv(KEDRO_ENV_VAR))
with context = load_context(path)
in .ipython/profile_default/startup/00-kedro-init.py
kedro build-reqs
We have upgraded pip-tools
which is used by kedro build-reqs
to 5.x. This pip-tools
version requires pip>=20.0
. To upgrade pip
, please refer to their documentation.
@foolsgold, Mani Sarkar, Priyanka Shanbhag, Luis Blanche, Deepyaman Datta, Antony Milne, Panos Psimatikas, Tam-Sanh Nguyen, Tomasz Kaczmarczyk, Kody Fischer, Waylon Walker
Published by idanov over 4 years ago
Published by idanov over 4 years ago
requirements.txt
so pandas.CSVDataSet
class works out of box with pip install kedro
.pandas
to our extra_requires
in setup.py
.DataSet
class are missing.Published by idanov over 4 years ago
Published by idanov over 4 years ago
TL;DR We're launching
kedro.extras
, the new home for our revamped series of datasets, decorators and dataset transformers. The datasets inkedro.extras.datasets
usefsspec
to access a variety of data stores including local file systems, network file systems, cloud object stores (including S3 and GCP), and Hadoop, read more about this here. The change will allow #178 to happen in the next major release of Kedro.
An example of this new system can be seen below, loading the CSV SparkDataSet
from S3:
weather:
type: spark.SparkDataSet # Observe the specified type, this affects all datasets
filepath: s3a://your_bucket/data/01_raw/weather* # filepath uses fsspec to indicate the file storage system
credentials: dev_s3
file_format: csv
You can also load data incrementally whenever it is dumped into a directory with the extension to PartionedDataSet
, a feature that allows you to load a directory of files. The IncrementalDataSet
stores the information about the last processed partition in a checkpoint
, read more about this feature here.
layer
attribute for datasets in kedro.extras.datasets
to specify the name of a layer according to data engineering convention, this feature will be passed to kedro-viz
in future releases.catalog.load("dataset_name", version="<2019-12-13T15.08.09.255Z>")
.run_id
on ProjectContext
, used for versioning using the Journal
. To customise your journal run_id
you can override the private method _get_run_id()
.pip install "kedro[all]"
.DataCatalog
's load order for datasets, loading order is the following:
kedro.io
kedro.extras.datasets
type
copy_mode
flag to CachedDataSet
and MemoryDataSet
to specify (deepcopy
, copy
or assign
) the copy mode to use when loading and saving.Type | Description | Location |
---|---|---|
ParquetDataSet |
Handles parquet datasets using Dask | kedro.extras.datasets.dask |
PickleDataSet |
Work with Pickle files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pickle |
CSVDataSet |
Work with CSV files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
TextDataSet |
Work with text files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
ExcelDataSet |
Work with Excel files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
HDFDataSet |
Work with HDF using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
YAMLDataSet |
Work with YAML files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.yaml |
MatplotlibWriter |
Save with Matplotlib images using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.matplotlib |
NetworkXDataSet |
Work with NetworkX files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.networkx |
BioSequenceDataSet |
Work with bio-sequence objects using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.biosequence |
GBQTableDataSet |
Work with Google BigQuery | kedro.extras.datasets.pandas |
FeatherDataSet |
Work with feather files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
IncrementalDataSet |
Inherit from PartitionedDataSet and remembers the last processed partition |
kedro.io |
Type | New Location |
---|---|
JSONDataSet |
kedro.extras.datasets.pandas |
CSVBlobDataSet |
kedro.extras.datasets.pandas |
JSONBlobDataSet |
kedro.extras.datasets.pandas |
SQLTableDataSet |
kedro.extras.datasets.pandas |
SQLQueryDataSet |
kedro.extras.datasets.pandas |
SparkDataSet |
kedro.extras.datasets.spark |
SparkHiveDataSet |
kedro.extras.datasets.spark |
SparkJDBCDataSet |
kedro.extras.datasets.spark |
kedro/contrib/decorators/retry.py |
kedro/extras/decorators/retry_node.py |
kedro/contrib/decorators/memory_profiler.py |
kedro/extras/decorators/memory_profiler.py |
kedro/contrib/io/transformers/transformers.py |
kedro/extras/transformers/time_profiler.py |
kedro/contrib/colors/logging/color_logger.py |
kedro/extras/logging/color_logger.py |
extras/ipython_loader.py |
tools/ipython/ipython_loader.py |
kedro/contrib/io/cached/cached_dataset.py |
kedro/io/cached_dataset.py |
kedro/contrib/io/catalog_with_default/data_catalog_with_default.py |
kedro/io/data_catalog_with_default.py |
kedro/contrib/config/templated_config.py |
kedro/config/templated_config.py |
Category | Type |
---|---|
Datasets | BioSequenceLocalDataSet |
CSVGCSDataSet |
|
CSVHTTPDataSet |
|
CSVLocalDataSet |
|
CSVS3DataSet |
|
ExcelLocalDataSet |
|
FeatherLocalDataSet |
|
JSONGCSDataSet |
|
JSONLocalDataSet |
|
HDFLocalDataSet |
|
HDFS3DataSet |
|
kedro.contrib.io.cached.CachedDataSet |
|
kedro.contrib.io.catalog_with_default.DataCatalogWithDefault |
|
MatplotlibLocalWriter |
|
MatplotlibS3Writer |
|
NetworkXLocalDataSet |
|
ParquetGCSDataSet |
|
ParquetLocalDataSet |
|
ParquetS3DataSet |
|
PickleLocalDataSet |
|
PickleS3DataSet |
|
TextLocalDataSet |
|
YAMLLocalDataSet |
|
Decorators | kedro.contrib.decorators.memory_profiler |
kedro.contrib.decorators.retry |
|
kedro.contrib.decorators.pyspark.spark_to_pandas |
|
kedro.contrib.decorators.pyspark.pandas_to_spark |
|
Transformers | kedro.contrib.io.transformers.transformers |
Configuration Loaders | kedro.contrib.config.TemplatedConfigLoader |
config.yaml
using YAML dict style instead of string CLI formatting only.--node
and --tag
support comma-separated values, alternative methods will be deprecated in future releases.invalidate_cache
method of ParquetGCSDataSet
and CSVGCSDataSet
.--load-version
now won't break if version value contains a colon.node
s with duplicate inputs.SparkJDBCDataSet
.template/.../pipeline.py
.HDFS3DataSet
.from_nodes
and to_nodes
in pipelines using transcoding.pandas.DataFrame.to_numpy
(recommended alternative to pandas.DataFrame.values
).Pipeline.transform
skips modifying node inputs/outputs containing params:
or parameters
keywords.dataset_credentials
key in the credentials for PartitionedDataSet
is now deprecated. The dataset credentials should be specified explicitly inside the dataset config.confirm
function which is called after a successful node function execution if the node contains confirms
argument with such dataset name.--from-nodes
instead of --from-inputs
to avoid unnecessarily re-running nodes that had already executed.--idle-timeout
option to update it.kedro-viz
to the Kedro project template requirements.txt
file.results
and references
folder from the project template.CONTRIBUTING.md
.MatplotlibWriter
dataset in contrib
was renamed to MatplotlibLocalWriter
.kedro/contrib/io/matplotlib/matplotlib_writer.py
was renamed to kedro/contrib/io/matplotlib/matplotlib_local_writer.py
.kedro.contrib.io.bioinformatics.sequence_dataset.py
was renamed to kedro.contrib.io.bioinformatics.biosequence_local_dataset.py
.Andrii Ivaniuk, Jonas Kemper, Yuhao Zhu, Balazs Konig, Pedro Abreu, Tam-Sanh Nguyen, Peter Zhao, Deepyaman Datta, Florian Roessler, Miguel Rodriguez Gutierrez