Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
APACHE-2.0 License
Bot releases are hidden (Show)
Published by idanov almost 5 years ago
kedro run
CLI flags from a configuration file with the --config
flag (e.g. kedro run --config run_config.yml
)--params
flag (e.g. kedro run --params param1:value1,param2:value2
).kedro lint
command, your project is linted with black
(Python 3.6+), flake8
and isort
.KEDRO_ENV
which will globally set run
, jupyter notebook
and jupyter lab
commands using environment variables.CSVGCSDataSet
dataset in contrib
for working with CSV files in Google Cloud Storage.ParquetGCSDataSet
dataset in contrib
for working with Parquet files in Google Cloud Storage.JSONGCSDataSet
dataset in contrib
for working with JSON files in Google Cloud Storage.MatplotlibS3Writer
dataset in contrib
for saving Matplotlib images to S3.PartitionedDataSet
for working with datasets split across multiple files.JSONDataSet
dataset for working with JSON files that uses fsspec
to communicate with the underlying filesystem. It doesn't support http(s)
protocol for now.s3fs_args
to all S3 datasets.pipeline1 - pipeline2
.ParallelRunner
now works with SparkDataSet
.parameters.yml
.%reload_kedro
wasn't reloading all user modules.pandas_to_spark
and spark_to_pandas
decorators to work with functions with kwargs.kedro jupyter notebook
and kedro jupyter lab
would run a different Jupyter installation to the one in the local environment.SparkDataSet
.kedro package
would fail in certain situations where kedro build-reqs
was used to generate requirements.txt
.bucket_name
argument optional for the following datasets: CSVS3DataSet
, HDFS3DataSet
, PickleS3DataSet
, contrib.io.parquet.ParquetS3DataSet
, contrib.io.gcs.JSONGCSDataSet
- bucket name can now be included into the filepath along with the filesystem protocol (e.g. s3://bucket-name/path/to/key.csv
).run_package()
instead of main()
in src/<package>/run.py
.bucket_name
key has been removed from the string representation of the following datasets: CSVS3DataSet
, HDFS3DataSet
, PickleS3DataSet
, contrib.io.parquet.ParquetS3DataSet
, contrib.io.gcs.JSONGCSDataSet
.mem_profiler
decorator to contrib
and separated the contrib
decorators so that dependencies are modular. You may need to update your import paths, for example the pyspark decorators should be imported as from kedro.contrib.decorators.pyspark import <pyspark_decorator>
instead of from kedro.contrib.decorators import <pyspark_decorator>
.Sheldon Tsen, @roumail, Karlson Lee, Waylon Walker, Deepyaman Datta, Giovanni, Zain Patel
Published by nakhan98 almost 5 years ago
kedro jupyter
now gives the default kernel a sensible name.Pipeline.name
has been deprecated in favour of Pipeline.tags
.Pipeline.transform
, it simplifies dataset and node renaming.%run_viz
) to run kedro viz
in a Notebook cell (requires kedro-viz
version 3.0.0 or later).NetworkXLocalDataSet
in kedro.contrib.io.networkx
to load and save local graphs (JSON format) via NetworkX. (by @josephhaaga)SparkHiveDataSet
in kedro.contrib.io.pyspark.SparkHiveDataSet
allowing usage of Spark and insert/upsert on non-transactional Hive tables.kedro.contrib.config.TemplatedConfigLoader
now supports name/dict key templating and default values.get_last_load_version()
method for versioned datasets now returns exact last load version if the dataset has been loaded at least once and None
otherwise._exists
method for versioned SparkDataSet
.ExcelLocalDataSet
by specifying options under writer
key in save_args
.kedro install
command failing on Windows if src/requirements.txt
contains a different version of Kedro.tags="my_tag"
)._check_paths_consistency()
method from AbstractVersionedDataSet
. Version consistency check is now done in AbstractVersionedDataSet.save()
. Custom versioned datasets should modify save()
method implementation accordingly.Joseph Haaga, Deepyaman Datta, Joost Duisters, Zain Patel, Tom Vigrass
Published by nakhan98 about 5 years ago
PyTables
so that we maintain support for Python 3.5.Published by nakhan98 about 5 years ago
--load-version
, a kedro run
argument that allows you run the pipeline with a particular load version of a dataset.src/
, break the pipeline into isolated parts with reusability in mind.kedro run --pipeline NAME
.MatplotlibWriter
dataset in contrib
for saving Matplotlib images.kedro.contrib.config.TemplatedConfigLoader
.context.params
.max_workers
parameter for ParallelRunner
._get_pipeline
abstract method in ProjectContext(KedroContext)
in run.py
rather than the pipeline
abstract property. The pipeline
property is not abstract anymore.catalog
global variable to 00-kedro-init.py
, allowing you to load datasets with catalog.load()
.ConfigLoader
loading the same file more than once, and deduplicated the conf_paths
passed in.--open
flag to kedro build-docs
that opens the documentation on build.Pipeline
representation to include name of the pipeline, also making it readable as a context property.kedro.contrib.io.pyspark.SparkDataSet
and kedro.contrib.io.azure.CSVBlobDataSet
now support versioning.KedroContext.run()
no longer accepts catalog
and pipeline
arguments.node.inputs
now returns the node's inputs in the order required to bind them properly to the node's function.Deepyaman Datta, Luciano Issoe, Joost Duisters, Zain Patel, William Ashford, Karlson Lee
Published by nakhan98 about 5 years ago
versioning
support to cover the tracking of environment setup, code and datasets.FeatherLocalDataSet
in contrib
for usage with pandas. (by @mdomarsaleem)get_last_load_version
and get_last_save_version
to AbstractVersionedDataSet
.__call__
method on Node
to allow for users to execute my_node(input1=1, input2=2)
as an alternative to my_node.run(dict(input1=1, input2=2))
.--from-inputs
run argument.load_context()
not loading context in non-Kedro Jupyter Notebooks.ConfigLoader.get()
not listing nested files for **
-ending glob patterns.03_configuration
regarding how to modify the configuration path.extras/kedro_project_loader.py
renamed to extras/ipython_loader.py
and now runs any IPython startup scripts without relying on the Kedro project structure.None
Published by nakhan98 about 5 years ago
KedroContext
base class which holds the configuration and Kedro's main functionality (catalog, pipeline, config, runner).kedro jupyter convert
to facilitate converting Jupyter Notebook cells into Kedro nodes.pip-compile
and new Kedro command kedro build-reqs
that generates requirements.txt
based on requirements.in
.kedro install
will install packages to conda environment if src/environment.yml
exists in your project.--node
flag to kedro run
, allowing users to run only the nodes with the specified names.--from-nodes
and --to-nodes
run arguments, allowing users to run a range of nodes from the pipeline.params:
to the parameters specified in parameters.yml
which allows users to differentiate between their different parameter node inputs and outputs.CSVHTTPDataSet
to load CSV using HTTP(s) links.JSONBlobDataSet
to load json (-delimited) files from Azure Blob Storage.ParquetS3DataSet
in contrib
for usage with pandas. (by @mmchougule)CachedDataSet
in contrib
which will cache data in memory to avoid io/network operations. It will clear the cache once a dataset is no longer needed by a pipeline. (by @tsanikgr)YAMLLocalDataSet
in contrib
to load and save local YAML files. (by @Minyus)anyconfig
default log level changed from INFO
to WARNING
.kedro info
.kedro build-docs
will resemble the style of kedro docs
.run.py
with the introduction of KedroContext
class.FilepathVersionMixIn
and S3VersionMixIn
under one abstract class AbstractVersionedDataSet
which extendsAbstractDataSet
.name
changed to be a keyword-only argument for Pipeline
.CSVLocalDataSet
no longer supports URLs. CSVHTTPDataSet
supports URLs.This guide assumes that:
src/
.The breaking changes were introduced in the following project template files:
<project-name>/.ipython/profile_default/startup/00-kedro-init.py
<project-name>/kedro_cli.py
<project-name>/src/tests/test_run.py
<project-name>/src/<package-name>/run.py
<project-name>/.kedro.yml
(new file)The easiest way to migrate your project from Kedro 0.14.* to Kedro 0.15.0 is to create a new project (by using kedro new
) and move code and files bit by bit as suggested in the detailed guide below:
Create a new project with the same name by running kedro new
Copy the following folders to the new project:
results/
references/
notebooks/
logs/
data/
conf/
src/<package>/run.py
, make sure you apply the same customisations to src/<package>/run.py
get_config()
, you can override config_loader
property in ProjectContext
derived classcreate_catalog()
, you can override catalog()
property in ProjectContext
derived classrun()
, you can override run()
method in ProjectContext
derived classenv
, you can override it in ProjectContext
derived class or pass it at construction. By default, env
is local
.root_conf
, you can override CONF_ROOT
attribute in ProjectContext
derived class. By default, KedroContext
base class has CONF_ROOT
attribute set to conf
.proj_dir
-> context.project_path
proj_name
-> context.project_name
conf
-> context.config_loader
.io
-> context.catalog
(e.g., io.load()
-> context.catalog.load()
)If you customised your kedro_cli.py
, you need to apply the same customisations to your kedro_cli.py
in the new project.
Copy the contents of the old project's src/requirements.txt
into the new project's src/requirements.in
and, from the project root directory, run the kedro build-reqs
command in your terminal window.
If you defined any custom dataset classes which support versioning in your project, you need to apply the following changes:
AbstractVersionedDataSet
only.super().__init__()
with the appropriate arguments in the dataset's __init__
. If storing on local filesystem, providing the filepath and the version is enough. Otherwise, you should also pass in an exists_function
and a glob_function
that emulate exists
and glob
in a different filesystem (see CSVS3DataSet
as an example)._filepath
and _version
attributes in the dataset's __init__
, as this is taken care of in the base abstract class._get_load_path
and _get_save_path
methods should take no arguments._get_load_path
and _get_save_path
appropriately, as these now return PurePath
s instead of strings._check_paths_consistency
is called with PurePath
s as input arguments, instead of strings.These steps should have brought your project to Kedro 0.15.0. There might be some more minor tweaks needed as every project is unique, but now you have a pretty solid base to work with. If you run into any problems, please consult the Kedro documentation.
Dmitry Vukolov, Jo Stichbury, Angus Williams, Deepyaman Datta, Mayur Chougule, Marat Kopytjuk, Evan Miller, Yusuke Minami
Published by nakhan98 over 5 years ago
ipython
or jupyter
sessions. (Thank you @datajoely and @WaylonWalker)release
function that instructs them to free any cached data. The runners will call this when the dataset is no longer needed downstream.~
for TextLocalDataSet (see issue #19).short_name
property to Node
s for a display-friendly (but not necessarily unique) name.extras/kedro_project_loader.py
.MemoryDataSet
constructor and from the AbstractRunner.create_default_data_set
method.Published by nakhan98 over 5 years ago
ExistsMixin
into AbstractDataSet
.Pipeline.node_dependencies
returns a dictionary keyed by node, with sets of parent nodes as values; Pipeline
and ParallelRunner
were refactored to make use of this for topological sort for node dependency resolution and running pipelines respectively.Pipeline.grouped_nodes
returns a list of sets, rather than a list of lists.Published by nakhan98 over 5 years ago
HDFS3DataSet
.run.py
will throw a warning instead of error if credentials.yml
None
Published by nakhan98 over 5 years ago
The initial release of Kedro.
Jo Stichbury, Aris Valtazanos, Fabian Peters, Guilherme Braccialli, Joel Schwarzmann, Miguel Beltre, Mohammed ElNabawy, Deepyaman Datta, Shubham Agrawal, Oleg Andreyev, Mayur Chougule, William Ashford, Ed Cannon, Nikhilesh Nukala, Sean Bailey, Vikram Tegginamath, Thomas Huijskens, Musa Bilal.
We are also grateful to everyone who advised and supported us, filed issues or helped resolve them, asked and answered questions and were part of inspiring discussions.