kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.

APACHE-2.0 License

Downloads
590.5K
Stars
9.4K
Committers
182

Bot releases are visible (Hide)

kedro - 0.15.5

Published by idanov almost 5 years ago

Major features and improvements

  • New CLI commands and command flags:
    • Load multiple kedro run CLI flags from a configuration file with the --config flag (e.g. kedro run --config run_config.yml)
    • Run parametrised pipeline runs with the --params flag (e.g. kedro run --params param1:value1,param2:value2).
    • Lint your project code using the kedro lint command, your project is linted with black (Python 3.6+), flake8 and isort.
  • Load specific environments with Jupyter notebooks using KEDRO_ENV which will globally set run, jupyter notebook and jupyter lab commands using environment variables.
  • Added the following datasets:
    • CSVGCSDataSet dataset in contrib for working with CSV files in Google Cloud Storage.
    • ParquetGCSDataSet dataset in contrib for working with Parquet files in Google Cloud Storage.
    • JSONGCSDataSet dataset in contrib for working with JSON files in Google Cloud Storage.
    • MatplotlibS3Writer dataset in contrib for saving Matplotlib images to S3.
    • PartitionedDataSet for working with datasets split across multiple files.
    • JSONDataSet dataset for working with JSON files that uses fsspec to communicate with the underlying filesystem. It doesn't support http(s) protocol for now.
  • Added s3fs_args to all S3 datasets.
  • Pipelines can be deducted with pipeline1 - pipeline2.

Bug fixes and other changes

  • ParallelRunner now works with SparkDataSet.
  • Allowed the use of nulls in parameters.yml.
  • Fixed an issue where %reload_kedro wasn't reloading all user modules.
  • Fixed pandas_to_spark and spark_to_pandas decorators to work with functions with kwargs.
  • Fixed a bug where kedro jupyter notebook and kedro jupyter lab would run a different Jupyter installation to the one in the local environment.
  • Implemented Databricks-compatible dataset versioning for SparkDataSet.
  • Fixed a bug where kedro package would fail in certain situations where kedro build-reqs was used to generate requirements.txt.
  • Made bucket_name argument optional for the following datasets: CSVS3DataSet, HDFS3DataSet, PickleS3DataSet, contrib.io.parquet.ParquetS3DataSet, contrib.io.gcs.JSONGCSDataSet - bucket name can now be included into the filepath along with the filesystem protocol (e.g. s3://bucket-name/path/to/key.csv).
  • Documentation improvements and fixes.

Breaking changes to the API

  • Renamed entry point for running pip-installed projects to run_package() instead of main() in src/<package>/run.py.
  • bucket_name key has been removed from the string representation of the following datasets: CSVS3DataSet, HDFS3DataSet, PickleS3DataSet, contrib.io.parquet.ParquetS3DataSet, contrib.io.gcs.JSONGCSDataSet.
  • Moved the mem_profiler decorator to contrib and separated the contrib decorators so that dependencies are modular. You may need to update your import paths, for example the pyspark decorators should be imported as from kedro.contrib.decorators.pyspark import <pyspark_decorator> instead of from kedro.contrib.decorators import <pyspark_decorator>.

Thanks for supporting contributions

Sheldon Tsen, @roumail, Karlson Lee, Waylon Walker, Deepyaman Datta, Giovanni, Zain Patel

kedro - 0.15.4

Published by nakhan98 almost 5 years ago

Major features and improvements

  • kedro jupyter now gives the default kernel a sensible name.
  • Pipeline.name has been deprecated in favour of Pipeline.tags.
  • Reuse pipelines within a Kedro project using Pipeline.transform, it simplifies dataset and node renaming.
  • Added Jupyter Notebook line magic (%run_viz) to run kedro viz in a Notebook cell (requires kedro-viz version 3.0.0 or later).
  • Added the following datasets:
    • NetworkXLocalDataSet in kedro.contrib.io.networkx to load and save local graphs (JSON format) via NetworkX. (by @josephhaaga)
    • SparkHiveDataSet in kedro.contrib.io.pyspark.SparkHiveDataSet allowing usage of Spark and insert/upsert on non-transactional Hive tables.
  • kedro.contrib.config.TemplatedConfigLoader now supports name/dict key templating and default values.

Bug fixes and other changes

  • get_last_load_version() method for versioned datasets now returns exact last load version if the dataset has been loaded at least once and None otherwise.
  • Fixed a bug in _exists method for versioned SparkDataSet.
  • Enabled the customisation of the ExcelWriter in ExcelLocalDataSet by specifying options under writer key in save_args.
  • Fixed a bug in IPython startup script, attempting to load context from the incorrect location.
  • Removed capping the length of a dataset's string representation.
  • Fixed kedro install command failing on Windows if src/requirements.txt contains a different version of Kedro.
  • Enabled passing a single tag into a node or a pipeline without having to wrap it in a list (i.e. tags="my_tag").

Breaking changes to the API

  • Removed _check_paths_consistency() method from AbstractVersionedDataSet. Version consistency check is now done in AbstractVersionedDataSet.save(). Custom versioned datasets should modify save() method implementation accordingly.

Thanks for supporting contributions

Joseph Haaga, Deepyaman Datta, Joost Duisters, Zain Patel, Tom Vigrass

kedro - 0.15.3

Published by nakhan98 about 5 years ago

Bug Fixes and other changes

  • Narrowed the requirements for PyTables so that we maintain support for Python 3.5.
kedro - 0.15.2

Published by nakhan98 about 5 years ago

Major features and improvements

  • Added --load-version, a kedro run argument that allows you run the pipeline with a particular load version of a dataset.
  • Support for modular pipelines in src/, break the pipeline into isolated parts with reusability in mind.
  • Support for multiple pipelines, an ability to have multiple entry point pipelines and choose one with kedro run --pipeline NAME.
  • Added a MatplotlibWriter dataset in contrib for saving Matplotlib images.
  • An ability to template/parameterize configuration files with kedro.contrib.config.TemplatedConfigLoader.
  • Parameters are exposed as a context property for ease of access in iPython / Jupyter Notebooks with context.params.
  • Added max_workers parameter for ParallelRunner.

Bug fixes and other changes

  • Users will override the _get_pipeline abstract method in ProjectContext(KedroContext) in run.py rather than the pipeline abstract property. The pipeline property is not abstract anymore.
  • Improved an error message when versioned local dataset is saved and unversioned path already exists.
  • Added catalog global variable to 00-kedro-init.py, allowing you to load datasets with catalog.load().
  • Enabled tuples to be returned from a node.
  • Disallowed the ConfigLoader loading the same file more than once, and deduplicated the conf_paths passed in.
  • Added a --open flag to kedro build-docs that opens the documentation on build.
  • Updated the Pipeline representation to include name of the pipeline, also making it readable as a context property.
  • kedro.contrib.io.pyspark.SparkDataSet and kedro.contrib.io.azure.CSVBlobDataSet now support versioning.

Breaking changes to the API

  • KedroContext.run() no longer accepts catalog and pipeline arguments.
  • node.inputs now returns the node's inputs in the order required to bind them properly to the node's function.

Thanks for supporting contributions

Deepyaman Datta, Luciano Issoe, Joost Duisters, Zain Patel, William Ashford, Karlson Lee

kedro - 0.15.1

Published by nakhan98 about 5 years ago

Major features and improvements

  • Extended versioning support to cover the tracking of environment setup, code and datasets.
  • Added the following datasets:
    • FeatherLocalDataSet in contrib for usage with pandas. (by @mdomarsaleem)
  • Added get_last_load_version and get_last_save_version to AbstractVersionedDataSet.
  • Implemented __call__ method on Node to allow for users to execute my_node(input1=1, input2=2) as an alternative to my_node.run(dict(input1=1, input2=2)).
  • Added new --from-inputs run argument.

Bug fixes and other changes

  • Fixed a bug in load_context() not loading context in non-Kedro Jupyter Notebooks.
  • Fixed a bug in ConfigLoader.get() not listing nested files for **-ending glob patterns.
  • Fixed a logging config error in Jupyter Notebook.
  • Updated documentation in 03_configuration regarding how to modify the configuration path.
  • Documented the architecture of Kedro showing how we think about library, project and framework components.
  • extras/kedro_project_loader.py renamed to extras/ipython_loader.py and now runs any IPython startup scripts without relying on the Kedro project structure.
  • Fixed TypeError when validating partial function's signature.
  • After a node failure during a pipeline run, a resume command will be suggested in the logs. This command will not work if the required inputs are MemoryDataSets.

Breaking changes to the API

None

Thanks for supporting contributions

Omar Saleem, Mariana Silva, Anil Choudhary, Craig

kedro - 0.15.0

Published by nakhan98 about 5 years ago

Major features and improvements

  • Added KedroContext base class which holds the configuration and Kedro's main functionality (catalog, pipeline, config, runner).
  • Added a new CLI command kedro jupyter convert to facilitate converting Jupyter Notebook cells into Kedro nodes.
  • Added support for pip-compile and new Kedro command kedro build-reqs that generates requirements.txt based on requirements.in.
  • Running kedro install will install packages to conda environment if src/environment.yml exists in your project.
  • Added a new --node flag to kedro run, allowing users to run only the nodes with the specified names.
  • Added new --from-nodes and --to-nodes run arguments, allowing users to run a range of nodes from the pipeline.
  • Added prefix params: to the parameters specified in parameters.yml which allows users to differentiate between their different parameter node inputs and outputs.
  • Jupyter Lab/Notebook now starts with only one kernel by default.
  • Added the following datasets:
    • CSVHTTPDataSet to load CSV using HTTP(s) links.
    • JSONBlobDataSet to load json (-delimited) files from Azure Blob Storage.
    • ParquetS3DataSet in contrib for usage with pandas. (by @mmchougule)
    • CachedDataSet in contrib which will cache data in memory to avoid io/network operations. It will clear the cache once a dataset is no longer needed by a pipeline. (by @tsanikgr)
    • YAMLLocalDataSet in contrib to load and save local YAML files. (by @Minyus)

Bug fixes and other changes

  • Documentation improvements including instructions on how to initialise a Spark session using YAML configuration.
  • anyconfig default log level changed from INFO to WARNING.
  • Added information on installed plugins to kedro info.
  • Added style sheets for project documentation, so the output of kedro build-docs will resemble the style of kedro docs.

Breaking changes to the API

  • Simplified the Kedro template in run.py with the introduction of KedroContext class.
  • Merged FilepathVersionMixIn and S3VersionMixIn under one abstract class AbstractVersionedDataSet which extendsAbstractDataSet.
  • name changed to be a keyword-only argument for Pipeline.
  • CSVLocalDataSet no longer supports URLs. CSVHTTPDataSet supports URLs.

Migration guide from Kedro 0.14.X to Kedro 0.15.0

Migration for Kedro project template

This guide assumes that:

  • The framework specific code has not been altered significantly
  • Your project specific code is stored in the dedicated python package under src/.

The breaking changes were introduced in the following project template files:

  • <project-name>/.ipython/profile_default/startup/00-kedro-init.py
  • <project-name>/kedro_cli.py
  • <project-name>/src/tests/test_run.py
  • <project-name>/src/<package-name>/run.py
  • <project-name>/.kedro.yml (new file)

The easiest way to migrate your project from Kedro 0.14.* to Kedro 0.15.0 is to create a new project (by using kedro new) and move code and files bit by bit as suggested in the detailed guide below:

  1. Create a new project with the same name by running kedro new

  2. Copy the following folders to the new project:

  • results/
  • references/
  • notebooks/
  • logs/
  • data/
  • conf/
  1. If you customised your src/<package>/run.py, make sure you apply the same customisations to src/<package>/run.py
  • If you customised get_config(), you can override config_loader property in ProjectContext derived class
  • If you customised create_catalog(), you can override catalog() property in ProjectContext derived class
  • If you customised run(), you can override run() method in ProjectContext derived class
  • If you customised default env, you can override it in ProjectContext derived class or pass it at construction. By default, env is local.
  • If you customised default root_conf, you can override CONF_ROOT attribute in ProjectContext derived class. By default, KedroContext base class has CONF_ROOT attribute set to conf.
  1. The following syntax changes are introduced in ipython or Jupyter notebook/labs:
  • proj_dir -> context.project_path
  • proj_name -> context.project_name
  • conf -> context.config_loader.
  • io -> context.catalog (e.g., io.load() -> context.catalog.load())
  1. If you customised your kedro_cli.py, you need to apply the same customisations to your kedro_cli.py in the new project.

  2. Copy the contents of the old project's src/requirements.txt into the new project's src/requirements.in and, from the project root directory, run the kedro build-reqs command in your terminal window.

Migration for versioning custom dataset classes

If you defined any custom dataset classes which support versioning in your project, you need to apply the following changes:

  1. Make sure your dataset inherits from AbstractVersionedDataSet only.
  2. Call super().__init__() with the appropriate arguments in the dataset's __init__. If storing on local filesystem, providing the filepath and the version is enough. Otherwise, you should also pass in an exists_function and a glob_function that emulate exists and glob in a different filesystem (see CSVS3DataSet as an example).
  3. Remove setting of the _filepath and _version attributes in the dataset's __init__, as this is taken care of in the base abstract class.
  4. Any calls to _get_load_path and _get_save_path methods should take no arguments.
  5. Ensure you convert the output of _get_load_path and _get_save_path appropriately, as these now return PurePaths instead of strings.
  6. Make sure _check_paths_consistency is called with PurePaths as input arguments, instead of strings.

These steps should have brought your project to Kedro 0.15.0. There might be some more minor tweaks needed as every project is unique, but now you have a pretty solid base to work with. If you run into any problems, please consult the Kedro documentation.

Thanks for supporting contributions

Dmitry Vukolov, Jo Stichbury, Angus Williams, Deepyaman Datta, Mayur Chougule, Marat Kopytjuk, Evan Miller, Yusuke Minami

kedro - 0.14.3

Published by nakhan98 over 5 years ago

Major features and improvements

  • Tab completion for catalog datasets in ipython or jupyter sessions. (Thank you @datajoely and @WaylonWalker)
  • Added support for transcoding, an ability to decouple loading/saving mechanisms of a dataset from its storage location, denoted by adding '@' to the dataset name.
  • Datasets have a new release function that instructs them to free any cached data. The runners will call this when the dataset is no longer needed downstream.

Bug fixes and other changes

  • Add support for pipeline nodes made up from partial functions.
  • Expand user home directory ~ for TextLocalDataSet (see issue #19).
  • Add a short_name property to Nodes for a display-friendly (but not necessarily unique) name.
  • Add Kedro project loader for IPython: extras/kedro_project_loader.py.
  • Fix source file encoding issues with Python 3.5 on Windows.
  • Fix local project source not having priority over the same source installed as a package, leading to local updates not being recognised.

Breaking changes to the API

  • Remove the max_loads argument from the MemoryDataSet constructor and from the AbstractRunner.create_default_data_set method.

Thanks for supporting contributions

Joel Schwarzmann, Alex Kalmikov

kedro - 0.14.2

Published by nakhan98 over 5 years ago

Major features and improvements

  • Added Data Set transformer support in the form of AbstractTransformer and DataCatalog.add_transformer.

Breaking changes to the API

  • Merged the ExistsMixin into AbstractDataSet.
  • Pipeline.node_dependencies returns a dictionary keyed by node, with sets of parent nodes as values; Pipeline and ParallelRunner were refactored to make use of this for topological sort for node dependency resolution and running pipelines respectively.
  • Pipeline.grouped_nodes returns a list of sets, rather than a list of lists.

Thanks for supporting contributions

Darren Gallagher, Zain Patel

kedro - 0.14.1

Published by nakhan98 over 5 years ago

Major features and improvements

  • New I/O module HDFS3DataSet.

Bug fixes and other changes

  • Improved API docs.
  • Template run.py will throw a warning instead of error if credentials.yml
    is not present.

Breaking changes to the API

None

kedro - 0.14.0

Published by nakhan98 over 5 years ago

Major features and improvements

The initial release of Kedro.

Thanks for supporting contributions

Jo Stichbury, Aris Valtazanos, Fabian Peters, Guilherme Braccialli, Joel Schwarzmann, Miguel Beltre, Mohammed ElNabawy, Deepyaman Datta, Shubham Agrawal, Oleg Andreyev, Mayur Chougule, William Ashford, Ed Cannon, Nikhilesh Nukala, Sean Bailey, Vikram Tegginamath, Thomas Huijskens, Musa Bilal.

We are also grateful to everyone who advised and supported us, filed issues or helped resolve them, asked and answered questions and were part of inspiring discussions.

Package Rankings
Top 1.02% on Pypi.org
Top 10.72% on Conda-forge.org
Badges
Extracted from project README
Python version PyPI version Conda version License Slack Organisation Slack Archive Documentation OpenSSF Best Practices Monthly downloads Total downloads Powered by Kedro
Related Projects