Processing engine and React components for constructing configuration-based data transformation and processing pipelines.
MIT License
This project provides a collection of components for executing processing pipelines, particularly oriented to data wrangling. Detailed documentation is provided in subfolders, with an overview of high-level goals and concepts here. Most of the documentation within individual packages is tailored to developers needing to understand how the code is organized and executed. Higher-level concepts for the project as a whole, constructing workflows, etc. are in the root docs folder.
There are four primary goals of the project:
Individual documentation for the JavaScript and Python implementations can be found in their respective folders. Broad documentation about building pipelines and the available verbs is available in the docs folder.
We currently have seven primary JavaScript packages:
Also note that each JavaScript package has a generated docs folder containing Markdown API documentation extracted from code comments using api-extractor.
The Python packages are much simpler, because there is no associated web application and component code.
We generate JSONSchema for formal project artifacts including resource definitions and workflow specifications. This allows validation by any consumer and/or implementor. Schema versions are published on github.io for permanent reference. Each variant of a schema is hosted in perpetuity with semantic versioning. Aliases to the most recent (unversioned latest) and major revisions are also published. Here are direct links to the latest versions of our primary schemas:
Note that for the purposes of pipeline development, the workflow
schema is primary. The rest are largely used for package management and table bundling in the web application.
For new verbs within the DataShaper toolkit, you must first determine if JavaScript and Python parity is desired. For operations that should be configurable via a UX, a JavaScript implementation is necessary. However, if the verb is primarily useful for data science workflows and has potentially complicated parameters, a Python-only implementation may be fine. We have a preference for parity to reduce confusion and allow for cross-platform execution of any pipelines created with the tool, but also recognize the value of the Python-based execution engine for configuring data science and ETL workflows that will only ever be run server-side.
Core verbs are built into the toolkit, and should generally have JavaScript and Python parity. Creating these verbs involves the following steps:
"verb": {
"const": "strings.upper",
"type": "string"
}
The location of the verb must be in datashaper.engine.verbs.strings.upper.
@verb
decorator to make it available to the Workflow engine. The name
parameter of the decorator must match the package name defined in the schema. For example:@verb(name="my_package.upper")
def upper(input: VerbInput, column: str, to: str):
...
Important Note: If a verb already exists with the same name
you will get a ValueError
, pick a unique name for each verb. For example if you try to create a new "strings.upper"
you will get a ValueError
if you want to create a custom version of this verb you could use "my_package.upper"
like the example above.
The Python implementation supports the use of custom verbs supplied by your application - this allows arbitrary processing pipelines to be built that contain custom logic and processing steps.
TODO: document custom verb format
JavaScript
yarn
yarn build
yarn start
Python
poetry install
poetry run poe test
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.