Smart Automation Tool for building modern Data Lakes and Data Pipelines
GPL-3.0 License
Smart Data Lake Builder is a data lake automation framework that makes loading and transforming data a breeze. It is implemented in Scala and builds on top of open-source big data technologies like Apache Hadoop and Apache Spark, including connectors for diverse data sources (HadoopFS, Hive, DeltaLake, JDBC, Splunk, Webservice, SFTP, JMS, Excel, Access) and file formats.
Some common use cases include:
See Features for a comprehensive list of Smart Data Lake Builder features.
The following diagram shows the core concepts:
A data object defines the location and format of data. Some data objects require a connection to access remote data (e.g. a database connection).
The "data processors" are called actions. An action requires at least one input and output data object. An action reads the data from the input data object, processes and writes it to the output data object. Many actions are predefined e.g. transform data from json to csv but you can also define your custom transformer action.
Actions connect different Data Object and implicitly define a directed acyclic graph, as they model the dependencies needed to fill a Data Object. This automatically generated, arbitrary complex data flow can be divided up into Feed's (subgraphs) for execution and monitoring.
All metadata i.e. connections, data objects and actions are defined in a central configuration file, usually called application.conf. The file format used is HOCON which makes it easy to edit.
To see how all this works in action, head over to the Getting Started page.
www.sbb.ch : Provided the previously developed software as a foundation for the open source project
www.elca.ch : Did the comprehensive revision and provision as open source project
Getting Started Reference Architecture Testing Glossary Troubleshooting FAQ Contributing Running in the Public Cloud