Data engineering meets software engineering
📚 Documentation: https://aidictive.github.io/tuberia
⌨️ Source Code: https://github.com/aidictive/tuberia
Tuberia is born from the need to bring the worlds of data and software engineering closer together. Here is a list of common problems in data projects:
Tuberia aims to solve all these problems and many others.
You can view Tuberia as if it were a compiler. Instead of compiling a programming language, it compiles the steps necessary for your data pipeline to run successfully.
Tuberia is not an orchestrator, but it allows you to run the code you write in Python in any existing orchestrator: Airflow, Prefect, Databricks Jobs, Data Factory....
Tuberia provides some abstraction of where the code is executed, but defines
very well what are the necessary steps to execute it. For example, this shows
how to create a PySpark DataFrame from the range
function and creates a Delta
table.
import pyspark.sql.functions as F
from tuberia import PySparkTable, run
class Range(PySparkTable):
"""Table with numbers from 1 to `n`.
Attribute:
n: Max number in table.
"""
n: int = 10
def df(self):
return self.spark.range(self.n).withColumn("id", F.col(self.schema.id)
class DoubleRange(PySparkTable):
range: Range = Range()
def df(self):
return self.range.read().withColumn("id", F.col("id") * 2)
run(DoubleRange())
!!! warning
Previous code may not work yet and it can change. Please, notice this
project is in an early stage of its development.
All docstrings included in the code will be used to generate documentation about your data pipeline. That information, together with the result of data expectations/data quality rules will help you to always have complete and up to date documentation.
Besides that, as you have seen, Tuberia is pure Python so doing unit tests/data tests is very easy. Programming gurus will enjoy data engineering again!