Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
MIT License
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython /...
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Data Lakehouse local stack with PySpark, Trino, and Minio. Includes an example to process Raygun ...
Visualize column-level data lineage in Spark SQL
Python Helper library for Jupyter Notebooks
A repository to keep track of all the code that I end up writing for my blog posts.
Type System for Data Analysis in Python
There are Python 2.7 codes and learning notes for Spark 2.1.1
C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.
A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars cod...
Apache Flink
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orch...
Simple and Distributed Machine Learning
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure...