Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Suite of tools for deploying and training deep learning models using the JVM
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses
Alluxio, data orchestration for analytics and machine learning in the cloud
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet
batch processing and realtime tains(railway) data analysis to help Station Masters refreshing each 20 seconds
Study project for big data (Hadoop, Zookeeper, Kafka, Flink, Spark)
A personal project that builds an end-to-end data pipeline using the 2024 Olympics data
Daph是一个通用的数据同步与数据处理平台级工具,既具有丰富的数据同步能力,又具有强大的数据处理能力,一站式满足数据开发所有需求,可用于构建可视化配置化的数据同步与数据处理平台。
An end-to-end ETL pipeline for analyzing and visualizing Tokyo Olympics 2021 data using Azure tools and Power BI
Welcome to StreamlineDE, an end-to-end data engineering project designed to demonstrate real-time data ingestion, processing, and storage using a modern data engineering stack
Projeto que completa a criação de um ambiente para extração, armazenamento e processamento de dados do Youtube
Quickly setup and simulate a multi node spark cluster using docker and docker-compose
基于Spark+Debezium打造的简单易用、超高性能大数据治理引擎,适用于批流一体的数据集成和数据分析场景,支持CDC实时数据采集,支持海量数据同步、数据建模和OLAP数据分析
A Java based project aims to extract news articles from large
Contributor repository for code samples, plugins and libraries in Java for Apache Pulsar