Apache Spark Ecosystem

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Created by
Matei Zaharia
Released
May 26, 2014
Community Repos
8,421
Total GitHub Stars
1,087,452
Core Projects
More
airflow
36,052
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
flink
23,713
Apache Flink
arrow
14,356
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
Popular Projects 
More

spark

Apache Spark - A unified analytics engine for large-scale data processing

25 Feb 2014 38,255

spark-nlp

State of the Art Natural Language Processing

24 Sep 2017 3,717

SynapseML

Simple and Distributed Machine Learning

05 Jun 2017 4,986

zeppelin

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more

25 Mar 2015 6,276

redash

Make Your Company Data Driven

28 Oct 2013 25,090

h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc

03 Mar 2014 6,710

delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

22 Apr 2019 7,462

deeplearning4j

Suite of tools for deploying and training deep learning models using the JVM

27 Nov 2013 13,630

TensorFlowOnSpark

TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters

20 Jan 2017 3,872

spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

14 May 2020 757

fugue

A unified interface for distributed computing

24 Mar 2020 1,901

spark-cassandra-connector

DataStax Connector for Apache Spark to Apache Cassandra

27 Jun 2014 1,931

horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet

09 Aug 2017 14,183

kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses

18 Dec 2017 1,953

alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud

21 Dec 2012 6,814

spark

22 Apr 2019 2,020

apache-spark-internals

The Internals of Apache Spark

31 Aug 2015 1,468

ytsaurus

YTsaurus is a scalable and fault-tolerant open-source big data platform

05 Dec 2022 1,797

adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet

19 Nov 2013 996

mlflow

Open source platform for the machine learning lifecycle

05 Jun 2018 17,453
Up and Coming Projects 
More

bigdata---train-analysis

batch processing and realtime tains(railway) data analysis to help Station Masters refreshing each 20 seconds

17 Sep 2024 0

maven-hocon-extension

Apache maven

12 Sep 2024 26

mcdd-big-data-study

Study project for big data (Hadoop, Zookeeper, Kafka, Flink, Spark)

11 Sep 2024 2

nifi-api

Apache NiFi API

09 Sep 2024 3

Olympics_data_Project

A personal project that builds an end-to-end data pipeline using the 2024 Olympics data

09 Sep 2024 1

daph

Daph是一个通用的数据同步与数据处理平台级工具,既具有丰富的数据同步能力,又具有强大的数据处理能力,一站式满足数据开发所有需求,可用于构建可视化配置化的数据同步与数据处理平台。

09 Sep 2024 8

Tokyo-Olympics-2021-Analytics

An end-to-end ETL pipeline for analyzing and visualizing Tokyo Olympics 2021 data using Azure tools and Power BI

08 Sep 2024 0

airflow-spark

If you want to use airflow with spark, ready to use ;-)

08 Sep 2024 1

StreamlineDE-

Welcome to StreamlineDE, an end-to-end data engineering project designed to demonstrate real-time data ingestion, processing, and storage using a modern data engineering stack

07 Sep 2024 0

Reddit-Analysis

Sentiment Analysis on streaming data and batch data from Reddit

07 Sep 2024 0

Daily-Youtube-Extraction

Projeto que completa a criação de um ambiente para extração, armazenamento e processamento de dados do Youtube

05 Sep 2024 0

ozhera-site

Website sources for the Apache OzHera(Incubating) website

30 Aug 2024 0

spark-cluster-multi-node-setup

Quickly setup and simulate a multi node spark cluster using docker and docker-compose

30 Aug 2024 1

arrow-go

Official Go implementation of Apache Arrow

29 Aug 2024 8

Stark

基于Spark+Debezium打造的简单易用、超高性能大数据治理引擎,适用于批流一体的数据集成和数据分析场景,支持CDC实时数据采集,支持海量数据同步、数据建模和OLAP数据分析

28 Aug 2024 8

BigBanyanTree

Gathering insights from Common Crawl using Apache Spark and LLMs

25 Aug 2024 2

BigDataETLAndSentimentAnalysis

A Java based project aims to extract news articles from large

22 Aug 2024 0

pulsar-java-contrib

Contributor repository for code samples, plugins and libraries in Java for Apache Pulsar

20 Aug 2024 6

polaris-site

Apache polaris

19 Aug 2024 0

case-study-accidents

Spark analysis on the accidents-data

19 Aug 2024 0