Apache Spark Ecosystem

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Created by

Matei Zaharia

Released

May 26, 2014

Community Repos

8,421

Total GitHub Stars

1,087,452

Core Projects

airflow

36,052

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

arrow

14,356

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Popular Projects

spark

Apache Spark - A unified analytics engine for large-scale data processing

25 Feb 2014 38,255

spark-nlp

State of the Art Natural Language Processing

24 Sep 2017 3,717

SynapseML

Simple and Distributed Machine Learning

05 Jun 2017 4,986

zeppelin

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more

25 Mar 2015 6,276

redash

Make Your Company Data Driven

28 Oct 2013 25,090

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc

03 Mar 2014 6,710

delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

22 Apr 2019 7,462

deeplearning4j

Suite of tools for deploying and training deep learning models using the JVM

27 Nov 2013 13,630

TensorFlowOnSpark

TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters

20 Jan 2017 3,872

spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

14 May 2020 757

fugue

A unified interface for distributed computing

24 Mar 2020 1,901

spark-cassandra-connector

DataStax Connector for Apache Spark to Apache Cassandra

27 Jun 2014 1,931

horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet

09 Aug 2017 14,183

kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses

18 Dec 2017 1,953

alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud

21 Dec 2012 6,814

spark

22 Apr 2019 2,020

apache-spark-internals

The Internals of Apache Spark

31 Aug 2015 1,468

ytsaurus

YTsaurus is a scalable and fault-tolerant open-source big data platform

05 Dec 2022 1,797

adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet

19 Nov 2013 996

mlflow

Open source platform for the machine learning lifecycle

05 Jun 2018 17,453

More Popular

Up and Coming Projects

bigdata---train-analysis

batch processing and realtime tains(railway) data analysis to help Station Masters refreshing each 20 seconds

17 Sep 2024 0

maven-hocon-extension

Apache maven

12 Sep 2024 26

mcdd-big-data-study

Study project for big data (Hadoop, Zookeeper, Kafka, Flink, Spark)

11 Sep 2024 2

nifi-api

Apache NiFi API

09 Sep 2024 3

Olympics_data_Project

A personal project that builds an end-to-end data pipeline using the 2024 Olympics data

09 Sep 2024 1

daph

Daph是一个通用的数据同步与数据处理平台级工具，既具有丰富的数据同步能力，又具有强大的数据处理能力，一站式满足数据开发所有需求，可用于构建可视化配置化的数据同步与数据处理平台。

09 Sep 2024 8

Tokyo-Olympics-2021-Analytics

An end-to-end ETL pipeline for analyzing and visualizing Tokyo Olympics 2021 data using Azure tools and Power BI

08 Sep 2024 0

airflow-spark

If you want to use airflow with spark, ready to use ;-)

08 Sep 2024 1

StreamlineDE-

Welcome to StreamlineDE, an end-to-end data engineering project designed to demonstrate real-time data ingestion, processing, and storage using a modern data engineering stack

07 Sep 2024 0