Open Source Ecosystems

If you enjoy DataFlint please give us a ⭐️ and join our slack community for feature requests, support and more!

What is DataFlint?

DataFlint is a modern, user-friendly enhancement for Apache Spark that simplifies performance monitoring and debugging. It adds an intuitive tab to the existing Spark Web UI, transforming a powerful but often overwhelming interface into something easy to navigate and understand.

Why DataFlint?

Intuitive Design: DataFlint's tab in the Spark Web UI presents complex metrics in a clear, easy-to-understand format, making Spark performance accessible to everyone.
Effortless Setup: Install DataFlint in minutes with just a few lines of code or configuration, without making any changes to your existing Spark environment.
For All Skill Levels: Whether you're a seasoned data engineer or just starting with Spark, DataFlint provides valuable insights that help you work more effectively.

With DataFlint, spend less time deciphering Spark Web UI and more time deriving value from your data. Make big data work better for you, regardless of your role or experience level with Spark.

Usage

After installation, you will see a "DataFlint" tab in the Spark Web UI. Click on it to start using DataFlint.

Demo

Features

📈 Real-time query and cluster status
📊 Query breakdown with performance heat map
📋 Application Run Summary
⚠️ Performance alerts and suggestions
👀 Identify query failures
🤖 Spark AI Assistant

See Our Features for more information

Installation

Scala

Install DataFlint via sbt:

libraryDependencies += "io.dataflint" %% "spark" % "0.2.3"

Then instruct spark to load the DataFlint plugin:

val spark = SparkSession
    .builder()
    .config("spark.plugins", "io.dataflint.spark.SparkDataflintPlugin")
    ...
    .getOrCreate()

PySpark

Add these 2 configs to your pyspark session builder:

builder = pyspark.sql.SparkSession.builder
    ...
    .config("spark.jars.packages", "io.dataflint:spark_2.12:0.2.3") \
    .config("spark.plugins", "io.dataflint.spark.SparkDataflintPlugin") \
    ...

Spark Submit

Alternatively, install DataFlint with no code change as a spark ivy package by adding these 2 lines to your spark-submit command:

spark-submit
--packages io.dataflint:spark_2.12:0.2.3 \
--conf spark.plugins=io.dataflint.spark.SparkDataflintPlugin \
...

Additional installation options

There is also support for scala 2.13, if your spark cluster is using scala 2.13 change package name to io.dataflint:spark_2.13:0.2.3
For more installation options, including for python and k8s spark-operator, see Install on Spark docs
For installing DataFlint in spark history server for observability on completed runs see install on spark history server docs
For installing DataFlint on DataBricks see install on databricks docs

How it Works

DataFlint is installed as a plugin on the spark driver and history server.

The plugin exposes an additional HTTP resoures for additional metrics not available in Spark UI, and a modern SPA web-app that fetches data from spark without the need to refresh the page.

For more information, see how it works docs

Medium Articles

Compatibility Matrix

DataFlint require spark version 3.2 and up, and supports both scala versions 2.12 or 2.13.

Spark Platforms	DataFlint Realtime	DataFlint History server
Local	✅	✅
Standalone	✅	✅
Kubernetes Spark Operator	✅	✅
EMR	✅	✅
Dataproc	✅	❓
HDInsights	✅	❓
Databricks	✅	❌

For more information, see supported versions docs

Badges

Extracted from project README's

Related Projects

setl

A simple Spark-powered ETL framework that just works 🍺

20 Dec 2019 177

data-accelerator

Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a ric...

14 Mar 2019 293

azure-cosmosdb-spark

Apache Spark Connector for Azure Cosmos DB

30 Nov 2016 201

kotlin-spark-api

This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to ha...

01 Jun 2020 443

spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

22 Apr 2019 2,020

spark-Jupyter-AWS

A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support

25 Nov 2016 262

sparklint

A tool for monitoring and tuning Spark jobs for efficiency.

06 Oct 2016 357

spark-utils

Basic framework utilities to quickly start writing production ready Apache Spark applications

14 Apr 2018 36

spark

Apache Spark - A unified analytics engine for large-scale data processing

25 Feb 2014 38,255

spark-fits

FITS data source for Spark SQL and DataFrames

31 Jan 2018 20

azure-event-hubs-spark

Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs

04 Sep 2015 233

sparksql-for-hbase

Learn how to use Spark SQL and HSpark connector package to create / query data tables that reside...

31 Aug 2017 69

spark-scala-tutorial

A free tutorial for Apache Spark.

01 May 2014 981

spark-cassandra-connector

DataStax Connector for Apache Spark to Apache Cassandra

27 Jun 2014 1,931

blaze

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its ...

28 Jun 2021 883