Spark Monitor - An extension for Jupyter Lab

This project was originally written by krishnan-r as a Google Summer of Code project for Jupyter Notebook. Check his website out here.

As a part of my internship as a Software Engineer at Yelp, I created this fork to update the extension to be compatible with JupyterLab - Yelp's choice for sharing and collaborating on notebooks.

About

Requirements

At least JupyterLab 3
pyspark 3.X.X or newer (For compatibility with older pyspark versions, use jupyterlab-sparkmonitor 3.X)

Features

Automatically displays a live monitoring tool below cells that run Spark jobs in a Jupyter notebook
A table of jobs and stages with progressbars
A timeline which shows jobs, stages, and tasks
A graph showing number of active tasks & executor cores vs time
A notebook server extension that proxies the Spark UI and displays it in an iframe popup for more details
For a detailed list of features see the use case notebooks
Support for multiple SparkSessions (default port is 4040)
How it Works

Quick Start

To do a quick test of the extension

This docker image has pyspark and several other related packages installed alongside the sparkmonitor extension.

docker run -it -p 8888:8888 itsjafer/sparkmonitor

Setting up the extension

pip install jupyterlab-sparkmonitor # install the extension

# set up ipython profile and add our kernel extension to it
ipython profile create --ipython-dir=.ipython
echo "c.InteractiveShellApp.extensions.append('sparkmonitor.kernelextension')" >>  .ipython/profile_default/ipython_config.py

# run jupyter lab
IPYTHONDIR=.ipython jupyter lab --watch

With the extension installed, a SparkConf object called conf will be usable from your notebooks. You can use it as follows:

from pyspark import SparkContext

# start the spark context using the SparkConf the extension inserted
sc=SparkContext.getOrCreate(conf=conf) #Start the spark context

# Monitor should spawn under the cell with 4 jobs
sc.parallelize(range(0,100)).count()
sc.parallelize(range(0,100)).count()
sc.parallelize(range(0,100)).count()
sc.parallelize(range(0,100)).count()

If you already have your own spark configuration, you will need to set spark.extraListeners to sparkmonitor.listener.JupyterSparkMonitorListener and spark.driver.extraClassPath to the path to the sparkmonitor python package path/to/package/sparkmonitor/listener.jar

from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .config('spark.extraListeners', 'sparkmonitor.listener.JupyterSparkMonitorListener')\
        .config('spark.driver.extraClassPath', 'venv/lib/python3.7/site-packages/sparkmonitor/listener.jar')\
        .getOrCreate()

# should spawn 4 jobs in a monitor bnelow the cell
spark.sparkContext.parallelize(range(0,100)).count()
spark.sparkContext.parallelize(range(0,100)).count()
spark.sparkContext.parallelize(range(0,100)).count()
spark.sparkContext.parallelize(range(0,100)).count()

Changelog

1.0 - Initial Release
2.0 - Migration to JupyterLab 2, Multiple Spark Sessions, and displaying monitors beneath the correct cell more accurately
3.0 - Migrate to JupyterLab 3 as prebuilt extension
4.0 - pyspark 3.X Compatibility; no longer compatible with PySpark 2.X or under

Development

If you'd like to develop the extension:

make all # Clean the directory, build the extension, and run it locally

Package Rankings

Top 14.11% on Npmjs.org

Top 8.99% on Pypi.org

Related Projects

sparkmagic

Jupyter magics and kernels for working with remote Spark clusters

21 Sep 2015 1,324

pyspark-devcontainer

A simple VS Code devcontainer setup for local PySpark development

09 Mar 2023 18

pyspark-asyncactions

Asynchronous actions for PySpark

26 Apr 2017 44

spark-kubernetes-operator

Apache Spark Kubernetes Operator

29 Mar 2024 55

spark-jobserver

REST job server for Apache Spark

21 Aug 2014 2,843

spark-on-k8s

A Python package to submit and manage Apache Spark applications on Kubernetes.

14 Jan 2024 35

sparklint

A tool for monitoring and tuning Spark jobs for efficiency.

06 Oct 2016 357

eat_pyspark_in_10_days

pyspark🍒🥭 is delicious，just eat it!😋😋

24 Dec 2020 684

spark-examples

Spark examples

22 Jun 2024 0

spark

Performance Observability for Apache Spark

28 Sep 2023 171

Spark-with-Python---My-learning-notes-

ETL pipeline using pyspark (Spark - Python)

13 Mar 2017 106

spark-scala-tutorial

A free tutorial for Apache Spark.

01 May 2014 981

spark

Apache Spark - A unified analytics engine for large-scale data processing

25 Feb 2014 38,255

sparglim

Sparglim✨ makes PySpark App Configurable and Deploy Spark Connect Server Easier!

11 Jul 2023 35

spark3D

Spark extension for processing large-scale 3D data sets: Astrophysics, High Energy Physics, Meteo...

31 Jan 2018 30