Open Source Ecosystems

Learning Pyspark locally (i.e. without using any cloud service) via following the excellent Data Analysis with Python and PySpark by Jonathan Rioux.

Environment setup

From the project root, run:

pipenv install

This will create a virtual environment with all the required dependencies installed.

Although only pipenv is required for this setup to run, I strong recommend having both pyenv and pipenv installed. pyenv manages Python versions while pipenv takes care of virtual environments.

If you're on Windows, try pyenv-win. pipenv should work just fine.

The notebooks were created with Visual Studio Code's Jupyter code cells, which I prefer over standard Jupyter notebooks/labs because of much better git integration.

You can easily convert the code cells files into Jupyter notebooks with Visual Studio Code. Just open a file, right click and select Export current Python file as Jupyter notebook.

The data directory contains only the smaller-sized data files. You will have to download the larger ones as per the instructions in the individual notebooks, e.g.:

home_dir = os.environ["HOME"]
DATA_DIRECTORY = os.path.join(home_dir, "Documents", "spark", "data", "backblaze")

This works on my Linux machine. You may need to modify the path if you're on Windows.

Related Projects

APACHE-SPARK-PYSPARK-DATABRICKS-MACHINE-LEARNING-MLIB

Apache Spark Machine Learning project using MLlib and Linear Regression on Databricks!

07 Aug 2024 0

pyspark-devcontainer

A simple VS Code devcontainer setup for local PySpark development

09 Mar 2023 18

pyspark-maestro

This repo contains implementations of PySpark for real-world use cases for batch data processing,...

23 Jul 2024 1

eat_pyspark_in_10_days

pyspark🍒🥭 is delicious，just eat it!😋😋

24 Dec 2020 684

Spark-with-Python

Fundamentals of Spark with Python (using PySpark), code examples

20 Aug 2018 328

spark-scala-tutorial

A free tutorial for Apache Spark.

01 May 2014 981

Miscellaneous

Includes notes on using Apache Spark in general, notes on using Spark for Physics, how to run TPC...

28 Aug 2015 405

spark3D

Spark extension for processing large-scale 3D data sets: Astrophysics, High Energy Physics, Meteo...

31 Jan 2018 30

incubator-devlake-playground

Apache DevLake is an open-source dev data platform to ingest, analyze, and visualize the fragment...

01 Mar 2024 5

pybda

A commandline tool for analysis of big biological data sets for distributed HPC clusters.

13 Jul 2018 9

spark-notebook-examples

Some notebook examples related to Apache Spark, IPython / Jupyter, Zeppelin

18 Apr 2015 52

pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.

28 Dec 2017 1,639

APACHE-SPARK-PYSPARK-DATABRICKS

APACHE SPARK: Data Analysis, Transformation, and Visualisation with PySpark, IPL Data Analysis

06 Aug 2024 0

sparkmagic

Jupyter magics and kernels for working with remote Spark clusters

21 Sep 2015 1,324

NoSQL-DataArchitecture-Spark

Implementing core components of a data-driven architecture using Spark: Data Management and Data ...

14 Aug 2024 0

data-analysis-with-python-and-pyspark