exploring-parquet

Introduction

In this mini-project, I sought to compare the result of storing a table as a CSV file against parquet. It is well known that Parquet is a read-efficient file format that supercharges query performance for analytical use cases due to its columnar nature. By default, it uses snappy compression that reduces file size by more than half. Due to this compression, a slight overhead is associated with the write time of the file. Read my blog article to know more.

Procedure

Set up a MySQL Server on your local machine.
Download all Python packages and dependencies mentioned below.
Download the airportdb Database from MySQL's Official Pages.
Extract the files in your preferred directory.
Copy all files with the extension .zst and place it inside a folder called zst_folder.
Run the Python script zst_extractor.py making appropriate changes in the source and destination folder paths.
Run the tsv_to_csv_converter.py Python Script to convert to a format readable by MySQL Server. (This step is optional.)
Run the DDL Queries.sql file on the MySQL Server to create the schemas for all the required tables on the airportdb database.
Set allow_local_infile parameter on the MySQL Server configurations to true.
Run the Data Loader.ipynb notebook to load all the files to the database.
Query the tables on the database to ensure proper data load.
Run the ingestion_pipeline.py twice, once with write_mode='csv' and then with write_mode='parquet' and changing the sink folder location for each run.

You can monitor the Spark Jobs on the Web UI by opening the link http://localhost:4050 on your browser.

Dependencies

MySQL Database Ensure you have a MySQL database set up either locally or hosted somewhere. Ensure you have the hostname (IP Address), username and password (preferrably, non-root) with read and create table and create database access on the server.
Zstandard A Python compression library that is used to convert database into TSV files. Installed version: zstandard=0.22.0 Command to install: pip install zstandard
MySQL Connector Python A Python library to connect and query on the MySQL Database. Installed version: mysql-connector-python=9.0.0 Command to install: pip install mysql-connector-python
PySpark Python API for Apache Spark Official Documentation Installed Version: pyspark=3.5.1 Command to install: pip install pyspark
MySQL JDBC Driver Official MySQL JDBC Connector link Installed Version: 9.0.0

References

For any questions, comments or suggestions, please feel free to reach out to me via email! 😄📬

Related Projects

Spark-practice

Apache Spark (PySpark) Practice on Real Data

14 Dec 2015 272

PyNotes

My notebook on using Python with Jupyter Notebook, PySpark etc

23 Aug 2019 10

NYC_Taxi_Data_Pipeline

Nyc_Taxi_Data_Pipeline - DE Project

21 Jun 2024 63

pyspark-tutorial

PySpark-Tutorial provides basic algorithms using PySpark

12 Mar 2015 1,170

APACHE-SPARK-PYSPARK-DATABRICKS

APACHE SPARK: Data Analysis, Transformation, and Visualisation with PySpark, IPL Data Analysis

06 Aug 2024 0

NoSQL-DataArchitecture-Spark

Implementing core components of a data-driven architecture using Spark: Data Management and Data ...

14 Aug 2024 0

spark-bigquery

Google BigQuery support for Spark, Structured Streaming, SQL, and DataFrames with easy Databricks...

28 Jan 2017 70

data_lakehouse_local_stack

Data Lakehouse local stack with PySpark, Trino, and Minio. Includes an example to process Raygun ...

21 Jun 2024 0

parquet-format

Apache Parquet Format

10 Jun 2014 1,761

Miscellaneous

Includes notes on using Apache Spark in general, notes on using Spark for Physics, how to run TPC...

28 Aug 2015 405

SQL-Data-Analysis-and-Visualization-Projects

SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark...

29 Feb 2020 1,259

service-spark-iasworld

Service for extracting tables from the CCAO system-of-record and uploading them to the Data Depar...

09 Aug 2024 0

Spark-with-Python

Fundamentals of Spark with Python (using PySpark), code examples

20 Aug 2018 328

SparkSQL.jl

SparkSQL.jl enables Julia programs to work with Apache Spark data using just SQL.

13 Apr 2021 25

data-engineering-interview-questions

More than 2000+ Data engineer interview questions.

08 Aug 2021 1,060