Project to compare write efficiency and memory efficiency of CSV and Parquet files
In this mini-project, I sought to compare the result of storing a table as a CSV file against parquet. It is well known that Parquet is a read-efficient file format that supercharges query performance for analytical use cases due to its columnar nature. By default, it uses snappy
compression that reduces file size by more than half. Due to this compression, a slight overhead is associated with the write time of the file. Read my blog article to know more.
airportdb
Database from MySQL's Official Pages..zst
and place it inside a folder called zst_folder
.zst_extractor.py
making appropriate changes in the source and destination folder paths.tsv_to_csv_converter.py
Python Script to convert to a format readable by MySQL Server. (This step is optional.)DDL Queries.sql
file on the MySQL Server to create the schemas for all the required tables on the airportdb
database.allow_local_infile
parameter on the MySQL Server configurations to true
.Data Loader.ipynb
notebook to load all the files to the database.ingestion_pipeline.py
twice, once with write_mode='csv'
and then with write_mode='parquet'
and changing the sink folder location for each run.You can monitor the Spark Jobs on the Web UI by opening the link http://localhost:4050 on your browser.
MySQL Database Ensure you have a MySQL database set up either locally or hosted somewhere. Ensure you have the hostname (IP Address), username and password (preferrably, non-root) with read and create table and create database access on the server.
Zstandard
A Python compression library that is used to convert database into TSV files.
Installed version: zstandard=0.22.0
Command to install: pip install zstandard
MySQL Connector Python
A Python library to connect and query on the MySQL Database.
Installed version: mysql-connector-python=9.0.0
Command to install: pip install mysql-connector-python
PySpark
Python API for Apache Spark
Official Documentation
Installed Version: pyspark=3.5.1
Command to install: pip install pyspark
MySQL JDBC Driver Official MySQL JDBC Connector link Installed Version: 9.0.0
For any questions, comments or suggestions, please feel free to reach out to me via email! 😄📬