A personal project that builds an end-to-end data pipeline using the 2024 Olympics data.
This project focuses on building an end-to-end data pipeline that extracts raw data from Kaggle, processes and transforms it across three structured layers (Bronze, Silver, and Gold) using Apache Spark, and stores it in Hadoop HDFS. The final transformed data is then loaded into a Snowflake data warehouse. The entire pipeline is orchestrated using Apache Airflow. For data visualization and reporting, Apache Superset is used to create interactive and informative dashboards.
.
├───airflow # Airflow folder
│ └───dags # Directory for DAG files (main.py file)
│ ├───spark_script # Directory for Spark script files for the pipeline
│ └───sql # Directory for SQL scripts (creating and querying tables)
├───data # Data files
├───images # Images
├───jars # JAR files for configuration between Spark and Snowflake
└───notebook # Notebooks for Demo data pipeline
Here is the schema based on snowflake schema model, which includes 3 fact tables and 8 dimensional tables. This schema will be applied in the data warehouse.
Navigate to the note book file to see a demo of the pipeline running at each stage.
Clone this project by running the following commands:
git clone https://github.com/mjngxwnj/Olympics_data_Project.git
cd Olympics_data_Project
After that, run the following command to start Docker Compose (make sure Docker is installed and running):
docker-compose up
First, go to localhost:8080, then log in to Airflow with username: admin
, password: admin
.
Next, navigate Admin
-> Connection
, and edit spark_default
connection.
The Spark setting should look like this:
Save the connection, then go back to the DAGs page to prepare for triggering the DAGs.
Go to Snowflake account, navigate Database and create Database OLYMPICS_DB, two Schemas OLYMPICS_SCHEMA and REPORT.
Go to localhost:9870.
Initially, HDFS will be empty.
To trigger the DAG, click trigger DAG
in the top right corner. The pipeline will start.
After the DAG runs successfully:
As Olympics_data DAG runs, the data will be loaded into HDFS.
In each directory listed above, we can see all the tables we loaded.
Then, all tables in data warehouse will be created.
First, go to localhost:8088 to see Superset, with username: admin
and password: admin
.
In the top right corner, navigate to Setting
-> Database Connections
-> + DATABASE
. Choose Supported databases
and select Other
, enter the connection string in the format: snowflake://{user}:{password}@{account}.{region}/{database}?role={role}&warehouse={warehouse}
. Replace the placeholders with your Snowflake information.
Then, you can select tables in Snowflake using SQL Lab and create a dashboard according to your preferences.
Here is my custom dashboard for visualizing data about the Olympics games: