Olympics_data_Project

A personal project that builds an end-to-end data pipeline using the 2024 Olympics data.

Stars
1
Committers
1

Data Pipeline for Olympics 2024 Analysis

Description

This project focuses on building an end-to-end data pipeline that extracts raw data from Kaggle, processes and transforms it across three structured layers (Bronze, Silver, and Gold) using Apache Spark, and stores it in Hadoop HDFS. The final transformed data is then loaded into a Snowflake data warehouse. The entire pipeline is orchestrated using Apache Airflow. For data visualization and reporting, Apache Superset is used to create interactive and informative dashboards.

Table of contents

Architecture

Overview

Directory tree

.
├───airflow                # Airflow folder
│   └───dags               # Directory for DAG files (main.py file)
│       ├───spark_script   # Directory for Spark script files for the pipeline
│       └───sql            # Directory for SQL scripts (creating and querying tables)
├───data                   # Data files
├───images                 # Images
├───jars                   # JAR files for configuration between Spark and Snowflake
└───notebook               # Notebooks for Demo data pipeline

Schema

Here is the schema based on snowflake schema model, which includes 3 fact tables and 8 dimensional tables. This schema will be applied in the data warehouse.

Prerequisite

Demo notebook

Navigate to the note book file to see a demo of the pipeline running at each stage.

Set up

Set up Docker

Clone this project by running the following commands:

git clone https://github.com/mjngxwnj/Olympics_data_Project.git
cd Olympics_data_Project

After that, run the following command to start Docker Compose (make sure Docker is installed and running):

docker-compose up

Set up Airflow

First, go to localhost:8080, then log in to Airflow with username: admin, password: admin. Next, navigate Admin -> Connection, and edit spark_default connection. The Spark setting should look like this:

spark_default Save the connection, then go back to the DAGs page to prepare for triggering the DAGs.

Run pipeline

Go to Snowflake account, navigate Database and create Database OLYMPICS_DB, two Schemas OLYMPICS_SCHEMA and REPORT.

Go to localhost:9870.

Initially, HDFS will be empty.

To trigger the DAG, click trigger DAG in the top right corner. The pipeline will start.

After the DAG runs successfully:

As Olympics_data DAG runs, the data will be loaded into HDFS.

In each directory listed above, we can see all the tables we loaded.

Then, all tables in data warehouse will be created.

Visualization

First, go to localhost:8088 to see Superset, with username: admin and password: admin.

In the top right corner, navigate to Setting -> Database Connections -> + DATABASE. Choose Supported databases and select Other, enter the connection string in the format: snowflake://{user}:{password}@{account}.{region}/{database}?role={role}&warehouse={warehouse}. Replace the placeholders with your Snowflake information.

Then, you can select tables in Snowflake using SQL Lab and create a dashboard according to your preferences.

Here is my custom dashboard for visualizing data about the Olympics games:

Related Projects