Data Pipeline for Olympics 2024 Analysis

Description

This project focuses on building an end-to-end data pipeline that extracts raw data from Kaggle, processes and transforms it across three structured layers (Bronze, Silver, and Gold) using Apache Spark, and stores it in Hadoop HDFS. The final transformed data is then loaded into a Snowflake data warehouse. The entire pipeline is orchestrated using Apache Airflow. For data visualization and reporting, Apache Superset is used to create interactive and informative dashboards.

Architecture
Overview
Set up
Visualization

Architecture

Overview

Directory tree

.
├───airflow                # Airflow folder
│   └───dags               # Directory for DAG files (main.py file)
│       ├───spark_script   # Directory for Spark script files for the pipeline
│       └───sql            # Directory for SQL scripts (creating and querying tables)
├───data                   # Data files
├───images                 # Images
├───jars                   # JAR files for configuration between Spark and Snowflake
└───notebook               # Notebooks for Demo data pipeline

Schema

Here is the schema based on snowflake schema model, which includes 3 fact tables and 8 dimensional tables. This schema will be applied in the data warehouse.

Prerequisite

Demo notebook

Navigate to the note book file to see a demo of the pipeline running at each stage.

Set up

Set up Docker

Clone this project by running the following commands:

git clone https://github.com/mjngxwnj/Olympics_data_Project.git
cd Olympics_data_Project

After that, run the following command to start Docker Compose (make sure Docker is installed and running):

docker-compose up

Set up Airflow

First, go to localhost:8080, then log in to Airflow with username: admin, password: admin. Next, navigate Admin -> Connection, and edit spark_default connection. The Spark setting should look like this:

spark_default Save the connection, then go back to the DAGs page to prepare for triggering the DAGs.

Run pipeline

Go to Snowflake account, navigate Database and create Database OLYMPICS_DB, two Schemas OLYMPICS_SCHEMA and REPORT.

Go to localhost:9870.

Initially, HDFS will be empty.

To trigger the DAG, click trigger DAG in the top right corner. The pipeline will start.

After the DAG runs successfully:

As Olympics_data DAG runs, the data will be loaded into HDFS.

In each directory listed above, we can see all the tables we loaded.

Then, all tables in data warehouse will be created.

Visualization

First, go to localhost:8088 to see Superset, with username: admin and password: admin.

In the top right corner, navigate to Setting -> Database Connections -> + DATABASE. Choose Supported databases and select Other, enter the connection string in the format: snowflake://{user}:{password}@{account}.{region}/{database}?role={role}&warehouse={warehouse}. Replace the placeholders with your Snowflake information.

Then, you can select tables in Snowflake using SQL Lab and create a dashboard according to your preferences.

Here is my custom dashboard for visualizing data about the Olympics games:

Related Projects

Covid-Data-Process

This project integrates real-time data processing and analytics using Apache NiFi, Kafka, Spark, ...

18 Aug 2024 5

spotify_insights_project

Welcome to the Spotify Insights Data Pipeline Project where I analyze data from my Spotify listen...

31 Jul 2024 0

StreamlineDE-

Welcome to StreamlineDE, an end-to-end data engineering project designed to demonstrate real-time...

07 Sep 2024 0

e2e-structured-streaming

End-to-end data pipeline that ingests, processes, and stores data. It uses Apache Airflow to sche...

25 Jul 2024 6

goodreads_etl_pipeline

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

13 Feb 2020 1,281

MOLISA-Data-Warehouse-Integration

Extract data from many databases of Labor, Invalids and Social Affairs sectors and convert to app...

04 Aug 2024 3

Masters-Thesis-on-Big-Data

Master's thesis on Big Data

01 Feb 2022 33

e2e-data-engineering

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage...

06 Sep 2023 179

DataStreamingETL

Utilizing my background and love for Apache Airflow and Data to build a real-time data streaming ...

21 Jun 2024 0

Udacity-Data-Pipeline-with-Airflow

Udacity Data Engineering Nanodegree Program, Data Pipeline with Airflow project using MinIO and P...

14 Jul 2024 2

data_lakehouse_local_stack

Data Lakehouse local stack with PySpark, Trino, and Minio. Includes an example to process Raygun ...