Udacity-Data-Pipeline-with-Airflow

Udacity Data Engineering Nanodegree Program, Data Pipeline with Airflow project using MinIO and Postgresql.

Project Structure:

dags: Directory containing Airflow DAG scripts.
- create_tables_and_buckets.py: DAG for creating tables on PostgreSQL and buckets on MinIO.
- create_tables.sql: SQL script containing CREATE TABLE queries.
- pipeline_dag.py: Main DAG script for the ETL data pipeline.
data: Directory for storing project source data.
- log_data: Subdirectory for log data.
- song_data: Subdirectory for song data.
plugins: Directory for custom Airflow plugins.
- operators: Subdirectory for operator scripts.
  - init.py: Initialization script for operators.
  - stage_postgresql_operator.py: Operator script for copying data from MinIO to PostgreSQL.
  - load_fact_operator.py: Operator script for executing INSERT queries into the fact table.
  - load_dimension_operator.py: Operator script for executing INSERT queries into dimension tables.
  - data_quality_operator.py: Operator script for performing data quality checks.
- helpers: Subdirectory for helper scripts.
  - init.py: Initialization script for helpers.
  - sql_queries.py: Script containing SQL queries for building dimensional tables.

Airflow Dag Overview

Staging & Fact Table Schema

Dimension Tables Schema

How to Run

Clone the project repository

$ git clone https://github.com/akarce/Udacity-Data-Pipeline-with-Airflow

Change directory

$ cd Udacity-Data-Pipeline-with-Airflow

Build the docker images and start airflow:

$ docker-compose up airflow-init $ docker compose up -d

Create database called sparkifydb:

$ docker exec -it postgresdev psql -U postgres_user -d postgres_db

$ CREATE DATABASE sparkifydb;

WebUI links: \

Airflow webserver http://localhost:8080/

MinIO http://localhost:9001/

Sign in to Airflow webserver:

Username: airflow Password: airflow

Create Postgres connection:

Go to admin -> Connections -> Add a new record

Connection Id: postgres_conn Connection Type: Postgres Host: 172.18.0.1 Database: sparkifydb Login: postgres_user Password: postgres_password Port: 5432

Obtain Access Key from MinIO:

Go to MinIO WebUI using http://localhost:9001/ From Access Keys section -> create access key -> Create Store you Access Key and Secret Key using Download for Import

Create MinIo connection:

Go to admin -> Connections -> Add a new record

Connection Id: minio_conn Connection Type: Amazon Web Services AWS Access Key ID: <your_key_id> AWS Secret Access Key: <your_secret_key> \

Trigger the create_tables_and_buckets_dag using Airflow WebUI

This will create two buckets named udacity-dend, processed and 7 postgres tables named artists, songplays, songs, staging_events, staging_songs, times, users

To see the schemas of tables use \

$ \d <table_name>

Upload the log_data and song_data folders from data folder into udacity-dend bucket from MinIO WebUI

Username: minioadmin Password: minioadmin

Finally unpause the dag named pipeline:

After finished successfully you can check the logs for data quality messages, and the all objects will be moved to the processed bucket.

You can see the Airflow logs for data quality results:

=======

1b712ebd7a3d8fd64ccb00116f6c053a5f10cde7

Related Projects

ETL-Pipelines

My journey in understanding ETL pipelines using Python and Airflow

25 Jul 2024 3

e2e-structured-streaming

End-to-end data pipeline that ingests, processes, and stores data. It uses Apache Airflow to sche...

25 Jul 2024 6

Udacity-Data-Engineering

Udacity Data Engineering Nano Degree (DEND)

16 Apr 2019 183

e2e-data-engineering

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage...

06 Sep 2023 179

Cloud-Data-Warehousing-with-AWS-Redshift

This project builds a cloud-based ETL pipeline for Sparkify to move data to a cloud data warehous...

09 Sep 2024 0

MOLISA-Data-Warehouse-Integration

Extract data from many databases of Labor, Invalids and Social Affairs sectors and convert to app...

04 Aug 2024 3

NYC_Taxi_Data_Pipeline

Nyc_Taxi_Data_Pipeline - DE Project

21 Jun 2024 63

Udacity-Data-Engineering-Projects

Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, ...

20 Jan 2020 1,480