Realtime Data Streaming | End-to-End Data Engineering Project

Introduction

This project serves as a comprehensive guide to building an end-to-end data engineering pipeline. It covers each stage from data ingestion to processing and finally to storage, utilizing a robust tech stack that includes Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. Everything is containerized using Docker for ease of deployment and scalability.

System Architecture

The project is designed with the following components:

Data Source: We use randomuser.me API to generate random user data for our pipeline.
Apache Airflow: Responsible for orchestrating the pipeline and storing fetched data in a PostgreSQL database.
Apache Kafka and Zookeeper: Used for streaming data from PostgreSQL to the processing engine.
Control Center and Schema Registry: Helps in monitoring and schema management of our Kafka streams.
Apache Spark: For data processing with its master and worker nodes.
Cassandra: Where the processed data will be stored.

What You'll Learn

Setting up a data pipeline with Apache Airflow
Real-time data streaming with Apache Kafka
Distributed synchronization with Apache Zookeeper
Data processing techniques with Apache Spark
Data storage solutions with Cassandra and PostgreSQL
Containerizing your entire data engineering setup with Docker

Technologies

Apache Airflow
Python
Apache Kafka
Apache Zookeeper
Apache Spark
Cassandra
PostgreSQL
Docker

Getting Started

Clone the repository:

git clone https://github.com/vishalbansal28/End-to-end-realtime-data-streaming.git

Navigate to the project directory:
```
cd e2e-data-engineering
```
Run Docker Compose to spin up the services:
```
docker-compose up
```

Related Projects

MOLISA-Data-Warehouse-Integration

Extract data from many databases of Labor, Invalids and Social Affairs sectors and convert to app...

04 Aug 2024 3

e2e-structured-streaming

End-to-end data pipeline that ingests, processes, and stores data. It uses Apache Airflow to sche...

25 Jul 2024 6

Udacity-Data-Engineering

Udacity Data Engineering Nano Degree (DEND)

16 Apr 2019 183

Data-Engineering-Projects

Personal Data Engineering Projects

20 Apr 2020 832

ETL-Pipelines

My journey in understanding ETL pipelines using Python and Airflow

25 Jul 2024 3

Udacity-Data-Engineering-Projects

Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, ...

20 Jan 2020 1,480

e2e-data-engineering

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage...

06 Sep 2023 179

Udacity-Data-Engineering-Nanodgree

Udacity Data Engineering Nanodegree Program

19 Jan 2021 51

StreamlineDE-

Welcome to StreamlineDE, an end-to-end data engineering project designed to demonstrate real-time...

07 Sep 2024 0

Data-engineering-nanodegree

Projects done in the Data Engineering Nanodegree by Udacity.com

19 Apr 2019 267

End-to-end-realtime-data-streaming