An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.
This project serves as a comprehensive guide to building an end-to-end data engineering pipeline. It covers each stage from data ingestion to processing and finally to storage, utilizing a robust tech stack that includes Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. Everything is containerized using Docker for ease of deployment and scalability.
The project is designed with the following components:
randomuser.me
API to generate random user data for our pipeline.Clone the repository:
git clone https://github.com/vishalbansal28/End-to-end-realtime-data-streaming.git
Navigate to the project directory:
cd e2e-data-engineering
Run Docker Compose to spin up the services:
docker-compose up