End-to-end data pipeline that ingests, processes, and stores data. It uses Apache Airflow to schedule scripts that fetch data from an API, sends the data to Kafka, and processes it with Spark before writing to Cassandra. The pipeline, built with Python and Apache Zookeeper, is containerized with Docker for easy deployment and scalability.
This repository demonstrates a data engineering pipeline using Spark Structured Streaming. It retrieves random names from an API, sends the data to Kafka topics via Airflow, and processes it with Spark Structured Streaming before storing it in Cassandra.
Data Source: Uses the randomuser.me API for generating user data. Apache Airflow: Orchestrates the pipeline and schedules data ingestion. Apache Kafka & Zookeeper: Stream data from PostgreSQL to Spark. Apache Spark: Processes data in real time. Cassandra: Stores the processed data. Scripts:
kafka_stream.py: Airflow DAG script that pushes API data to Kafka during 2 minutes every 1 seconds. spark_stream.py: Consumes and processes data from Kafka using Spark Structured Streaming.
Setting up and orchestrating pipelines with Apache Airflow. Real-time data streaming with Apache Kafka. Synchronization with Apache Zookeeper. Data processing with Apache Spark. Storage solutions with Cassandra and PostgreSQL. Containerization of the entire setup using Docker. Technologies: Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, Cassandra, PostgreSQL, Docker
Airflow
: http://localhost:8080/
Kafka UI
: http://localhost:8085/ \
$ git clone https://github.com/akarce/e2e-structured-streaming.git
$ cd e2e-structured-streaming
$ echo -e "AIRFLOW_UID=$(id -u)" > .env
$ echo AIRFLOW_UID=50000 >> .env
$ docker-compose up airflow-init
$ docker compose up -d
$ docker cp dependencies.zip spark-master:/dependencies.zip
$ docker cp spark_stream.py spark-master:/spark_stream.py
$ docker exec -it cassandra cqlsh -u cassandra -p cassandra localhost 9042
cqlsh> DESCRIBE KEYSPACES;
Go to Airflow UI using : http://localhost:8080/
Login using Username: admin
Password: admin
You can track the topic creation and message queue using the open source tool named UI for Apache Kafka that is running as a container, WebUI link: http://localhost:8085/
Message schema looks like this
$ docker exec -it spark-master spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.12:3.5.1,org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1 --py-files /dependencies.zip /spark_stream.py
cqlsh> SELECT * FROM spark_streaming.created_users;
cqlsh> SELECT count(*) FROM spark_streaming.created_users;