An example data pipeline to count Twitter hashtags, built with Docker-Compose, Kafka, PySpark and Cassandra.

Prerequisites

Install Docker and Docker-Compose
Copy config.sample.py as config.py and fill in your access tokens from the Twitter Developer API

Setup

Create the network:

docker network create kafka-network

Start Kafka:

docker-compose -f docker-compose.kafka.yml up -d

Start the rest of the services and start pushing Tweets:

docker-compose build && docker-compose up

Launch Cassandra CQLSH and run cassandra.cql to create the table:

docker-compose exec cassandra cqlsh

Start Spark stream processing:

docker-compose exec sparksubmit bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11/2.0.0,anguenot:pyspark-cassandra:0.10.1,commons-configuration:commons-configuration:1.6 --conf spark.cassandra.connection.host=cassandra code/process.py

Open again CQLSH:

docker-compose exec cassandra cqlsh

Finally, check the table content:

SELECT * FROM MYKS.TEST;

Related Projects

PySpark-Structured-Streaming-ROS-Kafka-ApacheSpark-Cassandra

A structured streaming was applied to the robot data from ROS-Gazebo simulation environment using...

29 Jan 2022 16

e2e-structured-streaming

End-to-end data pipeline that ingests, processes, and stores data. It uses Apache Airflow to sche...

25 Jul 2024 6

bigdata-playground

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Stre...

12 Dec 2017 208

StreamlineDE-

Welcome to StreamlineDE, an end-to-end data engineering project designed to demonstrate real-time...

07 Sep 2024 0

kafka-spark-example