This project integrates real-time data processing and analytics using Apache NiFi, Kafka, Spark, Hive, and AWS services for comprehensive COVID-19 data insights.
MIT License
WARNING: This project only runs on ARM64 chips.
The Covid Data Process project aims to design and implement a comprehensive, real-time data processing pipeline specifically tailored to manage the continuous influx of COVID-19
data. This project seeks to build a scalable and efficient system capable of ingesting, processing, storing, and visualizing COVID-19
data in real-time, enabling stakeholders to make informed decisions based on the most current information available.
The dataset, sourced from the COVID-19 API, offers comprehensive and regularly updated reports on the global spread and impact of COVID-19
. It encompasses a wide range of data points that track the pandemic's progression across various regions, including countries, states, and provinces. This dataset captures essential metrics such as the number of confirmed cases, deaths, recoveries, and active cases, providing a granular view of the pandemic's evolution over time.
Data from API format example:
{
"data":
[
0:{
"date":"2023-03-09"
"confirmed":209451
"deaths":7896
"recovered":0
"confirmed_diff":0
"deaths_diff":0
"recovered_diff":0
"last_update":"2023-03-10 04:21:03"
"active":201555
"active_diff":0
"fatality_rate":0.0377
"region":{
"iso":"AFG"
"name":"Afghanistan"
"province":""
"lat":"33.9391"
"long":"67.7100"
"cities":[]
}
...
]
}
The system architecture for this COVID-19
data processing pipeline is designed to ensure efficient data ingestion, processing, storage, and visualization. It leverages a combination of open-source technologies and cloud services to provide a scalable, robust, and flexible framework for managing and analyzing large volumes of real-time data.
The system is divided into several components, each responsible for specific tasks within the data process:
Data Source: COVID-19
data is retrieved from the API https://covid-api.com/api/
using HTTP requests. NiFi
handles data ingestion, performing initial cleaning and transformation to prepare the data for further processing.
Producer/Consumer: NiFi
acts as both producer and consumer, forwarding processed data to Apache Kafka
for streaming.
Message Brokering: Kafka
serves as the message broker, streaming data between system components in real-time.
Monitoring: Redpanda
monitors Kafka
’s performance, ensuring system stability.
Streaming Analytics: Spark Streaming
processes the data in real-time, performing computations like aggregations and filtering as data flows through Kafka
.
Distributed Storage: Data is stored in Hadoop HDFS
, providing scalable, reliable storage.
Data Warehousing: Apache Hive
on HDFS
enables efficient querying of large datasets.
Job Scheduling: Airflow
orchestrates and schedules the system’s workflows, ensuring smooth execution of data ingestion, processing, and storage tasks.
Batch processing: Apache Spark
processing on data stored in HDFS, facilitating complex data analysis tasks.
Docker
containers ensure consistent environments across development, testing, and production, and are deployed on AWS EC2
for scalability.Amazon QuickSight
visualizes the processed data, allowing for the creation of interactive dashboards and reports.Amazon EC2
: Hosts the system in a scalable and flexible cloud environment.Docker
: Containerizes the system components, ensuring consistency and easy deployment.Apache NiFi
: Handles data ingestion and initial processing from the COVID-19 API.Apache Kafka
: Enables real-time data streaming between system components.Redpanda
: Monitors Kafka to ensure stable data flow and system performance.Apache Spark
: For both real-time and batch data processing.Hadoop HDFS
: Provides distributed storage for large volumes of processed data.Apache Hive
: Allows SQL-like querying and analysis of data stored in HDFS.Apache Airflow
: rchestrates and schedules the workflow of the entire system.Amazon QuickSight
: Provides business intelligence and data visualization capabilities for insightful reporting and analysis.1.1. Log in to AWS Management Console:
1.2. Launch a New EC2 Instance:
Navigate to the EC2 Dashboard.
Click on Launch Instance.
Choose an Amazon Machine Image (AMI):
Amazon Linux 2 AMI
(HVM) - Kernel 5.10, SSD Volume Type
Choose an Architecture
This project only run on architecture ARM64
Choose an Instance Type:
For this project, t4g.2xlarge
is recommended for its balance between performance and cost.
Configure Instance Details:
Configure Storage:
You should increase to 60GB for this project.
Configure Security Group:
Add the following rules:
your IP
.your IP
.Review and Launch the instance.
Download the key pair (.pem
file) and keep it safe; it’s needed for SSH access.
1.3. Access the EC2 Instance
Open your terminal.
Navigate to the directory where the .pem
file is stored.
Run the following command to connect to your instance:
ssh -i "your-key-file.pem" ec2-user@your-ec2-public-ip
Docker
on the EC2 Instance
2.1: Update the Package Repository
Run the following commands to ensure your package repository is up to date:
sudo yum update -y
2.2. Install Docker
Install Docker
by running the following commands:
sudo yum install -y docker
Start Docker
and enable it to start at boot:
sudo systemctl start docker
sudo systemctl enable docker
Verify Docker
installation:
docker --version
3.1. Install Docker Compose
Docker Compose
is not available in the default Amazon Linux 2
repositories, so you will need to download it manually:
sudo curl -L "https://github.com/docker/compose/releases/download/v2.19.1/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
3.2. Apply Executable Permissions
Apply executable permissions to the binary:
sudo chmod +x /usr/local/bin/docker-compose
3.3. Verify Docker Compose Installation
Verify the installation by checking the version:
docker-compose --version
1.1. Install Git (if not already installed)
Install Git to clone the repository:
sudo yum install -y git
1.2. Clone the Repository
Run the following command to clone the project repository:
git clone https://github.com/your-username/covid-data-process.git
Navigate to the project directory:
cd covid-data-process
2.1. Port forwarding to the AWS EC2 Instance
ssh -i "your-key-file.pem" \
-L 6060:localhost:6060 \
-L 8088:localhost:8088 \
-L 8080:localhost:8080 \
-L 1014:localhost:1010 \
-L 9999:localhost:9999 \
-L 8443:localhost:8443 \
ec2-user@eec2-user@your-ec2-public-ip
2.2. Grant Execution Permissions for Shell Scripts
Once connected, ensure all shell scripts have the correct execution permissions:
chmod +x ./*
2.3. Initialize the Project Environment
Run the init-pro.sh
script to initialize the environment:
./init-pro.sh
This script performs the following tasks:
AIRFLOW_UID
environment variable to match the current user's ID.Spark
and NiFi
.Airflow
by initializing it with Docker
.2.4. Start the Docker Containers
Next, bring up all the services defined in the docker-compose
file:
docker-compose --profile all up
2.5. Post-Deployment Configuration
After the Docker containers
are running, execute the after-compose.sh
script to perform additional setup:
./after-compose.sh
This script:
HDFS directories
and assigns the appropriate permissions.Kafka topics
(covidin and covidout).Spark job
to process streaming data.2.6. Configure and Run Apache NiFi
Apache NiFi
:
Open the NiFi Web UI
by navigating to http://localhost:8080/nifi
in your browser.
Add the template for the NiFi
workflow:
Start the NiFi
workflow to begin ingesting and processing COVID-19
data.
2.7. Monitor Kafka Using Redpanda
Kafka
, you can use Redpanda, which provides an easy-to-use interface for Kafka
monitoring:
Open Redpanda Console
by navigating to http://localhost:1010
in your browser.
In the Redpanda Console, you can monitor Kafka topics
, consumer groups, and producer metrics.
Use this interface to ensure that the Kafka topics
(covidin and covidout) are receiving and processing data as expected.
2.7. Finalize Setup and Execute SQL Commands
Once the NiFi workflow
is running, execute the final setup script after-nifi.sh
:
./after-nifi.sh
This script:
HDFS directories
have the correct permissions.SSH service
in the Spark master container.SQL
script in Spark
to set up the necessary databases and tables.2.8. Run Apache Airflow DAGs
Open the Airflow Web UI
by navigating to http://localhost:8080
in your browser
Add a new connection and fill in the following details:
Connection Id: spark_conn
Connection Type: SSH
Host: spark-master
User: root
Password: ren294
Port: 22
Activate the two DAGs
that were set up for this project.
2.9. Visualize Data in AWS QuickSight
Log in to your AWS QuickSight
account.
Connect to the Hive
database to access the processed COVID-19
data.
Create dashboards and visualizations to analyze and present the data insights.
1. Issue: Docker Containers Fail to Start
docker-compose --profile all up
, one or more containers fail to start.Check if Docker is running properly on your EC2 instance
by using docker ps
.
Review the docker-compose logs
using docker-compose logs
to identify the cause of the failure.
Ensure that the .env
file is correctly set up, especially the AIRFLOW_UID
variable.
Verify that the necessary directories (nifi
, spark
, etc.) have been created with the correct permissions using chmod -R 777
.
2. Issue: Airflow DAGs Fail to Trigger
Airflow
do not execute when triggered.Spark
(spark_conn) is correctly configured in Airflow
by checking it under Admin > Connections.Airflow logs
for errors related to Spark
or Kafka
integration.Airflow services
using docker-compose restart airflow
.3. Issue: Data Not Appearing in HDFS
HDFS
directories even after the NiFi
workflow is running.NiFi
logs to ensure that the processors are correctly writing data to HDFS
./data/covid_data
and /data/kafka_data
exist and have the correct permissions.HDFS
file system using the command docker exec -it namenode hdfs dfs -ls /data/
to check for the presence of data files.4. Issue: Spark Jobs Fail or Hang
Spark jobs
do not complete or get stuck in a certain stage.Spark Web UI
for any signs of resource exhaustion (e.g., memory or CPU limits).Spark logs
for errors that might indicate misconfiguration or issues with input data.Spark
services and re-submit the jobs if necessary.1. Viewing Docker Container Logs
Command: To view logs from a specific container, use:
docker logs <container_name>
2. Airflow Logs
Accessing Logs: In the Airflow Web UI
, navigate to the specific task instance under the DAG
runs.
Direct Logs: Logs are also available directly within the container:
docker exec -it airflow-worker tail -f /opt/airflow/logs/<dag_id>/<task_id>/log.txt
3. NiFi Logs
View NiFi
logs by accessing the NiFi
container:
docker exec -it nifi tail -f /nifi/logs/nifi-app.log
View folder mounted: nifi/logs/
Web UI: You can also monitor processor status and logs directly within the NiFi Web UI
by clicking on individual processors.
4. Kafka Logs
Command: Access Kafka
logs by connecting to the Kafka
container:
docker exec -it kafka tail -f /var/log/kafka/kafkaServer.log
5. Monitoring Kafka with Redpanda
Access: Navigate to http://localhost:1010
to access the Redpanda Console
.
Usage: Use the Redpanda Console
to monitor Kafka brokers
, topics, and consumer groups.
6. Spark Job Monitoring with Spark Web UI
Access: Navigate to http://localhost:8088
to open the Spark Web UI
.
Usage: The Spark Web UI
provides detailed information about the running and completed Spark
jobs, including stages, tasks, and executor logs.
7. Running Jupyter Lab for Testing
Open your web browser and go to http://localhost:9999
Run the following script to retrieve the token required to log into Jupyter Lab
:
./token_notebook.sh
Paste the token obtained from the token_notebook.sh
script to log in.
By connecting the processed data to AWS QuickSight
, the project provides a powerful visualization platform that enables stakeholders to derive meaningful insights from the data.
Kafka
and improving the resource allocation for Spark
jobs could help the system handle larger datasets more efficiently.Kubernetes
could also enable better management and scaling of Docker
containers across multiple nodes.machine learning
models for predicting future COVID-19 trends, could provide deeper insights. This would involve extending the Spark
jobs to include model training and inference within the pipeline.real-time visualizations
of predictive analytics alongside the current trends would add significant value.AWS MSK
(Managed Streaming for Kafka), AWS EMR
(Elastic MapReduce), and AWS Glue
could simplify maintenance and improve overall performance.AWS Lambda
for serverless processing tasks could reduce operational costs and improve the responsiveness of the system.Nguyen Trung Nghia