Property and Locality Data Analysis

Introduction

This project involves setting up a data engineering pipeline to collect, store, process, and analyze Property and Locality data using Hadoop, Docker, MySQL, Tailscale, and Selenium.

Project Overview

Objective: Analyze Property and Locality data to derive meaningful insights.
Scope: Collect data through web scraping, store in HDFS, process using Hadoop, and analyze with MySQL.

Video

Here is a demo video of the Pipeline:

Setup and Environment

Virtual Machines Setup

Step: Installed Ubuntu on VirtualBox for each VM.
Action: Configured each VM with necessary packages including Docker and Docker Compose.

Networking with Tailscale

Step: Installed and configured Tailscale on all VMs.
Action: Created a secure virtual network to enable communication between VMs.

Docker Swarm Initialization

Step: Initialized Docker Swarm on the master node and joined worker nodes.
Action: Used Docker Swarm for container orchestration.

Image of Setup Process

Data Collection

Web Scraping with Selenium

Step: Installed Selenium and Chrome WebDriver.
Action: Developed scripts to scrape Property and Locality data from various websites.

Image of Collection Process

Data Storage in HDFS

Step: cd spark cluster folder to use, A ready to go Big Data cluster (Hadoop + Hadoop Streaming + Spark + PySpark + Jupyter Notebook) with Docker and Docker Swarm!
Configured HDFS on the Hadoop cluster provided by Prof. Dr.-Ing. Binh Vu. Check the README.md file in the spark cluster folder to begin with the setup process`
Action: Stored scraped data in HDFS with appropriate partitioning and replication.

Image of Data Storage in HDFS Process

Data Processing

Hadoop Job Development

Step: Developed and executed Hadoop jobs for data cleaning and transformation.
Action: Used MapReduce for distributed processing.

Image of Data Processing Process

Failure Test

Step: Conducted data read/write operations while intentionally shutting down a worker node.
Action: Verified system resilience and fault tolerance.

Image of Failure Test Process - node down

Image of Failure Test Process - ingestion to mysql

Data Ingestion into MySQL

Database Design

Step: Created a relational database schema in MySQL.
Action: Developed scripts to ingest data from HDFS to MySQL.

Image of Database Schema

Business Insights

Query Development

Step: Developed SQL queries to extract insights from the database.
Action: Generated graphs and tables to present the results.

Image of Business Insights Visualization

Acknowledgements

Note: A special thank you to Prof. Dr.-Ing. Binh Vu for providing the ready-to-go Spark cluster image used in this project.

Badges

Extracted from project README's

Related Projects

Masters-Thesis-on-Big-Data

Master's thesis on Big Data

01 Feb 2022 33

apache-spark-docker

Dockerizing an Apache Spark Standalone Cluster

19 Jul 2021 40

Covid-Data-Process

This project integrates real-time data processing and analytics using Apache NiFi, Kafka, Spark, ...

18 Aug 2024 5

DataStreamingETL

Utilizing my background and love for Apache Airflow and Data to build a real-time data streaming ...

21 Jun 2024 0

data_lakehouse_local_stack

Data Lakehouse local stack with PySpark, Trino, and Minio. Includes an example to process Raygun ...