Implementing core components of a data-driven architecture using Spark: Data Management and Data Analysis Backbones with structured zones in a data lake and analytical capabilities
This repository contains the implementation of two critical backbones of a data-driven architecture: the Data Management Backbone and the Data Analysis Backbone. The project involves setting up a structured data lake with defined zones and performing either descriptive or predictive analysis.
This project focuses on creating a data-driven architecture using Apache Spark. It involves setting up a data lake with structured zones on the local file system, processing raw data, and performing analysis.
The following guide will aid in setting up PySpark on Mac (for help with Windows setup, please head to: https://www.machinelearningplus.com/pyspark/install-pyspark-on-windows/).
Spark Mac
brew install openjdk
java -version
whereis java
export JAVA_HOME=/usr/libexec/java_home
source ~/.bashrc
brew install apache-spark
brew info apache-spark
export SPARK_HOME=/usr/local/Cella/apache-spark/<version>/libexec
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
source ~/.bashrc
pip install pyspark
pyspark --version
Landing Zone
Stores raw data ingested into the data lake in a structured or semi-structured format. This includes data directly extracted from source systems with minimal transformation.
Formatted Zone
Stores data in a standardized format according to a canonical data model. Data is potentially enriched and in a consumption-ready form. -Implementation: Implemented using Parquet files for efficient storage and schema enforcement on the local file system.
Exploitation Zone
Contains processed and refined data optimized for analysis, such as features and KPIs.
Descriptive Analysis and Dashboarding
Descriptive Analysis: Performed exploratory data analysis (EDA) on the data in the Exploitation Zone to summarize and understand the data.
Dashboarding: Created interactive dashboards using tools like Tableau, Power BI, or Jupyter Notebooks with matplotlib/seaborn.
├── Documents
│ ├── BigData_Spark_notebook.ipynb
│ └── BigData_Spark_report.pdf
├── LandingZone
│ ├── cultural-sites
│ ├── income
│ └── price_opendata
├── FormattedZone
│ ├── CulturalSites
│ ├── Income
│ └── PriceOpenData
├── ExploitationZone
│ ├── CulturalSites
│ ├── Income
│ ├── Price_Income
│ └── PriceOpenData
└── README.md