Extract data from many databases of Labor, Invalids and Social Affairs sectors and convert to appropriate structure and format, then upload to shared data warehouse and data mart. Thanks to that, people of state agencies can easily retrieve and analyze data based on the compiled data warehouse.
Description: this data warehouse was designed follow Inmon approach
that integrated all of data into a single warehouse and it created several data marts associating sectors in government system
bronze -> silver -> gold
All of the step in this project was design to a data pipeline which can be automated to load raw data from source that then go in medallion procedure for ensuring the quality of information. Finally, it was passed into warehouse and data marts.
Dockerfile for Airflow and Spark
FROM apache/airflow:2.9.1-python3.11
USER root
# Install OpenJDK-17
RUN apt update && \
apt-get install -y openjdk-17-jdk && \
apt-get install -y ant && \
apt-get clean;
# Set JAVA_HOME
ENV JAVA_HOME /usr/lib/jvm/java-17-openjdk-amd64/
RUN export JAVA_HOME
USER airflow
# Sync files from local to Docker image
COPY ./airflow/dags /opt/airflow/dags
COPY requirements.txt .
# Pyspark package
RUN pip install --no-cache-dir -r requirements.txt
RUN rm requirements.txt
DAGs of data warehouse integration
DAGs of Resident data mart integration
DAGs of Time and Location integration