A data analytist solution designed to enhance credit risk assessment for financial institutions and businesses. This project leverages Apache Spark and modern data lake architecture.
The project is made to make real-time credit risk assessment based on the (bank) client profiles it is given by leveraging big data processing and machine learning, simulating how a bank might evaluate and manage credit risk for its clients.
Data Ingestion: The system simulates the ingestion of financial data including credit scores, account balances, and transaction histories for thousands of clients.
Data Processing: Leveraging PySpark's distributed computing capabilities, the raw data is cleaned, transformed, and prepared for analysis at scale.
Machine Learning Model:
Risk Scoring:
Dynamic Rate Calculation:
Real-time Assessment: As new financial data comes in, the system can rapidly reassess a client's credit risk, allowing for up-to-date risk management.
Risk Dashboard: A comprehensive dashboard provides bank managers with key metrics including average credit scores, risk distributions, and potential high-risk clients.
Clone the repository:
git clone https://github.com/dvelkow/credit_risk_data_lake_for_lending
Install the required packages:
pip install -r requirements.txt
Run the main data lake:
python main.py
(It would run with random/mock data, but you can easily connect it to a real database through the main.py file)