This repository contains a PySpark project for analyzing crash data. The project includes various analyses using data from CSV files.
utils.py
: Contains utility functions, including the read_csv
function for reading CSV files into Spark DataFrames.main.py
: Contains the main analysis code, including data loading, transformations, and computations.data/
: Directory containing CSV files used for analysis.requirements.txt
)Clone the Repository
git clone https://github.com/yourusername/your-repository.git
cd your-repository
Analysis 1
: Find the number of crashes (accidents) in which number of males killed are greater than 2?Analysis 2
: How many two wheelers are booked for crashes?Analysis 3
: Determine the Top 5 Vehicle Makes of the cars present in the crashes in which driver died and Airbags did not deploy.Analysis 4
: Determine number of Vehicles with driver having valid licences involved in hit and run?Analysis 5
: Which state has highest number of accidents in which females are not involved?Analysis 6
: Which are the Top 3rd to 5th VEH_MAKE_IDs that contribute to a largest number of injuries including deathAnalysis 7
: For all the body styles involved in crashes, mention the top ethnic user group of each unique body styleAnalysis 8
: Among the crashed cars, what are the Top 5 Zip Codes with highest number crashes with alcohols as the contributing factor to a crash (Use Driver Zip Code)Analysis 9
: Count of Distinct Crash IDs where No Damaged Property was observed and Damage Level (VEH_DMAG_SCL~) is above 4 and car avails InsuranceAnalysis 10
: Determine the Top 5 Vehicle Makes where drivers are charged with speeding related offences, has licensed Drivers, used top 10 used vehicle colours and has car licensed with the Top 25 states with highest number of offences (to be deduced from the data)