PySpark Crash Analysis

This repository contains a PySpark project for analyzing crash data. The project includes various analyses using data from CSV files.

Project Structure

utils.py: Contains utility functions, including the read_csv function for reading CSV files into Spark DataFrames.
main.py: Contains the main analysis code, including data loading, transformations, and computations.
data/: Directory containing CSV files used for analysis.

Prerequisites

Python
Apache Spark
PySpark
Required Python packages (listed in requirements.txt)

Setup

Clone the Repository

git clone https://github.com/yourusername/your-repository.git
cd your-repository

Analytics

Analysis 1: Find the number of crashes (accidents) in which number of males killed are greater than 2?
Analysis 2: How many two wheelers are booked for crashes?
Analysis 3: Determine the Top 5 Vehicle Makes of the cars present in the crashes in which driver died and Airbags did not deploy.
Analysis 4: Determine number of Vehicles with driver having valid licences involved in hit and run?
Analysis 5: Which state has highest number of accidents in which females are not involved?
Analysis 6: Which are the Top 3rd to 5th VEH_MAKE_IDs that contribute to a largest number of injuries including death
Analysis 7: For all the body styles involved in crashes, mention the top ethnic user group of each unique body style
Analysis 8: Among the crashed cars, what are the Top 5 Zip Codes with highest number crashes with alcohols as the contributing factor to a crash (Use Driver Zip Code)
Analysis 9: Count of Distinct Crash IDs where No Damaged Property was observed and Damage Level (VEH_DMAG_SCL~) is above 4 and car avails Insurance
Analysis 10: Determine the Top 5 Vehicle Makes where drivers are charged with speeding related offences, has licensed Drivers, used top 10 used vehicle colours and has car licensed with the Top 25 states with highest number of offences (to be deduced from the data)

Related Projects

Identification-of-Trucks-and-potential-risky-driver-using-Databricks-Spark-API-

The project intended to identify trucks based on their model, fuel consumption, driving behaviors...

22 Jun 2024 0

Credit-Card-Fraud-Detection-Spark

05 May 2024 0

Spark-with-Python

Fundamentals of Spark with Python (using PySpark), code examples

20 Aug 2018 328

Twitter-Sentiment-Analysis-Using-PySpark

This repository contains a project that demonstrates how to perform sentiment analysis on Twitter...

09 Jul 2024 3

Sales-Analytics-Pipeline

Data analytics pipeline built with Apache Spark and Hadoop for processing and analyzing large-sca...

17 Jul 2024 0

spark3D

Spark extension for processing large-scale 3D data sets: Astrophysics, High Energy Physics, Meteo...

31 Jan 2018 30

spark-py-notebooks

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython /...

06 May 2015 1,614

APACHE-SPARK-PYSPARK-DATABRICKS

APACHE SPARK: Data Analysis, Transformation, and Visualisation with PySpark, IPL Data Analysis

06 Aug 2024 0

Data_Processing_using_Spark_Flink

This project demonstrates data cleaning, processing with Apache Spark and Apache Flink, both loca...

20 Jul 2024 0

SQL-Data-Analysis-and-Visualization-Projects

SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark...

29 Feb 2020 1,259

data_science_blogs

A repository to keep track of all the code that I end up writing for my blog posts.

25 Dec 2019 252

pyspark-tutorial

PySpark-Tutorial provides basic algorithms using PySpark

12 Mar 2015 1,170

APACHE-SPARK-PYSPARK-DATABRICKS-MACHINE-LEARNING-MLIB

Apache Spark Machine Learning project using MLlib and Linear Regression on Databricks!

07 Aug 2024 0

Spark-with-Python---My-learning-notes-

ETL pipeline using pyspark (Spark - Python)

13 Mar 2017 106

NoSQL-DataArchitecture-Spark

Implementing core components of a data-driven architecture using Spark: Data Management and Data ...

14 Aug 2024 0

case-study-accidents