This project demonstrates data cleaning, processing with Apache Spark and Apache Flink, both locally and on AWS EMR.
This project demonstrates data cleaning, processing with Apache Spark and Apache Flink, both locally and on AWS EMR.
The taxi_tripdata.csv
dataset used in this project is sourced from the New York City Taxi and Limousine Commission (TLC). It contains detailed records of taxi trips in New York City, including information such as pickup and dropoff dates and times, passenger count, trip distance, payment type, fare amount, and more.
This dataset provides a comprehensive view of taxi operations in NYC, useful for various analytical and machine learning tasks, such as demand prediction, fare estimation, and route optimization.
import pandas as pd
import numpy as np
df = pd.read_csv('taxi_tripdata.csv', dtype={'store_and_fwd_flag': 'category'})
print(df['store_and_fwd_flag'].unique())
df['store_and_fwd_flag'] = df['store_and_fwd_flag'].map({'Y': True, 'N': False})
df['store_and_fwd_flag'] = df['store_and_fwd_flag'].fillna(False)
print(df.head())
spark-submit --master local[2] spark_script.py
spark-submit --master yarn spark_script.py
/path/to/flink/bin/flink run -py flink_script.py
/usr/lib/flink/bin/flink run -py /path/to/flink_script.py
Data Cleaning:
store_and_fwd_flag
column.Running Spark and Flink:
Installing Jupyter Notebook:
This project provided hands-on experience with data cleaning, and processing using Spark and Flink, and handling various challenges in different environments.