The project aims to predict NYC taxi trip durations using advanced regression techniques. We utilized Polynomial Linear Regression, Ridge Regression, and Lasso Regression for feature extraction and achieved a validation R² score of 0.67. Feature engineering included KMeans clustering, Haversine distance calculation, and date-time feature extraction
This project aims to predict the duration of taxi trips in New York City using a variety of regression techniques, including Polynomial Linear Regression, Ridge Regression, and Lasso Regression for feature extraction. The dataset used is the New York City Taxi Trip Duration Dataset, which contains detailed records of taxi trips including pickup and dropoff locations, times, and other related features.
model_pipeline.py
: Main script to run the feature engineering, data preprocessing, model training, evaluation, and prediction.test.py
: to make prediction and create Sample submission file.README.md
: Project documentation.grid_search.pkl
: Saved GridSearchCV object for the best model.model.pkl
: Trained model saved using joblib.submission.csv
: Sample submission file.The dataset includes the following key columns:
id
: Unique identifier for each tripvendor_id
: ID of the taxi vendorpickup_datetime
: Date and time when the trip starteddropoff_datetime
: Date and time when the trip endedpassenger_count
: Number of passengerspickup_longitude
: Longitude where the trip startedpickup_latitude
: Latitude where the trip starteddropoff_longitude
: Longitude where the trip endeddropoff_latitude
: Latitude where the trip endedstore_and_fwd_flag
: This flag indicates whether the trip record was sent to the vendor or held in vehicle memory before sendingtrip_duration
: Duration of the trip in seconds (target variable)The following feature engineering techniques were applied to enrich the dataset:
A Ridge Regression model with polynomial features was used for the final prediction. GridSearchCV was employed to tune the hyperparameters of the model.
The model was trained using the following metrics:
The model was evaluated on the validation set using the following metrics:
The main functions and their purposes are outlined below:
Data Loading and Preprocessing
load_data(file_path)
: Loads data from a CSV file.check_missing_data(train, validation)
: Checks for missing values in the train and validation data.preprocess_data(df, xlim, ylim, hour_to_speed, isTest=False)
: Integrates all preprocessing steps including feature engineering.Feature Engineering
add_average_hourly_speed(df, hour_to_speed)
: Adds average hourly speed to the dataframe.filter_geographical_boundaries(df, xlim, ylim)
: Filters data within specified geographical boundaries.apply_clustering(df, n_clusters=6)
: Applies KMeans clustering to pickup and dropoff coordinates.calculate_trip_distance(df)
: Adds trip distance and bearing to the dataframe.extract_datetime_features(df)
: Extracts features from datetime columns.calculate_distance_to_center(df, center_coordinates)
: Calculates the distance to city center.calculate_distance_to_airport(df, airport_coordinates, column_name)
: Calculates the distance to an airport.remove_outliers(df, column)
: Removes outliers from a specified column using the IQR method.add_time_features(df)
: Adds granular time features.add_manhattan_distance(df)
: Calculates the Manhattan distance between pickup and dropoff coordinates.add_interaction_features(df)
: Adds interaction features.Modeling and Evaluation
get_important_features(df, target_column, alpha=0.1)
: Extracts important features using Lasso regression.train_model(X_train, y_train)
: Trains the Ridge regression model with GridSearchCV.evaluate_model(model, X, y)
: Evaluates the model and prints cross-validated RMSE, MAE, and R² score.save_model(model, filename)
: Saves the trained model to a file.load_model(filename)
: Loads a model from a file.predict_and_save_submission(model, test_features, test_ids, filename)
: Predicts test data and saves the submission file.Install Dependencies Make sure you have the required libraries installed. You can install them using:
pip install pandas numpy scikit-learn joblib
Load Data
Use the load_data
function to load your dataset.
Preprocess Data
Apply the preprocess_data
function to your dataset.
Train the Model
Use the train_model
function to train the Ridge regression model.
Evaluate the Model
Evaluate the trained model using the evaluate_model
function.
Save the Model
Save the trained model using the save_model
function.
Load the Model
Load the saved model using the load_model
function.
Predict and Save Submission
Use the predict_and_save_submission
function to generate predictions on the test set and save the results.
The final model achieved the following performance on the validation set:
The project demonstrates a comprehensive approach to feature engineering and model training for predicting taxi trip durations in New York City.