Real-time prediction of used car prices using Machine Learning models and Streamlit for visualization
Automotive Industry , Data Science, Machine Learning
The object aim is to enhance the CarDheko customer experience and streamline the pricing process by leveraging machine learning. Requirement includes creation an accurate and user-friendly streamlit tool that predicts the prices of used cars based on various features. This tool should be deployed as an interactive web application for both customers and sales representatives to use seamlessly.
We have historical data on used car prices from CarDekho, including various features such as make, model, year, fuel type, transmission type, and other relevant attributes from different cities. Your task as a data scientist is to develop a machine learning model that can accurately predict the prices of used cars based on these features. The model should be integrated into a Streamlit-based web application to allow users to input car details and receive an estimated price instantly.
i) Import all city’s dataset which is in unstructured format. ii) Convert it into a structured format. iii) Added a new column named ‘Location’ and assign values for all rows with the name of the respective city. iv) Concatenate all datasets and make it as a single dataset.
i) For numerical columns, used techniques like mean or median ii) For categorical columns, used mode imputation or labeled as 'undefined' or 'other'.
i) Checked for all data types and did the necessary steps to keep the data in the correct format. (1) Eg. If a data point has string formats like 70 kms, then removed the unit ‘kms’ and changed the data type from string to integers.
i) Used one-hot encoding for nominal categorical variables.
i) Apply techniques like Min-Max Scaling or Standard Scaling.
i) Used Z-score analysis to remove outliers.
i) Mean, median, mode, standard deviation, etc.
i) Used scatter plots, histograms, box plots, and correlation heatmaps.
i) Used techniques like correlation analysis, feature importance from models, and domain knowledge.
i) Common split ratios are 70-30 or 80-20, utilized 80-20.
i) Used Linear Regression, Decision Trees, Random Forests, and XG Boosting Machines
i) Used cross-validation techniques to ensure robust performance.
i) Used technique Random Search to do so.
i) Used Mean Absolute Error (MAE), Mean Squared Error (MSE), R-squared (R2) , Mean Absolute Percentage Error (MAPE).
i) Used domain knowledge and exploratory data analysis insights.
i) Lasso (L1) and Ridge (L2) regularization.
i) Allow users to input car features and get real-time price predictions.
i) Provide clear instructions and error handling.
Best models are Random Forest and XGBoost
You can view the full notebook with detailed analysis and code here.