mscs_ml_student_performance

This project predicts students' final grades (G3) using machine learning based on their demographics and background.

Stars
1
Committers
1

Predicting Student Performance: An Analytical Approach

This repository contains the code and analysis for predicting student performance based on various factors using multiple machine learning algorithms.

Project Overview

The objective of this project is to predict students' final grades (G3) based on their demographic information, family background, educational support, extracurricular activities, and personal attributes.

Dataset

The dataset includes information about students such as:

  • Demographics: school, sex, age, address
  • Family Background: famsize, Pstatus, Medu, Fedu, Mjob, Fjob, guardian
  • Educational Support: traveltime, studytime, failures, schoolsup, famsup, paid, activities, nursery, higher, internet, romantic
  • Personal Information: famrel, freetime, goout, Dalc, Walc, health, absences
  • Grades: G1, G2, G3

Exploratory Data Analysis (EDA)

The EDA involves:

  • Visualizing the distribution of numerical features
  • Visualizing categorical features
  • Analyzing feature relationships with the target variable (G3)

Model Selection

We evaluated several machine learning algorithms to find the best model for predicting student performance:

  1. Logistic Regression
  2. Decision Tree Classifier
  3. Support Vector Machine
  4. Random Forest Classifier
  5. AdaBoost Classifier
  6. Gradient Boosting Classifier
  7. K Neighbors Classifier
  8. Gaussian Naive Bayes

StudentPerformanceAnalyzer Class

This class is designed to:

  • Load the dataset and initialize models
  • Preprocess data by encoding, splitting into train/test, and standardizing features
  • Train various classifiers and evaluate their metrics
  • Compare model performance
  • Plot feature importances for selected models

Model Training and Evaluation

Each model was trained on the training data and evaluated on the testing data. Evaluation metrics include accuracy, precision, recall, and F1 score.

Analysis Results

Model Performance

  • Gradient Boosting Classifier: Highest accuracy (0.515385)
  • Random Forest Classifier: Moderate accuracy (0.446154)
  • Decision Tree Classifier: Lower accuracy (0.384615)
  • Other Models: Showed lower performance

Feature Importance

  • Gradient Boosting Classifier: Top features are G2, G1, absences, age, and free time
  • Random Forest and Decision Tree Classifiers: Similar top features, emphasizing G1, G2, absences, free time, and parental education

Insights and Recommendations

  1. Importance of Continuous Assessment:

    • G1 and G2 are crucial for predicting G3.
    • Implement regular assessments and feedback.
  2. Attendance and Engagement:

    • Absences and free time impact performance.
    • Improve attendance and participation programs.
  3. Family Background:

    • Parental education affects performance.
    • Engage families and provide support for equal opportunities.
  4. Holistic Approach:

    • Address academic, social, and emotional needs for student success.

References

Related Projects