Energy-Data-Analytics-ML

Analyzing global data on sustainable energy, predicting CO2 emissions per capita

Stars
0
Committers
2

Energy Data Analytics

Description:

The project objective is to analyze global energy data, derive meaningful insights from it and visualize findings in a comprehensible format. In addition, a Streamlit app with an interactive dashboard was created and deployed, and also by using Machine Learning several models of predicting CO2 emissions were trained and evaluated.

You can find the dataset on Kaggle with a detailed description.

Python libraries used:

  • pandas: data cleaning and transformation
  • plotly: visualizing data, creating various charts
  • scikit-learn: ML models and evaluation metrics
  • streamlit: creating an online dashboard

Project overview

1. Data cleaning and transformation

Checking the dataset for missing values and duplicates, shortening column names and adding new derived columns.

2. Exploratory Data Analysis

Answering questions regaring the dataset by using data transformation and filtering, and creating various types of charts:

  • Top 10 countries by the amount of electricity generated from renewable sources
  • Correlation matrix of all numerical variables
  • Distribution of renewable energy share in countries on each continent
  • CO2 emissions per capita vs. GDP per capita
  • Distribution of the amount of renewable energy produced in European Union
  • Dynamics of electricity generation in the European Union
  • Energy intensity level of primary energy vs. renewable energy share in power generation in the EU

After that, the relevant dataset is downloaded in order to be used for the Streamlit dashboard.

3. Machine Learning - Predicting CO2 emissions

  • Pre-processing data: dropping irrelevant columns, removing rows with missing values and duplicates, encoding categorical data into numeric.
  • Feature Engineering: calculating features' correlations with the target variable and removing irrelevant ones, dropping some features that highly correlate with each other to avoid multicollinearity.
  • Modeling: splitting the data into training and testing sets, scaling the features, creating and evaluating machine learning models.

5 different Machine Learning models are trained: Linear Regression, Random Forest Regression, Gradient Boosting Regression, K Neighbors Regression, XGBoost Regression. The models are evaluated by several metrics: median absolute error, mean absolute error, mean squared error, root mean squared error, coefficient of determination.

Based on the evaluation metrics, Random Forest Regression outperformed others and demonstrated the lowest error and the highest prediction accuracy.

4. Streamlit Dashboard

Creating an interactive dashboard with:

  • Slider and checkbox filter with the options to choose a particular year, continent and type of energy
  • Colored map representing the amount of electricity generation in each country for a selected year
  • Pie chart with the energy distribution for a selected continent
  • Bar chart illustrating top 5 energy producers for a selected continent and year
  • Button to download the filtered dataset

Notes

How to run the code:

  • Ensure to download the necessary packages as in requirements.txt
  • Download the dataset global-data-on-sustainable-energy.csv from Kaggle
  • In case you change the file name, also change the data path in the code pd.read_csv('<YOUR DATA PATH>')
  • Download the processed dataset for the dashboard
  • Run the Streamlit dashboard by typing streamlit run energy_dashboard.py on the terminal
Badges
Extracted from project README
Earth-energy.png
Related Projects