This project addresses the challenge of distinguishing between real and fake news articles using Natural Language Processing (NLP) techniques and machine learning algorithms. Our goal is to develop a classifier that can accurately identify fake news, contributing to the ongoing efforts to combat misinformation.
To run this project, you'll need Python and Jupyter Notebook installed. Follow these steps:
Ensure you have Jupyter Notebook installed. If not, you can install it using:
pip install jupyter
Install the required packages. You can do this directly in a code cell within the notebook:
!pip install pandas numpy scikit-learn spacy
Download the spaCy model. Run this in a code cell:
!python -m spacy download en_core_web_lg
Ensure you have the "Fake_Real_Data.csv" file in the same directory as the notebook.
Start Jupyter Notebook:
jupyter notebook
Open the "Fake_News_Classification.ipynb" file in the Jupyter interface.
Run the cells in order, following the instructions within the notebook.
The notebook guides you through the following steps:
import pandas as pd
# Read the dataset
df = pd.read_csv("Fake_Real_Data.csv")
# Print the shape of dataframe
print(df.shape)
# Print top 5 rows
df.head(5)
# Check the distribution of labels
df['label'].value_counts()
We use spaCy's en_core_web_lg
model to create word embeddings:
import spacy
nlp = spacy.load("en_core_web_lg")
# This will take some time (nearly 15 minutes)
df['vector'] = df['Text'].apply(lambda text: nlp(text).vector)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
df.vector.values,
df.label_num,
test_size=0.2,
random_state=2022
)
import numpy as np
X_train_2d = np.stack(X_train) # converting to 2d numpy array
X_test_2d = np.stack(X_test)
We implement and compare two models:
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)
clf = MultinomialNB()
clf.fit(scaled_train_embed, y_train)
from sklearn.metrics import classification_report
y_pred = clf.predict(scaled_test_embed)
print(classification_report(y_test, y_pred))
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
clf.fit(X_train_2d, y_train)
y_pred = clf.predict(X_test_2d)
print(classification_report(y_test, y_pred))
The notebook presents classification reports for both models:
precision recall f1-score support
0 0.95 0.94 0.95 1024
1 0.94 0.95 0.94 956
accuracy 0.94 1980
macro avg 0.94 0.94 0.94 1980
weighted avg 0.94 0.94 0.94 1980
precision recall f1-score support
0 1.00 0.99 0.99 1024
1 0.99 0.99 0.99 956
accuracy 0.99 1980
macro avg 0.99 0.99 0.99 1980
weighted avg 0.99 0.99 0.99 1980
Effective Vectorization: GloVe embeddings from spaCy provided rich 300-dimensional vectors, capturing semantic relationships effectively.
Model Performance:
Preprocessing Impact: Pre-trained GloVe embeddings significantly enhanced both models' performance, especially KNN.
Time Consideration: While GloVe embedding is time-consuming (about 15 minutes for this dataset), it results in high-quality feature representations.
π Note: This project is for educational purposes. Always critically evaluate news sources and cross-reference information, regardless of model predictions.