Arabic-News-Article-Classification

Automatic categorization of documents, consists in assigning a category to a text based on the information it contains. We'll follow different approach of Supervised Machine Learning.

Stars
90

Arabic News Article Classification

Based on: Building TALAA, a Free General and Categorized Arabic Corpus

University of Science and Technology Houari Boumediene, Algiers, Algeria


Corpus

"The TALAA corpus is a voluminous general Arabic corpus, built from daily Arabic newspaper websites. The corpus is a collection of more than 14 million words with 15,891,729 tokens contained in 57,827 different articles." [1]

Description of the TALAA corpus [1] :

Features Corpora
Nb. of articles 57.827
Nb. of categories 8
Nb. of words 14.068.407
Nb. of types 582.531
Nb. of tokens 15.891.729

The corpus is distributed on 8 categories [1] :

Category Nb. of articles
Culture 5322
Economic 8768
Politics 9620
Religion 4526
Society 9744
Sports 9103
World 6344
Other 4400

Pre-processing

The following data pre-processing steps have been performed:

0.Example:

1.Tokenization

Each collected article was segmented into tokens, using NLTK.

2.Removing stopwords

Tokenized text was cleaned from stopwords. There's a complete and reviewed list here, It contains 750 stop words.

3.Stemming

Each word was stemmed using Farasa Arabic text processing toolkit.


Dataset

Categories = { : Algeria, : entertainment, : religion, : society, : sport, : world}


Machine Learning Models

Many Machine Learning algorithms has been experimented:

Algorithm Precision Recall F-mesure
Decision Tree 0.82 0.84 0.83
SVM (SGD) 0.94 0.94 0.94
Naive Bayes 0.89 0.87 0.88

Evaluation (Confusion matrix)

Confusion matrix using the best model SVM with Stochastic Gradient Descent:


TODO


Contributing


Credits

Related Projects