Arabic News Article Classification

Based on: Building TALAA, a Free General and Categorized Arabic Corpus

University of Science and Technology Houari Boumediene, Algiers, Algeria

Corpus

"The TALAA corpus is a voluminous general Arabic corpus, built from daily Arabic newspaper websites. The corpus is a collection of more than 14 million words with 15,891,729 tokens contained in 57,827 different articles." [1]

Description of the TALAA corpus [1] :

Features	Corpora
Nb. of articles	57.827
Nb. of categories	8
Nb. of words	14.068.407
Nb. of types	582.531
Nb. of tokens	15.891.729

The corpus is distributed on 8 categories [1] :

Category	Nb. of articles
Culture	5322
Economic	8768
Politics	9620
Religion	4526
Society	9744
Sports	9103
World	6344
Other	4400

Pre-processing

The following data pre-processing steps have been performed:

0.Example:

1.Tokenization

Each collected article was segmented into tokens, using NLTK.

2.Removing stopwords

Tokenized text was cleaned from stopwords. There's a complete and reviewed list here, It contains 750 stop words.

3.Stemming

Each word was stemmed using Farasa Arabic text processing toolkit.

Dataset

Categories = { : Algeria, : entertainment, : religion, : society, : sport, : world}

Machine Learning Models

Many Machine Learning algorithms has been experimented:

Algorithm	Precision	Recall	F-mesure
Decision Tree	0.82	0.84	0.83
SVM (SGD)	0.94	0.94	0.94
Naive Bayes	0.89	0.87	0.88

Evaluation (Confusion matrix)

Confusion matrix using the best model SVM with Stochastic Gradient Descent:

TODO

Contributing

Credits

Team mate: Fawzi TOUATI
Initial idea and mentor: Pr. Ahmed GUESSOUM
Mentor: Dr. Riadh BELKEBIR

Related Projects

ml-with-text

[Tutorial] Demystifying Natural Language Processing with Python

23 Feb 2019 18

Machine-Learning-approach-to-Bengali-POS-Tagging-using-BNLP

Machine Learning approach to Bengali Corpus POS (Parts of Speech) Tagging using BNLP (Bengali Nat...

03 Dec 2021 7

mishkal

Mishkal is an arabic text vocalization software

20 May 2014 272

Shakkala

Deep learning for AR text Vocalization - التشكيل الالي للنصوص العربية

28 Nov 2017 337

ar-embeddings

Sentiment Analysis for Arabic Text (tweets, reviews, and standard Arabic) using word2vec

15 Nov 2016 90

DAT8

General Assembly's 2015 Data Science course in Washington, DC

07 Aug 2015 1,606

awesome-text-classification

Text classification meets word embeddings.

14 Mar 2017 30

arabic-text-diacritization

Benchmark Arabic text diacritization dataset

22 Feb 2019 71

Phony-News-Classifier

Phony News Classifier is a repository which contains analysis of a natural language processing ap...

27 May 2020 11

translation-over-diacritization

Translation-over-Diacritization technique implementation

17 Aug 2019 4

entity-recognition-datasets

A collection of corpora for named entity recognition (NER) and entity recognition tasks. These an...

01 Sep 2018 1,495

ktrain

ktrain is a Python library that makes deep learning and AI more accessible and easier to apply

06 Feb 2019 1,226

semantic-question-similarity

Official implementation of: Tha3aroon at NSURL-2019 Task 8: Semantic Question Similarity in Arabic

10 May 2019 13

arabicstemmer

Assem's Arabic Light Stemmer is a snowball-based stemming algorithm for Arabic aimed mainly to i...

11 Jan 2016 144

MT-SFT-ShareGPT

18 Aug 2024 3

Arabic-News-Article-Classification