Automatic categorization of documents, consists in assigning a category to a text based on the information it contains. We'll follow different approach of Supervised Machine Learning.
Features | Corpora |
---|---|
Nb. of articles | 57.827 |
Nb. of categories | 8 |
Nb. of words | 14.068.407 |
Nb. of types | 582.531 |
Nb. of tokens | 15.891.729 |
Category | Nb. of articles |
---|---|
Culture | 5322 |
Economic | 8768 |
Politics | 9620 |
Religion | 4526 |
Society | 9744 |
Sports | 9103 |
World | 6344 |
Other | 4400 |
Each collected article was segmented into tokens, using NLTK.
Tokenized text was cleaned from stopwords. There's a complete and reviewed list here, It contains 750 stop words.
Each word was stemmed using Farasa Arabic text processing toolkit.
Categories = { : Algeria, : entertainment, : religion, : society, : sport, : world}
Many Machine Learning algorithms has been experimented:
Algorithm | Precision | Recall | F-mesure |
---|---|---|---|
Decision Tree | 0.82 | 0.84 | 0.83 |
SVM (SGD) | 0.94 | 0.94 | 0.94 |
Naive Bayes | 0.89 | 0.87 | 0.88 |
Confusion matrix using the best model SVM with Stochastic Gradient Descent: