Python code for classification of documents into different classes using machine learning
MIT License
Document classification is one of the important classification problem that we deal nowadays, and is slightly different from text classification. Documents are available in many different formats and in huge numbers in enterprises and need to be classified for different purposes and end goals.
This code helps to build different classification model to classify documents to different classes. In this case into five different categories: TAX, AGGREMENTS, VALUATION, HUMAN RESOURCES, DEEDS. This code can be modified for your purpose
The raw data folder contains different file formats in different folders. Textual data needs to be extracted from them. Here doc files have been converted to PDF format first before extraction. I had issues while installing textract library which extracts test from the doc files directly, so I converted them to PDF and used pdfminer to extract texts. Texts are extracted on folder basis because the extracted texts need to be labeled.
Each data point of the extracted text belongs to one page of the document. One data point doesn't correspond to one whole document. The reason for this approach was because every document size is different, some 3 pages while some 40 pages and this doesn't create a normalized dataset to deal with.
Another approach would be to convert the PDFs to HTML form and then extract the data from the tags. The PDFs when converted to HTML give the data inside tags, however when I converted them the data was all embedded in the 'div' tags with no access to the unique heading tags or other data. So ended up extracting from PDFs.
The next step involves pre-processing the texts. Approaches taken include lowercasing, removing special characters, removing number between words and digits itself, removing stop words and converting words to their dictionary form through lemmatization.
This is an example of imbalanced dataset and needs to be taken care of before building model. The two options are oversampling or undersmapling the data. Undersampling is not a option due to small dataset and model would not be able to predict the classification, so I went with oversampling.
After oversampling each class gets 807 data points to work with.
As you can see that the model predicted the class of the document accurately as Human Resources.
The saved model has correctly predicted the class of the input file. The saved model had predicting accuracy of around 94% with a good precision for all classes above 90%. Different classification models can be trained for this task and compared for their perfomances.