A REST API that classifies resumes into occupation fields and seniority levels using machine learning. Trained on 3,000+ resumes across 26 occupations, the API provides accurate classifications with efficient PDF text extraction.
The project showcases a rest api that recceives a pdf curriculum and returns the field of ocupation and the level of seniority, along side with the acuracy for each; The machine learning model was trained with 26 areas of ocupation and over 3,000 curricula.
After cloning the repository run:
pip install < requirements.txt
fastapi dev server.py
After running the project locally the documentation is available on:
127.0.0.1:8000/docs
This is a custom dataset tailored for this usecase: https://www.kaggle.com/datasets/danicardeal/resume-occupation-and-seniority
For training the seniority classifier, the text field and the seniority field from the CSV were used.
For the area of expertise classifier, the class number and text fields were utilized.
In the preprocessing phase, the following steps were implemented:
Spacy
en_core_web_lg
Re
CSV
Models evaluated:
Model chosen: XGBoost with parameters
The data was vectorized using the CountVectorizer
from sklearn. The trained model was exported and loaded using joblib
for deployment and inference.
This project demonstrates a robust proof of concept for a REST API capable of classifying curricula into specific fields of occupation and levels of seniority using machine learning algorithms. The model, trained on a custom dataset with over 3,000 resumes spanning 26 areas of occupation, achieves accurate classifications while providing valuable insights into the efficacy of various PDF extraction libraries and machine learning models.
Effective PDF Processing: After evaluating multiple libraries for PDF extraction, pdftotext
was selected for its superior performance in terms of processing time.
Comprehensive Preprocessing: Utilizing Spacy
for text processing (stopwords removal, lemmatization, and tokenization) and Re
for hyperlink removal ensured clean and relevant data for model training.
Model Evaluation and Selection: Among the evaluated models, XGBoost emerged as the best performer, providing high accuracy in both seniority and area of expertise classifications.
Data Vectorization and Persistence: The use of CountVectorizer
for data vectorization and joblib
for model persistence streamlined the deployment and inference process, making the system efficient and scalable.
Accuracy and Performance: The achieved accuracies and confusion matrices for both seniority and area of expertise classifications highlight the model's effectiveness and reliability.
This project not only showcases the potential for automated resume classification but also serves as an excellent learning experience in handling real-world data, evaluating multiple libraries and models, and implementing a complete machine learning pipeline from data preprocessing to deployment.
Free Software, Hell Yeah!