Implementation of a search engine using TF-IDF and Word Embedding-based vectorization techniques for efficient document retrieval
MIT License
This project, developed as an assignment for the Information Retrieval subject, demonstrates the implementation of search engines using two distinct techniques: TF-IDF based vectorization and embedding-based vectorization. Our goal is to showcase efficient and accurate document retrieval in response to user queries, highlighting the differences and advantages of each approach.
Process | Description |
---|---|
Offline Process | 1. Load and preprocess documents2. Create vocabulary3. Compute TF-IDF matrix4. Store TF-IDF matrix and vocabulary |
Online Process | 1. Receive user query2. Preprocess query3. Convert query to TF-IDF vector4. Compute similarity with document vectors5. Rank and return top results |
Process | Description |
---|---|
Offline Process | 1. Load and preprocess documents2. Train or load pre-trained Word2Vec model3. Compute document embeddings4. Store embeddings in Chroma DB |
Online Process | 1. Receive user query2. Preprocess query3. Compute query embedding4. Perform similarity search in Chroma DB5. Rank and return top results |
The TF-IDF (Term Frequency-Inverse Document Frequency) approach involves:
The Word Embedding approach involves:
Query Suggestion | Query Result |
---|---|
Topic Detection | Similar Documents |
---|---|
Metric | TF-IDF Based | Word Embedding Based |
---|---|---|
MAP | 54% | 70% |
MRR | 63% | 80% |
The Word Embedding based approach shows superior performance in both Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR) metrics.
Our system provides query suggestions based on:
We implement document clustering to group similar documents and identify topics:
[To be added in a future update]
For complete documentation of the project in Arabic, please refer to the following link:
This project is licensed under the MIT License - see the LICENSE file for details.