text-based-search-engine

Implementation of a search engine using TF-IDF and Word Embedding-based vectorization techniques for efficient document retrieval

MIT License

Stars

0

Committers

View Code on GitHub View on X

Ecosystems: Python, FastAPI, scikit-learn, NumPy

Text-Based Search Engine Project

Project Overview

This project, developed as an assignment for the Information Retrieval subject, demonstrates the implementation of search engines using two distinct techniques: TF-IDF based vectorization and embedding-based vectorization. Our goal is to showcase efficient and accurate document retrieval in response to user queries, highlighting the differences and advantages of each approach.

Features

Dual search engine implementation: TF-IDF and Word Embedding based
Query suggestion functionality
Document clustering and topic detection
Similar document retrieval
Efficient offline processing and fast online querying

Technologies Used

Python: Primary programming language
NumPy: For numerical computations
Chroma DB: Vector database for efficient similarity search
Gensim: For Word2Vec model implementation
Scikit-learn: For TF-IDF vectorization and other machine learning utilities
FastAPI: For creating the web API
NLTK: For text processing and tokenization

Datasets

Antique: A non-factoid question answering dataset Link
Wikipedia: A subset of Wikipedia articles Link

Process Workflow

TF-IDF Based Search Engine

Process	Description
Offline Process	1. Load and preprocess documents2. Create vocabulary3. Compute TF-IDF matrix4. Store TF-IDF matrix and vocabulary
Online Process	1. Receive user query2. Preprocess query3. Convert query to TF-IDF vector4. Compute similarity with document vectors5. Rank and return top results

Word2Vec Based Search Engine

Process	Description
Offline Process	1. Load and preprocess documents2. Train or load pre-trained Word2Vec model3. Compute document embeddings4. Store embeddings in Chroma DB
Online Process	1. Receive user query2. Preprocess query3. Compute query embedding4. Perform similarity search in Chroma DB5. Rank and return top results

Implementation Details

TF-IDF Based Vectorization

The TF-IDF (Term Frequency-Inverse Document Frequency) approach involves:

Creating a vocabulary from all documents
Computing TF-IDF scores for each term in each document
Representing documents and queries as TF-IDF vectors
Using cosine similarity to find relevant documents

Embedding-Based Vectorization

The Word Embedding approach involves:

Using pre-trained or custom-trained Word2Vec models
Representing words as dense vectors
Computing document embeddings by averaging word vectors
Using vector similarity in embedding space to find relevant documents

Examples

Query Suggestion	Query Result

Topic Detection	Similar Documents

Performance Comparison

Metric	TF-IDF Based	Word Embedding Based
MAP	54%	70%
MRR	63%	80%

The Word Embedding based approach shows superior performance in both Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR) metrics.

Additional Features

Query Suggestion

Our system provides query suggestions based on:

Processing the user's input query
Generating word vectors using Word2Vec
Finding similar terms using cosine similarity
Ranking and presenting the top suggestions

Documents Clustering

We implement document clustering to group similar documents and identify topics:

Using K-Means clustering algorithm
Applying Latent Dirichlet Allocation (LDA) for topic modeling

How to Use

[To be added in a future update]

Documentation

For complete documentation of the project in Arabic, please refer to the following link:

Arabic Documentation

Future Improvements

Implement more advanced embedding models (e.g., BERT, GPT)
Enhance query suggestion with user interaction data
Improve clustering algorithms for better topic detection
Optimize performance for larger datasets

Contributors

Alaa Aldeen Zamel
Anas Rish
Anas Durra
Mohammed Hadi Barakat
Mohammed Fares Dabbas

License

This project is licensed under the MIT License - see the LICENSE file for details.

Related Projects

COVID19-Literature-Clustering

An approach to document exploration using Machine Learning. Let's cluster similar research articl...

karateclub

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CI...

05 Dec 2019 2,092

robics

Automatic detection of robust parametrizations for LDA and NMF. Compatible with scikit-learn and ...

role2vec

A scalable Gensim implementation of "Learning Role-based Graph Embeddings" (IJCAI 2018).

27 Jan 2019 166

Machine_Learning

Some fundamental machine learning and data-analysis techniques are explained through realistic ex...

19 Sep 2018 118

RAGFlowChain

A comprehensive toolkit for building Retrieval-Augmented Generation (RAG) pipelines, including da...

DAT8

General Assembly's 2015 Data Science course in Washington, DC

07 Aug 2015 1,606

python-machine-learning-book

The "Python Machine Learning (1st edition)" book code repository and info resource

07 Aug 2015 12,234

textvec

Text vectorization tool to outperform TFIDF for classification tasks

12 Apr 2018 193

ShallowLearn

An experiment about re-implementing supervised learning models based on shallow neural network ap...

08 Oct 2016 198

ailearning

AiLearning：数据分析+机器学习实战+线性代数+PyTorch+NLTK+TF2

25 Feb 2017 38,884

upgini

Data search & enrichment library for Machine Learning → Easily find and add relevant features to ...

08 Dec 2021 312

data-science-portfolio

Portfolio of data science projects completed by me for academic, self learning, and hobby purposes.

05 Sep 2016 1,089

hummingbird

Hummingbird compiles trained ML models into tensor computation for faster inference.

12 Mar 2020 3,335

topicwizard

Powerful topic model visualization in Python