text-based-search-engine

Implementation of a search engine using TF-IDF and Word Embedding-based vectorization techniques for efficient document retrieval

MIT License

Stars

0

Committers

View Code on GitHub Twitter

Ecosystems: Python, FastAPI, scikit-learn, NumPy

Text-Based Search Engine Project

Project Overview

This project, developed as an assignment for the Information Retrieval subject, demonstrates the implementation of search engines using two distinct techniques: TF-IDF based vectorization and embedding-based vectorization. Our goal is to showcase efficient and accurate document retrieval in response to user queries, highlighting the differences and advantages of each approach.

Features

Dual search engine implementation: TF-IDF and Word Embedding based
Query suggestion functionality
Document clustering and topic detection
Similar document retrieval
Efficient offline processing and fast online querying

Technologies Used

Python: Primary programming language
NumPy: For numerical computations
Chroma DB: Vector database for efficient similarity search
Gensim: For Word2Vec model implementation
Scikit-learn: For TF-IDF vectorization and other machine learning utilities
FastAPI: For creating the web API
NLTK: For text processing and tokenization

Datasets

Antique: A non-factoid question answering dataset Link
Wikipedia: A subset of Wikipedia articles Link

Process Workflow

TF-IDF Based Search Engine

Process	Description
Offline Process	1. Load and preprocess documents2. Create vocabulary3. Compute TF-IDF matrix4. Store TF-IDF matrix and vocabulary
Online Process	1. Receive user query2. Preprocess query3. Convert query to TF-IDF vector4. Compute similarity with document vectors5. Rank and return top results

Word2Vec Based Search Engine

Process	Description
Offline Process	1. Load and preprocess documents2. Train or load pre-trained Word2Vec model3. Compute document embeddings4. Store embeddings in Chroma DB
Online Process	1. Receive user query2. Preprocess query3. Compute query embedding4. Perform similarity search in Chroma DB5. Rank and return top results

Implementation Details

TF-IDF Based Vectorization

The TF-IDF (Term Frequency-Inverse Document Frequency) approach involves:

Creating a vocabulary from all documents
Computing TF-IDF scores for each term in each document
Representing documents and queries as TF-IDF vectors
Using cosine similarity to find relevant documents

Embedding-Based Vectorization

The Word Embedding approach involves:

Using pre-trained or custom-trained Word2Vec models
Representing words as dense vectors
Computing document embeddings by averaging word vectors
Using vector similarity in embedding space to find relevant documents

Examples

Query Suggestion	Query Result

Topic Detection	Similar Documents

Performance Comparison

Metric	TF-IDF Based	Word Embedding Based
MAP	54%	70%
MRR	63%	80%

The Word Embedding based approach shows superior performance in both Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR) metrics.

Additional Features

Query Suggestion

Our system provides query suggestions based on:

Processing the user's input query
Generating word vectors using Word2Vec
Finding similar terms using cosine similarity
Ranking and presenting the top suggestions

Documents Clustering

We implement document clustering to group similar documents and identify topics:

Using K-Means clustering algorithm
Applying Latent Dirichlet Allocation (LDA) for topic modeling

How to Use

[To be added in a future update]

Documentation

For complete documentation of the project in Arabic, please refer to the following link:

Arabic Documentation

Future Improvements

Implement more advanced embedding models (e.g., BERT, GPT)
Enhance query suggestion with user interaction data
Improve clustering algorithms for better topic detection
Optimize performance for larger datasets

Contributors

Alaa Aldeen Zamel
Anas Rish
Anas Durra
Mohammed Hadi Barakat
Mohammed Fares Dabbas

License

This project is licensed under the MIT License - see the LICENSE file for details.

Related Projects

data-science-ipython-notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Ka...

23 Jan 2015 27,232

Neuraxle

The world's cleanest AutoML library ✨ - Do hyperparameter tuning with the right pipeline abstract...

26 Mar 2019 606

Mental_Health_Analysis

This is project that helps detect user's mental health based on user's description.

python-signal-processing

splearn: package for signal processing and machine learning with Python. Contains tutorials on un...

token2index

A lightweight but powerful library to build token indices for NLP tasks, compatible with major De...

datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...

10 Sep 2018 4,157

datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient dat...

26 Mar 2020 18,550

Books_Recommendation_System

A 𝗕𝗼𝗼𝗸 𝗥𝗲𝗰𝗼𝗺𝗺𝗲𝗻𝗱𝗮𝘁𝗶𝗼𝗻 𝗦𝘆𝘀𝘁𝗲𝗺 📚 uses algorithms to suggest books based on user preferences and beh...

mexican-government-report

Text Mining on the 2019 Mexican Government Report, covering from extracting text from a PDF file ...

10 Sep 2019 482

Machine-Learning-with-Python

Practice and tutorial-style notebooks covering wide variety of machine learning techniques

17 Jul 2017 3,075

machine-learning-experiments

🤖 Interactive Machine Learning experiments: 🏋️models training + 🎨models demo

14 Nov 2019 1,644

Human-detection-and-Tracking

Human-detection-and-Tracking

05 May 2016 852

ILearnDeepLearning.py

This repository contains small projects related to Neural Networks and Deep Learning in general. ...

10 Aug 2018 1,344

Ai-Learn

人工智能学习路线图，整理近200个实战案例与项目，免费提供配套教材，零基础入门，就业实战！包括：Python，数学，机器学习，数据分析，深度学习，计算机视觉，自然语言处理，PyTorch ten...

28 Jan 2020 9,592

eindex

Multidimensional indexing for tensors

11 Mar 2023 107