🚀 CLIP Zero-shot Object Detection

Welcome! This project enables zero-shot object detection using OpenAI's CLIP model, complemented by the Fast R-CNN model for generating region proposals.

📋 Requirements

To dive into the magic of CLIP-based object detection, make sure you have the following installed:

Python 3: The backbone of your environment.
PyTorch: Essential for deep learning computations.
Matplotlib: For visualizing your results.
CLIP: The model itself (CLIP GitHub Repository).
NumPy: For numerical operations.

🚀 What is CLIP?

CLIP (Contrastive Language–Image Pre-training) is a neural network trained on a massive dataset of 400 million image-text pairs. By learning to predict the correct text for images and vice versa, CLIP effectively bridges the gap between visual and textual understanding. This self-supervised model excels in various image-language tasks, including classification and object detection, without the need for labeled training data.

🔍 Methodology

Region Proposal: Start with Faster R-CNN’s Region Proposal Network (RPN) to identify potential areas of interest in the image.
Embedding: Use CLIP to encode both the proposed regions and textual queries into high-dimensional embeddings.
Comparison: Average multiple phrasings of each query to get a robust representation. Then, compute cosine similarity between the regional embeddings and the averaged query embeddings.
Detection: Identify the regions that most closely align with the textual descriptions for accurate object detection.