Builds Docker Image for Loading Pinecone Database with Vectors from CVE data.
This project downloads CVE records from a CVE URL, processes them, and stores their embeddings in a Pinecone vector database using HuggingFace's BAAI/bge-small-en-v1.5
model. The workflow handles downloading, unzipping, processing, embedding, and cleaning up after the data has been processed.
BAAI/bge-small-en-v1.5
Downloading and Unzipping:
Moving Files:
data
folder for further processing.Processing CVE Data:
Embedding with HuggingFace:
bge-small-en-v1.5
model to generate embeddings for storage in Pinecone.Pinecone Integration:
langchain.vectorstores.Pinecone
.Cleanup:
Environment Setup:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install sentence-transformers
Environment Variables:
.env
file in the root directory with the following variables:
PINECONE_API_KEY=your_pinecone_api_key
PINECONE_CLOUD=aws
PINECONE_REGION=us-east-1
URL=https://your_download_link.com/deltaCves.zip
Running the Application:
python pinecone_db.py
Docker Build:
docker build -t cve-processor:latest .
Running the Docker Container:
docker run --env-file .env cve-processor:latest
pinecone_db.py
This script handles:
data
directory.Dockerfile
The Dockerfile
sets up the environment, installs dependencies, and defines the entry point to run the pinecone_db.py
script inside a containerized environment.
This project is licensed under the MIT License - see the LICENSE file for details.