wikichat
ingests Cohere's multilingual Wikipedia embeddings into a Chroma vector database and provides a Chainlit web interface for retrieval-augmented-generation against the data using gpt-4-1106-preview
.
I wanted to explore the idea of maintaining a local copy of Wikipedia, and this seemed like a good entry point. Down the road I might update this code to regularly pull the full Wikipedia dump and create the embeddings, instead of relying on Cohere's prebuilt embeddings. I went this route as a proof of concept, and as an excuse to try out Chainlit.
Based on Wikipedia_Semantic_Search_With_Cohere_Embeddings_Archives.ipynb
Clone the Repository:
git clone https://github.com/deadbits/wikipedia-chat.git
cd wikipedia-chat
Setup Python virtual environment:
python3 -m venv venv
source venv/bin/activate
Install Dependencies:
pip install -r requirements.txt
Set Cohere and OpenAI API keys
export OPENAI_API_KEY="...."
export COHERE_API_KEY="..."
485,859
1.63
GBRun ingest.py
to download the Wikipedia embeddings dataset and load into ChromaDB:
python ingest.py
The script adds records in batches of 100, but this will still take some time. The batch size could probably be increased.
To initiate the web interface, run the chainlit_ui.py
script with the Chainlit library:
chainlit run chainlit_ui.py
Chainlit interface