📚 wikichat

🌟 Overview

wikichat ingests Cohere's multilingual Wikipedia embeddings into a Chroma vector database and provides a Chainlit web interface for retrieval-augmented-generation against the data using gpt-4-1106-preview.

I wanted to explore the idea of maintaining a local copy of Wikipedia, and this seemed like a good entry point. Down the road I might update this code to regularly pull the full Wikipedia dump and create the embeddings, instead of relying on Cohere's prebuilt embeddings. I went this route as a proof of concept, and as an excuse to try out Chainlit.

Based on Wikipedia_Semantic_Search_With_Cohere_Embeddings_Archives.ipynb

🛠 Installation

Clone the Repository:

git clone https://github.com/deadbits/wikipedia-chat.git
cd wikipedia-chat

Setup Python virtual environment:

python3 -m venv venv
source venv/bin/activate

Install Dependencies:
```
pip install -r requirements.txt
```

📖 Usage

Set Cohere and OpenAI API keys

export OPENAI_API_KEY="...."
export COHERE_API_KEY="..."

Ingest Data

Dataset: Cohere/wikipedia-22-12-simple-embeddings
Rows: 485,859
Size: 1.63 GB

Run ingest.py to download the Wikipedia embeddings dataset and load into ChromaDB:

python ingest.py

The script adds records in batches of 100, but this will still take some time. The batch size could probably be increased.

Web Interface

To initiate the web interface, run the chainlit_ui.py script with the Chainlit library:

chainlit run chainlit_ui.py

Chainlit interface

Related Projects

samples

Sample integrations built by Cohere.

01 Apr 2022 25

sandbox-multilingual

A demonstration of a multilingual semantic search engine you can be quickly built using Cohere's ...

05 Dec 2022 59

wikisearch

Multilingual Semantic Search with Reranking on a prepared large vectorized dataset comprising 10 ...

27 Sep 2023 10

ThruThinkCohereWeaviateChat

Cohere and Weaviate powered ThruThink support chat on Streamlit

17 Nov 2023 1

sandbox-toy-semantic-search

A demonstration of how a toy (but usable!) semantic search engine can be quickly built using Cohe...

02 Nov 2022 114

BinaryVectorDB

Efficient vector database for hundred millions of embeddings.

21 Mar 2024 198

embedJs

A NodeJS RAG framework to easily work with LLMs and embeddings

29 Jun 2023 281

quick-start-connectors

This open-source repository offers reference code for integrating workplace datastores with Coher...

30 Oct 2023 138

notebooks

Code examples and jupyter notebooks for the Cohere Platform

06 Oct 2021 261

DiskVectorIndex

Efficient vector DB on large datasets from disk, using minimal memory.

02 Jul 2024 202

Article-Summarizer-LangChain-

A powerful tool for summarizing news articles using LangChain and Cohere. Load articles from URLs...

24 Aug 2024 0

aidapter

Adapter / facade for language models (OpenAI, Anthropic, Cohere, local transformers, etc)

23 May 2023 18

semantic-search

Semantic search web app using the Large Language Model (LLM) Cohere for embeddings to match conte...

24 Jun 2024 4

ghost

Where Privacy Meets Intelligence!

27 Apr 2024 0

lexitalk

🤖🎙️ Explore Lex Fridman Podcast Transcripts with a smart chatbot!

20 Jan 2024 10