A simple RAG chatbot that can retrieve from a mediawiki data dump
MIT License
Chatbots are very popular right now. Most openly accessible information is stored in some kind of a Mediawiki. Creating a RAG Chatbot is becoming a very powerful alternative to traditional data gathering. This project is designed to create a basic format for creating your own chatbot to run locally on linux.
Mediawikis hosted by Fandom usually allow you to download an XML dump of the entire wiki as it currently exists. This project primarily leverages Langchain with a few other open source projects to combine many of the readily available quickstart guides into a complete vertical application based on mediawiki data.
graph TD;
a[/xml dump a/] --MWDumpLoader--> emb
b[/xml dump b/] --MWDumpLoader--> emb
emb{Embedding} --> db
db[(Chroma)] --Document Retriever--> lc
hf(Huggingface) --Sentence Transformer --> emb
hf --LLM--> modelfile
modelfile[/Modelfile/] --> Ollama
Ollama(((Ollama))) <-.ChatOllama.-> lc
lc{Langchain} <-.LLMChain.-> cl(((Chainlit)))
click db href "https://github.com/chroma-core/chroma"
click hf href "https://huggingface.co/"
click cl href "https://github.com/Chainlit/chainlit"
click lc href "https://github.com/langchain-ai/langchain"
click Ollama href "https://github.com/jmorganca/ollama"
multi-mediawiki-rag # $HOME/app
├── .chainlit
│ ├── .langchain.db # Server Cache
│ └── config.toml # Server Config
├── app.py
├── chainlit.md
├── config.yaml
├── data # VectorDB
│ ├── 47e4e036-****-****-****-************
│ │ └── *
│ └── chroma.sqlite3
├── embed.py
├── entrypoint.sh
└── requirements.txt
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
These steps assume you are using a modern Linux OS like Ubuntu 22.04 with Python 3.10+.
apt-get install -y curl git python3-venv
git clone https://github.com/tylertitsworth/multi-mediawiki-rag.git
curl https://ollama.ai/install.sh | sh
python -m .venv venv
source .venv/bin/activate
pip install -U pip setuptools wheel
pip install -r requirements.txt
/wiki/Special:Statistics
or using a tool like wikiteam3
wikiteam3
, scrape only namespace 0sources/<wikiname>_pages_current.xml
config.yaml
with the location of your XML mediawiki data you downloaded in step 1 and other configuration information[!CAUTION] Installing Ollama will create a new user and a service on your system. Follow the manual installation steps to avoid this step and instead launch the ollama API using
ollama serve
.
After installing Ollama we can use a Modelfile to download and tune an LLM to be more precise for Document Retrieval QA.
ollama create volo -f ./Modelfile
[!TIP] Choose a model from the Ollama model library and download with
ollama pull <modelname>:<version>
, then edit themodel
field inconfig.yaml
with the same information.
git clone https://huggingface.co/<org>/<modelname> model/<modelname>
.GGUF
format, convert it with docker run --rm -v $PWD/model/<modelname>:/model ollama/quantize -q q4_0 /model
.FROM
line to contain the path to the q4_0.bin
file in the modelname directory.Your XML data needs to be loaded and transformed into embeddings to create a Chroma VectorDB.
python embed.py
2023-12-16 09:50:53 - Loaded .env file
2023-12-16 09:50:55 - Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
2023-12-16 09:51:18 - Use pytorch device: cpu
2023-12-16 09:56:09 - Anonymized telemetry enabled. See
https://docs.trychroma.com/telemetry for more information.
Batches: 100%|████████████████████████████████████████| 1303/1303 [1:23:14<00:00, 3.83s/it]
...
Batches: 100%|████████████████████████████████████████| 1172/1172 [1:04:08<00:00, 3.28s/it]
023-12-16 19:47:01 - Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
2023-12-16 19:47:33 - Use pytorch device: cpu
Batches: 100%|████████████████████████████████████████████████| 1/1 [00:00<00:00, 40.41it/s]
A Tako was an intelligent race of octopuses found in the Kara-Tur setting. They were known for
their territorial nature and combat skills, as well as having incredible camouflaging abilities
that allowed them to blend into various environments. Takos lived in small tribes with a
matriarchal society led by one or two female rulers. Their diet consisted mainly of crabs,
lobsters, oysters, and shellfish, while their ink was highly sought after for use in calligraphy
within Kara-Tur.
Choose a new File type Document Loader or App Document Loader, and add them using your own script. Check out the provided Example.
chainlit run app.py -h
Access the Chatbot GUI at http://localhost:8000
.
export DISCORD_BOT_TOKEN=...
chainlit run app.py -h
[!TIP] Develop locally with ngrok.
This chatbot is hosted on Huggingface Spaces for free, which means this chatbot is very slow due to the minimal hardware resources allocated to it. Despite this, the provided Dockerfile provides a generic method for hosting this solution as one unified container, however this method is not ideal and can lead to many issues if used for professional production systems.
Cypress tests modern web applications with visual debugging. It is used to test the Chainlit UI functionality.
npm install
# Run Test Suite
bash cypress/test.sh
[!NOTE] Cypress requires
node >= 16
.
Pytest is a mature full-featured Python testing tool that helps you write better programs.
pip install pytest
# Test Embedding Functions
pytest test/test_embed.py -W ignore::DeprecationWarning
# Test e2e with Ollama Backend
pytest test -W ignore::DeprecationWarning