An open source, Gradio-based chatbot app that combines the best of retrieval augmented generation and prompt engineering into an intelligent assistant for modern professionals.
GPL-3.0 License
Welcome to PyTLDR, a Python project that pushes the boundaries of chatbot assistants by combining NoSQL databases, language models, and prompt engineering. In this README file, I'll provide an in-depth overview of the system's architecture and functionality, as well as installation and usage instructions.
If you have ever wanted to deploy a private version of ChatGPT that can index and read your ebooks and text data, then this is the app for you. Best of all, it runs locally on your hardware, and runs on a wide range of graphics processing units (GPUs), no third party subscriptions or APIs necessary.
Many professionals are interested in integrating Large Language Models (LLMs) into their tooling and daily workflow, but are concerned about privacy and cost. Additionally, LLMs are prone to hallucniation, and it is extremely costly to train and fine tune your own models for your particular use case. This application alleviates those concerns by allowing users to run inference locally on hardware they own, and integrates with Cassandra databases to provide Retrieval Augmented Generation (RAG), allowing users to augment foundation models such as Llama 2 with domain-specific knowledge, without investing in training or fine tuning.
By instructing the LLM to respond to queries based on data you own, the possibilities for education, systems automation, and accelerating development are endless. Sit down and have a conversation with your data, and never fear the mountains of text you need to consume in order to go about your day. If there is no time to read or consume it, now you can simply ask your virtual assistant to do the dirty work for you.
PyTLDR consists of four primary components:
Here's how retrieval augmented generation (RAG) in PyTLDR works:
PyTLDR offers several key advantages over traditional chatbots and information retrieval systems:
Use the following instructions to set up the environment and dependencies for PyTLDR. You will need some familiarity with Linux command line, Python, and Docker in order to proceed, so if you are unfamiliar please refer to the official documentation for these tools.
git clone https://github.com/mlc-delgado/pytldr-oss.git
/pytldr-oss
folder for the next steps.This app works best with persistent data storage for LLMs and sentence transformer models. Since the app runs within Docker, you will need to instruct Docker on where to locate the files that will persist outside of Docker.
docker-compose-rocm.yml
for AMD, and if using Nvidia open the docker-compose-cuda.yml
.pytldr-oss/dse
, using the included creation script at its default location. Use the format < local directory >:< container directory >
when setting up volumes. For example:volumes:
- /home/< your username >/code/pytldr-oss/6.8.37:/var/lib/cassandra
...
volumes:
- /home/< your username >/code/pytldr-oss/7.0.0-alpha.4:/var/lib/cassandra
/pytldr
directory to where you cloned the Python code, and the /usr/share/models
to a downloads directory of your choosing. The /root/.cache/huggingface
can be another directory of your choosing, or you can map it to your existing Hugging Face cache if you've downloaded sentence transformer models in the past. If you prefer to use a different path for storing the LLMs in the container, make sure to update the llmPath
in the config.yaml
to reflect the new download path for LLMs. For example:volumes:
- /home/< your username >/code/pytldr-oss:/pytldr
- /home/< your username >/code/llama.cpp/models:/usr/share/models
- /home/< your username >/.cache/huggingface:/root/.cache/huggingface
IMPORTANT: These steps are only required when setting up a PyTLDR installation for the first time, in order to prevent errors when loading DataStax Enterprise containers. The default images for DataStax Enterprise expect there to be existing directories and permissions within /var/lib/cassandra
for a user named dse
. When the Cassandra data volume is mapped to the host machine, the containers do not automatically create the dse
user and data directories. To work around this, these steps will create the user and directories required in order for the containers to start. You will need sudo
privileges or execute as root to proceed with these steps.
pytldr-oss/dse
directory.cd dse
sudo python3 create_folders.py
6.8.37
and 7.0.0-alpha.4
, and several subdirectories, in the current working directory. If you run into any errors during the creation of the user and directories and want to start fresh, you will need to use sudo
again to clean these up. The following commands will remove the created user and the data directories:sudo userdel dse
sudo rm -r 6.8.37
sudo rm -r 7.0.0-alpha.4
pytldr-rocm.dockerfile
, and if using Nvidia use the pytldr-cuda.dockerfile
.docker build -t pytldr -f pytldr-< platform >.dockerfile
docker compose -f docker-compose-< platform >.yml up -d
docker logs --follow pytldr-server-1
Running on local URL: http://0.0.0.0:7860
CTRL+C
to stop watching the docker logs, or from another terminal execute the following command to gracefully shut down the server and databases:docker compose -f docker-compose-< platform >.yml down
Your data and conversations will persist on the Cassandra databases between application restarts. On first startup you will simply be presented with a menu to select a LLM and context, and from there you can start a conversation. If you have previous saved conversations then they will be stored in the History. Click on the name of a conversation to switch to another conversation, or click on New Chat to start a new one.
If you have not already downloaded a compatible LLM, the LLM list will be empty. To download the default model, click on the RAG Settings tab and see the Downloading models section below before starting.
< None >
.Search my data
to enable the RAG workflow.Submit
, then wait for the response to be generated. If RAG is enabled, a list of the generated search keywords will temporarily appear while the search is in progress.Clear
to immediately clear the prompt input box.Agent contexts are short, user-defined phrases to describe a subject or a line of thinking that the chatbot should apply when searching and answering queries. These short phrases could be the name of a topic, such as vegan cooking
, or the name of a product, such as Ubuntu Linux
. This phrase gets inserted into system prompts to guide the LLM, literally as in Answer this query in the context of < agent context >.
Defining a new context is as simple as completing the sentence with a name or description of your desired agent context.
Omit Search Keywords
is an optional list of keywords that you can use to clean up your RAG queries. Some agent contexts, like the rules for Warhammer 40K, have a tendency to cause the LLM to generate a lot of extra search keywords that aren't necessary in the given context, which can skew search results. For those contexts, you can list the keywords you want to omit from searches. Removing keywords such as "Warhammer" and "40K" will greatly improve the accuracy of the searches, since those are already implied by the sources.
Agent contexts are saved to the local config.yaml
file. Exercise caution when creating and removing context. Once created, an agent context cannot be renamed, and removing an agent context will also remove any sources associated with that context.
Agent contexts are optional for a conversation. If you leave the agent context as < None >
, then you are freely chatting with the LLM with no system prompt and no RAG, so when you prompt it will behave the same as the default Llama 2 model. However, once you select an agent context the chatbot behaves completely differently.
Agent contexts can operate in three different modes, system prompts, system prompts with RAG, and Safe Mode.
If you have no books or sources loaded to your selected agent context, or if you have sources but leave the Search my data
box unchecked, then selecting an agent context other than < None >
will apply a system prompt that instructs the LLM to respond in the chosen context. It is not searching your sources, but it will attempt to follow the prompt and the current conversation history to control the context of the responses.
If you have sources loaded to the selected agent context and have checked the box for Search my data
, then the full RAG workflow activates. When RAG is enabled the system prompts inform the LLM to present only what is found in the source data. Depending on the amount of context found, sometimes the conversation history is ignored in favor of fitting as much context as possible for answers.
Exceeding the maximum number of tokens that the LLM can process with a single prompt can cause the GPU to run out of memory, so the prompt length is given a hard limit and the app truncates everything else. Given this, it's best to enable RAG only when you want to locate information in your sources, and turn it off when not needed. Once you've received a response that contains the data you need, you can switch off Search my data
, which alllows for more space in the context for the conversation history. From there you can prompt the chatbot again to transform the previous reply into the format you want.
To help lessen the amount of processing of the RAG prompts, the chatbot will attempt to suggest previous answers if a user repeats the same queries in a conversation. This is currently experimental, and will not activate if you rephrase the previous query rather than repeat it.
This is the most restrictive mode for RAG, and is disabled by default. Safe Mode applies an additional prompt engineering check at the beginning of the RAG workflow to ensure that the answers stay within the selected agent context. Enable with the checkbox labeled Safe Mode
in the RAG Settings
menu. You must enable both Search my data
and Safe Mode
for the feature to activate.
Under Safe Mode, when users issue commands such as, "Ignore all previous instructions", the chatbot will interpret that as an attempt to change the subject, and will disregard the instructions and try to redirect back to the agent context. This treats all queries outside of the agent context as potentially unsafe, and will prevent the chatbot from answering outside the selected context, rather than responding according to its own definition of an unsafe request.
Conversations and data sources are stored in Casssandra databases with DataStax Enterprise containers. This presents a few benefits and a few drawbacks.
On one hand, this allows the application to benefit from the features and performance enhancements provided by DataStax Enterprise. Solr and vector indexing and searching are easy to implement, with little to no configuration required out of the box. Additionally, tools and support are available for migrating to larger Cassandra clusters and enterprise management tools. On the other hand, this adds some complexity to the database infrastructure, and makes it more difficult to start and stop the application from a user's local machine.
Alternate databases are being evaluated and may make it into a future release. For now, be sure to give yourself enough time to gracefully start and stop the databases and application before shutting down the host operating system.
Source files are linked to the selected agent context when they are uploaded. This ensures that when you perform searches with RAG you can keep conflicting sources and contexts separate from one another.
Currently the following file extensions are supported:
< book title > Author < author >.< extension >
, such as The Old Man and the Sea Author Ernest Hemingway.pdf
.The application was developed around Llama 2 at 13 billion parameters. If the default model is not found in the llmPath
specified in the config.yaml
, then the application will provide the option to download it for you from Hugging Face.
Alternatively, you can supply your own llama.cpp-compatible GGUF model file, but this can have unpredictable effects on the prompts, as each model performs slightly differently across various prompts. I tested with Mistral and got similar results, but not quite as accurate as Llama 2, and tested again with Llama 3 and saw it become completely unusable. For best results I recommend sticking to the default Llama 2, pending future updates.
Without RAG, the app performs the same as other chatbots when prompting Llama 2. Enabling RAG adds to the number of prompts that are executed in the background before a final answer is provided to the user, so you should expect to see a noticeable increase in processing time while RAG is active. For capable hardware, results can usually be generated within 30-90 seconds.
The quality and accuracy of the outputs depends entirely on your data sources and what is being asked. You will need to experiment with search keywords, the number of sources, and various types of prompts in order to determine what works best for your use case.
Making changes in the RAG Settings
menu will change them for the current session only. If you want the settings to persist between application restarts, then click on Save Settings
to save the changes to the config.yaml
.
Ignore all previous instructions
or Ignore the previous conversation
is usually enough to get the chabot to change the subject and do what is asked.docker logs dse68
or docker logs dse7
to view the logs from each database. In many cases you can simply try to run docker compose up -d
again to see if the error corrects itself. If not, you may need to attempt to run some repair steps from another container.Thanks to Meta for providing free and open LLMs, and thanks to the broader machine learning community for their contributions to fantastic tools such as llama.cpp and Gradio.
And a special thanks to AMD for providing a GPU to aid in developing this solution for ROCm.