A comprehensive toolkit for building Retrieval-Augmented Generation (RAG) pipelines, including data loading, vector database creation, retrieval, and chain management.
MIT License
RAGFlowChain is a powerful and flexible toolkit designed for building Retrieval-Augmented Generation (RAG) pipelines. This library integrates data loading from various sources, vector database creation, and chain management, making it easier to develop advanced AI solutions that combine retrieval mechanisms with generative models.
To install RAGFlowChain, simply run:
pip install RAGFlowChain==0.5.1
RAGFlowChain allows you to fetch and process data from various online and local sources, all integrated into a single DataFrame.
from ragflowchain import data_loader
import yaml
import os
# API Keys
PATH_CREDENTIALS = '../credentials.yml'
BOOKS_API_KEY = yaml.safe_load(open(PATH_CREDENTIALS))['book']
NEWS_API_KEY = yaml.safe_load(open(PATH_CREDENTIALS))['news']
YOUTUBE_API_KEY = yaml.safe_load(open(PATH_CREDENTIALS))['youtube']
# Define online and local data sources
# Define URLs for websites
urls = [
"https://www.honda.ca/en",
"https://www.honda.ca/en/vehicles",
"https://www.honda.ca/en/odyssey"
]
# Define online sources
online_sources = {
'youtube': {
'topic': 'honda acura',
'api_key': YOUTUBE_API_KEY,
'max_results': 10
},
'websites': urls,
'books': {
'api_key': BOOKS_API_KEY,
'query': 'automobile industry',
'max_results': 10
},
'news_articles': {
'api_key': NEWS_API_KEY,
'query': 'automobile marketing',
'page_size': 5,
'max_pages': 1
}
}
local_sources = ["../folder/irt.ppt", "../book/book.pdf", "../documents/sample.docx", "../notes/note.txt"]
# Fetch and process the data
final_data_df = data_loader(online_sources=online_sources, local_sources=local_sources, chunk_size=1000)
# Display the DataFrame
print(final_data_df)
final_data_df
:
pandas.DataFrame
source
, title
, author
, publishedDate
, description
, content
, url
, and source_type
. Each row corresponds to a chunk of content from a source, making it ready for further processing or embedding.Once you have the data, you can create a vector database using the create_database
function.
from ragflowchain import create_database
# Create a vector store from the processed data
vectorstore, docs_recursive = create_database(
df=final_data_df,
page_content="content",
embedding_function=None, # Uses default SentenceTransformerEmbeddings
vectorstore_method='Chroma', # Options: 'Chroma', 'FAISS', 'Annoy'
vectorstore_directory="data/chroma.db", # Adjust according to vectorstore_method
chunk_size=1000,
chunk_overlap=100
)
create_database
Arguments:df
:
pandas.DataFrame
page_content
) that you want to split into chunks and store in the vector database. Other columns might include metadata like source
, title
, author
, etc.page_content
:
str
df
) that contains the main text content. This content will be split into chunks and used to create the embeddings that are stored in the vector database.embedding_function
:
SentenceTransformerEmbeddings
model from sentence-transformers
, specifically the "all-MiniLM-L6-v2" model. This model converts text chunks into high-dimensional vectors that can be stored in the vector database.vectorstore_method
:
str
'Chroma'
: A flexible and persistent vector store that is saved to disk.'FAISS'
: High-performance, in-memory or disk-based approximate nearest neighbor search.'Annoy'
: Lightweight, memory-efficient approximate nearest neighbor search.vectorstore_directory
:
str
vectorstore_method
:
Chroma
, this specifies the directory where the database is stored.FAISS
, this is the path to save the FAISS index file.Annoy
, this specifies the file path for the Annoy index.chunk_size
:
int
1000
characters.chunk_overlap
:
int
100
characters.vectorstore
:
vectorstore_method
(Chroma
, FAISS
, or Annoy
)vectorstore_directory
) and can be used for retrieval tasks.docs_recursive
:
List[Document]
Document
class, containing both the content and metadata such as source, title, and other relevant information from the original DataFrame.Integrate the data and vector store into a Retrieval-Augmented Generation (RAG) chain.
from ragflowchain import create_rag_chain
# Create the RAG chain
rag_chain = create_rag_chain(
llm=YourLanguageModel(), # Replace with your LLM instance
vector_database_directory="data/chroma.db",
method='Chroma', # Choose 'Chroma', 'FAISS', or 'Annoy'
embedding_function=None, # Optional, defaults to SentenceTransformerEmbeddings
system_prompt="This is a system prompt.", # Optional: Customize your system prompt
chat_history_prompt="This is a chat history prompt.", # Optional: Customize your chat history prompt
tavily_search="YourTavilyAPIKey" # Optional: Replace with your Tavily API key or TavilySearchResults instance
)
create_rag_chain
Arguments:llm
:
vector_database_directory
:
str
method
:
str
'Chroma'
: A flexible and persistent vector store that is saved to disk.'FAISS'
: High-performance, in-memory or disk-based approximate nearest neighbor search.'Annoy'
: Lightweight, memory-efficient approximate nearest neighbor search.embedding_function
:
SentenceTransformerEmbeddings
if not provided.system_prompt
:
str
None
, a default system prompt will be used.chat_history_prompt
:
str
None
, a default prompt for contextualizing questions will be used.tavily_search
:
str
or TavilySearchResults
instance (Optional)TavilySearchResults
. If provided, the chain will include up-to-date web search results in its responses.rag_chain
:
RunnableWithMessageHistory
invoke
# Example usage with invoke method
result = rag_chain.invoke(
{"input": "Your question here"},
config={
"configurable": {"session_id": "user123"}
}
)
print(result["answer"])
invoke
Usage:invoke
: This method is used to trigger the execution of the RAG chain. Its preferred over run
when working with configurable settings or when invoke
is the designated method in the LangChain framework youre using.
Input Dictionary: The users question or input is passed as a dictionary with the key "input"
.
config
Dictionary: Additional configurations can be passed using the config
dictionary. Here, "configurable": {"session_id": "user123"}
sets a session ID, which is useful for tracking the conversation history across multiple interactions.
result
:
dict
"answer"
contains the generated response to the user's input, while other keys might include additional metadata depending on the RAG chain configuration.data_loader
data_loader(online_sources=None, local_sources=None, chunk_size=1000)
online_sources
: A dictionary specifying the online sources from which to fetch data. The keys represent the type of source, and the values are tuples containing the necessary parameters.
books
: A tuple (api_key
, query
, max_results
) for fetching books from Google Books API.news_articles
: A tuple (api_key
, query
, page_size
, max_pages
) for fetching news articles from NewsAPI.youtube
: A tuple (query
, api_key
, max_results
) for fetching YouTube videos.websites
: A list of URLs to fetch content from websites.local_sources
: A list of paths to local files (PDF, PPT, DOCX, TXT). The function loads and processes these documents into manageable chunks.
chunk_size
: The size of each text chunk, in characters. Default is 1000
. This determines how the text content is split, ensuring that chunks are neither too large nor too small.
create_database
create_database(df, page_content, embedding_function=None, vectorstore_method='Chroma', vectorstore_directory="data/vectorstore.db", chunk_size=1000, chunk_overlap=100)
df
: A pandas DataFrame containing the processed data. This should include columns like content
, source
, etc.
page_content
: The name of the column in the DataFrame that contains the main text content to be embedded.
embedding_function
: (Optional) A function or model used to generate embeddings. Defaults to using SentenceTransformerEmbeddings
.
vectorstore_method
: The method used for the vector store. Options include:
'Chroma'
: For a flexible and persistent vector store saved to disk.'FAISS'
: For high-performance, in-memory, or disk-based approximate nearest neighbor search.'Annoy'
: For a lightweight, memory-efficient approximate nearest neighbor search.vectorstore_directory
: The directory where the vector store will be saved. The default is "data/vectorstore.db"
, but the exact usage depends on the vectorstore_method
:
'Chroma'
, this specifies the directory where the database is stored.'FAISS'
, this is the path to save the FAISS index file.'Annoy'
, this specifies the file path for the Annoy index.chunk_size
: The size of each text chunk, in characters, used during text splitting.
chunk_overlap
: The overlap between consecutive chunks, to maintain context. Default is 100
.
create_rag_chain
Here's the updated version of your code snippet and explanation to include the tavily_search
argument and adjust the descriptions accordingly:
create_rag_chain(llm, vector_database_directory, method='Chroma', embedding_function=None, system_prompt=None, chat_history_prompt=None, tavily_search=None)
llm
: The language model that will be used in the RAG chain. This could be an instance of GPT-3 or any other compatible model.
vector_database_directory
: The directory where the vector database is located.
method
: The method used for the vector store. Options include:
'Chroma'
: For a flexible and persistent vector store saved to disk.'FAISS'
: For high-performance, in-memory, or disk-based approximate nearest neighbor search.'Annoy'
: For a lightweight, memory-efficient approximate nearest neighbor search.embedding_function
: (Optional) The function or model used to generate embeddings during retrieval. If not provided, it defaults to using SentenceTransformerEmbeddings
.
system_prompt
: (Optional) A prompt given to the language model to guide its responses. This could include instructions or context specific to the application. If set to None
, a default system prompt will be used.
chat_history_prompt
: (Optional) A prompt template that incorporates the chat history, helping the model maintain context across multiple interactions. If set to None
, a default prompt for contextualizing questions will be used.
tavily_search
: (Optional) This argument allows you to integrate real-time web search results into your RAG chain. You can provide either your Tavily API key as a string or an instance of TavilySearchResults
. If provided, the chain will include up-to-date web search results in its responses.
rag_chain.invoke
rag_chain.invoke({"input": question}, config={"configurable": {"session_id": "any"}})
input
: The users input or question, passed as a dictionary with the key "input"
. This is the prompt or query that the model will process.
config
: A dictionary containing additional configuration settings. The "configurable"
key allows you to set a session ID or other parameters that influence how the chain processes the input.
For more detailed documentation, including advanced usage and customization options, please visit the GitHub repository.
RAGFlowChain is licensed under the MIT License. See the LICENSE file for more information.
RAGFlowChain is built on top of powerful tools like LangChain and Chroma. We thank the open-source community for their contributions.
Made with by Kwadwo Daddy Nyame Owusu - Boakye.