Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A

Clearly explained guide for running quantized open-source LLM applications on CPUs using LLama 2, C Transformers, GGML, and LangChain

Step-by-step guide on TowardsDataScience: https://towardsdatascience.com/running-llama-2-on-cpu-inference-for-document-q-a-3d636037a3d8

Context

Third-party commercial large language model (LLM) providers like OpenAI's GPT4 have democratized LLM use via simple API calls.
However, there are instances where teams would require self-managed or private model deployment for reasons like data privacy and residency rules.
The proliferation of open-source LLMs has opened up a vast range of options for us, thus reducing our reliance on these third-party providers.
When we host open-source LLMs locally on-premise or in the cloud, the dedicated compute capacity becomes a key issue. While GPU instances may seem the obvious choice, the costs can easily skyrocket beyond budget.
In this project, we will discover how to run quantized versions of open-source LLMs on local CPU inference for document question-and-answer (Q&A).

Quickstart

Ensure you have downloaded the GGML binary file from https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML and placed it into the models/ folder
To start parsing user queries into the application, launch the terminal from the project directory and run the following command:
poetry run python main.py "<user query>"
For example, poetry run python main.py "What is the minimum guarantee payable by Adidas?"
Note: Omit the prepended poetry run if you are NOT using Poetry

Tools

LangChain: Framework for developing applications powered by language models
C Transformers: Python bindings for the Transformer models implemented in C/C++ using GGML library
FAISS: Open-source library for efficient similarity search and clustering of dense vectors.
Sentence-Transformers (all-MiniLM-L6-v2): Open-source pre-trained transformer model for embedding text to a 384-dimensional dense vector space for tasks like clustering or semantic search.
Llama-2-7B-Chat: Open-source fine-tuned Llama 2 model designed for chat dialogue. Leverages publicly available instruction datasets and over 1 million human annotations.
Poetry: Tool for dependency management and Python packaging

Files and Content

/assets: Images relevant to the project
/config: Configuration files for LLM application
/data: Dataset used for this project (i.e., Manchester United FC 2022 Annual Report - 177-page PDF document)
/models: Binary file of GGML quantized LLM model (i.e., Llama-2-7B-Chat)
/src: Python codes of key components of LLM application, namely llm.py, utils.py, and prompts.py
/vectorstore: FAISS vector store for documents
db_build.py: Python script to ingest dataset and generate FAISS vector store
main.py: Main Python script to launch the application and to pass user query via command line
pyproject.toml: TOML file to specify which versions of the dependencies used (Poetry)
requirements.txt: List of Python dependencies (and version)

References

Related Projects

llm-chain

`llm-chain` is a powerful rust crate for building chains in large language models allowing you to...

24 Mar 2023 1,322

libre-chat

🦙 Free and Open Source Large Language Model (LLM) chatbot web UI and API. Self-hosted, offline ca...

26 Jul 2023 128

Get-Things-Done-with-Prompt-Engineering-and-LangChain

LangChain & Prompt Engineering tutorials on Large Language Models (LLMs) such as ChatGPT with cus...

12 Apr 2023 1,094

lmql

A language for constraint-guided and efficient LLM programming.

24 Nov 2022 3,637

web-llm

High-performance In-browser LLM Inference Engine

13 Apr 2023 13,259

llm-api

Run any Large Language Model behind a unified API

02 Apr 2023 159

PrivateDocBot

📚 Local PDF-Integrated Chat Bot: Secure Conversations and Document Assistance with LLM-Powered Pr...

13 Aug 2023 71

awesome-llm-and-aigc

🚀🚀🚀A collection of some awesome public projects about Large Language Model, Vision Foundation Mod...

15 Feb 2023 516

llama.go

llama.go is like llama.cpp in pure Golang!

19 Mar 2023 1,245

llama-gpt

A self-hosted, offline, ChatGPT-like chatbot. Powered by Llama 2. 100% private, with no data leav...

22 Jul 2023 10,740

LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and b...

17 Apr 2023 19,659

MedicalGPT

MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline. 训练医疗大模型，实现了包括增量预训...

02 Jun 2023 2,446

LLamaTuner

Easy and Efficient Finetuning LLMs. (Supported LLama, LLama2, LLama3, Qwen, Baichuan, GLM , Fal...

25 May 2023 568

GPT-4-LLM

Instruction Tuning with GPT-4

06 Apr 2023 3,923

textgen

TextGen: Implementation of Text Generation models, include LLaMA, BLOOM, GPT2, BART, T5, SongNet ...

07 Apr 2021 926

Llama-2-Open-Source-LLM-CPU-Inference