Openai-style, fast & lightweight local language model inference w/ documents
g4l
is a high-level Python library that allows you to run language models using the llama.cpp
bindings. It is a sister project to @gpt4free, which also provides AI, but using internet and external providers, aswell as additional feature such as text retrieval from documents.
pull requests are welcome !!
llama-cpp-python
To use G4L, you need to have the llama.cpp Python bindings installed. You can install them using pip:
pip3 install -U llama-cpp-python
git clone https://github.com/gpt4free/gpt4local
cd gpt4local
pip install -r requirements.txt
GGUF
format from HuggingFace. You can find a variety of quantized .gguf
models on TheBloke's page../models
folder.Some popular models include:
The models are available in different quantization levels, such as q2_0
, q4_0
, q5_0
, and q8_0
. Higher quantization 'bit counts' (4 bits or more) generally preserve more quality, whereas lower levels compress the model further, which can lead to a significant loss in quality. The standard quantization level is q4_0
.
Keep in mind the memory requirements for different model sizes:
8gb
of RAM16gb
of RAMAccording to chat.lmsys.org, the best models are:
7b
model: Mistral-7B-Instruct-v0.2
Qwen1.5-72B-Chat
(available here)from g4l.local import LocalEngine
engine = LocalEngine(
gpu_layers = -1, # use all GPU layers
cores = 0 # use all CPU cores
)
response = engine.chat.completions.create(
model = 'orca-mini-3b-gguf2-g4_0',
messages = [{"role": "user", "content": "hi"}],
stream = True
)
for token in response:
print(token.choices[0].delta.content)
Note: The model
parameter must match the file name of the .gguf
model you placed in ./models
, without the .gguf
extension!
from g4l.local import LocalEngine, DocumentRetriever
engine = LocalEngine(
gpu_layers = -1, # use all GPU layers
cores = 0, # use all CPU cores
document_retriever = DocumentRetriever(
files = ['einstein-albert.pdf'],
embed_model = 'SmartComponents/bge-micro-v2', # https://huggingface.co/spaces/mteb/leaderboard
)
)
response = engine.chat.completions.create(
model = 'mistral-7b-instruct',
messages = [
{
"role": "user", "content": "how was einstein's work in the laboratory"
}
],
stream = True
)
for token in response:
print(token.choices[0].delta.content or "", end="", flush=True)
! The embeddings model will be downloaded upon first use, but it is really small and lightweight.
G4L provides a DocumentRetriever
class that allows you to retrieve relevant information from documents based on a query. Here's an example of how to use it:
from g4l.local import DocumentRetriever
engine = DocumentRetriever(
files=['einstein-albert.txt'],
embed_model='SmartComponents/bge-micro-v2', # https://huggingface.co/spaces/mteb/leaderboard
verbose=True,
)
retrieval_data = engine.retrieve('what inventions did he do')
for node_with_score in retrieval_data:
node = node_with_score.node
score = node_with_score.score
text = node.text
metadata = node.metadata
page_label = metadata['page_label']
file_name = metadata['file_name']
print(f"Text: {text}")
print(f"Score: {score}")
print(f"Page Label: {page_label}")
print(f"File Name: {file_name}")
print("---")
You can also get a ready-to-go prompt for the language model using the retrieve_for_llm
method:
retrieval_data = engine.retrieve_for_llm('what inventions did he do')
print(retrieval_data)
The prompt template used by retrieve_for_llm
is as follows:
prompt = (f'Context information is below.\n'
+ '---------------------\n'
+ f'{context_batches}\n'
+ '---------------------\n'
+ 'Given the context information and not prior knowledge, answer the query.\n'
+ f'Query: {query_str}\n'
+ 'Answer: ')
G4L provides several configuration options to customize the behavior of the LocalEngine
. Here are some of the available options:
gpu_layers
: The number of layers to offload to the GPU. Use -1
to offload all layers.cores
: The number of CPU cores to use. Use 0
to use all available cores.use_mmap
: Whether to use memory mapping for faster model loading. Default is True
.use_mlock
: Whether to lock the model in memory to prevent swapping. Default is False
.offload_kqv
: Whether to offload key, query, and value tensors to the GPU. Default is True
.context_window
: The maximum context window size. Default is 4900
.You can pass these options when creating an instance of LocalEngine
:
engine = LocalEngine(
gpu_layers = -1,
cores = 0,
use_mmap = True,
use_mlock = False,
offload_kqv= True,
context_window = 4900
)
Benchmark ran on a 2022 MacBook Air M2, 8GB RAM.
PC: Mac Air M2
CPU/GPU: M2 chip
Cores: All (8)
GPU Layers: All
GPU Offload: 100%
No power:
Model: mistral-7b-instruct-v2
Number of iterations: 5
Average loading time: 1.85s
Average total tokens: 48.20
Average total time: 5.34s
Average speed: 9.02 t/s
With power:
Model: mistral-7b-instruct-v2
Number of iterations: 5
Average loading time: 1.88s
Average total tokens: 317
Average total time: 17.7s
Average speed: 17.9 t/s
llama.cpp
, like LmStudio, which still had great performance but in my case a speed of ~7.83
tokens/s in contrast to 9.02
t/s with native llama.cpp Python bindings.