This project is based on llama.cpp and compiles only the RPC server, along with auxiliary utilities operating in RPC client mode, which are necessary for implementing distributed inference of Large Language Models (LLMs) and Embedding Models converted into the GGUF format.
The general architecture of an application using the RPC server looks as follows:
Instead of llama-server
, you can use llama-cli
or llama-embedding
, which are included in the standard container
package.
Docker images are built with support for the following architectures:
Unfortunately, CUDA builds for arm64 fail due to an error, so they are temporarily disabled.
Name | Default | Description |
---|---|---|
APP_MODE | backend | Container operation mode, available options: server, backend, and none |
APP_BIND | 0.0.0.0 | Interface to bind to |
APP_PORT |
8080 for server , 50052 for backend
|
Port number on which the server is running |
APP_MEM | 1024 | Amount of MiB of RAM available to the client; in CUDA mode, this is the amount of GPU memory |
APP_RPC_BACKENDS | backend-cuda:50052,backend-cpu:50052 | Comma-separated addresses of backends that the container will try to connect to in server mode |
APP_MODEL | /app/models/TinyLlama-1.1B-q4_0.gguf | Path to the model weights inside the container |
APP_REPEAT_PENALTY | 1.0 | Repeat penalty |
APP_GPU_LAYERS | 99 | Number of layers offloaded to the backend |
In this example, llama-server
(container main
) is launched and the
model TinyLlama-1.1B-q4_0.gguf, which was
previously downloaded to the ./models
directory located at the same level as docker-compose.yml
, is initialized. The
./models
directory is then mounted inside the main
container and is available at the path /app/models
.
version: "3.9"
services:
main:
image: evilfreelancer/llama.cpp-rpc:latest
restart: unless-stopped
volumes:
- ./models:/app/models
environment:
# Operation mode (RPC client in API server format)
APP_MODE: server
# Path to the model weights, preloaded inside the container
APP_MODEL: /app/models/TinyLlama-1.1B-q4_0.gguf
# Addresses of the RPC servers the client will interact with
APP_RPC_BACKENDS: backend-cuda:50052,backend-cpu:50052
ports:
- "127.0.0.1:8080:8080"
backend-cpu:
image: evilfreelancer/llama.cpp-rpc:latest
restart: unless-stopped
environment:
# Operation mode (RPC server)
APP_MODE: backend
# Amount of system RAM available to the RPC server (in Megabytes)
APP_MEM: 2048
backend-cuda:
image: evilfreelancer/llama.cpp-rpc:latest-cuda
restart: "unless-stopped"
environment:
# Operation mode (RPC server)
APP_MODE: backend
# Amount of GPU memory available to the RPC server (in Megabytes)
APP_MEM: 1024
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [ gpu ]
A complete example is available in docker-compose.dist.yml.
As a result, we obtain the following diagram:
Once launched, you can make HTTP requests like this:
curl \
--request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "Building a website can be done in 10 simple steps:"}'
Building containers in CPU-only mode:
docker build ./llama.cpp/
Building the container for CUDA:
docker build ./llama.cpp/ --file ./llama.cpp/Dockerfile.cuda
Using the build argument LLAMACPP_VERSION
, you can specify the tag version, branch name, or commit hash to build the
container from. By default, the master
branch is specified in the container.
# Build the container from the tag https://github.com/ggerganov/llama.cpp/releases/tag/b3700
docker build ./llama.cpp/ --build-arg LLAMACPP_VERSION=b3700
# Build the container from the master branch
docker build ./llama.cpp/ --build-arg LLAMACPP_VERSION=master
# or simply
docker build ./llama.cpp/
An example of docker-compose.yml that performs image building with an explicit tag specification:
version: "3.9"
services:
main:
restart: "unless-stopped"
build:
context: ./llama.cpp
args:
- LLAMACPP_VERSION=b3700
volumes:
- ./models:/app/models
environment:
APP_MODE: none
ports:
- "8080:8080"
backend:
restart: "unless-stopped"
build:
context: ./llama.cpp
args:
- LLAMACPP_VERSION=b3700
environment:
APP_MODE: backend
ports:
- "50052:50052"
[Pavel Rykov]. (2024). llama.cpp RPC-server in Docker. GitHub. https://github.com/EvilFreelancer/docker-llama.cpp-rpc
@misc{pavelrykov2024llamacpprpc,
author = {Pavel Rykov},
title = {llama.cpp RPC-server in Docker},
year = {2024},
url = {https://github.com/EvilFreelancer/docker-llama.cpp-rpc}
}