๐ธ Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
MIT License
Published by borzunov about 1 year ago
๐ฆ
Falcon support. Petals now supports all models based on Falcon, including Falcon 180B released today. We improved the ๐ค Transformers FalconModel
implementation to be up to 40% faster on recent GPUs. Our chatbot app runs Falcon 180B-Chat at ~2 tokens/sec.
Falcon-40B is licensed under Apache 2.0, so you can load it by specifying tiiuae/falcon-40b
or tiiuae/falcon-40b-instruct
as the model name. Falcon-180B is licensed under a custom license, and it is not clear if we can provide a Python interface for inference and fine-tuning of this model. Right now, it is only available in the chatbot app, and we are waiting for further clarifications from TII on this issue.
๐ Native macOS support. You can run Petals clients and servers on macOS natively - just install Homebrew and run these commands:
brew install python
python3 -m pip install git+https://github.com/bigscience-workshop/petals
python3 -m petals.cli.run_server petals-team/StableBeluga2
If your computer has Apple M1/M2 chip, the Petals server will use the integrated GPU automatically. We recommend to only host Llama-based models, since other supported architectures do not work efficiently on M1/M2 chips yet. We also recommend using Python 3.10+ on macOS (installed by Homebrew automatically).
๐ Serving custom models. Custom models now automatically show up at https://health.petals.dev as "not officially supported" models. As a reminder, you are not limited to models available at https://health.petals.dev and can run a server hosting any model based on BLOOM, Llama, or Falcon architecture (given that it's allowed by the model license), or even add a support for a new architecture yourself. We also improved Petals compatibility with some popular Llama-based models (e.g., models from NousResearch) in this release.
๐ Bug fixes. This release also fixes inference of prefix-tuned models, which was broken in Petals 2.1.0.
.generate(input_ids=...)
by @borzunov in https://github.com/bigscience-workshop/petals/pull/485
Full Changelog: https://github.com/bigscience-workshop/petals/compare/v2.1.0...v2.2.0
Published by borzunov about 1 year ago
๐ Compatibility with ๐ค Transformers generation utils. Petals models now directly use ๐ค Transformers .generate() implementation instead of custom generation code. This means that you can use a variety of generation methods and constraints implemented in ๐ค Transformers (e.g., repetition_penalty
, beam search, etc.) and expect an exact match between Petals and a model running locally.
Most common methods are compatible with reusing inference sessions, so that you can run .generate()
multiple times without reprocessing the dialogue history from scratch:
with model.inference_session(max_length=100):
outputs1 = model.generate(user_prompt1, repetition_penalty=1.2)
outputs2 = model.generate(user_prompt2, repetition_penalty=1.2)
โก Faster loading of Stable Beluga 2. We repacked Stable Beluga 2, the most popular model at the moment, to increase its loading speed and minimize RAM and disk space requirements. The repacked version can be loaded from the petals-team/StableBeluga2
repository and is fully compatible with clients and servers using the standard repository (stabilityai/StableBeluga2
).
Now, clients need to download only 1.05 GB of data to run Stable Beluga 2 (instead of ~20 GB needed before) and require only 4 GB of RAM (instead of ~20 GB required before). Servers need to download and store 2x less data and load the model from disk significantly faster. If you're switching from the old repository, don't forget to remove the old cache in the~/.cache/petals/models--stabilityai--StableBeluga2
directory to save disk space.
โฑ๏ธ More responsive inference. In older versions, servers could become unresponsive for a few seconds while processing large prefixes (thousands of tokens) on inference. This release allows to perform small inference requests (a few tokens) in the middle of processing a large request, thus avoiding freezes during token-by-token inference caused by someone processing a large prefix.
๐ Minor improvements. This release adds support for loading weights in the safetensors format on servers and adds the blocked_servers
client option to avoid a given set of servers:
from petals import AutoDistributedModelForCausalLM
blocked_servers = ["12D3KooWA6g...", "12D3KooWGyD..."] # Full peer IDs from https://health.petals.dev
model = AutoDistributedModelForCausalLM.from_pretrained(model_name, blocked_servers=blocked_servers)
๐ Bug fixes. This release also includes a variety of bug fixes allowing to speed up the chatbot app and fine-tuning, better bypass recently disconnect servers, improve rebalancing algorithm and usability of benchmarks, fix throughput measurements and installation on ARM CPUs.
We also fixed Petals compatibility with the latest releases of ๐ค Transformers, Accelerate, and PEFT libraries.
๐ Default inference sessions. If you run .generate()
or forward passes inside an .inference_session()
context, they now use the opened session by default. These snippets are now equivalent:
# Using default session
with model.inference_session(max_length=100):
output_ids = model.generate(input_ids, max_new_tokens=3)
# Explicitly specifying a session
with model.inference_session(max_length=100) as sess:
output_ids = model.generate(input_ids, max_new_tokens=3, session=sess)
Earlier, the 1st snippet was creating a new session, which confused most people and lead to bugs.
โก๏ธ Renaming. We renamed SequenceManagerConfig
to petals.ClientConfig and petals.dht_utils
to petals.utils.dht. The old names now lead to DeprecationWarning
s and will be removed in Petals 2.2.0+.
blocked_servers
argument by @borzunov in https://github.com/bigscience-workshop/petals/pull/462
Full Changelog: https://github.com/bigscience-workshop/petals/compare/v2.0.1...v2.1.0
Published by borzunov over 1 year ago
๐ฃ๏ธ Inference of longer sequences. We extended the max sequence length to 8192 tokens for Llama 2 and added chunking to avoid server out-of-memory errors (happened when processing long prefixes). This became possible thanks to multi-query attention used in Llama 2, which uses 8x less GPU memory for attention caches. Now you can process longer sequences using a Petals client and have dialogues of up to 8192 tokens at https://chat.petals.dev
๐ Python 3.11 support. Petals clients and servers now work on Python 3.11.
๐ Bug fixes. We fixed the server's --token
argument (used to provide your ๐ค Model Hub access token for loading Llama 2), possible deadlocks in the server, issues with fine-tuning speed (servers available via relays are deprioritized) and other minor load balancing issues.
๐ช Running server on Windows. We made a better guide for running a server in WSL (Windows Subsystem for Linux).
๐ฆ Running server on Runpod. We added a guide for using a Petals template on Runpod.
Full Changelog: https://github.com/bigscience-workshop/petals/compare/v2.0.0.post1...v2.0.1
Published by borzunov over 1 year ago
We're excited to announce Petals 2.0.0 โ the largest Petals release to date!
๐ฆ Support for LLaMA and LLaMA 2. We've added support for inference and fine-tuning of any models based on ๐ค Transformers LlamaModel
, including all variants of LLaMA and LLaMA 2 โ one of the strongest open source models available today. The public swarm hosts the largest variants of these models, LLaMA-65B and LLaMA 2 (70B and 70B-Chat), providing inference at the speed of up to 5-6 tokens/sec.
๐๏ธ 4-bit quantization. We've integrated efficient 4-bit (NF4) quantization from the recent "QLoRA: Efficient Finetuning of Quantized LLMs" paper. This allows to use ~40% less GPU memory (thus, ~40% less servers) to fit all model blocks and have ~2x speedup for token-by-token inference, compared to the 8-bit quantization we previously used, with relatively small quality loss.
๐ Pre-loading LoRA adapters, such as Guanaco. We've also added an opportunity to pre-load LoRA adapters compatible with the ๐ค PEFT library, which may add extra functionality to the model you host. This adapters are activated at a client's request - specifically, the client may specify .from_pretrained(..., active_adapter="adapter_repo")
when loading a distributed model. One example of this is Guanaco - an instruction-finetuned adapter for LLaMA that turns it into a helpful chatbot that carefully follows user's instructions. You can try using LLaMA with this adapter in our chatbot app.
โก๏ธ Direct server-to-server communication. Previously, servers didn't send tensors to each other directly due to specifics of our fault-tolerant inference algorithm. This update changes that, which saves round-trip time between servers and a client and leads to substantial speedups for clients located far away from servers they're using.
๐ฃ๏ธ Shortest-path routing for inference. Previously, a client didn't properly choose geographically close and fast servers, so the client could choose a slow inference chain, especially if the swarm has many servers located for away from it. Now, the client builds a full graph of client-server and server-server latencies, as well as server inference speeds, to find the fastest chain of servers for inference among all possible ones. It also considers the amount of GPU memory left for attention caches, so that we don't choose a close server that doesn't actually have memory for our request.
๐ Loading models directly from ๐ค Model Hub and Auto
classes. Starting from Petals 2.0.0, models do not need to be converted to a special format to be hosted by Petals. Instead, both clients and servers can load models directly from ๐ค Model Hub, fetching only the shards they need to host their part of the model. Furthermore, you can write code supporting multiple architectures at once using Auto
classes, such as AutoDistributedConfig.from_pretrained(...)
and AutoDistributedModelForCausalLM.from_pretrained(...)
. The guide for adding new model architectures to Petals also became much simpler due to generalizing Petals code to multiple architectures and the absence of the model conversion step.
๐๏ธ Fine-tuning examples. We've switched most examples to LLaMA-65B and fixed previously reported bugs. In particular, the "Getting started" notebook now includes a simple example of deep prompt tuning on a dummy task, and the sequence classification notebook uses LLaMA-65B and improved hyperparameters for a stable training.
๐ฅ๏ธ Upgraded swarm monitor. The swarm monitor now contains much more info about the server, including pre-loaded LoRA adapters, detailed performance info, latencies to potential next servers, and so on. All these info is published to DHT, so you don't need to ping each server to fetch it. We've also added a "Contributor" column, so that contributors hosting 10+ blocks get a chance to publish their name, advertise their company or a social media account in exchange to hosting a server for Petals. A name (or a link) shown there may be specified using the server's --public_name
argument.
Full Changelog: https://github.com/bigscience-workshop/petals/compare/v1.1.5...v2.0.0.post1
Published by borzunov over 1 year ago
โฑ Faster fine-tuning. Fine-tuning uses ~2x less traffic (tensors are now sent in bfloat16 by default) and builds routes using a heuristic maximizing the swarm's throughput. This should address timeout errors that could happen during fine-tuning.
๐ Bug fixes. On servers, this release fixes out-of-memory errors and freezing network throughput evals. On clients, it fixes issues with slicing RemoteSequential
and silently ignoring unsupported .generate()
kwargs. Also, this release fixes warnings originated from hivemind.p2p
and hivemind.compression
.
๐ฃ๏ธ Updated throughput formula. We have updated the throughput formula to reflect that servers hosting many blocks still run forward and backward passes through only one block at a time. Don't be surprised if your throughput became smaller than in 1.1.4 โ these numbers are not directly comparable!
๐ผ๏ธ Improved lower-level interfaces. We have refactored lower-level interfaces, such as RemoteSequential
and RemoteSequenceManager
, to be more reliable (e.g. when doing retries) and much easier to use. Some rarely used low-level functions in petals.dht_utils
were removed.
Full Changelog: https://github.com/bigscience-workshop/petals/compare/v1.1.4...v1.1.5
Published by borzunov over 1 year ago
๐๏ธ 8-bit servers support more GPUs. A bitsandbytes update brings 8-bit support to older generations of NVIDIA GPUs, as well as the GeForce 16 GPU series (e.g. 1660 Ti). Please try Petals 1.1.4 if you previously had errors like Your GPU does not support Int8 Matmul!
and cublasLt ran into an error!
on some GPUs. This version also loads weights in 8-bit by default when tensor parallelism is enabled.
โฑ๏ธ Servers start faster. Servers take ~2x less time to load block weights from the disk cache to the GPU memory. The next release will also reduce the time it takes to download the weights from the Internet, since they will be downloaded in 8-bit instead of 16-bit.
๐งต Multi-threaded clients work faster. Earlier, multi-threaded clients were actually performing only one network request at a time due to a bug in hivemind. This bug was recently fixed in hivemind. This significantly improves the speed of the chat.petals.ml app when multiple users chat concurrently.
โฑ๏ธ Clients start faster. Clients take ~10% less time to load the model, since they build a route through remote servers in parallel with loading the local part of the model (input/output embeddings).
๐ณ Relaxed dependency requirements. We relaxed version requirements for transformers and other huggingface libraries, so you can update them independently of Petals. In particular, Petals works with PyTorch 2.0 and the latest transformers
release. Also, we fixed a bug where the client loaded a model in float32 by default (instead of bfloat16/float16) in some transformers
releases. Please try Petals 1.1.4 if you previously had out-of-memory errors when running the client.
Full Changelog: https://github.com/bigscience-workshop/petals/compare/v1.1.3...v1.1.4
Published by borzunov over 1 year ago
๐ Bug fixes. We have fixed a variety of minor issues related to timeout errors in the client, fine-tuning, and tensor parallelism.
โ๏ธ New options in the client. Added allowed_servers
and max_retries
options:
allowed_servers
allows to restrict the set of servers a client can use for its requests (e.g., to only use the servers trusted to process your data).max_retries
allows to limit the number of retries a client does before raising an exception (previously, clients continued retrying indefinitely).๐ FAQ. We have released the FAQ page that covers common questions about running clients and servers, as well as troubleshooting common problems.
allowed_servers
, max_retries
options to the client, improve logs by @borzunov in https://github.com/bigscience-workshop/petals/pull/235
Full Changelog: https://github.com/bigscience-workshop/petals/compare/v1.1.2...v1.1.3
Published by borzunov over 1 year ago
๐โโ๏ธ Faster inference. We've shipped server-side changes improving the inference speed by up to 30%. This is a result of profiling the server's inference performance (see details in #224 and #225). The public swarm will become faster once everyone upgrades to the latest Petals version and restarts their servers.
๐ Prompt-tuning bug fixes. We've shipped bug fixes for prompt-tuning notebooks (see details in #231).
๐งโ๐ซ New pretrained model. We've added a new model, BLOOMZ-176B by BigScience, to the public swarm. You can run it (or host its blocks) by specifying bigscience/bloomz-petals
as the model name.
Full Changelog: https://github.com/bigscience-workshop/petals/compare/v1.1.1...v1.1.2
Published by borzunov almost 2 years ago
โฐ๏ธ Stability. This release improves stability and performance of the Petals DHT in presence of many servers joined via NAT traversal & relays. Now, the DHT prefers to store the keys on directly reachable peers, so that all peers can access them faster and with less failures. Also, this release contains a minor fix to the block reassignment algorithm that decreases excess reassignments that were leading to the swarm downtime in the past.
๐ Basic routing. We have improved the routing algorithm for inference, so that clients weakly prefer servers holding more blocks to minimize latency and increase inference speed. This is only a basic algorithm, and we are working on smarter routing (taking into account latency, throughput, etc.) for both inference and fine-tuning in future releases. This release also makes the servers share more technical information about themselves (their version, free cache, etc.), so it can be used by the smarter routing algorithms in future and shown at http://health.petals.ml for debugging purposes.
Full Changelog: https://github.com/bigscience-workshop/petals/compare/v1.1.0...v1.1.1
Published by borzunov almost 2 years ago
๐ NAT traversal & relays. Now, servers can join the swarm automatically even if your machine is located behind a NAT or a firewall, or has a dynamic IP address. You don't have to manually set up port forwarding or provide any arguments to make it work.
Please upgrade the Petals package and restart all your servers & clients to use this feature or access servers joined via relays:
pip install --upgrade petals
How does it work? If the server learns that it can't accept incoming connections due to NAT/firewall, it opens a long-term outcoming connection to one of relay nodes, then the relay node forwards all requests to this server through this connection. In turn, any server with a public IP may serve as a relay node if necessary. We use libp2p circuit relays under the hood: https://docs.libp2p.io/concepts/nat/circuit-relay/
๐ฌ Chatbot app. We've released a chatbot app working over Petals: http://chat.petals.ml (source code).
Disclaimer: This chatbot uses the regular BLOOM, which is not fine-tuned for question answering. Please do not expect it to behave like ChatGPT.
How does it work? Under the hood, this web app uses our HTTP endpoint for running inference using the public Petals swarm. You can use this endpoint for your own projects, or set up another endpoint yourself (no GPU needed). See API docs here: https://github.com/borzunov/chat.petals.ml#http-api-methods
๐โโ๏ธ Faster CPU-only clients. If your CPU supports the AVX512 instruction set, a CPU-only client now runs almost as fast as a GPU-enabled one. This way, you can rent cheap CPU instances to run the client or an HTTP endpoint, like the one we use for the chatbot app.
๐ฅ Swarm health monitor. We've updated the swarm health monitor: http://health.petals.ml (source code). It provides an overview of servers who joined the public swarm and reports any connection issues.
Full Changelog: https://github.com/bigscience-workshop/petals/compare/v1.0.0...v1.1.0
Published by borzunov almost 2 years ago
This release contains the core functionality of the Petals platform described in our paper.
model
argument as required by @borzunov in https://github.com/bigscience-workshop/petals/pull/81
low_cpu_mem_usage=True
(as in Colab) by @borzunov in https://github.com/bigscience-workshop/petals/pull/103
petals
as a module by @borzunov in https://github.com/bigscience-workshop/petals/pull/159
Full Changelog: https://github.com/bigscience-workshop/petals/commits/v1.0.0