Finetune Llama 3.2, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
APACHE-2.0 License
Bot releases are visible (Hide)
Published by danielhanchen 24 days ago
There are some issues with Qwen 2.5 models which Unsloth has fixed!
Qwen 2.5 Base models (0.5b all the way until 72b) - EOS token should be <|endoftext|> not <|im_end|>. The base models <|im_end|> is actually untrained, so it'll cause NaN gradients if you use it. You should re-pull the tokenizer from source, or you can download fixed base models from https://huggingface.co/unsloth if that helps.
Qwen 2.5 0.5b 4bit 0.5b Instruct 0.5b 4bit Instruct 0.5b
Qwen 2.5 1.5b 4bit 1.5b Instruct 1.5b 4bit Instruct 1.5b
Qwen 2.5 3b 4bit 3b Instruct 3b 4bit Instruct 3b
Qwen 2.5 7b 4bit 7b Instruct 7b 4bit Instruct 7b
Qwen 2.5 14b 4bit 14b Instruct 14b 4bit Instruct 14b
Qwen 2.5 32b 4bit 32b Instruct 32b 4bit Instruct 32b
Qwen 2.5 72b 4bit 72b Instruct 72b 4bit Instruct 72b
Full Changelog: https://github.com/unslothai/unsloth/compare/August-2024...September-2024
Try it out here: https://colab.research.google.com/drive/1lN6hPQveB_mHSnTOYifygFcrO8C1bxq4?usp=sharing
Full Changelog: https://github.com/unslothai/unsloth/compare/July-Mistral-2024...August-2024
Published by danielhanchen 3 months ago
Excited to announce Unsloth makes finetuning Llama 3.1 2.1x faster and use 60% less VRAM! Read up on our release here: https://unsloth.ai/blog/llama3-1
We uploaded a Google Colab notebook to finetune Llama 3.1 (8B) on a free Tesla T4: Llama 3.1 (8B) Notebook. We also have a new UI on Google Colab for chatting with your Llama 3.1 Instruct models which uses our own 2x faster inference engine.
We created a new chat UI using Gradio where users can upload and chat with their Llama 3.1 Instruct models online for free on Google Colab.
We uploaded 4bit bitsandbytes quants here: https://huggingface.co/unsloth
To finetune Llama 3.1, please update Unsloth:
pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
Published by danielhanchen 3 months ago
See https://unsloth.ai/blog/mistral-nemo for more details. 4 bit pre-quantized weights at https://huggingface.co/unsloth
2x faster 60% less VRAM Colab finetuning notebook here and also our Kaggle notebook is here
To use, create and customize your chat template with a dataset and Unsloth will automatically export the finetune to Ollama with automatic Modelfile creation. We also created a 'Step-by-Step Tutorial on How to Finetune Llama-3 and Deploy to Ollama'. Check out our Ollama Llama-3 Alpaca and CSV/Excel Ollama Guide notebooks.
Unlike regular chat templates that use 3 columns, Ollama simplifies the process with just 2 columns: instruction and output. And with Ollama, you can save, run, and deploy your finetuned models locally on your own device.
We now support training only on the output tokens and not the inputs, which can increase accuracy. Try it with:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
...
args = TrainingArguments(
...
),
)
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(trainer)
We now allow you to finetune Gemma 2, Mistral, Mistral NeMo, Qwen2 and more models with “unlimited” context lengths through RoPE linear scaling through Unsloth. Coupled with our 4x longer context support, Unsloth can do extremely long context support!
Introducing our new Documentation site which has all the most important info about Unsloth in one place. If you'd like to contribute, please contact us! Docs: https://docs.unsloth.ai/
Please update Unsloth in local machines (Colab and Kaggle just refresh and reload notebooks) via:
pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
Published by danielhanchen 4 months ago
We now support Gemma 2! It's 2x faster and uses 63% less VRAM than HF+FA2!
We have a Gemma 2 9b notebook here: https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing
To use Gemma 2, please update Unsloth:
pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
Head over to our blog post: https://unsloth.ai/blog/gemma2 for more details.
We uploaded 4bit quants for 4x faster downloading to:
https://huggingface.co/unsloth/gemma-2-9b-bnb-4bit
https://huggingface.co/unsloth/gemma-2-27b-bnb-4bit
https://huggingface.co/unsloth/gemma-2-9b-it-bnb-4bit
https://huggingface.co/unsloth/gemma-2-27b-it-bnb-4bit
You can now do continued pretraining with Unsloth. See https://unsloth.ai/blog/contpretraining for more details!
Continued pretraining is 2x faster and uses 50% less VRAM than HF + FA2 QLoRA. We offload embed_tokens
and lm_head
to disk to save VRAM!
You can now simply use both in the target modules like below:
model = FastLanguageModel.get_peft_model(
model,
r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
"embed_tokens", "lm_head",], # Add for continual pretraining
lora_alpha = 32,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora = True, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)
We also allow 2 learning rates - one for the embedding matrices and another for the LoRA adapters:
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments
trainer = UnslothTrainer(
args = UnslothTrainingArguments(
....
learning_rate = 5e-5,
embedding_learning_rate = 5e-6,
),
)
We also share a free Colab to finetune Mistral v3 to learn Korean (you can select any language you like) using Wikipedia and the Aya Dataset: https://colab.research.google.com/drive/1tEd1FrOXWMnCU9UIvdYhs61tkxdMuKZu?usp=sharing
And we're sharing our free Colab notebook for continued pretraining for text completion: https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing
Full Changelog: https://github.com/unslothai/unsloth/commits/June-2024
Published by danielhanchen 5 months ago
Phi-3 models Mini and Medium are now supported.
Finetune Phi-3 Medium 1.8x faster: Colab for Phi-3 medium
Finetune Phi-3 Mini 1.85x faster: Colab for Phi-3 mini
We also resolved all issues affecting Llama 3 finetuning, so to get proper results, make sure to update Unsloth!
Many Llama 3 finetunes are broken, and we discussed this on a Reddit thread. So, be sure to use our Llama 3 base notebook or our Instruct notebook!
Mistral v3, Qwen and Yi are also now supported. We make Phi-3 2x faster and use 50% less memory and make Mistral v3 2.2x faster with 73% less VRAM. All pre-quantized 4bit models (4x faster downloading) are on our Hugging Face page including Phi 3, Qwen etc.
See our blog post for more details!
Phi-3's chat template:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "phi-3",
mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"},
)
Llama-3 Instruct's chat template:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "llama-3",
mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"},
)
Please update Unsloth for local machines.
For Colab or Kaggle just refresh and restart the env!
pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
is_bfloat16_supported
missing by @danielhanchen in https://github.com/unslothai/unsloth/pull/510
Full Changelog: https://github.com/unslothai/unsloth/compare/April-Llama-3-2024...May-2024
Published by danielhanchen 6 months ago
Llama-3 (15 trillion tokens, GPT3.5 level) is fully supported! Get 2x faster, 60% less VRAM usage than HF + FA2!
Colab notebook: https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing
Pre-quantized 8b and 70b weights (4x faster downloading) via https://huggingface.co/unsloth
Full Changelog: https://github.com/unslothai/unsloth/compare/April-2024...April-Llama-3-2024
Published by danielhanchen 6 months ago
You can now 2x your batch size or train on long context windows with Unsloth! 228K context windows on H100s are now possible (4x longer than HF+FA2) with Mistral 7b.
How? We coded up an async offloaded gradient checkpointing in 20 loc of pure @PyTorch, reducing VRAM by >30% with +1.9% extra overhead. We carefully mask movement betw RAM<=>GPU. No extra dependencies needed.
Try our Colab notebook with Mistral's new long context v2 7b model + our new VRAM savings
You can turn it on with use_gradient_checkpointing = "unsloth"
:
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
)
Below shows the maximum possible sequence length with Mistral 7b QLoRA rank=32:
GPU | Memory | HF+FA2 | Unsloth | Unsloth New |
---|---|---|---|---|
RTX 4060 | 8 GB | 1,696 | 3,716 | 7,340 |
RTX 4070 | 12 GB | 4,797 | 11,055 | 19,610 |
RTX 4080 | 16 GB | 7,898 | 18,394 | 31,880 |
RTX 4090 | 24 GB | 14,099 | 33,073 | 56,420 |
A100 | 40 GB | 26,502 | 62,431 | 105,500 |
A6000 | 48 GB | 32,704 | 77,110 | 130,040 |
H100 | 80 GB | 57,510 | 135,826 | 228,199 |
We managed to smartly and on the fly convert a slow HF tokenizer to a fast one. We also automatically now load the tokenizer, and fix some dangling incorrect tokens. What can this be useful for?
@HuyNguyen-hust managed to make Unsloth RoPE Embeddings around 28% faster! This primarily is useful for long context windows. Via torch profiler, Unsloth's original kernel made RoPE use up less than 2% of total runtime, so you will see maybe 0.5 to 1% speedups especially for large training runs. Any speedup is vastly welcome! See #238 for more details.
protobuf<4.0.0
. We edited the pyproject.toml
to make it work.As always, Colab and Kaggle do not need updating. On local machines, please use pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
to update Unsloth with no dependency changes.
Published by danielhanchen 8 months ago
You can now finetune Gemma 7b 2.43x faster than HF + Flash Attention 2 with 57.5% less VRAM use. When compared to vanilla HF, Unsloth is 2.53x faster and uses 70% less VRAM. Blog post: https://unsloth.ai/blog/gemma. On local machines, update Unsloth via pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
On 1x A100 80GB GPU, Unsloth can fit 40K total tokens (8192 * bsz of 5), whilst FA2 can fit ~15K tokens and vanilla HF can fit 9K tokens.
Gemma 7b Colab Notebook free Tesla T4: https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing
Gemma 2b Colab Notebook free Tesla T4: https://colab.research.google.com/drive/15gGm7x_jTm017_Ic8e317tdIpDG53Mtu?usp=sharing
To use Gemma, simply use FastLanguageModel
:
# Load Llama model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/gemma-7b-bnb-4bit",
max_seq_length = max_seq_length,
dtype = None,
load_in_4bit = True,
)
Published by danielhanchen 8 months ago
Update Unsloth on local machines with no dependency updates with pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
Unsloth supports natively 2x faster inference. All QLoRA, LoRA and non LoRA inference paths are 2x faster. This requires no change of code or any new dependencies.
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)
Assuming your dataset is a list of list of dictionaries like the below:
[
[{'from': 'human', 'value': 'Hi there!'},
{'from': 'gpt', 'value': 'Hi how can I help?'},
{'from': 'human', 'value': 'What is 2+2?'}],
[{'from': 'human', 'value': 'What's your name?'},
{'from': 'gpt', 'value': 'I'm Daniel!'},
{'from': 'human', 'value': 'Ok! Nice!'},
{'from': 'gpt', 'value': 'What can I do for you?'},
{'from': 'human', 'value': 'Oh nothing :)'},],
]
You can use our get_chat_template
to format it. Select chat_template
to be any of zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
, and use mapping
to map the dictionary values from
, value
etc. map_eos_token
allows you to map <|im_end|>
to EOS without any training.
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
map_eos_token = True, # Maps <|im_end|> to </s> instead
)
def formatting_prompts_func(examples):
convos = examples["conversations"]
texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
return { "text" : texts, }
pass
from datasets import load_dataset
dataset = load_dataset("philschmid/guanaco-sharegpt-style", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)
You can also make your own custom chat templates! For example our internal chat template we use is below. You must pass in a tuple
of (custom_template, eos_token)
where the eos_token
must be used inside the template.
unsloth_template = \
"{{ bos_token }}"\
"{{ 'You are a helpful assistant to the user\n' }}"\
"{% endif %}"\
"{% for message in messages %}"\
"{% if message['role'] == 'user' %}"\
"{{ '>>> User: ' + message['content'] + '\n' }}"\
"{% elif message['role'] == 'assistant' %}"\
"{{ '>>> Assistant: ' + message['content'] + eos_token + '\n' }}"\
"{% endif %}"\
"{% endfor %}"\
"{% if add_generation_prompt %}"\
"{{ '>>> Assistant: ' }}"\
"{% endif %}"
unsloth_eos_token = "eos_token"
tokenizer = get_chat_template(
tokenizer,
chat_template = (unsloth_template, unsloth_eos_token,), # You must provide a template and EOS token
mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
map_eos_token = True, # Maps <|im_end|> to </s> instead
)
And many bug fixes!
Published by danielhanchen 9 months ago
Upgrade Unsloth via pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
. No dependency updates will be done.
# To merge to 16bit:
model.save_pretrained_merged("dir", tokenizer, save_method = "merged_16bit")
# To merge to 4bit:
model.save_pretrained_merged("dir", tokenizer, save_method = "merged_4bit")
# To save to GGUF:
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "q4_k_m")
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "q8_0")
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "f16")
# All methods supported (listed below)
To push to HF:
model.push_to_hub_merged("hf_username/dir", tokenizer, save_method = "merged_16bit")
model.push_to_hub_merged("hf_username/dir", tokenizer, save_method = "merged_4bit")
model.push_to_hub_gguf("hf_username/dir", tokenizer, quantization_method = "q4_k_m")
model.push_to_hub_gguf("hf_username/dir", tokenizer, quantization_method = "q8_0")
"unsloth/mistral-7b-bnb-4bit",
"unsloth/llama-2-7b-bnb-4bit",
"unsloth/llama-2-13b-bnb-4bit",
"unsloth/codellama-34b-bnb-4bit",
"unsloth/tinyllama-bnb-4bit",
packing = True
support making training 5x faster via TRL.
GGUF:
Choose for `quantization_method` to be:
"not_quantized" : "Recommended. Fast conversion. Slow inference, big files.",
"fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.",
"quantized" : "Recommended. Slow conversion. Fast inference, small files.",
"f32" : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.",
"f16" : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
"q8_0" : "Fast conversion. High resource use, but generally acceptable.",
"q4_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
"q5_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
"q2_k" : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
"q3_k_l" : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
"q3_k_m" : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
"q3_k_s" : "Uses Q3_K for all tensors",
"q4_0" : "Original quant method, 4-bit.",
"q4_1" : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
"q4_k_s" : "Uses Q4_K for all tensors",
"q5_0" : "Higher accuracy, higher resource usage and slower inference.",
"q5_1" : "Even higher accuracy, resource usage and slower inference.",
"q5_k_s" : "Uses Q5_K for all tensors",
"q6_k" : "Uses Q8_K for all tensors",
Published by danielhanchen 10 months ago
Use Mistral as follows:
pip install "unsloth[colab_ampere] @ git+https://github.com/unslothai/unsloth.git"
from unsloth import FastMistralModel
import torch
model, tokenizer = FastMistralModel.from_pretrained(
model_name = model_name,
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)
model = FastMistralModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Currently only supports dropout = 0
bias = "none", # Currently only supports bias = "none"
use_gradient_checkpointing = True,
random_state = 3407,
max_seq_length = max_seq_length,
)
https://unsloth.ai/blog/mistral-benchmark for full benchmarks and more details