Bot releases are visible (Hide)

unsloth - Qwen 2.5 Support

Published by danielhanchen 24 days ago

Qwen 2.5 Support is here!

There are some issues with Qwen 2.5 models which Unsloth has fixed!

Kaggle Base model finetuning notebook: https://www.kaggle.com/code/danielhanchen/kaggle-qwen-2-5-unsloth-notebook/notebook
Kaggle Instruct model finetuning notebook: https://www.kaggle.com/code/danielhanchen/kaggle-qwen-2-5-conversational-unsloth
Colab finetuning notebook: https://colab.research.google.com/drive/1Kose-ucXO1IBaZq5BvbwWieuubP7hxvQ?usp=sharing
Colab conversational notebook: https://colab.research.google.com/drive/1qN1CEalC70EO1wGKhNxs1go1W9So61R5?usp=sharing

EOS token issues

Qwen 2.5 Base models (0.5b all the way until 72b) - EOS token should be <|endoftext|> not <|im_end|>. The base models <|im_end|> is actually untrained, so it'll cause NaN gradients if you use it. You should re-pull the tokenizer from source, or you can download fixed base models from https://huggingface.co/unsloth if that helps.

Chat template issues

Qwen 2.5 Base models should NOT have a chat_template, this will actually cause errors especially in Unsloth's finetuning notebooks, since I check if untrained tokens exist in the chat template to counteract NaN gradients.
Do NOT use Qwen 2.5's chat template for the base models. This will cause NaN gradients!

4bit uploaded models

Qwen 2.5 0.5b 4bit 0.5b Instruct 0.5b 4bit Instruct 0.5b
Qwen 2.5 1.5b 4bit 1.5b Instruct 1.5b 4bit Instruct 1.5b
Qwen 2.5 3b 4bit 3b Instruct 3b 4bit Instruct 3b
Qwen 2.5 7b 4bit 7b Instruct 7b 4bit Instruct 7b
Qwen 2.5 14b 4bit 14b Instruct 14b 4bit Instruct 14b
Qwen 2.5 32b 4bit 32b Instruct 32b 4bit Instruct 32b
Qwen 2.5 72b 4bit 72b Instruct 72b 4bit Instruct 72b

What's Changed

Phi 3.5 by @danielhanchen in https://github.com/unslothai/unsloth/pull/940
Phi 3.5 by @danielhanchen in https://github.com/unslothai/unsloth/pull/941
Fix DPO by @danielhanchen in https://github.com/unslothai/unsloth/pull/947
Phi 3.5 bug fix by @danielhanchen in https://github.com/unslothai/unsloth/pull/955
Cohere, Bug fixes by @danielhanchen in https://github.com/unslothai/unsloth/pull/984
Gemma faster inference by @danielhanchen in https://github.com/unslothai/unsloth/pull/987
Bug fixes by @danielhanchen in https://github.com/unslothai/unsloth/pull/1004
Update README.md by @danielhanchen in https://github.com/unslothai/unsloth/pull/1033
Update README.md by @danielhanchen in https://github.com/unslothai/unsloth/pull/1036
fix: chat_templates.py bug by @NazimHAli in https://github.com/unslothai/unsloth/pull/1048

New Contributors

@NazimHAli made their first contribution in https://github.com/unslothai/unsloth/pull/1048

Full Changelog: https://github.com/unslothai/unsloth/compare/August-2024...September-2024

unsloth - Phi 3.5 Latest Release

Published by danielhanchen about 2 months ago

Phi 3.5 is here!

Try it out here: https://colab.research.google.com/drive/1lN6hPQveB_mHSnTOYifygFcrO8C1bxq4?usp=sharing

What's Changed

Llama 3.1 by @danielhanchen in https://github.com/unslothai/unsloth/pull/797
Better debugging by @danielhanchen in https://github.com/unslothai/unsloth/pull/826
fix UnboundLocalError by @xyangk in https://github.com/unslothai/unsloth/pull/834
Gemma by @danielhanchen in https://github.com/unslothai/unsloth/pull/843
Fix ROPE extension issue and device mismatch by @xyangk in https://github.com/unslothai/unsloth/pull/840
Fix RoPE extension by @danielhanchen in https://github.com/unslothai/unsloth/pull/846
fix: fix config.torch_dtype bug by @relic-yuexi in https://github.com/unslothai/unsloth/pull/874
pascal support by @emuchogu in https://github.com/unslothai/unsloth/pull/870
Fix tokenizers by @danielhanchen in https://github.com/unslothai/unsloth/pull/887
Torch 2.4, Xformers>0.0.27, TRL>0.9, Python 3.12 + bug fixes by @danielhanchen in https://github.com/unslothai/unsloth/pull/902
Fix DPO stats by @danielhanchen in https://github.com/unslothai/unsloth/pull/906
Fix Chat Templates by @danielhanchen in https://github.com/unslothai/unsloth/pull/916
Fix chat templates by @danielhanchen in https://github.com/unslothai/unsloth/pull/917
Bug Fixes by @danielhanchen in https://github.com/unslothai/unsloth/pull/920
Fix mapping by @danielhanchen in https://github.com/unslothai/unsloth/pull/921
untrained tokens llama 3.1 base by @danielhanchen in https://github.com/unslothai/unsloth/pull/929
Bug #930 by @danielhanchen in https://github.com/unslothai/unsloth/pull/931
Fix NEFTune by @danielhanchen in https://github.com/unslothai/unsloth/pull/937
Update README.md by @danielhanchen in https://github.com/unslothai/unsloth/pull/938

New Contributors

@relic-yuexi made their first contribution in https://github.com/unslothai/unsloth/pull/874
@emuchogu made their first contribution in https://github.com/unslothai/unsloth/pull/870

Full Changelog: https://github.com/unslothai/unsloth/compare/July-Mistral-2024...August-2024

unsloth - Llama 3.1 Support

Published by danielhanchen 3 months ago

Llama 3.1 Support

Excited to announce Unsloth makes finetuning Llama 3.1 2.1x faster and use 60% less VRAM! Read up on our release here: https://unsloth.ai/blog/llama3-1

We uploaded a Google Colab notebook to finetune Llama 3.1 (8B) on a free Tesla T4: Llama 3.1 (8B) Notebook. We also have a new UI on Google Colab for chatting with your Llama 3.1 Instruct models which uses our own 2x faster inference engine.

Run UI Preview

unsloth_chat_ui_cHN5s0tryafdUM6nzXjgf
We created a new chat UI using Gradio where users can upload and chat with their Llama 3.1 Instruct models online for free on Google Colab.

We uploaded 4bit bitsandbytes quants here: https://huggingface.co/unsloth
To finetune Llama 3.1, please update Unsloth:

pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

unsloth - July-Mistral-2024

Published by danielhanchen 3 months ago

Mistral NeMo, Ollama & CSV support

See https://unsloth.ai/blog/mistral-nemo for more details. 4 bit pre-quantized weights at https://huggingface.co/unsloth

2x faster 60% less VRAM Colab finetuning notebook here and also our Kaggle notebook is here

Export to Ollama & CSV Support

To use, create and customize your chat template with a dataset and Unsloth will automatically export the finetune to Ollama with automatic Modelfile creation. We also created a 'Step-by-Step Tutorial on How to Finetune Llama-3 and Deploy to Ollama'. Check out our Ollama Llama-3 Alpaca and CSV/Excel Ollama Guide notebooks.

Unlike regular chat templates that use 3 columns, Ollama simplifies the process with just 2 columns: instruction and output. And with Ollama, you can save, run, and deploy your finetuned models locally on your own device.

Train on Completions / Inputs

We now support training only on the output tokens and not the inputs, which can increase accuracy. Try it with:

from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    ...
    args = TrainingArguments(
        ...
    ),
)
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(trainer)

RoPE Scaling for all models

We now allow you to finetune Gemma 2, Mistral, Mistral NeMo, Qwen2 and more models with “unlimited” context lengths through RoPE linear scaling through Unsloth. Coupled with our 4x longer context support, Unsloth can do extremely long context support!

New Docs!

Introducing our new Documentation site which has all the most important info about Unsloth in one place. If you'd like to contribute, please contact us! Docs: https://docs.unsloth.ai/

Update instructions

Please update Unsloth in local machines (Colab and Kaggle just refresh and reload notebooks) via:

pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

unsloth - 2x faster Gemma 2

Published by danielhanchen 4 months ago

Gemma 2 support

We now support Gemma 2! It's 2x faster and uses 63% less VRAM than HF+FA2!

We have a Gemma 2 9b notebook here: https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing

To use Gemma 2, please update Unsloth:

pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

Head over to our blog post: https://unsloth.ai/blog/gemma2 for more details.

We uploaded 4bit quants for 4x faster downloading to:

https://huggingface.co/unsloth/gemma-2-9b-bnb-4bit

https://huggingface.co/unsloth/gemma-2-27b-bnb-4bit

https://huggingface.co/unsloth/gemma-2-9b-it-bnb-4bit

https://huggingface.co/unsloth/gemma-2-27b-it-bnb-4bit

Continued pretraining

You can now do continued pretraining with Unsloth. See https://unsloth.ai/blog/contpretraining for more details!

Continued pretraining is 2x faster and uses 50% less VRAM than HF + FA2 QLoRA. We offload embed_tokens and lm_head to disk to save VRAM!

You can now simply use both in the target modules like below:

model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",
                      "embed_tokens", "lm_head",], # Add for continual pretraining
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

We also allow 2 learning rates - one for the embedding matrices and another for the LoRA adapters:

from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    args = UnslothTrainingArguments(
        ....
        learning_rate = 5e-5,
        embedding_learning_rate = 5e-6,
    ),
)

We also share a free Colab to finetune Mistral v3 to learn Korean (you can select any language you like) using Wikipedia and the Aya Dataset: https://colab.research.google.com/drive/1tEd1FrOXWMnCU9UIvdYhs61tkxdMuKZu?usp=sharing

And we're sharing our free Colab notebook for continued pretraining for text completion: https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing

What's Changed

Ollama Chat Templates by @danielhanchen in https://github.com/unslothai/unsloth/pull/582
Fix case where GGUF saving fails when model_dtype is torch.float16 ("f16") by @chrehall68 in https://github.com/unslothai/unsloth/pull/630
Support revision parameter in FastLanguageModel.from_pretrained by @chrehall68 in https://github.com/unslothai/unsloth/pull/629
clears any selected_adapters before calling internal_model.save_pretr… by @neph1 in https://github.com/unslothai/unsloth/pull/609
Check for incompatible modules before importing unsloth by @xyangk in https://github.com/unslothai/unsloth/pull/602
Fix #603 handling of formatting_func in tokenizer_utils for assitant/chat/completion training by @Oseltamivir in https://github.com/unslothai/unsloth/pull/604
Add GGML saving option to Unsloth for easier Ollama model creation and testing. by @mahiatlinux in https://github.com/unslothai/unsloth/pull/345
Add Documentation for LoraConfig Parameters by @sebdg in https://github.com/unslothai/unsloth/pull/619
llama.cpp failing by @bet0x in https://github.com/unslothai/unsloth/pull/371
fix libcuda_dirs import for triton 3.0 by @t-vi in https://github.com/unslothai/unsloth/pull/227
Nightly by @danielhanchen in https://github.com/unslothai/unsloth/pull/632
README: Fix minor typo. by @shaper in https://github.com/unslothai/unsloth/pull/559
Qwen bug fixes by @danielhanchen in https://github.com/unslothai/unsloth/pull/639
Fix segfaults by @danielhanchen in https://github.com/unslothai/unsloth/pull/641
Nightly by @danielhanchen in https://github.com/unslothai/unsloth/pull/646
Nightly by @danielhanchen in https://github.com/unslothai/unsloth/pull/648
Nightly by @danielhanchen in https://github.com/unslothai/unsloth/pull/649
Fix breaking bug in save.py with interpreting quantization_method as a string when saving to gguf by @ArcadaLabs-Jason in https://github.com/unslothai/unsloth/pull/651
Revert "Fix breaking bug in save.py with interpreting quantization_method as a string when saving to gguf" by @danielhanchen in https://github.com/unslothai/unsloth/pull/652
Revert "Revert "Fix breaking bug in save.py with interpreting quantization_method as a string when saving to gguf"" by @danielhanchen in https://github.com/unslothai/unsloth/pull/653
Fix GGUF by @danielhanchen in https://github.com/unslothai/unsloth/pull/654
Fix continuing LoRA finetuning by @danielhanchen in https://github.com/unslothai/unsloth/pull/656

New Contributors

@chrehall68 made their first contribution in https://github.com/unslothai/unsloth/pull/630
@neph1 made their first contribution in https://github.com/unslothai/unsloth/pull/609
@xyangk made their first contribution in https://github.com/unslothai/unsloth/pull/602
@Oseltamivir made their first contribution in https://github.com/unslothai/unsloth/pull/604
@mahiatlinux made their first contribution in https://github.com/unslothai/unsloth/pull/345
@sebdg made their first contribution in https://github.com/unslothai/unsloth/pull/619
@bet0x made their first contribution in https://github.com/unslothai/unsloth/pull/371
@t-vi made their first contribution in https://github.com/unslothai/unsloth/pull/227
@shaper made their first contribution in https://github.com/unslothai/unsloth/pull/559
@ArcadaLabs-Jason made their first contribution in https://github.com/unslothai/unsloth/pull/651

Full Changelog: https://github.com/unslothai/unsloth/commits/June-2024

unsloth - Phi-3 & Llama-3 bug fixes

Published by danielhanchen 5 months ago

Phi-3 Mini, Medium Support

Phi-3 models Mini and Medium are now supported.

Finetune Phi-3 Medium 1.8x faster: Colab for Phi-3 medium

Finetune Phi-3 Mini 1.85x faster: Colab for Phi-3 mini

Llama-3 Bug Fixes

We also resolved all issues affecting Llama 3 finetuning, so to get proper results, make sure to update Unsloth!

Many Llama 3 finetunes are broken, and we discussed this on a Reddit thread. So, be sure to use our Llama 3 base notebook or our Instruct notebook!

Mistral v3, Qwen and Yi are also now supported. We make Phi-3 2x faster and use 50% less memory and make Mistral v3 2.2x faster with 73% less VRAM. All pre-quantized 4bit models (4x faster downloading) are on our Hugging Face page including Phi 3, Qwen etc.

See our blog post for more details!

Chat Templates for Phi-3, Llama-3

Phi-3's chat template:

from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "phi-3",
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"},
)

Llama-3 Instruct's chat template:

from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3",
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"},
)

How to get the new updates?

Please update Unsloth for local machines.

For Colab or Kaggle just refresh and restart the env!

pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

What's Changed

Fix ChatML by @danielhanchen in https://github.com/unslothai/unsloth/pull/357
Fix Llama-3 by @danielhanchen in https://github.com/unslothai/unsloth/pull/366
Nightly by @danielhanchen in https://github.com/unslothai/unsloth/pull/370
Phi-3 by @danielhanchen in https://github.com/unslothai/unsloth/pull/397
Fix llama-3 by @danielhanchen in https://github.com/unslothai/unsloth/pull/423
llama-3 bug fixes by @danielhanchen in https://github.com/unslothai/unsloth/pull/429
May 2024 Prelim by @danielhanchen in https://github.com/unslothai/unsloth/pull/447
Nightly by @danielhanchen in https://github.com/unslothai/unsloth/pull/461
Fix generation by @danielhanchen in https://github.com/unslothai/unsloth/pull/472
Fix GGUF broken by @danielhanchen in https://github.com/unslothai/unsloth/pull/480
Nightly by @danielhanchen in https://github.com/unslothai/unsloth/pull/483
Nightly by @danielhanchen in https://github.com/unslothai/unsloth/pull/506
Fix is_bfloat16_supported missing by @danielhanchen in https://github.com/unslothai/unsloth/pull/510
Mistral v3 by @danielhanchen in https://github.com/unslothai/unsloth/pull/514
Phi 3 Medium by @danielhanchen in https://github.com/unslothai/unsloth/pull/518
Phi-3, Llama-3 bug fixes by @danielhanchen in https://github.com/unslothai/unsloth/pull/519

Full Changelog: https://github.com/unslothai/unsloth/compare/April-Llama-3-2024...May-2024

unsloth - Llama-3 Support

Published by danielhanchen 6 months ago

Llama-3 (15 trillion tokens, GPT3.5 level) is fully supported! Get 2x faster, 60% less VRAM usage than HF + FA2!

Colab notebook: https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing

Pre-quantized 8b and 70b weights (4x faster downloading) via https://huggingface.co/unsloth

What's Changed

Readme Changes by @danielhanchen in https://github.com/unslothai/unsloth/pull/324
Tokenizers fix by @danielhanchen in https://github.com/unslothai/unsloth/pull/336
Update README.md by @danielhanchen in https://github.com/unslothai/unsloth/pull/351
Update README.md by @danielhanchen in https://github.com/unslothai/unsloth/pull/352

Full Changelog: https://github.com/unslothai/unsloth/compare/April-2024...April-Llama-3-2024

unsloth - Long context windows 30% less VRAM

Published by danielhanchen 6 months ago

Long Context Window support

You can now 2x your batch size or train on long context windows with Unsloth! 228K context windows on H100s are now possible (4x longer than HF+FA2) with Mistral 7b.

How? We coded up an async offloaded gradient checkpointing in 20 loc of pure @PyTorch, reducing VRAM by >30% with +1.9% extra overhead. We carefully mask movement betw RAM<=>GPU. No extra dependencies needed.

Try our Colab notebook with Mistral's new long context v2 7b model + our new VRAM savings

You can turn it on with use_gradient_checkpointing = "unsloth":

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
)

Below shows the maximum possible sequence length with Mistral 7b QLoRA rank=32:

GPU	Memory	HF+FA2	Unsloth	Unsloth New
RTX 4060	8 GB	1,696	3,716	7,340
RTX 4070	12 GB	4,797	11,055	19,610
RTX 4080	16 GB	7,898	18,394	31,880
RTX 4090	24 GB	14,099	33,073	56,420
A100	40 GB	26,502	62,431	105,500
A6000	48 GB	32,704	77,110	130,040
H100	80 GB	57,510	135,826	228,199

Self Healing Tokenizers

We managed to smartly and on the fly convert a slow HF tokenizer to a fast one. We also automatically now load the tokenizer, and fix some dangling incorrect tokens. What can this be useful for?

Broken tokenizers like Starling or CodeLlama can be “self healed” to work. Not healing them can cause unlucky out of bounds memory accesses.
No need to manually edit the tokenizer files to support the ChatML format. Sloth automatically edits the sentencepiece tokenizer.model and other files.
Sometimes model uploaders require you to use the slow tokenizer, due to the fast tokenizer (HF’s Rust version) giving wrong results. We try to convert it to a fast variant, and confirm if it tokenizes correctly.

28% Faster RoPE Embeddings

@HuyNguyen-hust managed to make Unsloth RoPE Embeddings around 28% faster! This primarily is useful for long context windows. Via torch profiler, Unsloth's original kernel made RoPE use up less than 2% of total runtime, so you will see maybe 0.5 to 1% speedups especially for large training runs. Any speedup is vastly welcome! See #238 for more details.

Bug Fixes

Gemma would not convert to GGUF correctly due to tied weights. Now fixed.
Merging to 16bit on Kaggle breaks since Kaggle only supports 20GB of disk space - we smartly delete the 4GB model.safetensors file, allowing you to merge to 16bit.
Inference is finally fixed on batched generation. We did not accidentally account for the attention mask and position ids. Reminder inference is 2x faster natively!
Finetuning on lm_head and embed_tokens now works correctly! See https://github.com/unslothai/unsloth/wiki#finetuning-the-lm_head-and-embed_tokens-matrices. Remember to set modules_to_save.
@oKatanaaa via #305 noticed you must downgrade protobuf<4.0.0. We edited the pyproject.toml to make it work.

As always, Colab and Kaggle do not need updating. On local machines, please use pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git to update Unsloth with no dependency changes.

unsloth - 2.43x faster Gemma finetuning

Published by danielhanchen 8 months ago

You can now finetune Gemma 7b 2.43x faster than HF + Flash Attention 2 with 57.5% less VRAM use. When compared to vanilla HF, Unsloth is 2.53x faster and uses 70% less VRAM. Blog post: https://unsloth.ai/blog/gemma. On local machines, update Unsloth via pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

On 1x A100 80GB GPU, Unsloth can fit 40K total tokens (8192 * bsz of 5), whilst FA2 can fit ~15K tokens and vanilla HF can fit 9K tokens.
gemma reddit

Gemma 7b Colab Notebook free Tesla T4: https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing

Gemma 2b Colab Notebook free Tesla T4: https://colab.research.google.com/drive/15gGm7x_jTm017_Ic8e317tdIpDG53Mtu?usp=sharing

To use Gemma, simply use FastLanguageModel:

# Load Llama model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-7b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

unsloth - 2x Faster Inference, Chat Templates

Published by danielhanchen 8 months ago

Update Unsloth on local machines with no dependency updates with pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

2x Faster Inference

Unsloth supports natively 2x faster inference. All QLoRA, LoRA and non LoRA inference paths are 2x faster. This requires no change of code or any new dependencies.

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)

Chat Templates

Assuming your dataset is a list of list of dictionaries like the below:

[
    [{'from': 'human', 'value': 'Hi there!'},
     {'from': 'gpt', 'value': 'Hi how can I help?'},
     {'from': 'human', 'value': 'What is 2+2?'}],
    [{'from': 'human', 'value': 'What's your name?'},
     {'from': 'gpt', 'value': 'I'm Daniel!'},
     {'from': 'human', 'value': 'Ok! Nice!'},
     {'from': 'gpt', 'value': 'What can I do for you?'},
     {'from': 'human', 'value': 'Oh nothing :)'},],
]

You can use our get_chat_template to format it. Select chat_template to be any of zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth, and use mapping to map the dictionary values from, value etc. map_eos_token allows you to map <|im_end|> to EOS without any training.

from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
    map_eos_token = True, # Maps <|im_end|> to </s> instead
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("philschmid/guanaco-sharegpt-style", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

You can also make your own custom chat templates! For example our internal chat template we use is below. You must pass in a tuple of (custom_template, eos_token) where the eos_token must be used inside the template.

unsloth_template = \
    "{{ bos_token }}"\
    "{{ 'You are a helpful assistant to the user\n' }}"\
    "{% endif %}"\
    "{% for message in messages %}"\
        "{% if message['role'] == 'user' %}"\
            "{{ '>>> User: ' + message['content'] + '\n' }}"\
        "{% elif message['role'] == 'assistant' %}"\
            "{{ '>>> Assistant: ' + message['content'] + eos_token + '\n' }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}"\
        "{{ '>>> Assistant: ' }}"\
    "{% endif %}"
unsloth_eos_token = "eos_token"

tokenizer = get_chat_template(
    tokenizer,
    chat_template = (unsloth_template, unsloth_eos_token,), # You must provide a template and EOS token
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
    map_eos_token = True, # Maps <|im_end|> to </s> instead
)

And many bug fixes!

unsloth - GGUF, DPO, packing + more

Published by danielhanchen 9 months ago

Upgrade Unsloth via pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git. No dependency updates will be done.

6x faster GGUF conversion and QLoRA to float16 merging support

# To merge to 16bit:
model.save_pretrained_merged("dir", tokenizer, save_method = "merged_16bit")
# To merge to 4bit:
model.save_pretrained_merged("dir", tokenizer, save_method = "merged_4bit")
# To save to GGUF:
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "q4_k_m")
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "q8_0")
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "f16")
# All methods supported (listed below)

To push to HF:

model.push_to_hub_merged("hf_username/dir", tokenizer, save_method = "merged_16bit")
model.push_to_hub_merged("hf_username/dir", tokenizer, save_method = "merged_4bit")
model.push_to_hub_gguf("hf_username/dir", tokenizer, quantization_method = "q4_k_m")
model.push_to_hub_gguf("hf_username/dir", tokenizer, quantization_method = "q8_0")

4x faster model downloading + >= 500MB less GPU fragmentation by pre-quantized models:

    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/llama-2-13b-bnb-4bit",
    "unsloth/codellama-34b-bnb-4bit",
    "unsloth/tinyllama-bnb-4bit",

packing = True support making training 5x faster via TRL.
DPO support! 188% faster DPO training + no OOMs!
Dropout. Bias LoRA support
RSLoRA (Rank stabilized LoRA), LoftQ support
Llama-Factory support as a UI - https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison
Tonnes of bug fixes
And if you can - please support out work via Kofi! https://ko-fi.com/unsloth

GGUF:

Choose for `quantization_method` to be:
"not_quantized"  : "Recommended. Fast conversion. Slow inference, big files.",
"fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.",
"quantized"      : "Recommended. Slow conversion. Fast inference, small files.",
"f32"     : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.",
"f16"     : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
"q8_0"    : "Fast conversion. High resource use, but generally acceptable.",
"q4_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
"q5_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
"q2_k"    : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
"q3_k_l"  : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
"q3_k_m"  : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
"q3_k_s"  : "Uses Q3_K for all tensors",
"q4_0"    : "Original quant method, 4-bit.",
"q4_1"    : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
"q4_k_s"  : "Uses Q4_K for all tensors",
"q5_0"    : "Higher accuracy, higher resource usage and slower inference.",
"q5_1"    : "Even higher accuracy, resource usage and slower inference.",
"q5_k_s"  : "Uses Q5_K for all tensors",
"q6_k"    : "Uses Q8_K for all tensors",

unsloth - Mistral, RoPE Scaling, CodeLlama

Published by danielhanchen 10 months ago

1. Preliminary Mistral support (4K context) Solves #2
2. FINAL Mistral support (Sliding Window Attention) Solves #2
3. Solves #10
4. Preliminary Solves #8 and #6 Now supports Yi, TinyLlama and all with Grouped Query Attention
5. FINAL GQA support - allow Flash Attn v2 install path
6. Solves #5
7. Solves #7 which supports larger vocab sizes over 2^15 but below 2^16
8. Update Readme
9. Preliminary DPO Support by example from https://github.com/152334H
10. WSL (Windows) Support confirmed by https://github.com/RandomInternetPreson

Use Mistral as follows:

pip install "unsloth[colab_ampere] @ git+https://github.com/unslothai/unsloth.git"
from unsloth import FastMistralModel
import torch

model, tokenizer = FastMistralModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

model = FastMistralModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    use_gradient_checkpointing = True,
    random_state = 3407,
    max_seq_length = max_seq_length,
)

https://unsloth.ai/blog/mistral-benchmark for full benchmarks and more details

Package Rankings

Top 18.26% on Pypi.org

Related Projects

Local_LLM_Deployment_Guide_Chinese

本地部署大语言模型的中文教学

09 May 2024 25

llama3-Chinese-chat

Llama3、Llama3.1 中文仓库（随书籍撰写中... 各种网友及厂商微调、魔改版本有趣权重 & 训练、推理、评测、部署教程视频 & 文档）

18 Apr 2024 3,967

LLamaTuner

Easy and Efficient Finetuning LLMs. (Supported LLama, LLama2, LLama3, Qwen, Baichuan, GLM , Fal...

25 May 2023 568

Chinese-LLaMA-Alpaca-3

中文羊驼大模型三期项目 (Chinese Llama-3 LLMs) developed from Meta Llama 3

09 Nov 2023 878

ghost-8b-beta

Ghost 8B Beta is a large language model developed with goals that include excellent multilingual ...

22 Jul 2024 2

Alpaca-CoT

We unified the interfaces of instruction-tuning data (e.g., CoT data), multiple LLMs and paramete...

24 Mar 2023 2,597

llama-squad

Train Llama 2 & 3 on the SQuAD v2 task as an example of how to specialize a generalized (foundati...

29 Jul 2023 45

textgen

TextGen: Implementation of Text Generation models, include LLaMA, BLOOM, GPT2, BART, T5, SongNet ...

07 Apr 2021 926

xllm

🦖 X—LLM: Cutting Edge & Easy LLM Finetuning

10 Nov 2023 375

Chinese-LLaMA-Alpaca-2

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context m...

18 Jul 2023 7,068

Local-LLM-Comparison-Colab-UI

Compare the performance of different LLM that can be deployed locally on consumer hardware. Run y...

07 May 2023 899

FindTheChatGPTer

ChatGPT爆火，开启了通往AGI的关键一步，本项目旨在汇总那些ChatGPT的开源平替们，包括文本大模型、多模态大模型等，为大家提供一些便利

07 Apr 2023 2,014

slowllama

Finetune llama2-70b and codellama on MacBook Air without quantization

26 Aug 2023 444

KoAlpaca

KoAlpaca: 한국어 명령어를 이해하는 오픈소스 언어모델

18 Mar 2023 1,460

rungpt

An open-source cloud-native of large multi-modal models (LMMs) serving framework.

04 Apr 2023 151