An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta.

Fine-tuning Llama3.2-Vision

This repository contains a script for training Llama3.2-Vision with only using HuggingFace.

  • [2024/10/04] đŸ”„Supports text-only data.

Table of Contents

Supported Features

  • Deepspeed
  • LoRA, QLoRA
  • Full-finetuning
  • Multi-image and video training


Install the required packages using environment.yml.

Using environment.yaml

conda env create -f environment.yaml
conda activate llama

Note: Llama3.2-Vision does not support flash-attention2 for now.

Dataset Preparation

The script requires a dataset formatted according to the LLaVA specification. The dataset should be a JSON file where each entry contains information about conversations and images. Ensure that the image paths in the dataset match the provided --image_folder.

When using a multi-image dataset, the image tokens should all be <image>, and the image file names should have been in a list. Please see the example below and follow format your data.

    "id": "000000033471",
    "image": "000000033471.jpg",
    "conversations": [
        "from": "human",
        "value": "<image>\nWhat are the colors of the bus in the image?"
        "from": "gpt",
        "value": "The bus in the image is white and red."
        "from": "human",
        "value": "What feature can be seen on the back of the bus?"
        "from": "gpt",
        "value": "The back of the bus features an advertisement."
        "from": "human",
        "value": "Is the bus driving down the street or pulled off to the side?"
        "from": "gpt",
        "value": "The bus is driving down the street, which is crowded with people and other vehicles."
    "id": "000000033471",
    "image": ["000000033471.jpg", "000000033472.jpg"],
    "conversations": [
        "from": "human",
        "value": "<image>\n<image>\nIs the perspective of the camera differnt?"
        "from": "gpt",
        "value": "Yes, It the perspective of the camera is different."
    "id": "sample1",
    "video": "sample1.mp4",
    "conversations": [
        "from": "human",
        "value": "<video>\nWhat is going on in this video?"
        "from": "gpt",
        "value": "A man is walking down the road."

Note: Llama3.2-Vision uses a video as a sequential of images.


To run the training script, use the following command:

Full Finetuning

bash scripts/

Finetune with LoRA

If you want to train only the language model with LoRA and perform full training for the vision model:

bash scripts/

If you want to train both the language model and the vision model with LoRA:

bash scripts/

IMPORTANT: If you want to tune the embed_token with LoRA, You need to tune lm_head together.

  • --deepspeed (str): Path to DeepSpeed config file (default: "scripts/zero2.json").
  • --data_path (str): Path to the LLaVA formatted training data (a JSON file). (Required)
  • --image_folder (str): Path to the images folder as referenced in the LLaVA formatted training data. (Required)
  • --model_id (str): Path to the Llama3.2-Vision model. (Required)
  • --output_dir (str): Output directory for model checkpoints
  • --num_train_epochs (int): Number of training epochs (default: 1).
  • --per_device_train_batch_size (int): Training batch size per GPU per forwarding step.
  • --gradient_accumulation_steps (int): Gradient accumulation steps (default: 4).
  • --freeze_vision_tower (bool): Option to freeze vision_model (default: False).
  • --tune_merger (bool): Option to tune projector (default: True).
  • --num_lora_modules (int): Number of target modules to add LoRA (-1 means all layers).
  • --vision_lr (float): Learning rate for vision_model.
  • --projector_lr (float): Learning rate for projector.
  • --learning_rate (float): Learning rate for language module.
  • --bf16 (bool): Option for using bfloat16.
  • --fp16 (bool): Option for using fp16.
  • --lora_namespan_exclude (str): Exclude modules with namespans to add LoRA.
  • --max_seq_length (int): Maximum sequence length (default: 128K).
  • --bits (int): Quantization bits (default: 16).
  • --disable_flash_attn2 (bool): Disable Flash Attention 2.
  • --report_to (str): Reporting tool (choices: 'tensorboard', 'wandb', 'none') (default: 'tensorboard').
  • --logging_dir (str): Logging directory (default: "./tf-logs").
  • --lora_rank (int): LoRA rank (default: 128).
  • --lora_alpha (int): LoRA alpha (default: 256).
  • --lora_dropout (float): LoRA dropout (default: 0.05).
  • --logging_steps (int): Logging steps (default: 1).
  • --dataloader_num_workers (int): Number of data loader workers (default: 4).

Note: The learning rate of vision_model should be 10x ~ 5x smaller than the language_model.

Train with video dataset

You can train the model using a video dataset. However, Llama3.2-Vision processes videos as a sequence of images, so you’ll need to select specific frames and treat them as multiple images for training. You can set LoRA configs and use for LoRA too.

bash scripts/

If you run out of vram, you can use zero3_offload instead of zero3. However, using zero3 is preferred.

Merge LoRA Weights

bash scripts/

Note: Remember to replace the paths in or with your specific paths. (Also in when using LoRA.)

Issue for libcudnn error

Could not load library Error: /usr/local/cuda-12.1/lib/ undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version

You could run unset LD_LIBRARY_PATH for this error. You could see this issue


  • Support for multi-image & video data
  • Support for batch_size > 1

This project is licensed under the Apache-2.0 License. See the LICENSE file for details.


This project is based on

