SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
OTHER License
This project accompanies the research paper,
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models Mingze Xu*, Mingfei Gao*, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, Afshin Dehghan
SlowFast-LLaVA is a training-free multimodal large language model (LLM) for video understanding and reasoning. Without requiring fine-tuning on any data, it achieves comparable or even better performance compared to state-of-the-art Video LLMs on a wide range of VideoQA tasks and benchmarks, as shown in the figure.
The code is developed with CUDA 11.7, Python >= 3.10.12, PyTorch >= 2.1.0
[Optional but recommended] Create a new conda environment.
conda create -n sf_llava python=3.10.12
And activate the environment.
conda activate sf_llava
Install the requirements.
bash setup_env.sh
Add OpenAI key and organization to the system environment to use GPT-3.5-turbo for model evaluation.
export OPENAI_API_KEY=$YOUR_OPENAI_API_KEY
export OPENAI_ORG=$YOUR_OPENAI_ORG # optional
Download pre-trained LLaVA-NeXT weights from HuggingFace
, and put them under the ml-slowfast-llava
folder.
git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b liuhaotian/llava-v1.6-vicuna-7b
git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-34b liuhaotian/llava-v1.6-34b
We prepare the ground-truth question and answer files based on IG-VLM
, and put them under playground/gt_qa_files.
MSVD_QA.csv
from the here
python scripts/data/prepare_msvd_qa_file.py --qa_file $PATH_TO_CSV_FILE
MSRVTT_QA.csv
from the here
python scripts/data/prepare_msrvtt_qa_file.py --qa_file $PATH_TO_CSV_FILE
TGIF_FrameQA.csv
from the here
python scripts/data/prepare_tgif_qa_file.py --qa_file $PATH_TO_CSV_FILE
Activitynet_QA.csv
from the here
python scripts/data/prepare_activitynet_qa_file.py --qa_file $PATH_TO_CSV_FILE
NExT_QA.csv
from the here
python scripts/data/prepare_nextqa_qa_file.py --qa_file $PATH_TO_CSV_FILE
EgoSchema.csv
from the here
python scripts/data/prepare_egoschema_qa_file.py --qa_file $PATH_TO_CSV_FILE
IntentQA.csv
from the here
python scripts/data/prepare_intentqa_qa_file.py --qa_file $PATH_TO_CSV_FILE
text_generation_benchmark
python scripts/data/prepare_vcgbench_qa_file.py --qa_folder $TEXT_GENERATION_BENCHMARK
Download the raw videos from the official websites.
Openset VideoQA
Video-LLaVA
to download raw videos.Multiple Choice VideoQA
Text Generation
Organize the raw videos under playground/data.
To directly use our data loaders without changing paths, please organize your datasets as follows
$ ml-slowfast-llava/playground/data
video_qa
MSVD_Zero_Shot_QA
videos
...
MSRVTT_Zero_Shot_QA
videos
all
...
TGIF_Zero_Shot_QA
mp4
...
Activitynet_Zero_Shot_QA
all_test
...
multiple_choice_qa
NExTQA
video
...
EgoSchema
video
...
IntentQA
video
...
We use yaml config to control the design choice of SlowFast-LLaVA. We will use the config of SlowFast-LLaVA-7B as an example to explain some important parameters.
SCRIPT
: It controls the tasks that you want to run.DATA_DIR
and CONV_MODE
: They are the data directories and prompts for different tasks. They could be either a string or a list of strings, but must match the SCRIPT
.NUM_FRAMES
: The total number of sampled video frames.TEMPORAL_AGGREGATION
: It controls the setting of Slow and Fast pathways. It should be a string with the pattern slowfast-slow_{$S_FRMS}frms_{$S_POOL}-fast_{$F_OH}x{F_OW}
, where
$S_FRMS
should be an integer which indicates the number of frames in the Slow pathway,$S_POOL
should be a string which indicates the pooling operation for the Slow pathway,$F_OH
and $F_OW
should be an integer and are the height and width of the output tokens in the Fast pathway.SlowFast-LLaVA is a training-free method, so we can directly do the inference and evaluation without model training.
By default, we use 8 GPUs for the model inference. We can modify the CUDA_VISIBLE_DEVICES
in the config file to accommodate your own settings. Please note that the model inference of SlowFast-LLaVA-34B requires GPUs with at least 80G memory.
cd ml-slowfast-llava
python run_inference.py --exp_config $PATH_TO_CONFIG_FILE
export PYTHONWARNINGS="ignore"
if you want to suppress the warnings.outputs/artifacts
.
outputs/eval_save_dir
.
outputs/logs
.
We provide a script for running video question-answering on a single video.
cd ml-slowfast-llava
python run_demo.py --video_path $PATH_TO_VIDEO --model_path $PATH_TO_LLAVA_MODEL --question "Describe this video in details"
This project is licensed under the Apple Sample Code License
.
If you are using the data/code/model provided here in a publication, please cite our paper:
@article{xu2024slowfast,
title={SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models},
author={Xu, Mingze and Gao, Mingfei and Gan, Zhe, and Chen, Hong-You and Lai, Zhengfeng and Gang, Haiming and Kang, Kai and Dehghan, Afshin},
journal={arXiv:2407.15841},
year={2024}
}