BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Announcement: BLIP is now officially integrated into LAVIS - a one-stop library for language-and-vision research and applications!
This is the PyTorch code of the BLIP paper [blog]. The code has been tested on PyTorch 1.10.
To install the dependencies, run pip install -r requirements.txt
Catalog:
Inference demo:
Run our interactive demo using Colab notebook (no GPU needed).
The demo includes code for:
- Image captioning
- Open-ended visual question answering
- Multimodal / unimodal feature extraction
- Image-text matching
Try out the Web demo, integrated into Huggingface Spaces 🤗 using Gradio.
Replicate web demo and Docker image is also available at
Pre-trained checkpoints:
Num. pre-train images |
BLIP w/ ViT-B |
BLIP w/ ViT-B and CapFilt-L |
BLIP w/ ViT-L |
14M |
Download
|
- |
- |
129M |
Download
|
Download
|
Download
|
Finetuned checkpoints:
Task |
BLIP w/ ViT-B |
BLIP w/ ViT-B and CapFilt-L |
BLIP w/ ViT-L |
Image-Text Retrieval (COCO) |
Download
|
- |
Download
|
Image-Text Retrieval (Flickr30k) |
Download
|
- |
Download
|
Image Captioning (COCO) |
- |
Download
|
Download
|
VQA |
Download
|
Download
|
- |
NLVR2 |
Download
|
- |
- |
Image-Text Retrieval:
- Download COCO and Flickr30k datasets from the original websites, and set 'image_root' in configs/retrieval_{dataset}.yaml accordingly.
- To evaluate the finetuned BLIP model on COCO, run:
- To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/retrieval_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:
Image-Text Captioning:
- Download COCO and NoCaps datasets from the original websites, and set 'image_root' in configs/caption_coco.yaml and configs/nocaps.yaml accordingly.
- To evaluate the finetuned BLIP model on COCO, run:
- To evaluate the finetuned BLIP model on NoCaps, generate results with: (evaluation needs to be performed on official server)
- To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/caption_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth". Then run:
VQA:
- Download VQA v2 dataset and Visual Genome dataset from the original websites, and set 'vqa_root' and 'vg_root' in configs/vqa.yaml.
- To evaluate the finetuned BLIP model, generate results with: (evaluation needs to be performed on official server)
- To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/vqa.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth". Then run:
NLVR2:
- Download NLVR2 dataset from the original websites, and set 'image_root' in configs/nlvr.yaml.
- To evaluate the finetuned BLIP model, run
- To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/nlvr.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:
Finetune with ViT-L:
In order to finetune a model with ViT-L, simply change the config file to set 'vit' as large. Batch size and learning rate may also need to be adjusted accordingly (please see the paper's appendix for hyper-parameter details). Gradient checkpoint can also be activated in the config file to reduce GPU memory usage.
Pre-train:
- Prepare training json files where each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'image': path_of_image, 'caption': text_of_image}.
- In configs/pretrain.yaml, set 'train_file' as the paths for the json files .
- Pre-train the model using 8 A100 GPUs:
Zero-shot video-text retrieval:
- Download MSRVTT dataset following the instructions from https://github.com/salesforce/ALPRO, and set 'video_root' accordingly in configs/retrieval_msrvtt.yaml.
- Install decord with pip install decord
- To perform zero-shot evaluation, run
Pre-training datasets download:
We provide bootstrapped pre-training datasets as json files. Each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'url': url_of_image, 'caption': text_of_image}.
Image source |
Filtered web caption |
Filtered synthetic caption by ViT-B |
Filtered synthetic caption by ViT-L |
CC3M+CC12M+SBU |
Download
|
Download
|
Download
|
LAION115M |
Download
|
Download
|
Download
|
Citation
If you find this code to be useful for your research, please consider citing.
Acknowledgement
The implementation of BLIP relies on resources from ALBEF, Huggingface Transformers, and timm. We thank the original authors for their open-sourcing.