Open Source Ecosystems

Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation

News 📢

[2023.09.30] We now provide the code and our trained checkpoints (of BLIP) for quick deploying and easy reproduction. The previous demonstrative codes are now available at demonstrative.md.
[2023.06.26] We provide the demonstrative codes to show how to implement SMILE in your codebase, including a pseudocode, a BLIP version, and a transformers version.

Demo

Try out our online demo integrated into Huggingface Spaces 🤗 using Gradio!

Usage

git clone https://github.com/yuezih/SMILE
cd SMILE/BLIP

Installation

pip install -r requirements.txt

The code has been tested on PyTorch 2.0.0.

Data Preparation

The data configs are in SMILE/BLIP/configs/caption_coco.yaml.

Set the image_root to your MSCOCO image root.
MSCOCO annotation files will be automatically downloaded.

Checkpoints

The pre-trained and MLE-finetuned checkpoints are available at the original BLIP repo.

We provide our two checkpoints finetuned on MSCOCO with SMILE:

blip_smile_base.pth: The vanilla SMILE-optimized BLIP.
blip_mle_smile_base.pth: BLIP finetuned with MLE+SMILE (0.01:0.99), with a compromise between descriptiveness and accuracy.

Model	Cap. Len.	Lex. Div.	R@1	R@5	CLIPScore	PPL
`blip_smile_base.pth`	22.3	4.5	10.0	24.5	75.0	95.6
`blip_mle_smile_base.pth`	19.8	3.6	10.9	25.1	76.2	79.4

They are available at our Huggingface Spaces. You can clone the entire space with the following commands, and then the checkpoints can be found in BLIP-SMILE/model.

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/spaces/yuezih/BLIP-SMILE

We also provide the link to download the checkpoints from OneDrive.

After preparing the checkpoint, Set the checkpoint path in SMILE/BLIP/configs/caption_coco.yaml.

Training & Inference

bash scripts/train.sh

bash scripts/eval.sh

Kind reminders:

Please use transformers==4.15.0 rather than a higher version.
For torch<2.0.0, replace torchrun with python -m torch.distributed.run in the training and inference scripts.

Citation

If you find this repo to be helpful for your research, please consider citing our paper:

@misc{yue2023learning,
      title={Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation}, 
      author={Zihao Yue and Anwen Hu and Liang Zhang and Qin Jin},
      year={2023},
      eprint={2306.13460},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgement

Our work relies on resources from BLIP and HuggingFace transformers. Many thanks to them for their amazing efforts.

Related Projects

taming-transformers

Taming Transformers for High-Resolution Image Synthesis

17 Dec 2020 5,727

BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Unde...

25 Jan 2022 4,693

AdaptiveAttention

Implementation of "Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Capti...

23 Feb 2017 333

CLIP_prefix_caption

Simple image captioning model

16 Sep 2021 1,288

AniTalker

[ACM MM 2024] This is the official code for "AniTalker: Animate Vivid and Diverse Talking Faces t...

29 Apr 2024 1,389

Thin-Plate-Spline-Motion-Model

[CVPR 2022] Thin-Plate Spline Motion Model for Image Animation.

24 Nov 2021 3,441