Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation (NeurIPS 2023)
OTHER License
Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation
Try out our online demo integrated into Huggingface Spaces 🤗 using Gradio!
git clone https://github.com/yuezih/SMILE
cd SMILE/BLIP
pip install -r requirements.txt
The code has been tested on PyTorch 2.0.0.
The data configs are in SMILE/BLIP/configs/caption_coco.yaml
.
image_root
to your MSCOCO image root.The pre-trained and MLE-finetuned checkpoints are available at the original BLIP repo.
We provide our two checkpoints finetuned on MSCOCO with SMILE:
blip_smile_base.pth
: The vanilla SMILE-optimized BLIP.blip_mle_smile_base.pth
: BLIP finetuned with MLE+SMILE (0.01:0.99), with a compromise between descriptiveness and accuracy.Model | Cap. Len. | Lex. Div. | R@1 | R@5 | CLIPScore | PPL |
---|---|---|---|---|---|---|
blip_smile_base.pth |
22.3 | 4.5 | 10.0 | 24.5 | 75.0 | 95.6 |
blip_mle_smile_base.pth |
19.8 | 3.6 | 10.9 | 25.1 | 76.2 | 79.4 |
They are available at our Huggingface Spaces. You can clone the entire space with the following commands, and then the checkpoints can be found in BLIP-SMILE/model
.
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/spaces/yuezih/BLIP-SMILE
We also provide the link to download the checkpoints from OneDrive.
After preparing the checkpoint, Set the checkpoint path in SMILE/BLIP/configs/caption_coco.yaml
.
bash scripts/train.sh
bash scripts/eval.sh
Kind reminders:
transformers==4.15.0
rather than a higher version.torch<2.0.0
, replace torchrun
with python -m torch.distributed.run
in the training and inference scripts.If you find this repo to be helpful for your research, please consider citing our paper:
@misc{yue2023learning,
title={Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation},
author={Zihao Yue and Anwen Hu and Liang Zhang and Qin Jin},
year={2023},
eprint={2306.13460},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Our work relies on resources from BLIP and HuggingFace transformers. Many thanks to them for their amazing efforts.