Visual Parser: Representing Part-whole Hierarchies with Transformers

This repository contains the official implementation to reproduce object detection results of ViP. It is based on mmdetection.

Results and Models

Cascade Mask R-CNN

Backbone	Pretrain	Lr Schd	box mAP	mask mAP	#params	FLOPs	config	log	model
ViP-Ti	ImageNet-1K	1x	45.3	39.8	69.2M	678G	config	Google Drive	Google Drive
ViP-S	ImageNet-1K	1x	48.0	42.0	87.1M	725G	config	Google Drive	Google Drive
ViP-M	ImageNet-1K	1x	49.9	43.5	107.0M	785G	-	-	Coming Soon

RetinaNet

Backbone	Pretrain	Lr Schd	box mAP	#params	FLOPs	config	log	model
ViP-Ti	ImageNet-1k	1x	39.9	21.4M	181G	config	Google Drive	Google Drive
ViP-S	ImageNet-1k	1x	42.7	39.9M	227G	config	Google Drive	Google Drive
ViP-S	ImageNet-1k	3x	43.9	39.9M	227G	config	Google Drive	Google Drive
ViP-M	ImageNet-1k	1x	44.3	59.8M	287G	-	-	Coming Soon

Notes:

Pre-trained models can be downloaded from Visual Parser.

Usage

Installation

Please refer to get_started.md for installation and dataset preparation.

Inference

# single-gpu testing
python tools/test.py <CONFIG_FILE> <DET_CHECKPOINT_FILE> --eval bbox segm

# multi-gpu testing
tools/dist_test.sh <CONFIG_FILE> <DET_CHECKPOINT_FILE> <GPU_NUM> --eval bbox segm

Training

To train a detector with pre-trained models, run:

# single-gpu training
python tools/train.py <CONFIG_FILE>

# multi-gpu training
tools/dist_train.sh <CONFIG_FILE> <GPU_NUM>

Citing ViP

@article{sun2021visual,
  title={Visual Parser: Representing Part-whole Hierarchies with Transformers},
  author={Sun, Shuyang and Yue, Xiaoyu, Bai, Song and Torr, Philip},
  journal={arXiv preprint arXiv:2107.05790},
  year={2021}
}