OneFormer: One Transformer to Rule Universal Image Segmentation, arxiv 2022 / CVPR 2023
MIT License
Jitesh Jain, Jiachen Li†, MangTik Chiu†, Ali Hassani, Nikita Orlov, Humphrey Shi
† Equal Contribution
[Project Page
] [arXiv
] [pdf
] [BibTeX
]
This repo contains the code for our paper OneFormer: One Transformer to Rule Universal Image Segmentation.
Method | Backbone | Crop Size | PQ | AP | mIoU (s.s) | mIoU (ms+flip) | #params | config | Checkpoint |
---|---|---|---|---|---|---|---|---|---|
OneFormer | Swin-L† | 640×640 | 49.8 | 35.9 | 57.0 | 57.7 | 219M | config | model |
OneFormer | Swin-L† | 896×896 | 51.1 | 37.6 | 57.4 | 58.3 | 219M | config | model |
OneFormer | Swin-L† | 1280×1280 | 51.4 | 37.8 | 57.0 | 57.7 | 219M | config | model |
OneFormer | ConvNeXt-L† | 640×640 | 50.0 | 36.2 | 56.6 | 57.4 | 220M | config | model |
OneFormer | DiNAT-L† | 640×640 | 50.5 | 36.0 | 58.3 | 58.4 | 223M | config | model |
OneFormer | DiNAT-L† | 896×896 | 51.2 | 36.8 | 58.1 | 58.6 | 223M | config | model |
OneFormer | DiNAT-L† | 1280×1280 | 51.5 | 37.1 | 58.3 | 58.7 | 223M | config | model |
OneFormer (COCO-Pretrained) | DiNAT-L† | 1280×1280 | 53.4 | 40.2 | 58.4 | 58.8 | 223M | config | model | pretrained |
OneFormer | ConvNeXt-XL† | 640×640 | 50.1 | 36.3 | 57.4 | 58.8 | 372M | config | model |
Method | Backbone | PQ | AP | mIoU (s.s) | mIoU (ms+flip) | #params | config | Checkpoint |
---|---|---|---|---|---|---|---|---|
OneFormer | Swin-L† | 67.2 | 45.6 | 83.0 | 84.4 | 219M | config | model |
OneFormer | ConvNeXt-L† | 68.5 | 46.5 | 83.0 | 84.0 | 220M | config | model |
OneFormer (Mapillary Vistas-Pretrained) | ConvNeXt-L† | 70.1 | 48.7 | 84.6 | 85.2 | 220M | config | model | pretrained |
OneFormer | DiNAT-L† | 67.6 | 45.6 | 83.1 | 84.0 | 223M | config | model |
OneFormer | ConvNeXt-XL† | 68.4 | 46.7 | 83.6 | 84.6 | 372M | config | model |
OneFormer (Mapillary Vistas-Pretrained) | ConvNeXt-XL† | 69.7 | 48.9 | 84.5 | 85.8 | 372M | config | model | pretrained |
Method | Backbone | PQ | PQTh | PQSt | AP | mIoU | #params | config | Checkpoint |
---|---|---|---|---|---|---|---|---|---|
OneFormer | Swin-L† | 57.9 | 64.4 | 48.0 | 49.0 | 67.4 | 219M | config | model |
OneFormer | DiNAT-L† | 58.0 | 64.3 | 48.4 | 49.2 | 68.1 | 223M | config | model |
Method | Backbone | PQ | mIoU (s.s) | mIoU (ms+flip) | #params | config | Checkpoint |
---|---|---|---|---|---|---|---|
OneFormer | Swin-L† | 46.7 | 62.9 | 64.1 | 219M | config | model |
OneFormer | ConvNeXt-L† | 47.9 | 63.2 | 63.8 | 220M | config | model |
OneFormer | DiNAT-L† | 47.8 | 64.0 | 64.9 | 223M | config | model |
If you found OneFormer useful in your research, please consider starring us on GitHub and citing us in your research!
@inproceedings{jain2023oneformer,
title={{OneFormer: One Transformer to Rule Universal Image Segmentation}},
author={Jitesh Jain and Jiachen Li and MangTik Chiu and Ali Hassani and Nikita Orlov and Humphrey Shi},
journal={CVPR},
year={2023}
}
We thank the authors of Mask2Former, GroupViT, and Neighborhood Attention Transformer for releasing their helpful codebases.