[ICCV 2023] Tracking Anything with Decoupled Video Segmentation
OTHER License
Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, Joon-Young Lee
University of Illinois Urbana-Champaign and Adobe
ICCV 2023
Note (Mar 6 2024): We have fixed a major bug (introduced in the last update) that prevented the deletion of unmatched segments in text/eval_with_detections modes. This should greatly reduce the amount of accumulated noisy detection/false positives, especially for long videos. See #64.
Note (Sep 12 2023): We have improved automatic video segmentation by not querying the points in segmented regions. We correspondingly increased the number of query points per side to 64 and deprecated the "engulf" mode. The old code can be found in the "legacy_engulf" branch. The new code should run a lot faster and capture smaller objects. The text-prompted mode is still recommended for better results.
Note (Sep 11 2023): We have removed the "pluralize" option as it works weirdly sometimes with GroundingDINO. If needed, please pluralize the prompt yourself.
We develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we propose a (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several tasks, most notably in large-vocabulary video panoptic segmentation and open-world video segmentation.
Source: https://www.youtube.com/watch?v=FM9SemMfknA
Source: https://youtu.be/FbK3SL97zf8
Source: https://youtu.be/couz1CrlTdQ
Source: DAVIS 2017 validation set "soapbox"
Source: https://youtu.be/FQQaSyH9hZI
Tested on Ubuntu only. For installation on Windows WSL2, refer to https://github.com/hkchengrex/Tracking-Anything-with-DEVA/issues/20 (thanks @21pl).
Prerequisite:
Clone our repository:
git clone https://github.com/hkchengrex/Tracking-Anything-with-DEVA.git
Install with pip:
cd Tracking-Anything-with-DEVA
pip install -e .
(If you encounter the File "setup.py" not found
error, upgrade your pip with pip install --upgrade pip
)
Download the pretrained models:
bash scripts/download_models.sh
Required for the text-prompted/automatic demo:
Install our fork of Grounded-Segment-Anything. Follow its instructions.
Grounding DINO installation might fail silently.
Try python -c "from groundingdino.util.inference import Model as GroundingDINOModel"
.
If you get a warning about running on CPU mode only, make sure you have CUDA_HOME
set during Grounding DINO installation.
(Optional) For fast integer program solving in the semi-online setting:
Get your gurobi licence which is free for academic use. If a license is not found, we fall back to using PuLP which is slower and is not rigorously tested by us. All experiments are conducted with gurobi.
DEMO.md contains more details on the input arguments and tips on speeding up inference.
You can always look at deva/inference/eval_args.py
and deva/ext/ext_eval_args.py
for a full list of arguments.
With gradio:
python demo/demo_gradio.py
Then visit the link that popped up on the terminal. If executing on a remote server, try port forwarding.
We have prepared an example in example/vipseg/12_1mWNahzcsAc
(a clip from the VIPSeg dataset).
The following two scripts segment the example clip using either Grounded Segment Anything with text prompts or SAM with automatic (points in grid) prompting.
Script (text-prompted):
python demo/demo_with_text.py --chunk_size 4 \
--img_path ./example/vipseg/images/12_1mWNahzcsAc \
--amp --temporal_setting semionline \
--size 480 \
--output ./example/output --prompt person.hat.horse
We support different SAM variants in text-prompted modes, by default we use original sam version. For higher-quality masks prediction, you specify --sam_variant sam_hq
. For running efficient sam usage, you can specify --sam_variant sam_hq_light
or --sam_variant mobile
.
Script (automatic):
python demo/demo_automatic.py --chunk_size 4 \
--img_path ./example/vipseg/images/12_1mWNahzcsAc \
--amp --temporal_setting semionline \
--size 480 \
--output ./example/output
max_missed_detection_count
might help since we delete objects from memory more eagerly.@inproceedings{cheng2023tracking,
title={Tracking Anything with Decoupled Video Segmentation},
author={Cheng, Ho Kei and Oh, Seoung Wug and Price, Brian and Schwing, Alexander and Lee, Joon-Young},
booktitle={ICCV},
year={2023}
}
The demo would not be possible without ❤️ from the community:
Grounded Segment Anything: https://github.com/IDEA-Research/Grounded-Segment-Anything
Segment Anything: https://github.com/facebookresearch/segment-anything
XMem: https://github.com/hkchengrex/XMem
Title card generated with OpenPano: https://github.com/ppwwyyxx/OpenPano