[SIGGRAPH Asia 2023] Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
OTHER License
Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation Shuai Yang, Yifan Zhou, Ziwei Liu and Chen Change Loy in SIGGRAPH Asia 2023 Conference Proceedings Project Page | Paper | Supplementary Video | Input Data and Video Results
Abstract: Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.
Features:
Please make sure your installation path only contain English letters or _
git submodule update --init --recursive
)git clone [email protected]:williamyang1991/Rerender_A_Video.git --recursive
cd Rerender_A_Video
pip install -r requirements.txt
You can also create a new conda environment from scratch.
conda env create -f environment.yml
conda activate rerender
24GB VRAM is required. Please refer to https://github.com/williamyang1991/Rerender_A_Video/pull/23#issue-1900789461 to reduce memory consumption.
./models
.python install.py
rerender.py
python rerender.py --cfg config/real2sculpture.json
Before running the above 1-4 steps, you need prepare:
FileNotFoundError: [Errno 2] No such file or directory: 'xxxx.bin' or 'xxxx.jpg'
:
python video_blend.py ...
in the error log and use it to manually run the ebsynth part, which is more stable than WebUI.KeyError: 'dataset'
: upgrade Gradio to the latest version (https://github.com/williamyang1991/Rerender_A_Video/issues/14#issuecomment-1722778672, https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/11855)ERR_ADDRESS_INVALID
Cannot open the webUI in browser: replace 0.0.0.0 with 127.0.0.1 in webUI.py (https://github.com/williamyang1991/Rerender_A_Video/issues/19#issuecomment-1723685825)CUDA out of memory
:
"use_limit_device_resolution"
to true
in the config to resize the video according to your VRAM (https://github.com/williamyang1991/Rerender_A_Video/issues/79). An example config config/van_gogh_man_dynamic_resolution.json
is provided.AttributeError: module 'keras.backend' has no attribute 'is_tensor'
: update einops (https://github.com/williamyang1991/Rerender_A_Video/issues/26#issuecomment-1726682446)IndexError: list index out of range
: use the original DDIM steps of 20 (https://github.com/williamyang1991/Rerender_A_Video/issues/30#issuecomment-1729039779)python webUI.py
The Gradio app also allows you to flexibly change the inference options. Just try it for more details. (For WebUI, you need to download revAnimated_v11 and realisticVisionV20_v20 to ./models/
after Installation)
Upload your video, input the prompt, select the seed, and hit:
We provide abundant advanced options to play with
sd_model_cfg.py
to add paths to the saved SD modelscontrol_type = gr.Dropdown(['HED', 'canny', 'depth']
here https://github.com/williamyang1991/Rerender_A_Video/blob/b6cafb5d80a79a3ef831c689ffad92ec095f2794/webUI.py#L690
elif control_type == 'depth':
following https://github.com/williamyang1991/Rerender_A_Video/blob/b6cafb5d80a79a3ef831c689ffad92ec095f2794/webUI.py#L88
elif control_type == 'depth':
following https://github.com/williamyang1991/Rerender_A_Video/blob/b6cafb5d80a79a3ef831c689ffad92ec095f2794/webUI.py#L122
We also provide a flexible script rerender.py
to run our method.
Set the options via command line. For example,
python rerender.py --input videos/pexels-antoni-shkraba-8048492-540x960-25fps.mp4 --output result/man/man.mp4 --prompt "a handsome man in van gogh painting"
The script will run the full pipeline. A work directory will be created at result/man
and the result video will be saved as result/man/man.mp4
Set the options via a config file. For example,
python rerender.py --cfg config/van_gogh_man.json
The script will run the full pipeline.
We provide some examples of the config in config
directory.
Most options in the config is the same as those in WebUI.
Please check the explanations in the WebUI section.
Specifying customized models by setting sd_model
in config. For example:
{
"sd_model": "models/realisticVisionV20_v20.safetensors",
}
Similar to WebUI, we provide three-step workflow: Rerender the first key frame, then rerender the full key frames, finally rerender the full video with propagation. To run only a single step, specify options -one
, -nb
and -nr
:
python rerender.py --cfg config/van_gogh_man.json -one -nb
python rerender.py --cfg config/van_gogh_man.json -nb
python rerender.py --cfg config/van_gogh_man.json -nr
We provide a separate Ebsynth python script video_blend.py
with the temporal blending algorithm introduced in
Stylizing Video by Example for interpolating style between key frames.
It can work on your own stylized key frames independently of our Rerender algorithm.
Usage:
video_blend.py [-h] [--output OUTPUT] [--fps FPS] [--beg BEG] [--end END] [--itv ITV] [--key KEY]
[--n_proc N_PROC] [-ps] [-ne] [-tmp]
name
positional arguments:
name Path to input video
optional arguments:
-h, --help show this help message and exit
--output OUTPUT Path to output video
--fps FPS The FPS of output video
--beg BEG The index of the first frame to be stylized
--end END The index of the last frame to be stylized
--itv ITV The interval of key frame
--key KEY The subfolder name of stylized key frames
--n_proc N_PROC The max process count
-ps Use poisson gradient blending
-ne Do not run ebsynth (use previous ebsynth output)
-tmp Keep temporary output
For example, to run Ebsynth on video man.mp4
,
videos/man/keys
for every 10 frames (named as 0001.png
, 0011.png
, ...)videos/man/video
(named as 0001.png
, 0002.png
, ...).videos/man/blend.mp4
under FPS 25 with the following command:python video_blend.py videos/man \
--beg 1 \
--end 101 \
--itv 10 \
--key keys \
--output videos/man/blend.mp4 \
--fps 25.0 \
-ps
Text-guided virtual character generation.
Video stylization and video editing.
Compared to the conference version, we are keeping adding new features.
By using cross-frame attention in less layers, our results will better match the input video, thus reducing ghosting artifacts caused by inconsistencies. This feature can be activated by checking Loose Cross-frame attention
in the Advanced options for the key frame translation for WebUI or setting loose_cfattn
for script (see config/real2sculpture_loose_cfattn.json
).
FreeU is a method that improves diffusion model sample quality at no costs. We find featured with FreeU, our results will have higher contrast and saturation, richer details, and more vivid colors. This feature can be used by setting FreeU backbone factors and skip factors in the Advanced options for the 1st frame translation for WebUI or setting freeu_args
for script (see config/real2sculpture_freeu.json
).
If you find this work useful for your research, please consider citing our paper:
@inproceedings{yang2023rerender,
title = {Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation},
author = {Yang, Shuai and Zhou, Yifan and Liu, Ziwei and and Loy, Chen Change},
booktitle = {ACM SIGGRAPH Asia Conference Proceedings},
year = {2023},
}
The code is mainly developed based on ControlNet, Stable Diffusion, GMFlow and Ebsynth.