[ACM MM 2024] This is the official code for "AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding"
APACHE-2.0 License
Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding
An updated version of the paper will be uploaded later
conda create -n anitalker python==3.9.0
conda activate anitalker
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge
pip install -r requirements.txt
Windows Tutorial (Contributed by newgenai79)
MacOS Tutorial (Contributed by airwzz999)
Please download the checkpoint from URL and place them into the folder ckpts
[] For Chinese users, we recommend you visit here to download.
ckpts/
chinese-hubert-large
config.json
preprocessor_config.json
pytorch_model.bin
stage1.ckpt
stage2_pose_only_mfcc.ckpt
stage2_full_control_mfcc.ckpt
stage2_audio_only_hubert.ckpt
stage2_pose_only_hubert.ckpt
stage2_full_control_hubert.ckpt
Model Description:
Stage | Model Name | Audio-only Inference | Addtional Control Signal |
---|---|---|---|
First stage | stage1.ckpt | - | Motion Encoder & Image Renderer |
Second stage (Hubert) | stage2_audio_only_hubert.ckpt | yes | - |
Second stage (Hubert) | stage2_pose_only_hubert.ckpt | yes | Head Pose |
Second stage (Hubert) | stage2_full_control_hubert.ckpt | yes | Head Pose/Location/Scale |
Second stage (MFCC) | stage2_pose_only_mfcc.ckpt | yes | Head Pose |
Second stage (MFCC) | stage2_full_control_mfcc.ckpt | yes | Head Pose/Location/Scale |
stage1.ckpt
is trained on a single image video dataset, aiming to learn the transfer of actions. After training, it utilizes the Motion Encoder (for extracting identity-independent motion) and Image Renderer.stage2
are trained on a video dataset with audio, and unless otherwise specified, are trained from scratch.stage2_audio_only_hubert.ckpt
inputs audio features as Hubert, without any control signals. Suitable for scenes with faces oriented forward, compared to controllable models, it requires less parameter adjustment to achieve satisfactory results. [We recommend starting with this model]stage2_pose_only_hubert.ckpt
is similar to stage2_pose_only_mfcc.ckpt
, the difference being that the audio features are Hubert. Compared to the audio_only model, it includes pose control signals.stage2_more_controllable_hubert.ckpt
is similar to stage2_more_controllable_mfcc.ckpt
, but uses Hubert for audio features.stage2_pose_only_mfcc.ckpt
inputs audio features as MFCC, and includes pose control signals (yaw, pitch, roll angles). [The performance of the MFCC model is poor and not recommended for use.]stage2_more_controllable_mfcc.ckpt
inputs audio features as MFCC, and adds control signals for face location and face scale in addition to pose.chinese-hubert-large
are used for extracting audio features.Quick Guide:
stage2_audio_only_hubert.ckpt
.Explanation of Parameters for demo.py
python ./code/demo.py \
--infer_type 'hubert_audio_only' \
--stage1_checkpoint_path 'ckpts/stage1.ckpt' \
--stage2_checkpoint_path 'ckpts/stage2_audio_only_hubert.ckpt' \
--test_image_path 'test_demos/portraits/monalisa.jpg' \
--test_audio_path 'test_demos/audios/monalisa.wav' \
--test_hubert_path 'test_demos/audios_hubert/monalisa.npy' \
--result_path 'outputs/monalisa_hubert/'
You only need to configure two items: test_image_path
(the image you want to drive) and test_audio_path
(the audio to drive the image). Other parameters like pose and eye blink are all sampled from the model!
The generated video of this sample will be saved to outputs/monalisa_hubert/monalisa-monalisa.mp4.
For Pose Controllable Hubert Cases, see more_hubert_cases_pose_only.
For Pose/Face Controllable Hubert Cases, see more_hubert_cases_more_control.
One Portrait | Result |
---|---|
Generated Raw Video (256 * 256)
You can submmit your demo via issue.
[Note] The Hubert model is our default model. For environment convenience, we provide an MFCC version, but we found that the utilization rate of the Hubert model is not high, and people still use MFCC more often. MFCC has poorer results. This goes against our original intention, so we have deprecated this model. We recommend you start testing with the hubert_audio_only model. Thanks.
[Upgrade for Early Users] Re-download the checkpoint with the Hubert model into the ckpts directory and additionally install pip install transformers==4.19.2
. When the code does not detect the Hubert path, it will automatically extract it and provide extra instructions on how to resolve any errors encountered.
The purpose is to upscale the resolution from 256 to 512 and address the issue of blurry rendering.
Please install addtional environment here:
pip install facexlib
pip install tb-nightly -i https://mirrors.aliyun.com/pypi/simple
pip install gfpgan
# Ignore the following warning:
# espnet 202301 requires importlib-metadata<5.0, but you have importlib-metadata 7.1.0 which is incompatible.
Then enable the option --face_sr
in your scripts. The first time will download the weights of gfpgan.
We welcome any contributions to the repository.
Regarding the checkpoints provided by this library, the issues we encountered while testing various audio clips and images have revealed model biases. These biases are primarily due to the training dataset or the model capacity, including but not limited to the following:
Please generate content carefully based on the above considerations.
@misc{liu2024anitalker,
title={AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding},
author={Tao Liu and Feilong Chen and Shuai Fan and Chenpeng Du and Qi Chen and Xie Chen and Kai Yu},
year={2024},
eprint={2405.03121},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
We hope more people can get involved, and we will promptly handle pull requests. Currently, there are still some tasks that need assistance, such as processing the crop pipeline, creating a web UI, and translation work, among others.
Special Contributors
nitinmukesh submitted the Windows installation tutorial. His Youtube channel has many amazing digital human tutorials. Welcome to subscribe to his channel!
https://github.com/tanshuai0219/EDTalk for image auto-crop code
Visit Count:
We would like to express our sincere gratitude to the numerous prior works that have laid the foundation for the development of AniTalker.
Stage 1, which primarily focuses on training the motion encoder and the rendering module, heavily relies on resources from LIA. The second stage of diffusion training is built upon diffae and espnet. For the computation of mutual information loss, we implement methods from CLUB and utilize AAM-softmax in the training of identity network. Moreover, we leverage the pretrained Hubert model provided by TencentGameMate and mfcc feature from MFCC.
Additionally, we employ 3DDFA_V2 to extract head pose and torchlm to obtain face landmarks, which are used to calculate face location and scale. We have already open-sourced the code usage for these preprocessing steps at talking_face_preprocessing. We acknowledge the importance of building upon existing knowledge and are committed to contributing back to the research community by sharing our findings and code.
1. This library's code is not a formal product, and we have not tested all use cases; therefore, it cannot be directly offered to end-service customers.
2. The main purpose of making our code public is to facilitate academic demonstrations and communication. Any use of this code to spread harmful information is strictly prohibited.
3. Please use this library in compliance with the terms specified in the license file and avoid improper use.
4. When using the code, please follow and abide by local laws and regulations.
5. During the use of this code, you will bear the corresponding responsibility. Our company (AISpeech Ltd.) is not responsible for the generated results.