无需情感标注的情感可控语音合成模型,基于VITS
MIT License
在线demo ↑↑↑ bilibili demo
数据集无需任何情感标注,通过情感提取模型 提取语句情感embedding输入网络,实现情感可控的VITS合成
该模型缺点:
该模型的优点:
可以使用 聚类算法 自动对音频的情感embedding进行分类,大致上可以区分出情感差异较大的各个类别,具体使用请参考 emotion_clustering.ipynb
# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace
# Preprocessing (g2p) for your own datasets. Preprocessed phonemes for nene have been already provided.
python preprocess.py --text_index 2 --filelists filelists/train.txt filelists/val.txt --text_cleaners japanese_cleaners
python emotion_extract.py --filelists filelists/train.txt filelists/val.txt
# nene
python train_ms.py -c configs/nene.json -m nene
# if you are fine tuning pretrained original VITS checkpoint ,
python train_ms.py -c configs/nene.json -m nene --ckptD /path/to/D_xxxx.pth --ckptG /path/to/G_xxxx.pth
See inference.ipynb or use MoeGoe