A TensorFlow Implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model
APACHE-2.0 License
We train the model on three different speech datasets.
LJ Speech Dataset is recently widely used as a benchmark dataset in the TTS task because it is publicly available. It has 24 hours of reasonable quality samples. Nick's audiobooks are additionally used to see if the model can learn even with less data, variable speech samples. They are 18 hours long. The World English Bible is a public domain update of the American Standard Version of 1901 into modern English. Its original audios are freely available here. Kyubyong split each chapter by verse manually and aligned the segmented audio clips to the text. They are 72 hours in total. You can download them at Kaggle Datasets.
hyperparams.py
. (If you want to do preprocessing, set prepro
True`.python train.py
. (If you set prepro
True, run python prepro.py
first)python eval.py
regularly during training.We generate speech samples based on Harvard Sentences as the original paper does. It is already included in the repo.
python synthesize.py
and check the files in samples
.It's important to monitor the attention plots during training. If the attention plots look good (alignment looks linear), and then they look bad (the plots will look similar to what they looked like in the begining of training), then training has gone awry and most likely will need to be restarted from a checkpoint where the attention looked good, because we've learned that it's unlikely that the loss will ever recover. This deterioration of attention will correspond with a spike in the loss.
In the original paper, the authors said, "An important trick we discovered was predicting multiple, non-overlapping output frames at each decoder step" where the number of of multiple frame is the reduction factor, r
. We originally interpretted this as predicting non-sequential frames during each decoding step t
. Thus were using the following scheme (with r=5
) during decoding.
t frame numbers
-----------------------
0 [ 0 1 2 3 4]
1 [ 5 6 7 8 9]
2 [10 11 12 13 14]
...
After much experimentation, we were unable to have our model learning anything useful. We then switched to predicting r
sequential frames during each decoding step.
t frame numbers
-----------------------
0 [ 0 1 2 3 4]
1 [ 5 6 7 8 9]
2 [10 11 12 13 14]
...
With this setup we noticed improvements in the attention and have since kept it.
Perhaps the most important hyperparemeter is the learning rate. With an intitial learning rate of 0.002 we were never able to learn a clean attention, the loss would frequently explode. With an initial learning rate of 0.001 we were able to learn a clean attention and train for much longer get decernable words during synthesis.
Check other TTS models such as DCTTS or deep voice 3.
Jan. 2018, Kyubyong Park & Tommy Mulc