Bot releases are hidden (Show)

STT - Coqui STT v0.10.0-alpha.14

Published by reuben about 3 years ago

STT - STT v0.10.0-alpha.7

Published by reuben over 3 years ago

Alpha release, for development purposes only.

STT -

Published by reuben over 3 years ago

STT - Coqui STT 0.9.3

Published by reuben over 3 years ago

General

This is an initial release for 🐸STT, backwards compatible with mozilla/DeepSpeech 0.9.3. The model files below are identical with to the 0.9.3 release of mozilla/DeepSpeech, and released under the MPL 2.0 license accordingly. These models are provided as a compatibility aid so that examples in our documentation can work with the existing release links.

This release includes the source code:

v0.9.3.tar.gz

Under the MPL-2.0 license. And the acoustic models:

coqui-stt-0.9.3-models.pbmm
coqui-stt-0.9.3-models.tflite

Experimental Mandarin Chinese acoustic models trained on an internal corpus composed of 2000h of read speech:

coqui-stt-0.9.3-models-zh-CN.pbmm
coqui-stt-0.9.3-models-zh-CN.tflite

all under the MPL-2.0 license.

The model files with the ".pbmm" extension are memory mapped and thus memory efficient and fast to load. The model files with the ".tflite" extension are converted to use TensorFlow Lite, has post-training quantization enabled, and are more suitable for resource constrained environments.

The acoustic models were trained on American English with synthetic noise augmentation and the .pbmm model achieves an 7.06% word error rate on the LibriSpeech clean test corpus.

Note that the model currently performs best in low-noise environments with clear recordings and has a bias towards US male accents. This does not mean the model cannot be used outside of these conditions, but that accuracy may be lower. Some users may need to train the model further to meet their intended use-case.

In addition we release the scorer:

coqui-stt-0.9.3-models.scorer

which takes the place of the language model and trie in older releases and which is also under the MPL-2.0 license.

There is also a corresponding scorer for the Mandarin Chinese model:

coqui-stt-0.9.3-models-zh-CN.scorer

We also include example audio files:

audio-0.9.3.tar.gz

which can be used to test the engine, and checkpoint files for both the English and Mandarin models:

coqui-stt-0.9.3-checkpoint.tar.gz
coqui-stt-0.9.3-checkpoint-zh-CN.tar.gz

which are under the MPL-2.0 license and can be used as the basis for further fine-tuning.

Training Regimen + Hyperparameters for fine-tuning

The hyperparameters used to train the model are useful for fine tuning. Thus, we document them here along with the training regimen, hardware used (a server with 8 Quadro RTX 6000 GPUs each with 24GB of VRAM), and our use of cuDNN RNN.

In contrast to some previous releases, training for this release occurred as a fine tuning of the previous 0.8.2 checkpoint, with data augmentation options enabled. The following hyperparameters were used for the fine tuning. See the 0.8.2 release notes for the hyperparameters used for the base model.

train_files Fisher, LibriSpeech, Switchboard, Common Voice English, and approximately 1700 hours of transcribed WAMU (NPR) radio shows explicitly licensed to use as training corpora.
dev_files LibriSpeech clean dev corpus.
test_files LibriSpeech clean test corpus
train_batch_size 128
dev_batch_size 128
test_batch_size 128
n_hidden 2048
learning_rate 0.0001
dropout_rate 0.40
epochs 200
augment pitch[pitch=1~0.1]
augment tempo[factor=1~0.1]
augment overlay[p=0.9,source=${noise},layers=1,snr=12~4] (where ${noise} is a dataset of Freesound.org background noise recordings)
augment overlay[p=0.1,source=${voices},layers=10~2,snr=12~4] (where ${voices} is a dataset of audiobook snippets extracted from Librivox)
augment resample[p=0.2,rate=12000~4000]
augment codec[p=0.2,bitrate=32000~16000]
augment reverb[p=0.2,decay=0.7~0.15,delay=10~8]
augment volume[p=0.2,dbfs=-10~10]
cache_for_epochs 10

The weights with the best validation loss were selected at the end of 200 epochs using --noearly_stop.

The optimal lm_alpha and lm_beta values with respect to the LibriSpeech clean dev corpus remain unchanged from the previous release:

lm_alpha 0.931289039105002
lm_beta 1.1834137581510284

For the Mandarin Chinese model, the following values are recommended:

lm_alpha 0.6940122363709647
lm_beta 4.777924224113021

Documentation

Documentation is available on stt.readthedocs.io.

Contact/Getting Help

GitHub Discussions - best place to ask questions, get support, and discuss anything related to 🐸STT with other users.
Gitter - You can also join our Gitter chat.
Issues - If you have discussed a problem and identified a bug in 🐸STT, or if you have a feature request, please open an issue in our repo. Please make sure you search for an already existing issue beforehand!

Contributors to 0.9.3 release

Everyone who helped us get this far! Thank you for your continued collaboration!

Package Rankings

Top 4.76% on Pypi.org

Top 5.51% on Npmjs.org

Top 5.21% on Proxy.golang.org

Related Projects

Automatic_Speech_Recognition

End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow

13 Nov 2016 2,839

DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in...

02 Jun 2016 24,174

audio-pretrained-model

A collection of Audio and Speech pre-trained models.

18 Jul 2020 180