Bot releases are visible (Hide)
We are delighted to announce the release of our latest models, MultiNet7 and nsnet1, as well as more wake word models trained by TTS samples.
We are proud to introduce our new MultiNet7 model. This new model is optimized for efficiency, using less memory and reducing compute time while maintaining high accuracy. You can upgrade MultiNet7 from MultiNet6 smoothly.
We are also introducing nsnet1, our first deep noise suppression model. This model is designed to enhance speech quality in noisy environments, making it perfect for real-world applications like voice assistants or telephony systems.
nsnet1 uses a deep learning approach to suppress background noise while preserving the original speech signal. It is trained on a large dataset to learn the patterns of noise and effectively cancel them out without distorting the speech.
This model is available for ESP32-S3 chip. You can enable it by setting afe_config.afe_ns_mode = NS_MODE_NET;
. Please refer to esp-skainet/examples/voice_communication for more details.
We have expanded our wake word models trained by TTS (Text-to-Speech) samples to include more wake word for our users. With the combination of TTS and LLM methods, the TTS model can be trained on a large amount of unlabeled audio data by self-supervised learning. The zero-shot performance of the TTS model is significantly improved, which allows us to clone voices based on a large number of short audio clips (less than 10 seconds). Now, we can train a wake word model only by TTS samples.
These wake word models are designed to recognize specific keywords or phrases that trigger an action or response from your device or application.
Published by feizi over 1 year ago
Documentation for release ESP-SR is available at https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/index.html
We're excited to announce the release of ESP-SR v1.2.0, an advanced version of Espressif's Speech Recognition Library. This update brings 5 significant enhancements to the previous version and is designed specifically for the ESP32-S3 microcontroller platforms.
With ESP-SR v1.2.0, you can develop voice-controlled applications on Espressif microcontroller platforms with increased accuracy and efficiency. Download the latest release and explore the full potential of our Speech Recognition Library.
The RNN-Transducer (RNN-T) framework is used to train MultiNet6. RNN-T combines high accuracy with naturally streaming recognition, which is a good choice for embedding system.
To accelerate RNN-T inference further, MultiNet6 skip some decoding frames based on CTC-based guidance. Please refer to "Accelerating RNN-T Training and Inference Using CTC guidance" for details of this method.
The model of MultiNet6 is based on Emformer. The parameters of MultiNet6 is about 3.5M. The CPU and memory consumption is as shown below(tested on ESP32S3):
parameters | CPU(one core) | PSRAM | SRAM | |
---|---|---|---|---|
MultiNet5 | 2.2 M | 44% | 2310 KB | 16 KB |
MultiNet6 | 3.5 M | 40% | 4000 KB | 48 KB |
Compared to MultiNet5, MultiNet6 has more parameters and less computation. The main reason is that the encoder model of MultiNet6 has a 4x downsampling factor.
Same as MultiNet5, the weight of Multinet6 is quantized by 8-bit.
MultiNet6 uses the Finite-State Transducer (FST) to build language model instead of the Trie used in MultiNet5. The beam search based on FST is more efficient and robust. The different units are used for different language:
The WER results for some popular open source dataset.
aishell test | |
---|---|
MultiNet5_cn | 9.5% |
MultiNet6_cn | 5.2% |
librispeech test-clean | librispeech test-other | |
---|---|---|
MultiNet5_en | 16.5% | 41.4% |
MultiNet6_en | 9.0% | 21.3% |
Note: The Pinyin syllables without tone is used to calculate WER of Chinese.
The Response Accuray Rate (RAR) results for Espressif speech commands dataset.
Model Type | Distance | Quiet | Stationary Noise SNR=(5~10 dB) | Speech Noise (SNR=5~10dB) |
---|---|---|---|---|
MultiNet5_cn | 3m | 88.9% | 66.1% | 67.5% |
MultiNet6_cn | 3m | 98.8% | 88.3% | 88.0% |
MultiNet5_en | 3m | 95.4% | 85.9% | 82.7% |
MultiNet6_en | 3m | 96.8% | 87.9% | 85.5% |
Please refer to benchmark documentation for the latest results.
Please refer to documentation of speech_command_recognition for details.
Published by feizi almost 3 years ago
ESP-SR Release V1.0 is the first release for ESP-SR, this release has some features and performance improvements. Include the following: