Avoid GRPC protocol in clients by hiding it under an HTTP JSON proxy, making it easier to fast-prototype.
MIT License
Nvidia Riva is a next-gen text-to-speech system. The text-to-speech (TTS) pipeline implemented for the Riva TTS service is based on a two-stage pipeline. Riva first generates a mel-spectrogram using the first model, and then generates speech using the second model. This pipeline forms a TTS system that enables you to synthesize natural sounding speech from raw transcripts without any additional information such as patterns or rhythms of speech.
This webserver wraps the Nvidia Riva gRPC client within an easy-to-use HTTP JSON API. It also handles some shortcomings of using Riva TTS directly, these are:
Note about memory usage: I tried to turn off the other features in Riva but it seems to load all models regardless. This may be because I would have to run "riva_init.sh" again, but it takes very long so I'm unwilling to perform that test. "riva_start.sh" should just be respecting the config, but it loads all the models, therefore I cannot recommend people try this without a large enough GPU.
With that said, the support matrix indicates that you only need 2.1GB for the TTS model.
Start the server. Be sure to forward port 5000 to somewhere on your host.
docker run -d -p 5000:5000 --restart unless-stopped --name riva_tts_proxy -t keyvanfatehi/riva_tts_proxy:latest
Now you can use it, for example, from the ReadAloud extension by entering http://localhost:5000 into the Riva TTS proxy server section.
The system will not work without a functional Riva stack. By default, this is expected to run on the same docker host. To change this, you may configure the RIVA_URI environment variable.
You may also wish to increase the amount of web workers. This is possible using the WEB_CONCURRENCY environment variable.
You may build the image like so:
docker build -t keyvanfatehi/riva_tts_proxy:latest .
Response with a JSON of supported voices.
Headers:
Content-Type: application/json
Accept: audio/webm
Body:
{
"voice_name" : "English-US.Female-1",
"text" : "Input text from which to generate speech. Feel free to use multiple sentences. There is no artificial pause between sentences. Want there to be? Contribute to the project."
}
N.B.
Response is an audio stream based on provided Accept header. If you do not provide one, you will receive audio/wav.
If you would like to use an encoding, set Accept to one of:
I've found that audio/mpeg provides the greatest compatibility with browser APIs, mainly, the picky, yet powerful MediaSource API
aplay
(written for the Comma 3, which is an ARM computer) https://github.com/kfatehi/tici-developer-setup/blob/master/scripts/riva-tts.py
Are you using this project? Please add it to the list.