Open Source Ecosystems

Supported functions

Speech recognition	Speech synthesis

Speaker identification	Speaker diarization	Speaker identification

Spoken Language identification	Audio tagging	Voice activity detection

Keyword spotting	Add punctuation

Supported platforms

Architecture	Android	iOS	Windows	macOS	linux
x64
x86
arm64
arm32
riscv64

Supported programming languages

1. C++	2. C	3. Python	4. JavaScript

5. Java	6. C#	7. Kotlin	8. Swift

9. Go	10. Dart	11. Rust	12. Pascal

For Rust support, please see sherpa-rs

It also supports WebAssembly.

Introduction

This repository supports running the following functions locally

Speech-to-text (i.e., ASR); both streaming and non-streaming are supported
Text-to-speech (i.e., TTS)
Speaker diarization
Speaker identification
Speaker verification
Spoken language identification
Audio tagging
VAD (e.g., silero-vad)
Keyword spotting

on the following platforms and operating systems:

x86, x86_64, 32-bit ARM, 64-bit ARM (arm64, aarch64), RISC-V (riscv64)
Linux, macOS, Windows, openKylin
Android, WearOS
iOS
NodeJS
WebAssembly
Raspberry Pi
RV1126
LicheePi4A
VisionFive 2
X3
[][]
etc

with the following APIs

C++, C, Python, Go, C#
Java, Kotlin, JavaScript
Swift, Rust
Dart, Object Pascal

Links for Huggingface Spaces

You can visit the following Huggingface spaces to try sherpa-onnx without installing anything. All you need is a browser.

Description	URL
Speech recognition	[Click me][hf-space-asr]
Speech recognition with [Whisper][Whisper]	[Click me][hf-space-asr-whisper]
Speech synthesis	[Click me][hf-space-tts]
Generate subtitles	[Click me][hf-space-subtitle]
Audio tagging	[Click me][hf-space-audio-tagging]
Spoken language identification with [Whisper][Whisper]	[Click me][hf-space-slid-whisper]

We also have spaces built using WebAssembly. They are listed below:

Description	Huggingface space	ModelScope space
Voice activity detection with silero-vad	[Click me][wasm-hf-vad]	[][wasm-ms-vad]
Real-time speech recognition (Chinese + English) with Zipformer	[Click me][wasm-hf-streaming-asr-zh-en-zipformer]	[][wasm-hf-streaming-asr-zh-en-zipformer]
Real-time speech recognition (Chinese + English) with Paraformer	[Click me][wasm-hf-streaming-asr-zh-en-paraformer]	[][wasm-ms-streaming-asr-zh-en-paraformer]
Real-time speech recognition (Chinese + English + Cantonese) with [Paraformer-large][Paraformer-large]	[Click me][wasm-hf-streaming-asr-zh-en-yue-paraformer]	[][wasm-ms-streaming-asr-zh-en-yue-paraformer]
Real-time speech recognition (English)	[Click me][wasm-hf-streaming-asr-en-zipformer]	[][wasm-ms-streaming-asr-en-zipformer]
VAD + speech recognition (Chinese + English + Korean + Japanese + Cantonese) with [SenseVoice][SenseVoice]	[Click me][wasm-hf-vad-asr-zh-en-ko-ja-yue-sense-voice]	[][wasm-ms-vad-asr-zh-en-ko-ja-yue-sense-voice]
VAD + speech recognition (English) with [Whisper][Whisper] tiny.en	[Click me][wasm-hf-vad-asr-en-whisper-tiny-en]	[][wasm-ms-vad-asr-en-whisper-tiny-en]
VAD + speech recognition (English) with Zipformer trained with [GigaSpeech][GigaSpeech]	[Click me][wasm-hf-vad-asr-en-zipformer-gigaspeech]	[][wasm-ms-vad-asr-en-zipformer-gigaspeech]
VAD + speech recognition (Chinese) with Zipformer trained with [WenetSpeech][WenetSpeech]	[Click me][wasm-hf-vad-asr-zh-zipformer-wenetspeech]	[][wasm-ms-vad-asr-zh-zipformer-wenetspeech]
VAD + speech recognition (Japanese) with Zipformer trained with [ReazonSpeech][ReazonSpeech]	[Click me][wasm-hf-vad-asr-ja-zipformer-reazonspeech]	[][wasm-ms-vad-asr-ja-zipformer-reazonspeech]
VAD + speech recognition (Thai) with Zipformer trained with [GigaSpeech2][GigaSpeech2]	[Click me][wasm-hf-vad-asr-th-zipformer-gigaspeech2]	[][wasm-ms-vad-asr-th-zipformer-gigaspeech2]
VAD + speech recognition (Chinese ) with a [TeleSpeech-ASR][TeleSpeech-ASR] CTC model	[Click me][wasm-hf-vad-asr-zh-telespeech]	[][wasm-ms-vad-asr-zh-telespeech]
VAD + speech recognition (English + Chinese, ) with Paraformer-large	[Click me][wasm-hf-vad-asr-zh-en-paraformer-large]	[][wasm-ms-vad-asr-zh-en-paraformer-large]
VAD + speech recognition (English + Chinese, ) with Paraformer-small	[Click me][wasm-hf-vad-asr-zh-en-paraformer-small]	[][wasm-ms-vad-asr-zh-en-paraformer-small]
Speech synthesis (English)	[Click me][wasm-hf-tts-piper-en]	[][wasm-ms-tts-piper-en]
Speech synthesis (German)	[Click me][wasm-hf-tts-piper-de]	[][wasm-ms-tts-piper-de]
Speaker diarization	[Click me][wasm-hf-speaker-diarization]	[][wasm-ms-speaker-diarization]

Links for pre-built Android APKs

Description	URL
Streaming speech recognition	[Address][apk-streaming-asr]	[][apk-streaming-asr-cn]
Text-to-speech	[Address][apk-tts]	[][apk-tts-cn]
Voice activity detection (VAD)	[Address][apk-vad]	[][apk-vad-cn]
VAD + non-streaming speech recognition	[Address][apk-vad-asr]	[][apk-vad-asr-cn]
Two-pass speech recognition	[Address][apk-2pass]	[][apk-2pass-cn]
Audio tagging	[Address][apk-at]	[][apk-at-cn]
Audio tagging (WearOS)	[Address][apk-at-wearos]	[][apk-at-wearos-cn]
Speaker identification	[Address][apk-sid]	[][apk-sid-cn]
Spoken language identification	[Address][apk-slid]	[][apk-slid-cn]
Keyword spotting	[Address][apk-kws]	[][apk-kws-cn]

Links for pre-built Flutter APPs

Real-time speech recognition

Description	URL
Streaming speech recognition	[Address][apk-flutter-streaming-asr]	[][apk-flutter-streaming-asr-cn]

Text-to-speech

Description	URL
Android (arm64-v8a, armeabi-v7a, x86_64)	[Address][flutter-tts-android]	[][flutter-tts-android-cn]
Linux (x64)	[Address][flutter-tts-linux]	[][flutter-tts-linux-cn]
macOS (x64)	[Address][flutter-tts-macos-x64]	[][flutter-tts-macos-arm64-cn]
macOS (arm64)	[Address][flutter-tts-macos-arm64]	[][flutter-tts-macos-x64-cn]
Windows (x64)	[Address][flutter-tts-win-x64]	[][flutter-tts-win-x64-cn]

Note: You need to build from source for iOS.

Links for pre-built Lazarus APPs

Generating subtitles

Description	URL
Generate subtitles ()	[Address][lazarus-subtitle]	[][lazarus-subtitle-cn]

Links for pre-trained models

Description	URL
Speech recognition (speech to text, ASR)	[Address][asr-models]
Text-to-speech (TTS)	[Address][tts-models]
VAD	[Address][vad-models]
Keyword spotting	[Address][kws-models]
Audio tagging	[Address][at-models]
Speaker identification (Speaker ID)	[Address][sid-models]
Spoken language identification (Language ID)	See multi-lingual [Whisper][Whisper] ASR models from [Speech recognition][asr-models]
Punctuation	[Address][punct-models]
Speaker segmentation	[Address][speaker-segmentation-models]