pisets

This project represents a python library and service for automatic speech recognition and transcribing in Russian and English.

You can generate subtitles in the SubRip format for any audio or video which is supported with FFmpeg software.

The "pisets" is Russian word (in Cyrillic, "писец") for denoting a person who writes down the text, including dictation (the corresponding English term is "scribe"). Thus, if you need to make a text transcript of an audio recording of a meeting or seminar, then the artificial "Pisets" will help you.

Installation

This project uses a deep learning, therefore a key dependency is a deep learning framework. I prefer PyTorch, and you need to install CPU- or GPU-based build of PyTorch ver. 2.3 or later. You can see more detailed description of dependencies in the requirements.txt.

Other important dependencies are:

Transformers: a Python library for building neural networks with Transformer architecture;
FFmpeg: a software for handling video, audio, and other multimedia files.

The first dependency is a well-known Python library, but the second dependency is not only "pythonic". You have to install FFmpeg in your system as described in the instructions https://ffmpeg.org/download.html.

Also, for installation you need to Python 3.10 or later. I recommend using a new Python virtual environment witch can be created with Anaconda. To install this project in the selected virtual environment, you should activate this environment and run the following commands in the Terminal:

git clone https://github.com/bond005/pisets.git
cd pisets
python -m pip install -r requirements.txt

To check workability and environment setting correctness you can run the unit tests:

python -m unittest

Usage

Command prompt

Usage of the Pisets is very simple. You have to write the following command in your command prompt:

python speech_to_srt.py \
    -i /path/to/your/sound/or/video.m4a \
    -o /path/to/resulted/transcription.srt \
    -m /path/to/local/directory/with/models \
    -lang ru

The 1st argument -i specifies the name of the source audio or video in any format supported by FFmpeg.

The 2st argument -o specifies the name of the resulting SubRip file into which the recognized transcription will be written.

Other arguments are not required. If you do not specify them, then their default values will be used. But I think, that their description matters for any user. So, -lang specifies the used language. You can select Russian (ru, rus, russian) or English (en, eng, english). The default language is Russian. Yet another argument -m points to the directory with all needed pre-downloaded models. This directory must include several subdirectories, which contain localized models for corresponding languages (ru or en is supported now). In turn, each language subdirectory includes three more subdirectories corresponding to the three models used:

wav2vec2 (for preliminary speech recognition and segmentation into speech frames);
ast (for filtering non-speech segments);
whisper (for final speech recognition).

If you don't specify the argument -m, then all needed models will be automatically downloaded from Huggingface hub:

for Russian:
for English:

Also, you can generate the transcription of your audio-record as a DocX file:

python speech_to_docx.py \
    -i /path/to/your/sound/or/video.m4a \
    -o /path/to/resulted/transcription.docx \
    -m /path/to/local/directory/with/models \
    -lang ru

If your computer has CUDA-compatible GPU, and your PyTorch has been correctly installed for this GPU, then the Pisets will transcribe your speech very quickly. So, the real-time factor (xRT), defined as the ratio between the time it takes to process the input and the duration of the input, is approximately 0.15 - 0.25 (it depends on the concrete GPU type). But if you use CPU only, then the Pisets will calculate your speech transcription significantly slower (xRT is approximately 1.0 - 1.5).

Docker and REST-API

Installation of the Pisets can be difficult, especially for Windows users (in Linux it is trivial). Accordingly, in order to simplify the installation process and hide all the difficulties from the user, I suggest using a docker container that is deployed and runs on any operating system. In this case, audio transmission for recognition and receiving transcription results is carried out by means of the REST API.

You can build the docker container youself:

docker build -t bond005/pisets:0.2 .

But the easiest way is to download the built image from Docker-Hub:

docker pull bond005/pisets:0.2

After building (or pulling) you have to run this docker container:

docker run --rm --gpus all -p 127.0.0.1:8040:8040 bond005/pisets:0.2

Hurray! The docker container is ready for use on GPU, and the Pisets will transcribe your speech. You can use the Python client for the Pisets service in the script client_ru_demo.py:

python client_ru_demo.py \
    -i /path/to/your/sound/or/video.m4a \
    -o /path/to/resulted/transcription.docx

Important notes

The Pisets in the abovementioned docker container currently supports only Russian. If you want to transcribe English speech, then you have to use the command-line tool speech_to_srt.py or speech_to_docx.py.

Cloud computing

You can open your personal account (in Russian) on the SibNN.AI and upload your audio recordings of any size for their automatic recognition.

In addition, you can try the demo of the cloud Pisets without registration on the web-page https://pisets.dialoger.tech (the demo without registration contains a limit on the maximum length of an audio recording of no more than 5 minutes, but allows you to record a signal from a microphone).

Contact

Ivan Bondarenko - @Bond_005 - [email protected]

Acknowledgment

This project was developed as part of a more fundamental project to create an open source system for automatic transcription and semantic analysis of audio recordings of interviews in Russian. Many journalists, sociologist and other specialists need to prepare the interview manually, and automatization can help their.

The Foundation for Assistance to Small Innovative Enterprises which is Russian governmental non-profit organization supports an unique program to build free and open-source artificial intelligence systems. This programs is known as "Code - Artificial Intelligence" (see https://fasie.ru/press/fund/kod-ai/?sphrase_id=114059 in Russian). The abovementioned project was started within the first stage of the "Code - Artificial Intelligence" program. You can see the first-stage winners list on this web-page: https://fasie.ru/competitions/kod-ai-results (in Russian).

Therefore, I thank The Foundation for Assistance to Small Innovative Enterprises for this support.

License

Distributed under the Apache 2.0 License. See LICENSE for more information.

Badges

Extracted from project README

Related Projects

CosyVoice

Multi-lingual large voice generation model, providing inference, training and deployment full-sta...

03 Jul 2024 5,597

Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to supp...

15 Nov 2023 4,482

speech-to-text-wavenet

Speech-to-Text-WaveNet : End-to-end sentence level English speech recognition based on DeepMind's...

14 Nov 2016 3,945

vocabsieve

Simple sentence mining tool for language learning

10 Jul 2021 372

open-dubbing

Open dubbing is an AI dubbing system which uses machine learning models to automatically translat...

14 Sep 2024 24

audiotext

A desktop application that transcribes audio from files, microphone input or YouTube videos with ...

31 Jan 2023 159

odsc

Project that aims to sentenize all the open data of Riksdagen and other sources to create an easi...

20 Nov 2023 0

AudioGPT

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

16 Mar 2023 9,988

Whisper-transcription_and_diarization-speaker-identification-

How to use OpenAIs Whisper to transcribe and diarize audio files

12 Oct 2022 285

ASR_benchmark

Program to benchmark various speech recognition APIs

11 Dec 2017 78

GPT-SoVITS

1 min voice data can also be used to train a good TTS model! (few shot voice cloning)

14 Jan 2024 33,328

easymms

A simple Python package to easily use Meta's Massively Multilingual Speech (MMS) project

28 May 2023 52

parrots

Automatic Speech Recognition(ASR), Text-To-Speech(TTS) engine. 中英语音识别、多角色语音合成，支持多语言，准确率高

26 Aug 2018 457

Screen-Translate

A Screen Translator/OCR Translator made by using Python and Tesseract, the user interface are mad...

28 Aug 2021 111

EmotiVoice

EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine

08 Nov 2023 7,282