Transcription service

This project extracts text transcriptions from movies and audio recordings in most popular video formats.

Capabilities

Supports transcription of both popular video and audio formats
Should be able to handle long videos (up to several hours)
Scalable architecture using Redis (with the extension of Redis Queue) for job queuing and worker management
Self-hosted solution using open-source Whisper AI model
RESTful API for file upload and transcription status checking
Separate worker processes for handling transcription tasks
Everything is containerized using Docker and Docker Compose, including cuda support setup
Option to include word timestamps in transcriptions

Architecture

The service consists of the following components:

API Server: Handles incoming requests and manages the transcription queue.
Redis: Acts as a message broker and job queue.
Worker: Processes transcription jobs using the Whisper AI model.

Running the containerized service

Prerequisites

Docker and Docker Compose
NVIDIA GPU with CUDA support (optional, for GPU acceleration)
NVIDIA drivers and NVIDIA Container Runtime installed on the host system (optional, for GPU acceleration)

Installation and setup (containerized)

Clone this repository:

git clone <org_path>/transcription_service.git
cd transcription_service

The environment variables are already set in the docker-compose.yml file. If you need to modify any settings, you can do so directly in the compose file or by creating a .env file in the project root directory.
Build and start the services using Docker Compose:
```
docker-compose up --build
```

Configuration

Configuration of different service components can be found in Docker Compose file under docker/docker-compose.yml.

Environment variables

The following environment variables are configured in the docker-compose.yml file:

REDIS_HOST: Hostname of the Redis server (set to redis unless one wants to use a different/external Redis server)
REDIS_PORT: Port of the Redis server (set to 6379)
REDIS_DB: Number of Redis database used for tasks/jobs orchestration (defaults to 10)
UPLOADS_DIR: Directory for uploaded files (defaults to /app/uploads, and the directory can be accessed from the volume)
TRANSCRIPTIONS_DIR: Directory for storing transcriptions (set to /app/transcriptions, and the directory can be accessed from the volume)
WHISPER_MODEL_NAME: Whisper model to use (defaults to large-v3 with best quality, see alternative models in resource-scare scenarios)
WHISPER_MODEL_DEVICE: Device to run the model on (set to cuda for GPU acceleration)

GPU support

GPU support is enabled by default in the Docker Compose configuration. To use it:

Ensure your host system has NVIDIA drivers and NVIDIA Container Runtime installed.
The docker-compose.yml file already includes the necessary configuration:

deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: all
          capabilities: [ gpu ]
runtime: nvidia

Scaling workers

To scale the number of worker processes horizontally:

Use Docker Compose's --scale option:

docker-compose up --scale worker=3

This command will start 3 worker containers, you can use whatever number you see fit.

Alternatively, you can modify the docker-compose.yml file to include a deploy section for the worker service:

worker:
  # ... other configurations ...
  deploy:
    replicas: 3

Then run docker-compose up --build to apply the changes.

IMPORTANT: Make sure to adjust the number of workers based on the available resources on your host system – especially when using GPU acceleration.

Whisper Model Configuration

By default, the service uses the large-v3 Whisper model, which requires approximately 10-11GB of GPU memory (VRAM). You can choose different models based on your hardware capabilities – with smaller options such as tiny or base requiring much less resources and working very fast even on just CPUs – but providing lower transcription quality.

For a full list of available models and their capabilities, visit: https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages To change the model, update the WHISPER_MODEL_NAME environment variable in the docker-compose.yml file under the worker service. Note: Models are not embedded in the worker images and are downloaded at runtime.

Running the service locally/development setup

For local development follow the steps below:

Clone this repository:

git clone <org_path>/transcription_service.git
cd transcription_service

Create a virtual environment and install dependencies (poetry must be installed in the system already):
```
poetry install
```
Set up pre-commit hooks (pre-commit must be installed in the system already):
```
pre-commit install
```

Start the service using Uvicorn:

uvicorn transcription_service.main:app --reload --env-file=.example.env

Workers should be able to be started with a similar fashion, using the following command:

python -m transcription_service.worker

Redis server should be running on the default port (6379) on localhost.

Testing and usage

To run the end-to-end transcription workflow, you can use the provided utility script:

python scripts/end_to_end_transcription.py

This script uploads a sample video file to the service, checks the transcription status, and downloads the transcription once it's ready.

Note: ensure the service is running before executing the script + you're running it from virtual environment with all dependencies installed.

Call with example arguments

python scripts/end_to_end_transcription.py localhost:8001 https://www.youtube.com/watch?v=TX4s0X6FDcQ /path/to/local/sandbox/outputs/TX4s0X6FDcQ_transcription.txt --include-word-timestamps

The call above will upload the video to the service running at localhost:8001 from the provided YouTube link (we support both paths to local files, and YT links for convenience, as it was a popular internal use case), check the transcription status, and download the transcription once it's ready to path specified. The --include-word-timestamps flag is responsible for setting the generated transcription to "rich" format which will include word timestamps in the transcription.

Swagger API documentation

All endpoints are documented using Swagger UI, which can be accessed at http://localhost:8001/docs.

TODOs

Future (possible) improvements:

Add configurable file size restrictions for uploads and video lengths to manage system resources effectively.
Implement user authentication and authorization for secure access to the API.
Add support for other TTS models than Whisper (it was prioritized due to being SOTA).

License

Extracted from a larger system, this component was brought to you by 🐰 datarabbit.ai 🐰

It is licenced under the Apache License, Version 2.0. in order to allow for the flexibility of use and modification.

Do something cool with it! 🚀

Related Projects

generate-subtitles

Generate transcripts for audio and video content with a user friendly UI, powered by Open AI's Wh...

06 Nov 2022 742

Transcribe-Translate

Local web app for transcription and translation services for audio and video using Whisper models

28 Aug 2024 1

ChineseTaiwaneseWhisper

This repository focuses on leveraging OpenAI's Whisper model for speech recognition in Chinese (M...

01 Jul 2024 3

WhisperS2T

An Optimized Speech-to-Text Pipeline for the Whisper Model Supporting Multiple Inference Engine

16 Dec 2023 284

transcriptionstream

turnkey self-hosted offline transcription and diarization service with llm summary

13 Nov 2023 703

wscribe

ez audio transcription tool with flexible processing and post-processing options

21 Jul 2023 125

docker-whisper-server

whisper.cpp HTTP transcription server with OpenAI-like API in Docker

20 Jul 2024 8

bulk_transcribe_youtube_videos_from_playlist

Easily take an entire YouTube playlist and turn it into high quality transcripts using Whisper.

12 Nov 2023 434

subsai

🎞️ Subtitles generation tool (Web-UI + CLI + Python package) powered by OpenAI's Whisper and its ...

28 Feb 2023 1,245

WhisperLive

A nearly-live implementation of OpenAI's Whisper.

04 May 2023 1,194

tafrigh

تفريغ النصوص وإنشاء ملفات SRT و VTT باستخدام نماذج Whisper وتقنية wit.ai.

20 Mar 2023 101

whisper.api

This project provides an API with user level access support to transcribe speech to text using a ...

12 Aug 2023 863

speech-to-text

Real-time transcription using faster-whisper

30 Mar 2023 375

Whisper-WebUI

A Web UI for easy subtitle using whisper model.

02 Mar 2023 1,083

note-taker

Note-taking app for online/local video/audio using Whisper transcription, ChatGPT, and Notion

17 Jun 2024 2