This project extracts text transcriptions from movies and audio recordings in most popular video formats.
The service consists of the following components:
git clone <org_path>/transcription_service.git
cd transcription_service
docker-compose up --build
Configuration of different service components can be found in Docker Compose file under docker/docker-compose.yml
.
The following environment variables are configured in the docker-compose.yml file:
GPU support is enabled by default in the Docker Compose configuration. To use it:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [ gpu ]
runtime: nvidia
To scale the number of worker processes horizontally:
docker-compose up --scale worker=3
This command will start 3 worker containers, you can use whatever number you see fit.
worker:
# ... other configurations ...
deploy:
replicas: 3
Then run docker-compose up --build to apply the changes.
IMPORTANT: Make sure to adjust the number of workers based on the available resources on your host system – especially when using GPU acceleration.
By default, the service uses the large-v3 Whisper model, which requires approximately 10-11GB of GPU memory (VRAM).
You can choose different models based on your hardware capabilities – with smaller options such as tiny
or base
requiring much less resources and working very fast even on just CPUs – but providing lower transcription quality.
For a full list of available models and their capabilities, visit: https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages To change the model, update the WHISPER_MODEL_NAME environment variable in the docker-compose.yml file under the worker service. Note: Models are not embedded in the worker images and are downloaded at runtime.
For local development follow the steps below:
git clone <org_path>/transcription_service.git
cd transcription_service
poetry install
pre-commit install
uvicorn transcription_service.main:app --reload --env-file=.example.env
Workers should be able to be started with a similar fashion, using the following command:
python -m transcription_service.worker
Redis server should be running on the default port (6379) on localhost.
To run the end-to-end transcription workflow, you can use the provided utility script:
python scripts/end_to_end_transcription.py
This script uploads a sample video file to the service, checks the transcription status, and downloads the transcription once it's ready.
Note: ensure the service is running before executing the script + you're running it from virtual environment with all dependencies installed.
python scripts/end_to_end_transcription.py localhost:8001 https://www.youtube.com/watch?v=TX4s0X6FDcQ /path/to/local/sandbox/outputs/TX4s0X6FDcQ_transcription.txt --include-word-timestamps
The call above will upload the video to the service running at localhost:8001 from the provided YouTube link (we
support both paths to local files, and YT links for convenience, as it was a popular internal use case), check
the transcription status, and download the transcription once it's ready to path specified.
The --include-word-timestamps
flag is responsible for setting the generated transcription to "rich" format
which will include word timestamps in the transcription.
All endpoints are documented using Swagger UI, which can be accessed at http://localhost:8001/docs.
Future (possible) improvements:
Extracted from a larger system, this component was brought to you by 🐰 datarabbit.ai 🐰
It is licenced under the Apache License, Version 2.0. in order to allow for the flexibility of use and modification.
Do something cool with it! 🚀