A command-line tool to extract accurate text subtitles (SRT format) from DVD and Blu-ray disc formats
MIT License
Tomoji is a command-line tool to extract accurate text subtitles (SRT format) from DVD and Blu-ray disc formats. DVD/Blu-ray discs store subtitle data as images, so converting them to text requires OCR (optical character recognition). Other tools for this purpose have had low-quality OCR that made them especially unsuitable for languages with complicated non-Latin scripts (e.g. Japanese).
Tomoji is implemented as a small Python wrapper/glue script; the real work is done by MKVToolNix, OGMRip, and the Google Cloud Vision API.
NOTE: The Google Cloud Vision API is cheap but not free. So unfortunately to use tomoji for OCR you'll need to have a Google Cloud Platform account and provide your API key on the command line.
Tomoji requires an .mkv (Matroska video) file with embedded subtitle-image (VOBSUB) tracks as input, so you'll have to use a separate program to extract a DVD/Blu-ray disc to an .mkv file. HandBrake is a free, open source, multi-platform tool that works well for this purpose (see below for tips on the right settings to use for HandBrake).
Installing the dependencies for tomoji can be a nightmare on Mac/Windows, so I've published a Docker image (rsimmons/tomoji that bundles them and lets you conveniently run tomoji on any platform supported by Docker. Input and output can be provided via stdin/stdout so that Docker volumes are not required:
$ docker run -i --rm rsimmons/tomoji list - < inputvideo.mkv
Available subtitle tracks (VOBSUB):
#3: Japanese (jpn)
#4: English (eng)
$ docker run -i --rm rsimmons/tomoji ocr -k YOUR_GOOG_API_KEY - 3 < inputvideo.mkv > outputsubs_ja.srt
On a recent version of Ubuntu, tomoji can be run like this:
$ sudo apt-get install -y ogmrip python3-venv
$ pyvenv env
$ ./env/bin/activate
(env) $ pip install pycountry requests
(env) $ python3 tomoji.py list inputvideo.mkv
Available subtitle tracks (VOBSUB):
#3: Japanese (jpn)
#4: English (eng)
(env) $ python3 tomoji.py ocr -k YOUR_GOOG_API_KEY inputvideo.mkv 3 > outputsubs_ja.srt
Coming Soon
At the time of this writing, ffmpeg can't extract VOBSUB tracks from .mp4 files, but mkvextract can extract them from .mkv files. Handbrake supports outputting .mkv files. So .mkv files end up being the best intermediate format for processing VOBSUB tracks.