Audiobook Revamper

Use AI to shorten and re-narrate audiobooks.

📕 ➔ 🗜️ ➔ 🗣️ ➔ 📗

Why?

Download an audiobook from Audible as an M4B file using OpenAudible.
Chapterize the M4B file into individual M4A audio files using FFMPEG.
Transcribe chapter audio files to text using WhisperX.
Shorten chapter text using Llama or Mixtral.
Narrate new chapters in a custom voice using a text-to-speech model from Eleven Labs.
Combine new chapters into a single M4B file.

export REPLICATE_TOKEN=r8_foobarbazzledazzle
export ELEVEN_LABS_API_KEY=eleventybillion

Install Python dependencies and run the script:

pip install -r requirements.txt

Split audiobook into individual chapters

python chapterize audiobook.m4b

Redo a chapter, shortened and renarrated

python compose.py chapters/*.mp3

Play with models in the browser. When first tinkering with an unfamiliar model, running it on Replicate's web UI makes it easier to get started, play with inputs, visualize outputs, then grab some code and run with it.
Use the Replicate dashboard to dig into your recent predictions and get a helpful view of inputs, outputs, and metrics.
Use Replicate deployments. Deploying your own copy of a model on Replicate gives you control over min/max instances, so you can keep a model on while you're prototyping and turn it down to zero when you're done.
Use Python for prototyping. ChatGPT is good at writing Python. Python has a big standard library so you can build stuff with fewer external dependencies. None of the ESM/CJS shenanigans of the JavaScript world. Better Replicate client library experience for working with local files.
Use Node.js for real products. When you start building something that's going to have real users, Vercel + Next.js is a winning combination. Instead of expensive long-running processes, use webhooks and serverless functions to minimize costs.
Use run counts as a proxy for model quality. - There are many whisper variants. Some are better than others. Some do diarization. Some fall over on large audio files. A high run count is usually a good indication that people are using a model with success.

Try running the language model locally using Ollama's Python or JavaScript client libraries.
Try replacing Eleven Labs with a Replicate-hosted text-to-speech model.
Play around with the summary prompts. How do you get the best "compression" while still maintaining the essence of the original text?
Bring your own audio file as voice training data. Eleven Labs supports training on the fly without pre-creating a voice.
Build an app that finds and extracts utterances in audio that are followed by laughter.