Chinese/Taiwanese Whisper ASR Project

An advanced Automatic Speech Recognition (ASR) system for Chinese (Traditional) and Taiwanese, leveraging the power of OpenAI's Whisper model. This project supports full fine-tuning, Parameter-Efficient Fine-Tuning (PEFT), and streaming inference, optimized for T4 GPUs.

🌟 Features

🎙️ Fine-tuning of Whisper models on Chinese/Taiwanese data
🚀 PEFT methods support (e.g., LoRA) for efficient fine-tuning
🔄 Batch and streaming inference capabilities
🖥️ User-friendly Gradio web interface
⚡ Optimized performance on T4 GPUs

📁 Project Structure

ChineseTaiwaneseWhisper/
├── scripts/
│   ├── gradio_interface.py
│   ├── infer.py
│   └── train.py
├── src/
│   ├── config/
│   ├── crawler/
│   ├── data/
│   ├── models/
│   ├── trainers/
│   └── inference/
├── tests/
├── requirements.txt
├── setup.py
└── README.md

🛠️ Installation

Clone the repository:

git clone https://github.com/sandy1990418/ChineseTaiwaneseWhisper.git
cd ChineseTaiwaneseWhisper

Set up a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install dependencies:
```
pip install -r requirements.txt
```

🚀 Usage

Training

Standard Fine-tuning

python scripts/train.py --model_name_or_path "openai/whisper-small" \
                        --language "chinese" \
                        --dataset_name "mozilla-foundation/common_voice_11_0" \
                        --youtube_data_dir "./youtube_data" \
                        --output_dir "./whisper-finetuned-zh-tw" \
                        --num_train_epochs 3 \
                        --per_device_train_batch_size 16 \
                        --learning_rate 1e-5 \
                        --fp16 \
                        --timestamp False

PEFT Fine-tuning (e.g., LoRA)

python scripts/train.py --model_name_or_path "openai/whisper-small" \
                        --language "chinese" \
                        --use_peft \
                        --peft_method "lora" \
                        --dataset "common_voice_13_train","youtube_data"  \
                        --output_dir "Checkpoint_Path" \
                        --num_train_epochs 10  \
                        --per_device_train_batch_size 4 \
                        --learning_rate 1e-5  \
                        --fp16\
                        --timestamp True

Training Arguments

Argument	Description	Default
`--model_name_or_path`	Path or name of the pre-trained model	Required
`--language`	Language for fine-tuning (e.g., "chinese", "taiwanese")	Required
`--dataset_name`	Name of the dataset to use	Required
`--dataset_config_names`	Configuration name for the dataset	Required
`--youtube_data_dir`	Directory containing YouTube data	Optional
`--output_dir`	Directory to save the fine-tuned model	Required
`--num_train_epochs`	Number of training epochs	3
`--per_device_train_batch_size`	Batch size per GPU/CPU for training	16
`--learning_rate`	Initial learning rate	3e-5
`--fp16`	Use mixed precision training	False
`--use_timestamps`	Include timestamp information in training	False
`--use_peft`	Use Parameter-Efficient Fine-Tuning	False
`--peft_method`	PEFT method to use (e.g., "lora")	None

Inference

Gradio Interface

Launch the interactive web interface:

python scripts/gradio_interface.py

Access the interface at http://127.0.0.1:7860 (default URL).

Note: For streaming mode, use Chrome instead of Safari to avoid CPU memory issues.

Batch Inference

python scripts/infer.py --model_path openai/whisper-small \
                        --audio_files audio.wav \
                        --mode batch \
                        --use_timestamps False

Inference Arguments

Argument	Description	Default
`--model_path`	Path to the fine-tuned model	Required
`--audio_files`	Path(s) to audio file(s) for transcription	Required
`--mode`	Inference mode ("batch" or "stream")	"batch"
`--use_timestamps`	Include timestamps in transcription	False
`--device`	Device to use for inference (e.g., "cuda", "cpu")	"cuda" if available, else "cpu"
`--output_dir`	Directory to save transcription results	"output"
`--use_peft`	Use PEFT model for inference	False
`--language`	Language of the audio (e.g., "chinese", "taiwanese")	"chinese"

Audio Crawler

Collect YouTube data:

python src/crawler/youtube_crawler.py \
       --playlist_urls "YOUTUBE_PLAYLIST_URL" \
       --output_dir ./youtube_data \
       --dataset_name youtube_asr_dataset \
       --file_prefix language_prefix

Crawler Arguments

Argument	Description	Default
`--playlist_urls`	YouTube playlist URL(s) to crawl	Required
`--output_dir`	Directory to save audio files and dataset	"./output"
`--dataset_name`	Name of the output dataset file	"youtube_dataset"
`--file_prefix`	Prefix for audio and subtitle files	"youtube"

🔧 Customization

📊 Use different datasets by modifying the dataset_name parameter
🛠️ Adjust PEFT methods via peft_method and configurations in src/config/train_config.py
🔬 Optimize inference by modifying ChineseTaiwaneseASRInference in src/inference/flexible_inference.py

🧪 Testing

Run tests with pytest:

pytest tests/

For detailed output:

pytest -v tests/

Check test coverage:

pip install pytest-cov
pytest --cov=src tests/

💻 Performance Optimization

Baseline Performance

On a T4 GPU, without any acceleration methods:

Inference speed: 1:24 (1 minute of processing time can transcribe 24 minutes of audio)

This baseline gives you an idea of the default performance. Depending on your specific needs, you may want to optimize further or use acceleration techniques.

Optimization Techniques

To address memory issues or improve performance on T4 GPUs:

📉 Reduce batch size (--per_device_train_batch_size)
- Decreases memory usage but may increase processing time
🔽 Use a smaller Whisper model (e.g., "openai/whisper-tiny")
- Faster inference but potentially lower accuracy
📈 Increase gradient accumulation steps (--gradient_accumulation_steps)
- Simulates larger batch sizes without increasing memory usage
🔀 Enable mixed precision training (--fp16)
- Speeds up computation and reduces memory usage with minimal impact on accuracy

Advanced Optimization

For further performance improvements:

🚀 Use PEFT methods like LoRA
- Significantly reduces memory usage and training time
⚡ Implement quantization (e.g., int8)
- Dramatically reduces model size and increases inference speed
🖥️ Utilize multi-GPU setups if available
- Distributes computation for faster processing

Note: The actual performance may vary depending on your specific hardware, audio complexity, and chosen optimization techniques. Always benchmark your specific use case.

🔄 Streaming ASR Flow

Real-time Audio Simple Transcription Pipeline

Set Up: We prepare our system to listen and transcribe.
Listen: We constantly listen for incoming audio.
Check: When we get audio, we check if it contains speech.
Process:
- If there's speech, we transcribe it.
- If not, we skip that part.
Share: We immediately share what we found, whether it's words or silence.
Repeat: We keep doing this until there's no more audio.
Finish: When the audio ends, we wrap everything up and provide the final transcript.

graph TD
    A[Start] --> B[Set Up System]
    B --> C{Listen for Audio}
    C -->|Audio Received| D[Check for Speech]
    D -->|Speech Found| E[Transcribe Audio]
    D -->|No Speech| F[Skip Transcription]
    E --> G[Output Result]
    F --> G
    G --> C
    C -->|No More Audio| H[Finish Up]
    H --> I[End]

Real-time Audio Transcription Pipeline

graph TD
    A[Start] --> B[Initialize Audio Stream]
    B --> C[Initialize ASR Model]
    C --> D[Initialize VAD Model]
    D --> E[Initialize Audio Buffer]
    E --> F[Initialize ThreadPoolExecutor]
    F --> G{Receive Audio Chunk}
    G -->|Yes| H[Add to Audio Buffer]
    H --> I{Buffer Full?}
    I -->|No| G
    I -->|Yes| J[Submit Chunk to ThreadPool]
    J --> K[Apply VAD]
    K --> L{Speech Detected?}
    L -->|No| O[Slide Buffer]
    L -->|Yes| M[Process Audio Chunk]
    M --> N[Generate Partial Transcription]
    N --> O
    O --> G
    G -->|No| P[Process Remaining Audio]
    P --> Q[Finalize Transcription]
    Q --> R[End]

    subgraph "Parallel Processing"
    J
    K
    L
    M
    N
    end

📊 Dataset Format

The Chinese/Taiwanese Whisper ASR project uses a specific format for its datasets to ensure compatibility with the training and inference scripts. The format can include or exclude timestamps, depending on the configuration.

Basic Structure

Each item in the dataset represents an audio file and its corresponding transcription:

{
    "audio": {
        "path": "path/to/audio/file.wav",
        "sampling_rate": 16000
    },
    "sentence": "The transcription of the audio in Chinese or Taiwanese.",
    "language": "zh-TW",  # or "zh-CN" for Mandarin, "nan" for Taiwanese, etc.
    "duration": 10.5  # Duration of the audio in seconds
}

Transcription Format Examples

Without Timestamps

labels:
<|startoftranscript|><|zh|><|transcribe|><|notimestamps|>地圖炮<|endoftext|>

In this example:

<|startoftranscript|>: Marks the beginning of the transcription
<|zh|>: Indicates the language (Chinese)
<|transcribe|>: Denotes that this is a transcription task
<|notimestamps|>: Indicates that no timestamps are included
地圖炮: The actual transcription
<|endoftext|>: Marks the end of the transcription

With Timestamps

labels:
<|startoftranscript|><|zh|><|transcribe|><|0.00|>而對樓市成交抑制作用最大的限購<|6.00|><|endoftext|>

In this example:

<|startoftranscript|>, <|zh|>, and <|transcribe|>: Same as above
<|0.00|>: Timestamp indicating the start of the transcription (0 seconds)
而對樓市成交抑制作用最大的限購: The actual transcription
<|6.00|>: Timestamp indicating the end of the transcription (6 seconds)
<|endoftext|>: Marks the end of the transcription

Notes

The choice between using timestamps or not should be consistent throughout your dataset and should match the use_timestamps parameter in your training and inference scripts.

Preparing Your Own Dataset

If you're preparing your own dataset:

Organize your audio files and transcriptions.
Ensure each transcription includes the appropriate tokens (<|startoftranscript|>, <|zh|>, etc.).
If using timestamps, include them in the format <|seconds.decimals|> before each segment of transcription.
Use <|notimestamps|> if not including timestamp information.
Always end the transcription with <|endoftext|>.

By following this format, you ensure that your dataset is compatible with the Chinese/Taiwanese Whisper ASR system, allowing for efficient training and accurate inference.

🌐 FastAPI Usage

🚀 Launching the API

Development Mode:
```
fastapi dev api_main.py
```
Production Mode:
```
fastapi run api_main.py
```

The API will be accessible at http://0.0.0.0:8000 by default.

🐳 Using Docker

Build and start the Docker container:
```
bash app/docker.sh
```

🔍 API Documentation

Access the Swagger UI documentation at http://localhost:8000/docs when the server is running.

🛠️ Using curl to Interact with the API

Health Check:
```
curl -k http://localhost:8000/health
```

Transcribe Audio:

curl -k -X POST -H "Content-Type: multipart/form-data" -F "file=@/path/to/your/audio/file.wav" http://localhost:8000/transcribe

Replace /path/to/your/audio/file.wav with the actual path to your audio file.

List All Transcriptions:

curl -k http://localhost:8000/transcriptions

Get a Specific Transcription:
```
curl -k http://localhost:8000/transcription/{transcription_id}
```
Replace {transcription_id} with the actual UUID of the transcription.
Delete a Transcription:
```
curl -k -X DELETE http://localhost:8000/transcription/{transcription_id}
```
Replace {transcription_id} with the UUID of the transcription you want to delete.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

🙏 Acknowledgments

OpenAI for the Whisper model
Hugging Face for the Transformers library
Mozilla Common Voice for the dataset

Related Projects

faster-whisper

Faster Whisper transcription with CTranslate2

11 Feb 2023 9,301

tafrigh

تفريغ النصوص وإنشاء ملفات SRT و VTT باستخدام نماذج Whisper وتقنية wit.ai.

20 Mar 2023 101

Whisper-Finetune

Fine-tune the Whisper speech recognition model to support training without timestamp data, traini...

22 Apr 2023 501

whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

09 Dec 2022 8,782

WhisperS2T

An Optimized Speech-to-Text Pipeline for the Whisper Model Supporting Multiple Inference Engine

16 Dec 2023 284

whisper-ctranslate2

Whisper command line client compatible with original OpenAI client based on CTranslate2.

17 Mar 2023 872

use-whisper

React hook for OpenAI Whisper with speech recorder, real-time transcription, and silence removal ...

06 Mar 2023 718

whisper.api

This project provides an API with user level access support to transcribe speech to text using a ...

12 Aug 2023 863

wscribe

ez audio transcription tool with flexible processing and post-processing options

21 Jul 2023 125

whisper

Robust Speech Recognition via Large-Scale Weak Supervision

16 Sep 2022 64,924

whisper-writer

💬📝 A small dictation app using OpenAI's Whisper speech recognition model.

18 Apr 2023 320

WhisperLive

A nearly-live implementation of OpenAI's Whisper.

04 May 2023 1,194

whisper-jax

JAX implementation of OpenAI's Whisper model for up to 70x speed-up on TPU.

02 Mar 2023 4,370

Whisper-WebUI

A Web UI for easy subtitle using whisper model.

02 Mar 2023 1,083

go-whisper

Speech o Text using docker image with ggerganov/whisper.cpp

10 Jun 2023 33