Course project for EE698R (2020-21 Sem 2). An X-Vector Based Speaker Diarization System with AutoEncoder based clustering method. Also supports spectral and KMeans clustering method.
GPL-3.0 License
Team Name: TensorSlow
Members: Aditya Singh (@adityajaas) and Shashi Kant Gupta (@shashikg)
Report: It is available here. It contains 4 pages of main text + 1 references page + 2 pages of supplementary materials.
Speaker diarization has received significant interest within the speech community due to its promise to improve automatic speech transcription considerably. Commonly used approach to this problem include using embedding vectors such as d-vectors, i-vectors, or x-vectors with Spectral Clustering. We propose using Unsupervised Deep Embedding Clustering to cluster data in a more semantically meaningful latent representation with pre-trained Auto Encoders for improved imbalanced data separation. Stacked layers of Auto Encoders have been trained in a residual fashion in place of De-noising Auto Encoders for enhanced learning. We use VoxConverse and AMI Corpus split datasets to test our model. Our model shows considerable improvement over the Spectral Clustering approach. Clustering is perfomed on x-vectors extracted using Desplanques et al.'s ECAPA-TDNN framework. We use Silero-VAD for voice audio detection.
Model is tested on VoxConverse dataset (total 216 audio files). We randomly split the dataset into two parts: test and train with test data having 50 audio files. We also tested the model on AMI test dataset (total 16 audio files).
Methods | DER |
---|---|
Spectral Clustering | 17.76 |
Ours | 12.99 |
Spectral Clustering (Oracle VAD) | 17.98 |
Ours (Oracle VAD) | 11.70 |
Methods | DER |
---|---|
Spectral Clustering | 21.99 |
Ours | 23.39 |
Spectral Clustering (Oracle VAD) | 14.96 |
Ours (Oracle VAD) | 13.14 |
Original Video Link: here Diarization Output Link: here
Documentation and details about functions isnide the core module.