Self-Supervised Speech Pre-training and Representation Learning Toolkit
APACHE-2.0 License
We prefer to have discussions directly on Github issue page, so that all the information is transparent to all the contributors and is auto-archived on the Github. If you wish to use email, please contact:
Please refer to the legacy citation of S3PRL and the timeline below, which justify our initiative on this project. This information is used to protect us from half-truths. We encourage to cite the individual papers most related to the function you are using to give fair credit to the developer of the function. You can find the names in the Change Log. Finally, we would like to thank our advisor, Prof. Hung-yi Lee, for his advice. The project would be impossible without his support.
If you have any question (e.g., about who came up with / developed which ideas / functions or how the project started), feel free to engage in an open and responsible conversation on the GitHub issue page, and we'll be happy to help!
Guideline
Tutorials
We support the following environments. The test cases are ran with tox locally and on github action:
Env | versions |
---|---|
os |
ubuntu-18.04 , ubuntu-20.04
|
python |
3.7 , 3.8 , 3.9 , 3.10
|
pytorch |
1.8.1 , 1.9.1 , 1.10.2 , 1.11.0 , 1.12.1 , 1.13.1 , 2.0.1 , 2.1.0
|
We only list the major contributors here for conciseness. However, we are deeply grateful for all the contributions. Please see the Contributors page for the full list.
This is an open source toolkit called s3prl, which stands for Self-Supervised Speech Pre-training and Representation Learning. Self-supervised speech pre-trained models are called upstream in this toolkit, and are utilized in various downstream tasks.
The toolkit has three major usages:
Here is a high-level illustration of how S3PRL might help you. We support to leverage numerous SSL representations on numerous speech processing tasks in our GitHub codebase:
We also modularize all the SSL models into a standalone PyPi package so that you can easily install it and use it without depending on our entire codebase. The following shows a simple example and you can find more details in our documentation.
pip install s3prl
import torch
from s3prl.nn import S3PRLUpstream
model = S3PRLUpstream("hubert")
model.eval()
with torch.no_grad():
wavs = torch.randn(2, 16000 * 2)
wavs_len = torch.LongTensor([16000 * 1, 16000 * 2])
all_hs, all_hs_len = model(wavs, wavs_len)
for hs, hs_len in zip(all_hs, all_hs_len):
assert isinstance(hs, torch.FloatTensor)
assert isinstance(hs_len, torch.LongTensor)
batch_size, max_seq_len, hidden_size = hs.shape
assert hs_len.dim() == 1
With this modularization, we have achieved close integration with the general speech processing toolkit ESPNet, enabling the use of SSL models for a broader range of speech processing tasks and corpora to achieve state-of-the-art (SOTA) results (kudos to the ESPNet Team):
You can start the journey of SSL with the following entry points:
Feel free to use or modify our toolkit in your research. Here is a list of papers using our toolkit. Any question, bug report or improvement suggestion is welcome through opening up a new issue.
If you find this toolkit helpful to your research, please do consider citing our papers, thanks!
pip install -e ".[all]"
README.md
under each upstream
folder. E.g., upstream/pase/README.md
=The majority of S3PRL Toolkit is licensed under the Apache License version 2.0, however all the files authored by Facebook, Inc. (which have explicit copyright statement on the top) are licensed under CC-BY-NC.
@article{mockingjay,
title={Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders},
ISBN={9781509066315},
url={http://dx.doi.org/10.1109/ICASSP40776.2020.9054458},
DOI={10.1109/icassp40776.2020.9054458},
journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
publisher={IEEE},
author={Liu, Andy T. and Yang, Shu-wen and Chi, Po-Han and Hsu, Po-chun and Lee, Hung-yi},
year={2020},
month={May}
}
@misc{tera,
title={TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech},
author={Andy T. Liu and Shang-Wen Li and Hung-yi Lee},
year={2020},
eprint={2007.06028},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
@inproceedings{audio_albert,
title={Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio Representation},
author={Po-Han Chi and Pei-Hung Chung and Tsung-Han Wu and Chun-Cheng Hsieh and Shang-Wen Li and Hung-yi Lee},
year={2020},
booktitle={SLT 2020},
}
@inproceedings{understanding_sat,
author={Shu-wen Yang and Andy T. Liu and Hung-yi Lee},
title={{Understanding Self-Attention of Self-Supervised Audio Transformers}},
year=2020,
booktitle={Proc. Interspeech 2020},
pages={3785--3789},
doi={10.21437/Interspeech.2020-2231},
url={http://dx.doi.org/10.21437/Interspeech.2020-2231}
}
Defense for Black-box Attacks on Anti-spoofing Models by Self-Supervised Learning (Wu et al., 2020), code for computing LNSR: utility/observe_lnsr.py
@inproceedings{mockingjay_defense,
author={Haibin Wu and Andy T. Liu and Hung-yi Lee},
title={{Defense for Black-Box Attacks on Anti-Spoofing Models by Self-Supervised Learning}},
year=2020,
booktitle={Proc. Interspeech 2020},
pages={3780--3784},
doi={10.21437/Interspeech.2020-2026},
url={http://dx.doi.org/10.21437/Interspeech.2020-2026}
}
@misc{asv_ssl,
title={Adversarial defense for automatic speaker verification by cascaded self-supervised learning models},
author={Haibin Wu and Xu Li and Andy T. Liu and Zhiyong Wu and Helen Meng and Hung-yi Lee},
year={2021},
eprint={2102.07047},
archivePrefix={arXiv},
primaryClass={eess.AS}
@misc{s2vc,
title={S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations},
author={Jheng-hao Lin and Yist Y. Lin and Chung-Ming Chien and Hung-yi Lee},
year={2021},
eprint={2104.02901},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
SUPERB: Speech processing Universal PERformance Benchmark (Yang et al., 2021)
@misc{superb,
title={SUPERB: Speech processing Universal PERformance Benchmark},
author={Shu-wen Yang and Po-Han Chi and Yung-Sung Chuang and Cheng-I Jeff Lai and Kushal Lakhotia and Yist Y. Lin and Andy T. Liu and Jiatong Shi and Xuankai Chang and Guan-Ting Lin and Tzu-Hsien Huang and Wei-Cheng Tseng and Ko-tik Lee and Da-Rong Liu and Zili Huang and Shuyan Dong and Shang-Wen Li and Shinji Watanabe and Abdelrahman Mohamed and Hung-yi Lee},
year={2021},
eprint={2105.01051},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Utilizing Self-supervised Representations for MOS Prediction (Tseng et al., 2021)
@misc{ssr_mos,
title={Utilizing Self-supervised Representations for MOS Prediction},
author={Wei-Cheng Tseng and Chien-yu Huang and Wei-Tsung Kao and Yist Y. Lin and Hung-yi Lee},
year={2021},
eprint={2104.03017},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
}
If you find this toolkit useful, please consider citing following papers.
@misc{tera,
title={TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech},
author={Andy T. Liu and Shang-Wen Li and Hung-yi Lee},
year={2020},
eprint={2007.06028},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
@article{mockingjay,
title={Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders},
ISBN={9781509066315},
url={http://dx.doi.org/10.1109/ICASSP40776.2020.9054458},
DOI={10.1109/icassp40776.2020.9054458},
journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
publisher={IEEE},
author={Liu, Andy T. and Yang, Shu-wen and Chi, Po-Han and Hsu, Po-chun and Lee, Hung-yi},
year={2020},
month={May}
}
@article{yang2024large,
title={A Large-Scale Evaluation of Speech Foundation Models},
author={Yang, Shu-wen and Chang, Heng-Jui and Huang, Zili and Liu, Andy T and Lai, Cheng-I and Wu, Haibin and Shi, Jiatong and Chang, Xuankai and Tsai, Hsiang-Sheng and Huang, Wen-Chin and others},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
year={2024},
publisher={IEEE}
}
@inproceedings{yang21c_interspeech,
author={Shu-wen Yang and Po-Han Chi and Yung-Sung Chuang and Cheng-I Jeff Lai and Kushal Lakhotia and Yist Y. Lin and Andy T. Liu and Jiatong Shi and Xuankai Chang and Guan-Ting Lin and Tzu-Hsien Huang and Wei-Cheng Tseng and Ko-tik Lee and Da-Rong Liu and Zili Huang and Shuyan Dong and Shang-Wen Li and Shinji Watanabe and Abdelrahman Mohamed and Hung-yi Lee},
title={{SUPERB: Speech Processing Universal PERformance Benchmark}},
year=2021,
booktitle={Proc. Interspeech 2021},
pages={1194--1198},
doi={10.21437/Interspeech.2021-1775}
}