Data manipulation and transformation for audio signal processing, powered by PyTorch
BSD-2-CLAUSE License
Bot releases are visible (Hide)
This release is compatible with PyTorch 2.4
. There are no new features added.
This release contains 2 fixes:
Published by atalman 4 months ago
This release is compatible with PyTorch 2.3.1 patch release. There are no new features added.
Published by ahmadsharif1 6 months ago
This release is compatible with PyTorch 2.3.0 patch release. There are no new features added.
This release contains minor documentation and code quality improvements (#3734, #3748, #3757, #3759)
Published by atalman 7 months ago
This release is compatible with PyTorch 2.2.2 patch release. There are no new features added.
Published by atalman 8 months ago
This release is compatible with PyTorch 2.2.1 patch release. There are no new features added.
Published by mthrok 9 months ago
trio
top-level module, dedicated for core I/O operations (https://github.com/pytorch/audio/pull/3676, https://github.com/pytorch/audio/pull/3680, https://github.com/pytorch/audio/pull/3681, https://github.com/pytorch/audio/pull/3682) Please refer to https://pytorch.org/audio/2.2.0/torio.html for the details.Published by huydhn 10 months ago
This is a patch release, which is compatible with PyTorch 2.1.2. There are no new features added.
Published by mthrok 11 months ago
This is a minor release, which is compatible with PyTorch 2.1.1 and includes bug fixes, improvements and documentation updates.
Published by mthrok about 1 year ago
TorchAudio v2.1 introduces the new features and backward-incompatible changes;
torchaudio.io.AudioEffector
can apply filters, effects and encodings to waveforms in online/offline fashion.torchaudio.functional.forced_align
computes alignment from an emission and torchaudio.pipelines.MMS_FA
provides access to the model trained for multilingual forced alignment in MMS: Scaling Speech Technology to 1000+ languages project.forced_align
function, and https://pytorch.org/audio/2.1/tutorials/forced_alignment_for_multilingual_data_tutorial.html for how one can use MMS_FA
to align transcript in multiple languages.torchaudio.pipelines.SQUIM_SUBJECTIVE
and torchaudio.pipelines.SQUIM_OBJECTIVE
models to estimate the various speech quality and intelligibility metrics. This is helpful when evaluating the quality of speech generation models, such as TTS.torchaudio.models.decoder.CUCTCDecoder
takes emission stored in CUDA memory and performs CTC beam search on it in CUDA device. The beam search is fast. It eliminates the need to move data from CUDA device to CPU when performing automatic speech recognition. With PyTorch's CUDA support, it is now possible to perform the entire speech recognition pipeline in CUDA.torchaudio.io.StreamWriter
(#3135)torchaudio.io.StreamReader.get_out_stream_info
(#3155)torchaudio.io.StreamReader
filter graph (#3183, #3479)torchaudio.io.StreamWriter
(#3194)torchaudio.io.StreamReader
(#3216)torchaudio.io.StreamWriter
(#3207)420p10le
support to torchaudio.io.StreamReader
CPU decoder (#3332)torchaudio.io.AudioEffector
(#3163, #3372, #3374)torchaudio.transforms.SpecAugment
(#3309, #3314)torchaudio.functional.forced_align
(#3348, #3355, #3533, #3536, #3354, #3365, #3433, #3357)torchaudio.functional.merge_tokens
(#3535, #3614)torchaudio.functional.frechet_distance
(#3545)torchaudio.models.SquimObjective
for speech enhancement (#3042, 3087, #3512)torchaudio.models.SquimSubjective
for speech enhancement (#3189)torchaudio.models.decoder.CUCTCDecoder
(#3096)torchaudio.pipelines.SquimObjectiveBundle
for speech enhancement (#3103)torchaudio.pipelines.SquimSubjectiveBundle
for speech enhancement (#3197)torchaudio.pipelines.MMS_FA
Bundle for forced alignment (#3521, #3538)torchaudio.io.AudioEffector
(#3226)torchaudio.models.decoder.CUCTCDecoder
(#3297)In this release, the following third party libraries are removed from TorchAudio binary distributions. TorchAudio now search and link these libraries at runtime. Please install them to use the corresponding APIs.
libsox
is used for various audio I/O, filtering operations.
Pre-built binaries are avaialble via package managers, such as conda
, apt
and brew
. Please refer to the respective documetation.
The APIs affected include;
torchaudio.load
("sox" backend)torchaudio.info
("sox" backend)torchaudio.save
("sox" backend)torchaudio.sox_effects.apply_effects_tensor
torchaudio.sox_effects.apply_effects_file
torchaudio.functional.apply_codec
(also deprecated, see below)Changes related to the removal: #3232, #3246, #3497, #3035
flashlight-text
is the core of CTC decoder.
Pre-built packages are available on PyPI. Please refer to https://github.com/flashlight/text for the detail.
The APIs affected include;
torchaudio.models.decoder.CTCDecoder
Changes related to the removal: #3232, #3246, #3236, #3339
A custom built libkaldi
was used to implement torchaudio.functional.compute_kaldi_pitch
. This function, along with libkaldi integration, is removed in this release. There is no replcement.
Changes related to the removal: #3368, #3403
To make I/O operations more flexible, TorchAudio introduced the backend dispatcher in v2.0, and users could opt-in to use the dispatcher.
In this release, the backend dispatcher becomes the default mechanism for selecting the I/O backend.
You can pass backend
argument to torchaudio.info
, torchaudio.load
and torchaudio.save
function to select I/O backend library per-call basis. (If it is omitted, an available backend is automatically selected.)
If you want to use the global backend mechanism, you can set the environment variable, TORCHAUDIO_USE_BACKEND_DISPATCHER=0
.
Please note, however, that this the global backend mechanism is deprecated and is going to be removed in the next release.
Please see #2950 for the detail of migration work.
torchaudio.io.StreamReader
accepted a byte-string wrapped in 1D torch.Tensor
object. This is no longer supported.
Please wrap the underlying data with io.BytesIO
instead.
The optional arguments of add_[audio|video]_stream
methods of torchaudio.io.StreamReader
and torchaudio.io.StreamWriter
are now keyword-only arguments.
Previously TorchAudio supported FFmpeg 4 (>=4.1, <=4.4). In this release, TorchAudio supports FFmpeg 4, 5 and 6 (>=4.4, <7). With this change, support for FFmpeg 4.1, 4.2 and 4.3 are dropped.
torchaudio.functional.apply_codec
(#3397)In previous versions, TorchAudio shipped custom built libsox
, so that it can perform in-memory decoding and encoding.
Now, in-memory decoding and encoding are handled by FFmpeg binding, and with the switch to dynamic libsox
linking, torchaudio.functional.apply_codec
no longer process audio in in-memory fashion. Instead it writes to temporary file.
For in-memory processing, please use torchaudio.io.AudioEffector
.
lstsq
when solving InverseMelScale (#3280)Previously, torchaudio.transform.InverseMelScale
ran SGD optimizer to find the inverse of mel-scale transform. This approach has number of issues as listed in #2643.
This release switches to use torch.linalg.lstsq
.
The infer
method of torchaudio.models.RNNTBeamSearch
has been updated to accept series of previous hypotheses.
bundle = torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH
decoder: RNNTBeamSearch = bundle.get_decoder()
hypothesis = None
while streaming:
...
hypo, state = decoder.infer(
features,
length,
beam_width,
state=state,
hypothesis=hypothesis,
)
...
hypothesis = hypo
# Previously this had to be hypothesis = hypo[0]
torchaudio.functional.apply_codec
function (#3386)Due to the removal of custom libsox binding, torchaudio.functional.apply_codec
no longer supports in-memory processing. Please migrate to torchaudio.io.AudioEffector
.
Please refer to for the detailed usage of torchaudio.io.AudioEffector
.
get_trellis
in forced alignment tutorial (#3172)torchaudio.io.StreamWriter
(#3373)lfilter
(#3432)torchaudio.io.StreamWriter
is not opened (#3152)torchaudio.io.StreamReader
(#3157, #3170, #3186, #3184, #3188, #3320, #3296, #3328, #3419, #3209)torchaudio.io.StreamWriter
(#3205, #3319, #3296, #3328, #3426, #3428)n_fft
(#3442)torch.norm
to torch.linalg.vector_norm
(#3522)torch.nn.utils.weight_norm
to nn.utils.parametrizations.weight_norm
(#3523)Published by mthrok over 1 year ago
This is a minor release, which is compatible with PyTorch 2.0.1 and includes bug fixes, improvements and documentation updates. There is no new feature added.
Full Changelog: https://github.com/pytorch/audio/compare/v2.0.1...v2.0.2
Published by xiaohui-zhang over 1 year ago
TorchAudio 2.0 release includes:
info
, load
, save
functionsThe release adds several data augmentation operators under torchaudio.functional
and torchaudio.transforms
:
torchaudio.functional.add_noise
torchaudio.functional.convolve
torchaudio.functional.deemphasis
torchaudio.functional.fftconvolve
torchaudio.functional.preemphasis
torchaudio.functional.speed
torchaudio.transforms.AddNoise
torchaudio.transforms.Convolve
torchaudio.transforms.Deemphasis
torchaudio.transforms.FFTConvolve
torchaudio.transforms.Preemphasis
torchaudio.transforms.Speed
torchaudio.transforms.SpeedPerturbation
The operators can be used to synthetically diversify training data to improve the generalizability of downstream models.
For usage details, please refer to the documentation for torchaudio.functional
and torchaudio.transforms
, and tutorial “Audio Data Augmentation”.
The release adds two self-supervised learning models for speech and audio.
Besides the model architectures, torchaudio also supports corresponding pre-trained pipelines:
torchaudio.pipelines.WAVLM_BASE
torchaudio.pipelines.WAVLM_BASE_PLUS
torchaudio.pipelines.WAVLM_LARGE
torchaudio.pipelines.WAV2VEC_XLSR_300M
torchaudio.pipelines.WAV2VEC_XLSR_1B
torchaudio.pipelines.WAV2VEC_XLSR_2B
For usage details, please refer to factory function
and pre-trained pipelines
documentation.
Release 2.0 introduces new versions of I/O functions torchaudio.info
, torchaudio.load
and torchaudio.save
, backed by a dispatcher that allows for selecting one of backends FFmpeg, SoX, and SoundFile to use, subject to library availability. Users can enable the new logic in Release 2.0 by setting the environment variable TORCHAUDIO_USE_BACKEND_DISPATCHER=1
; the new logic will be enabled by default in Release 2.1.
# Fetch metadata using FFmpeg
metadata = torchaudio.info("test.wav", backend="ffmpeg")
# Load audio (with no backend parameter value provided, function prioritizes using FFmpeg if it is available)
waveform, rate = torchaudio.load("test.wav")
# Write audio using SoX
torchaudio.save("out.wav", waveform, rate, backend="sox")
Please see the documentation for torchaudio
for more details.
Dropped Python 3.7 support (#3020)
Following the upstream PyTorch (https://github.com/pytorch/pytorch/pull/93155), the support for Python 3.7 has been dropped.
Default to "precise" seek in torchaudio.io.StreamReader.seek
(#2737, #2841, #2915, #2916, #2970)
Previously, the StreamReader.seek
method seeked into a key frame closest to the given time stamp. A new option mode
has been added which can switch the behavior to seeking into any type of frame, including non-key frames, that is closest to the given timestamp, and this behavior is now default.
Removed deprecated/unused/undocumented functions from datasets.utils (#2926, #2927)
The following functions are removed from datasets.utils
stream_url
download_url
validate_file
extract_archive
.Deprecated 'onesided' init param for MelSpectrogram (#2797, #2799)
torchaudio.transforms.MelSpectrogram
assumes the onesided
argument to be always True
. The forward path fails if its value is False
. Therefore this argument is deprecated. Users specifying this argument should stop specifying it.
Deprecated "sinc_interpolation"
and "kaiser_window"
option value in favor of "sinc_interp_hann"
and "sinc_interp_kaiser"
(#2922)
The valid values of resampling_method
argument of resampling operations (torchaudio.transforms.Resample
and torchaudio.functional.resample
) are changed. "kaiser_window"
is now "sinc_interp_kaiser"
and "sinc_interpolation"
is "sinc_interp_hann"
. The old values will continue to work, but users are encouraged to update their code.
For the reason behind of this change, please refer #2891.
Deprecated sox initialization/shutdown public API functions (#3010)
torchaudio.sox_effects.init_sox_effects
and torchaudio.sox_effects.shutdown_sox_effects
are deprecated. They were required to use libsox-related features, but are called automatically since v0.6, and the initialization/shutdown mechanism have been moved elsewhere. These functions are now no-op. Users can simply remove the call to these functions.
torchaudio.load
, torchaudio.info
and torchaudio.save
.torchaudio.sox_effects.apply_effects_file
and torchaudio.functional.apply_codec
.torchaudio.io.StreamReader
supports decoding media from byte strings contained in 1D tensors of torch.uint8
type. Using torch.Tensor type as a container for byte string is now deprecated. To pass byte strings, please wrap the string with io.BytesIO
.
torchaudio.functional.lfilter
(#3080)Without the change in #2873, the WER results are:
Model | dev-clean | dev-other | test-clean | test-other |
---|---|---|---|---|
WAV2VEC2_ASR_LARGE_LV60K_10M | 10.59 | 15.62 | 9.58 | 16.33 |
WAV2VEC2_ASR_LARGE_LV60K_100H | 2.80 | 6.01 | 2.82 | 6.34 |
WAV2VEC2_ASR_LARGE_LV60K_960H | 2.36 | 4.43 | 2.41 | 4.96 |
HUBERT_ASR_LARGE | 1.85 | 3.46 | 2.09 | 3.89 |
HUBERT_ASR_XLARGE | 2.21 | 3.40 | 2.26 | 4.05 |
After applying layer normalization, the updated WER results are:
Model | dev-clean | dev-other | test-clean | test-other |
---|---|---|---|---|
WAV2VEC2_ASR_LARGE_LV60K_10M | 6.77 | 10.03 | 6.87 | 10.51 |
WAV2VEC2_ASR_LARGE_LV60K_100H | 2.19 | 4.55 | 2.32 | 4.64 |
WAV2VEC2_ASR_LARGE_LV60K_960H | 1.78 | 3.51 | 2.03 | 3.68 |
HUBERT_ASR_LARGE | 1.77 | 3.32 | 2.03 | 3.68 |
HUBERT_ASR_XLARGE | 1.73 | 2.72 | 1.90 | 3.16 |
shuffle
is set True
in BucketizeBatchSampler
, the seed is only the same for the first epoch. In later epochs, each BucketizeBatchSampler
object will generate a different shuffled iteration list, which may cause DPP training to hang forever if the lengths of iteration lists are different across nodes. In the 2.0.0 release, the issue is fixed by using the same seed for RNG in all nodes._fail_info_fileobj
(#3032)torchaudio.io.StreamReader
.torchaudio.functional.lfilter
(#3018)AddNoise
, Convolve
, FFTConvolve
, Speed
, SpeedPerturbation
, Deemphasis
, and Preemphasis
in torchaudio.transforms
, and add_noise
, fftconvolve
, convolve
, speed
, preemphasis
, and deemphasis
in torchaudio.functional
.fill_buffer
method to torchaudio.io.StreamReader
(#2954, #2971)buffer_chunk_size=-1
option to torchaudio.io.StreamReader
(#2969)buffer_chunk_size=-1
, StreamReader
does not drop any buffered frame. Together with the fill_buffer
method, this is a recommended way to load the entire media.
reader = StreamReader("video.mp4")
reader.add_basic_audio_stream(buffer_chunk_size=-1)
reader.add_basic_video_stream(buffer_chunk_size=-1)
reader.fill_buffer()
audio, video = reader.pop_chunks()
torchaudio.io.StreamReader
(#2975)torchaudio.io.SteramReader
now gives PTS (presentation time stamp) of the media chunk it is returning. To maintain backward compatibility, the timestamp information is attached to the returned media chunk.
reader = StreamReader(...)
reader.add_basic_audio_stream(...)
reader.add_basic_video_stream(...)
for audio_chunk, video_chunk in reader.stream():
# Fetch timestamp
print(audio_chunk.pts)
print(video_chunk.pts)
# Chunks behave the same as torch.Tensor.
audio_chunk.mean(dim=1)
torchaudio.io.play_audio
(#3026, #3051)torchaudio.io.play_audio
function. (macOS only)torchaudio.utils.ffmpeg_utils
, which can be used to query into the dynamically linked FFmpeg libraries.
get_demuxers()
get_muxers()
get_audio_decoders()
get_audio_encoders()
get_video_decoders()
get_video_encoders()
get_input_devices()
get_output_devices()
get_input_protocols()
get_output_protocols()
get_build_config()
Refactor StreamReader/Writer implementation
torchaudio::ffmpeg
namespace with torchaudio::io
(#3013)pop_chunks
implementations (#3002)Added logging to torchaudio.io.StreamReader/Writer
(#2878)
Fixed the #threads used by FilterGraph to 1 (#2985)
Fixed the default #threads used by decoder to 1 in torchaudio.io.StreamReader
(#2949)
Moved libsox integration from libtorchaudio
to libtorchaudio_sox
(#2929)
Added query methods to FilterGraph (#2976)
cuda_version
(#2952)USE_CUDA
detection (#3005)USE_ROCM
detection (#3008)Published by mthrok almost 2 years ago
This is a minor release, which is compatible with PyTorch 1.13.1 and includes bug fixes, improvements and documentation updates. There is no new feature added.
Published by carolineechen almost 2 years ago
TorchAudio 0.13.0 release includes:
Hybrid Demucs is a music source separation model that uses both spectrogram and time domain features. It has demonstrated state-of-the-art performance in the Sony Music DeMixing Challenge. (citation: https://arxiv.org/abs/2111.03600)
The TorchAudio v0.13 release includes the following features
SDR Results of pre-trained pipelines on MUSDB-HQ test set
Pipeline | All | Drums | Bass | Other | Vocals |
---|---|---|---|---|---|
HDEMUCS_HIGH_MUSDB* | 6.42 | 7.76 | 6.51 | 4.47 | 6.93 |
HDEMUCS_HIGH_MUSDB_PLUS** | 9.37 | 11.38 | 10.53 | 7.24 | 8.32 |
* Trained on the training data of MUSDB-HQ dataset.
** Trained on both training and test sets of MUSDB-HQ and 150 extra songs from an internal database that were specifically produced for Meta.
Special thanks to @adefossez for the guidance.
ConvTasNet model architecture was added in TorchAudio 0.7.0. It is the first source separation model that outperforms the oracle ideal ratio mask. In this release, TorchAudio adds the pre-trained pipeline that is trained within TorchAudio on the Libri2Mix dataset. The pipeline achieves 15.6dB SDR improvement and 15.3dB Si-SNR improvement on the Libri2Mix test set.
With the addition of four new audio-related datasets, there is now support for all downstream tasks in version 1 of the SUPERB benchmark. Furthermore, these datasets support metadata mode through a get_metadata
function, which enables faster dataset iteration or preprocessing without the need to load or store waveforms.
Datasets with metadata functionality:
In release 0.12, TorchAudio released a CTC beam search decoder with KenLM language model support. This release, there is added functionality for creating custom Python language models that are compatible with the decoder, using the torchaudio.models.decoder.CTCDecoderLM
wrapper.
torchaudio.io.StreamWriter
is a class for encoding media including audio and video. This can handle a wide variety of codecs, chunk-by-chunk encoding and GPU encoding.
GriffinLim
implementations in transforms and functional used the momentum
parameter differently, resulting in inconsistent results between the two implementations. The transforms.GriffinLim
usage of momentum
is updated to resolve this discrepancy.torchaudio.info
decode audio to compute num_frames
if it is not found in metadata (#2740).torchaudio.info
may now return non-zero values for num_frames
.torchaudio.compliance.kaldi.fbank
with dither option produced a different output from kaldi because it used a skewed, rather than gaussian, distribution for dither. This is updated in this release to correctly use a random gaussian instead.runtime_error
exception with TORCH_CHECK
(#2550, #2551, #2592)torchaudio.functional.resample
function using the sinc resampling method, on float32
tensor with two channels and one second duration.CPU
torchaudio version | 8k → 16k [Hz] | 16k → 8k | 16k → 44.1k | 44.1k → 16k |
---|---|---|---|---|
0.13 | 0.256 | 0.549 | 0.769 | 0.820 |
0.12 | 0.386 | 0.534 | 31.8 | 12.1 |
CUDA
torchaudio version | 8k → 16k [Hz] | 16k → 8k | 16k → 44.1k | 44.1k → 16k |
---|---|---|---|---|
0.13 | 0.332 | 0.336 | 0.345 | 0.381 |
0.12 | 0.524 | 0.334 | 64.4 | 22.8 |
WER improvement on LibriSpeech dev and test sets
Viterbi (v0.12) | Viterbi (v0.13) | KenLM (v0.12) | KenLM (v0.13) | |
---|---|---|---|---|
dev-clean | 10.7 | 10.9 | 4.4 | 4.2 |
dev-other | 18.3 | 17.5 | 9.7 | 9.4 |
test-clean | 10.8 | 10.9 | 4.4 | 4.4 |
test-other | 18.5 | 17.8 | 10.1 | 9.5 |
:autosummary:
in torchaudio docs (#2664, #2681, #2683, #2684, #2693, #2689, #2690, #2692)Published by atalman about 2 years ago
This is a minor release, which is compatible with PyTorch 1.12.1 and include small bug fixes, improvements and documentation update. There is no new feature added.
For the full feature of v0.12, please refer to the v0.12.0 release note.
Published by hwangjeff over 2 years ago
TorchAudio 0.12.0 includes the following:
To support inference-time decoding, the release adds the wav2letter CTC beam search decoder, ported over from Flashlight (GitHub). Both lexicon and lexicon-free decoding are supported, and decoding can be done without a language model or with a KenLM n-gram language model. Compatible token, lexicon, and certain pretrained KenLM files for the LibriSpeech dataset are also available for download.
For usage details, please check out the documentation and ASR inference tutorial.
To improve flexibility in usage, the release adds two new beamforming modules under torchaudio.transforms
: SoudenMVDR and RTFMVDR. They differ from MVDR mainly in that they:
reference_channel
as an input argument in the forward method to allow users to select the reference channel in model training or dynamically change the reference channel in inference.Besides the two modules, the release adds new function-level beamforming methods under torchaudio.functional
. These include
For usage details, please check out the documentation at torchaudio.transforms and torchaudio.functional and the Speech Enhancement with MVDR Beamforming tutorial.
StreamReader
is TorchAudio’s new I/O API. It is backed by FFmpeg† and allows users to
For usage details, please check out the documentation and tutorials:
† To use StreamReader
, FFmpeg libraries are required. Please install FFmpeg. The coverage of codecs depends on how these libraries are configured. TorchAudio official binaries are compiled to work with FFmpeg 4 libraries; FFmpeg 5 can be used if TorchAudio is built from source.
torchaudio.load
, please install a compatible version of FFmpeg (Version 4 when using an official binary distribution).torchaudio.info
now returns num_frames=0
for MP3.Hypothesis
subclassed namedtuple
. Containers of namedtuple
instances, however, are incompatible with the PyTorch Lite Interpreter. To achieve compatibility, Hypothesis
has been modified in release 0.12 to instead alias tuple
. This affects RNNTBeamSearch
as it accepts and returns a list of Hypothesis
instances.complex128
to improve the precision and robustness of downstream matrix computations. The output dtype, however, is not correctly converted back to the original dtype. In release 0.12, we fix the output dtype to be consistent with the original input dtype.torchaudio.transforms.PitchShift
, after its first call, to perform the operation on float32
Tensor with two channels and 8000 frames, resampled to 44.1 kHz across various shifted steps.TorchAudio Version | 2 | 3 | 4 | 5 |
---|---|---|---|---|
0.12 | 2.76 | 5 | 1860 | 223 |
0.11 | 6.71 | 161 | 8680 | 1450 |
__getattr__
to implement delayed initialization (#2377)Published by nateanl over 2 years ago
TorchAudio 0.11.0 release includes:
To support streaming ASR use cases, the release adds implementations of Emformer (docs), an RNN-T model that uses Emformer (emformer_rnnt_base), and an RNN-T beam search decoder (RNNTBeamSearch). It also includes a pipeline bundle (EMFORMER_RNNT_BASE_LIBRISPEECH) that wraps pre- and post-processing components, the beam search decoder, and the RNN-T Emformer model with weights pre-trained on LibriSpeech, which in whole allow for performing streaming ASR inference out of the box. For reference and reproducibility, the release provides the training recipe used to produce the pre-trained weights in the examples directory.
The masked prediction training of HuBERT model requires the masked logits, unmasked logits, and feature norm as the outputs. The logits are for cross-entropy losses and the feature norm is for penalty loss. The release adds HuBERTPretrainModel and corresponding factory functions (hubert_pretrain_base, hubert_pretrain_large, and hubert_pretrain_xlarge) to enable training from scratch.
The release adds an implementation of Conformer (docs), a convolution-augmented transformer architecture that has achieved state-of-the-art results on speech recognition benchmarks.
F.magphase
, F.angle
, F.complex_norm
, and T.ComplexNorm
. (#1934, #1935, #1942)
F.spectrogram
, T.Spectrogram
, F.phase_vocoder
, and T.TimeStretch
(#1957, #1958)
create_fb_matrix
(#1998)
create_fb_matrix
was replaced by melscale_fbanks
in release 0.10. It is removed in 0.11. Please use melscale_fbanks
.VCTK_092
class for the latest version of the dataset.diskcache_iterator
and bg_iterator
were deprecated in 0.10. They are removed in 0.11. Please cease the usage of them.<s>
, <pad>
, </s>
, <unk>
) that were not related to ASR tasks and not used. These dimensions were removed.Published by atalman over 2 years ago
This is a minor release compatible with PyTorch 1.10.2.
There is no feature change in torchaudio from 0.10.1. For the full feature of v0.10, please refer to the v0.10.0 release notes.
Published by mthrok almost 3 years ago
This is a minor release, which is compatible with PyTorch 1.10.1 and include small bug fix, improvements and documentation update. There is no new feature added.
TORCH_CUDA_ARCH_LIST
delimiterFor the full feature of v0.10, please refer to the v0.10.0 release note.
Published by carolineechen almost 3 years ago
torchaudio 0.10.0 release includes:
HuBERT model architectures (“base”, “large” and “extra large” configurations) are added. In addition to that, support for pretrained weights from wav2vec 2.0, Unsupervised Cross-lingual Representation Learning and HuBERT are added.
These pretrained weights can be used for feature extractions and downstream task adaptation.
>>> import torchaudio
>>>
>>> # Build the model and load pretrained weight.
>>> model = torchaudio.pipelines.HUBERT_BASE.get_model()
>>> # Perform feature extraction.
>>> features, lengths = model.extract_features(waveforms)
>>> # Pass the features to downstream task
>>> ...
Some of the pretrained weights are fine-tuned for ASR tasks. The following example illustrates how to use weights and access to associated information, such as labels, which can be used in subsequent CTC decoding steps. (Note: torchaudio does not provide a CTC decoding mechanism.)
>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.HUBERT_ASR_LARGE
>>>
>>> # Build the model and load pretrained weight.
>>> model = bundle.get_model()
Downloading:
100%|███████████████████████████████| 1.18G/1.18G [00:17<00:00, 73.8MB/s]
>>> # Check the corresponding labels of the output.
>>> labels = bundle.get_labels()
>>> print(labels)
('<s>', '<pad>', '</s>', '<unk>', '|', 'E', 'T', 'A', 'O', 'N', 'I', 'H', 'S', 'R', 'D', 'L', 'U', 'M', 'W', 'C', 'F', 'G', 'Y', 'P', 'B', 'V', 'K', "'", 'X', 'J', 'Q', 'Z')
>>>
>>> # Infer the label probability distribution
>>> waveform, sample_rate = torchaudio.load(hello-world.wav')
>>>
>>> emissions, _ = model(waveform)
>>>
>>> # Pass emission to (hypothetical) decoder
>>> transcripts = ctc_decode(emissions, labels)
>>> print(transcripts[0])
HELLO WORLD
A new model architecture, Tacotron2 is added, alongside several pretrained weights for TTS (text-to-speech). Since these TTS pipelines are composed of multiple models and specific data processing, so as to make it easy to use associated objects, a notion of bundle is introduced. Bundles provide a common access point to create a pipeline with a set of pretrained weights. They are available under torchaudio.pipelines
module.
The following example illustrates a TTS pipeline where two models (Tacotron2 and WaveRNN) are used together.
>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH
>>>
>>> # Build text processor, Tacotron2 and vocoder (WaveRNN) model
>>> processor = bundle.get_text_preprocessor()
>>> tacotron2 = bundle.get_tacotron2()
Downloading:
100%|███████████████████████████████| 107M/107M [00:01<00:00, 87.9MB/s]
>>> vocoder = bundle.get_vocoder()
Downloading:
100%|███████████████████████████████| 16.7M/16.7M [00:00<00:00, 78.1MB/s]
>>>
>>> text = "Hello World!"
>>>
>>> # Encode text
>>> input, lengths = processor(text)
>>>
>>> # Generate (mel-scale) spectrogram
>>> specgram, lengths, _ = tacotron2.infer(input, lengths)
>>>
>>> # Convert spectrogram to waveform
>>> waveforms, lengths = vocoder(specgram, lengths)
>>>
>>> # Save audio
>>> torchaudio.save('hello-world.wav', waveforms, vocoder.sample_rate)
The loss function used in the RNN transducer architecture, which is widely used for speech recognition tasks, is added. The loss function (torchaudio.functional.rnnt_loss
or torchaudio.transforms.RNNTLoss
) supports float16 and float32 logits, has autograd and torchscript support, and can be run on both CPU and GPU, which has a custom CUDA kernel implementation for improved performance.
This release adds support for MVDR beamforming on multi-channel audio using Time-Frequency masks. There are three solutions (ref_channel, stv_evd, stv_power) and it supports single-channel and multi-channel (perform average in the method) masks. It provides an online option that recursively updates the parameters for streaming audio.
Please refer to the MVDR tutorial.
This release adds GPU builds that support custom CUDA kernels in torchaudio, like the one being used for RNN transducer loss. Following this change, torchaudio’s binary distribution now includes CPU-only versions and CUDA-enabled versions. To use CUDA-enabled binaries, PyTorch also needs to be compatible with CUDA.
torchaudio.functional.lfilter
now supports batch processing and multiple filters. Additional operations, including pitch shift, LFCC, and inverse spectrogram, are now supported in this release. The datasets CMUDict and LibriMix are added as well.
PCM_24
(the previous default) could cause warping. The default has been changed to PCM_16
, which does not suffer this.power=None
, torchaudio.functional.spectrogram
and torchaudio.transforms.Spectrogram
now defaults to return_complex=True
, which returns Tensor of native complex type (such as torch.cfloat
and torch.cdouble
). To use a pseudo complex type, pass the resulting tensor to torch.view_as_real
.torchaudio.functional.resample
.specgram
.extract_features
of Wav2Vec2Model (#1776)
Wav2Vec2Model.feature_extractor()
.Wav2Vec2Model
was updated. Wav2Vec2Model.encoder.read_out
module is moved to Wav2Vec2Model.aux
. If you have serialized state dict, please replace the key encoder.read_out
with aux
.num_out
parameter has been changed to aux_num_out
and other parameters are added before it. Please update the code from wav2vec2_base(num_out)
to wav2vec2_base(aux_num_out=num_out)
.melscale_fbanks
and deprecate create_fb_matrix
(#1653)
linear_fbanks
is introduced, create_fb_matrix
is renamed to melscale_fbanks
. The original create_fb_matrix
is now deprecated. Please use melscale_fbanks
.VCTK
dataset (#1810)
VCTK_092
dataset.bg_iterator
and diskcache_iterator
are known to not improve the throughput of data loaders. Please cease their usage.Tacotron2
HuBERT
CUDA
Tensor shape | [1,4,8000] | [1,4,16000] | [1,4,32000] |
---|---|---|---|
0.10 | 119 | 120 | 123 |
0.9 | 160 | 184 | 240 |
Unit: msec
Published by malfet about 3 years ago
This release depends on pytorch 1.9.1
No functional changes other than minor updates to CI rules.