Bot releases are hidden (Show)

audio - TorchAudio 2.4.0 Release Latest Release

Published by NicolasHug 3 months ago

This release is compatible with PyTorch 2.4. There are no new features added.

This release contains 2 fixes:

Fix view size error when backpropagating through lfilter https://github.com/pytorch/audio/pull/3794
[BC-Breaking] Fix model downloading in bento https://github.com/pytorch/audio/pull/3803

audio - TorchAudio 2.3.1 Release

Published by atalman 4 months ago

This release is compatible with PyTorch 2.3.1 patch release. There are no new features added.

audio - TorchAudio 2.3.0 Release

Published by ahmadsharif1 6 months ago

This release is compatible with PyTorch 2.3.0 patch release. There are no new features added.

This release contains minor documentation and code quality improvements (#3734, #3748, #3757, #3759)

audio - TorchAudio 2.2.2 Release

Published by atalman 7 months ago

This release is compatible with PyTorch 2.2.2 patch release. There are no new features added.

audio - TorchAudio 2.2.1 Release

Published by atalman 8 months ago

This release is compatible with PyTorch 2.2.1 patch release. There are no new features added.

audio - TorchAudio 2.2.0 Release

Published by mthrok 9 months ago

New Features

Add path-like object support to StreamReader/Writer https://github.com/pytorch/audio/pull/3608
Introduce trio top-level module, dedicated for core I/O operations (https://github.com/pytorch/audio/pull/3676, https://github.com/pytorch/audio/pull/3680, https://github.com/pytorch/audio/pull/3681, https://github.com/pytorch/audio/pull/3682) Please refer to https://pytorch.org/audio/2.2.0/torio.html for the details.

Bug Fixes

https://github.com/pytorch/audio/pull/3685 Make F.vad return empty tensor for zero valued tensor input

Recipe Updates

https://github.com/pytorch/audio/pull/3631 Fix inconsistent naming

audio - TorchAudio 2.1.2 Release

Published by huydhn 10 months ago

This is a patch release, which is compatible with PyTorch 2.1.2. There are no new features added.

audio - v2.1.1

Published by mthrok 11 months ago

This is a minor release, which is compatible with PyTorch 2.1.1 and includes bug fixes, improvements and documentation updates.

Bug Fixes

Cherry-pick 2.1.1: Fix WavLM bundles (#3665)
Cherry-pick 2.1.1: Add back compression level in i/o dispatcher backend by (#3666)

audio - Torchaudio 2.1 Release Note

Published by mthrok about 1 year ago

Hilights

TorchAudio v2.1 introduces the new features and backward-incompatible changes;

[BETA] A new API to apply filter, effects and codec
torchaudio.io.AudioEffector can apply filters, effects and encodings to waveforms in online/offline fashion.
You can use it as a form of augmentation.
Please refer to https://pytorch.org/audio/2.1/tutorials/effector_tutorial.html for the examples.
[BETA] Tools for forced alignment
New functions and a pre-trained model for forced alignment were added.
torchaudio.functional.forced_align computes alignment from an emission and torchaudio.pipelines.MMS_FA provides access to the model trained for multilingual forced alignment in MMS: Scaling Speech Technology to 1000+ languages project.
Please refer to https://pytorch.org/audio/2.1/tutorials/ctc_forced_alignment_api_tutorial.html for the usage of forced_align function, and https://pytorch.org/audio/2.1/tutorials/forced_alignment_for_multilingual_data_tutorial.html for how one can use MMS_FA to align transcript in multiple languages.
[BETA] TorchAudio-Squim : Models for reference-free speech assessment
Model architectures and pre-trained models from the paper TorchAudio-Squim: Reference-less Speech Quality and Intelligibility measures in TorchAudio were added.
You can use torchaudio.pipelines.SQUIM_SUBJECTIVE and torchaudio.pipelines.SQUIM_OBJECTIVE models to estimate the various speech quality and intelligibility metrics. This is helpful when evaluating the quality of speech generation models, such as TTS.
Please refer to https://pytorch.org/audio/2.1/tutorials/squim_tutorial.html for the detail.
[BETA] CUDA-based CTC decoder
torchaudio.models.decoder.CUCTCDecoder takes emission stored in CUDA memory and performs CTC beam search on it in CUDA device. The beam search is fast. It eliminates the need to move data from CUDA device to CPU when performing automatic speech recognition. With PyTorch's CUDA support, it is now possible to perform the entire speech recognition pipeline in CUDA.
Please refer to https://pytorch.org/audio/2.1/tutorials/asr_inference_with_cuda_ctc_decoder_tutorial.html for the detail.
[Prototype] Utilities for AI music generation
We are working to add utilities that are relevant to music AI. Since the last release, the following APIs were added to the prototype.
Please refer to respective documentation for the usage.
- torchaudio.prototype.chroma_filterbank
- torchaudio.prototype.transforms.ChromaScale
- torchaudio.prototype.transforms.ChromaSpectrogram
- torchaudio.prototype.pipelines.VGGISH
New recipes for training models.
Recipes for Audio-visual ASR, multi-channel DNN beamforming and TCPGen context-biasing were added.
Please refer to the recipes
Update to FFmpeg support
The version of supported FFmpeg libraries was updated.
TorchAudio v2.1 works with FFmpeg 6, 5 and 4.4. The support for 4.3, 4.2 and 4.1 are dropped.
Please refer to https://pytorch.org/audio/2.1/installation.html#optional-dependencies for the detail of the new FFmpeg integration mechanism.
Update to libsox integration
TorchAudio now depends on libsox installed separately from torchaudio. Sox I/O backend no longer supports file-like object. (This is supported by FFmpeg backend and soundfile)
Please refer to https://pytorch.org/audio/2.1/installation.html#optional-dependencies for the detail.

New Features

I/O

Support overwriting PTS in torchaudio.io.StreamWriter (#3135)
Include format information after filter torchaudio.io.StreamReader.get_out_stream_info (#3155)
Support CUDA frame in torchaudio.io.StreamReader filter graph (#3183, #3479)
Support YUV444P in GPU decoder (#3199)
Add additional filter graph processing to torchaudio.io.StreamWriter (#3194)
Cache and reuse HW device context in GPU decoder (#3178)
Cache and reuse HW device context in GPU encoder (#3215)
Support changing the number of channels in torchaudio.io.StreamReader (#3216)
Support encode spec change in torchaudio.io.StreamWriter (#3207)
Support encode options such as compression rate and bit rate (#3179, #3203, #3224)
Add 420p10le support to torchaudio.io.StreamReader CPU decoder (#3332)
Support multiple FFmpeg versions (#3464, #3476)
Support writing opus and mp3 with soundfile (#3554)
Add switch to disable sox integration and ffmpeg integration at runtime (#3500)

Ops

Add torchaudio.io.AudioEffector (#3163, #3372, #3374)
Add torchaudio.transforms.SpecAugment (#3309, #3314)
Add torchaudio.functional.forced_align (#3348, #3355, #3533, #3536, #3354, #3365, #3433, #3357)
Add torchaudio.functional.merge_tokens (#3535, #3614)
Add torchaudio.functional.frechet_distance (#3545)

Models

Add torchaudio.models.SquimObjective for speech enhancement (#3042, 3087, #3512)
Add torchaudio.models.SquimSubjective for speech enhancement (#3189)
Add torchaudio.models.decoder.CUCTCDecoder (#3096)

Pipelines

Add torchaudio.pipelines.SquimObjectiveBundle for speech enhancement (#3103)
Add torchaudio.pipelines.SquimSubjectiveBundle for speech enhancement (#3197)
Add torchaudio.pipelines.MMS_FA Bundle for forced alignment (#3521, #3538)

Tutorials

Add tutorial for torchaudio.io.AudioEffector (#3226)
Add tutorials for CTC forced alignment API (#3356, #3443, #3529, #3534, #3542, #3546, #3566)
Add tutorial for torchaudio.models.decoder.CUCTCDecoder (#3297)
Add tutorial for real-time av-asr (#3511)
Add tutorial for TorchAudio-SQUIM pipelines (#3279, #3313)
Split HW acceleration tutorial into nvdec/nvenc tutorials (#3483, #3478)

Recipe

Add TCPGen context-biasing Conformer RNN-T (#2890)
Add AV-ASR recipe (#3278, #3421, #3441, #3489, #3493, #3498, #3492, #3532)
Add multi-channel DNN beamforming training recipe (#3036)

Backward-incompatible changes

Third-party libraries

In this release, the following third party libraries are removed from TorchAudio binary distributions. TorchAudio now search and link these libraries at runtime. Please install them to use the corresponding APIs.

SoX

libsox is used for various audio I/O, filtering operations.

Pre-built binaries are avaialble via package managers, such as conda, apt and brew. Please refer to the respective documetation.

The APIs affected include;

torchaudio.load ("sox" backend)
torchaudio.info ("sox" backend)
torchaudio.save ("sox" backend)
torchaudio.sox_effects.apply_effects_tensor
torchaudio.sox_effects.apply_effects_file
torchaudio.functional.apply_codec (also deprecated, see below)

Changes related to the removal: #3232, #3246, #3497, #3035

Flashlight Text

flashlight-text is the core of CTC decoder.

Pre-built packages are available on PyPI. Please refer to https://github.com/flashlight/text for the detail.

The APIs affected include;

torchaudio.models.decoder.CTCDecoder

Changes related to the removal: #3232, #3246, #3236, #3339

Kaldi

A custom built libkaldi was used to implement torchaudio.functional.compute_kaldi_pitch. This function, along with libkaldi integration, is removed in this release. There is no replcement.

Changes related to the removal: #3368, #3403

I/O

Switch to the backend dispatcher (#3241)

To make I/O operations more flexible, TorchAudio introduced the backend dispatcher in v2.0, and users could opt-in to use the dispatcher.
In this release, the backend dispatcher becomes the default mechanism for selecting the I/O backend.

You can pass backend argument to torchaudio.info, torchaudio.load and torchaudio.save function to select I/O backend library per-call basis. (If it is omitted, an available backend is automatically selected.)

If you want to use the global backend mechanism, you can set the environment variable, TORCHAUDIO_USE_BACKEND_DISPATCHER=0.
Please note, however, that this the global backend mechanism is deprecated and is going to be removed in the next release.

Please see #2950 for the detail of migration work.

Remove Tensor binding from StreamReader (#3093, #3272)

torchaudio.io.StreamReader accepted a byte-string wrapped in 1D torch.Tensor object. This is no longer supported.
Please wrap the underlying data with io.BytesIO instead.

Make I/O optional arguments kw-only (#3208, #3227)

The optional arguments of add_[audio|video]_stream methods of torchaudio.io.StreamReader and torchaudio.io.StreamWriter are now keyword-only arguments.

Drop the support of FFmpeg < 4.1 (#3561, 3557)

Previously TorchAudio supported FFmpeg 4 (>=4.1, <=4.4). In this release, TorchAudio supports FFmpeg 4, 5 and 6 (>=4.4, <7). With this change, support for FFmpeg 4.1, 4.2 and 4.3 are dropped.

Ops

Use named file in torchaudio.functional.apply_codec (#3397)

In previous versions, TorchAudio shipped custom built libsox, so that it can perform in-memory decoding and encoding.
Now, in-memory decoding and encoding are handled by FFmpeg binding, and with the switch to dynamic libsox linking, torchaudio.functional.apply_codec no longer process audio in in-memory fashion. Instead it writes to temporary file.
For in-memory processing, please use torchaudio.io.AudioEffector.

Switch to lstsq when solving InverseMelScale (#3280)

Previously, torchaudio.transform.InverseMelScale ran SGD optimizer to find the inverse of mel-scale transform. This approach has number of issues as listed in #2643.

This release switches to use torch.linalg.lstsq.

Models

Improve RNN-T streaming decoding (#3295, #3379)

The infer method of torchaudio.models.RNNTBeamSearch has been updated to accept series of previous hypotheses.


bundle = torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH
decoder: RNNTBeamSearch = bundle.get_decoder()

hypothesis = None
while streaming:
    ...
    hypo, state = decoder.infer(
        features,
        length,
        beam_width,
        state=state,
        hypothesis=hypothesis,
    )
    ...
    hypothesis = hypo
    # Previously this had to be hypothesis = hypo[0]

Deprecations

Ops

Update and deprecate torchaudio.functional.apply_codec function (#3386)

Due to the removal of custom libsox binding, torchaudio.functional.apply_codec no longer supports in-memory processing. Please migrate to torchaudio.io.AudioEffector.

Please refer to for the detailed usage of torchaudio.io.AudioEffector.

Bug Fixes

Models

Fix the negative sampling in ConformerWav2Vec2PretrainModel (#3085)
Fix extract_features method for WavLM models (#3350)

Tutorials

Fix backtracking in forced alignment tutorial (#3440)
Fix initialization of get_trellis in forced alignment tutorial (#3172)

Build

Fix MKL issue on Intel mac build (#3307)

I/O

Surpress warning when saving vorbis with sox backend (#3359)
Fix g722 encoding in torchaudio.io.StreamWriter (#3373)
Refactor arg mapping in ffmpeg save function (#3387)
Fix save INT16 sox backend (#3524)
Fix SoundfileBackend method decorators (#3550)
Fix PTS initialization when using NVIDIA encoder (#3312)

Ops

Add non-default CUDA device support to lfilter (#3432)

Improvements

I/O

Set "experimental" automatically when using native opus/vorbis encoder (#3192)
Improve the performance of NV12 frame conversion (#3344)
Improve the performance of YUV420P frame conversion (#3342)
Refactor backend implementations (#3547, #3548, #3549)
Raise an error if torchaudio.io.StreamWriter is not opened (#3152)
Warn if decoding YUV images with different plane size (#3201)
Expose AudioMetadata (#3556)
Refactor the internal of torchaudio.io.StreamReader (#3157, #3170, #3186, #3184, #3188, #3320, #3296, #3328, #3419, #3209)
Refactor the internal of torchaudio.io.StreamWriter (#3205, #3319, #3296, #3328, #3426, #3428)
Refactor the FFmpeg abstraction layer (#3249, #3251)
Migrate the binding of FFmpeg utils to PyBind11 (#3228)
Simplify sox namespace (#3383)
Use const reference in sox implementation (#3389)
Ensure StreamReader returns tensors with requires_grad is False (#3467)
Set the default #threads to 1 in StreamWriter (#3370)
Remove ffmpeg fallback from sox_io backend (#3516)

Ops

Add arbitrary dim Tensor support to mask_along_axis{,_iid} (#3289)
Fix resampling to support dynamic input lengths for onnx exports. (#3473)
Optimize Torchaudio Vad (#3382)

Documentation

Build and use GPU-enabled FFmpeg in doc CI (#3045)
Misc tutorial update (#3449)
Update notes on FFmpeg version (#3480)
Update documentation about dependencies (#3517)
Update I/O and backend docs (#3555)

Tutorials

Update data augmentation tutorial (#3375)
Add more explanation about n_fft (#3442)

Build

Resolve some compilation warnings (#3471)
Use pre-built binaries for ffmpeg extension (#3460)
Add aarch64 workflow (#3553)
Add CUDA 12.1 builds (#3284)
Update CUDA to 12.1 U1 (#3563)

Recipe

Fix Adam and AdamW initializers in wav2letter example (#3145)
Update Librispeech RNNT recipe to support Lightening 2.0 (#3336)
Update HuBERT/SSL training recipes to support Lightning 2.x (#3396)
Add wav2vec2 loss function in self_supervised_learning training recipe (#3090)
Add Wav2Vec2DataModule in self_supervised_learning training recipe (#3081)

Other

Use FFmpeg6 in build doc (#3475)
Use FFmpeg6 in unit test (#3570)
Migrate torch.norm to torch.linalg.vector_norm (#3522)
Migrate torch.nn.utils.weight_norm to nn.utils.parametrizations.weight_norm (#3523)

audio - v2.0.2

Published by mthrok over 1 year ago

TorchAudio 2.0.2 Release Note

This is a minor release, which is compatible with PyTorch 2.0.1 and includes bug fixes, improvements and documentation updates. There is no new feature added.

Bug fix

#3239 Properly set #samples passed to encoder (#3204)
#3238 Fix virtual function issue with CTC decoder (#3230)
#3245 Fix path-like object support in FFmpeg dispatcher (#3243, #3248)
#3261 Use scaled_dot_product_attention in Wav2vec2/HuBERT's SelfAttention (#3253)
#3264 Use scaled_dot_product_attention in WavLM attention (#3252, #3265)

Full Changelog: https://github.com/pytorch/audio/compare/v2.0.1...v2.0.2

audio - Torchaudio 2.0 Release Note

Published by xiaohui-zhang over 1 year ago

Highlights

TorchAudio 2.0 release includes:

Data augmentation operators, e.g. convolution, additive noise, speed perturbation
WavLM and XLS-R models and pre-trained pipelines
Backend dispatcher powering revised info, load, save functions
Dropped support of Python 3.7
Added Python 3.11 support

[Beta] Data augmentation operators

The release adds several data augmentation operators under torchaudio.functional and torchaudio.transforms:

torchaudio.functional.add_noise
torchaudio.functional.convolve
torchaudio.functional.deemphasis
torchaudio.functional.fftconvolve
torchaudio.functional.preemphasis
torchaudio.functional.speed
torchaudio.transforms.AddNoise
torchaudio.transforms.Convolve
torchaudio.transforms.Deemphasis
torchaudio.transforms.FFTConvolve
torchaudio.transforms.Preemphasis
torchaudio.transforms.Speed
torchaudio.transforms.SpeedPerturbation

The operators can be used to synthetically diversify training data to improve the generalizability of downstream models.

For usage details, please refer to the documentation for torchaudio.functional and torchaudio.transforms, and tutorial “Audio Data Augmentation”.

[Beta] WavLM and XLS-R models and pre-trained pipelines

The release adds two self-supervised learning models for speech and audio.

WavLM that is robust to noise and reverberation.
XLS-R that is trained on cross-lingual datasets.

Besides the model architectures, torchaudio also supports corresponding pre-trained pipelines:

torchaudio.pipelines.WAVLM_BASE
torchaudio.pipelines.WAVLM_BASE_PLUS
torchaudio.pipelines.WAVLM_LARGE
torchaudio.pipelines.WAV2VEC_XLSR_300M
torchaudio.pipelines.WAV2VEC_XLSR_1B
torchaudio.pipelines.WAV2VEC_XLSR_2B

For usage details, please refer to factory function and pre-trained pipelines documentation.

Backend dispatcher

Release 2.0 introduces new versions of I/O functions torchaudio.info, torchaudio.load and torchaudio.save, backed by a dispatcher that allows for selecting one of backends FFmpeg, SoX, and SoundFile to use, subject to library availability. Users can enable the new logic in Release 2.0 by setting the environment variable TORCHAUDIO_USE_BACKEND_DISPATCHER=1; the new logic will be enabled by default in Release 2.1.

# Fetch metadata using FFmpeg
metadata = torchaudio.info("test.wav", backend="ffmpeg")

# Load audio (with no backend parameter value provided, function prioritizes using FFmpeg if it is available)
waveform, rate = torchaudio.load("test.wav")

# Write audio using SoX
torchaudio.save("out.wav", waveform, rate, backend="sox")

Please see the documentation for torchaudio for more details.

Backward-incompatible changes

Dropped Python 3.7 support (#3020)
Following the upstream PyTorch (https://github.com/pytorch/pytorch/pull/93155), the support for Python 3.7 has been dropped.
Default to "precise" seek in torchaudio.io.StreamReader.seek (#2737, #2841, #2915, #2916, #2970)
Previously, the StreamReader.seek method seeked into a key frame closest to the given time stamp. A new option mode has been added which can switch the behavior to seeking into any type of frame, including non-key frames, that is closest to the given timestamp, and this behavior is now default.
Removed deprecated/unused/undocumented functions from datasets.utils (#2926, #2927)
The following functions are removed from datasets.utils
- stream_url
- download_url
- validate_file
- extract_archive.

Deprecations

Ops

Deprecated 'onesided' init param for MelSpectrogram (#2797, #2799)
torchaudio.transforms.MelSpectrogram assumes the onesided argument to be always True. The forward path fails if its value is False. Therefore this argument is deprecated. Users specifying this argument should stop specifying it.
Deprecated "sinc_interpolation" and "kaiser_window" option value in favor of "sinc_interp_hann" and "sinc_interp_kaiser" (#2922)
The valid values of resampling_method argument of resampling operations (torchaudio.transforms.Resample and torchaudio.functional.resample) are changed. "kaiser_window" is now "sinc_interp_kaiser" and "sinc_interpolation" is "sinc_interp_hann". The old values will continue to work, but users are encouraged to update their code.
For the reason behind of this change, please refer #2891.
Deprecated sox initialization/shutdown public API functions (#3010)
torchaudio.sox_effects.init_sox_effects and torchaudio.sox_effects.shutdown_sox_effects are deprecated. They were required to use libsox-related features, but are called automatically since v0.6, and the initialization/shutdown mechanism have been moved elsewhere. These functions are now no-op. Users can simply remove the call to these functions.

Models

Deprecated static binding of Flashlight-text based CTC decoder (#3055, #3089)
Since v0.12, TorchAudio binary distributions included the CTC decoder based on flashlight-text project. In a future release, TorchAudio will switch to dynamic binding of underlying CTC decoder implementation, and stop shipping the core CTC decoder implementations. Users who would like to use the CTC decoder need to separately install the CTC decoder from the upstream flashlight-text project. Other functionalities of TorchAudio will continue to work without flashlight-text.
Note: The API and numerical behavior does not change.
For more detail, please refer #3088.

I/O

Deprecated file-like object support in sox_io (#3033)
As a preparation to switch to dynamically bound libsox, file-like object support in sox_io backend has been deprecated. It will be removed in 2.1 release in favor of the dispatcher. This deprecation affects the following functionalities.
- I/O: torchaudio.load, torchaudio.info and torchaudio.save.
- Effects: torchaudio.sox_effects.apply_effects_file and torchaudio.functional.apply_codec.
  For I/O, to continue using file-like objects, please use the new dispatcher mechanism.
  For effects, replacement functions will be added in the next release.
Deprecated the use of Tensor as a container for byte string in StreamReader (#3086)
torchaudio.io.StreamReader supports decoding media from byte strings contained in 1D tensors of torch.uint8 type. Using torch.Tensor type as a container for byte string is now deprecated. To pass byte strings, please wrap the string with io.BytesIO.

Bug Fixes

Ops

Fixed contiguous error when backpropagating through torchaudio.functional.lfilter (#3080)

Pipelines

Added layer normalization to wav2vec2 large+ pretrained models (#2873)
In self-supervised learning models such as Wav2Vec 2.0, HuBERT, or WavLM, layer normalization should be applied to waveforms if the convolutional feature extraction module uses layer normalization and is trained on a large-scale dataset. After adding layer normalization to those affected models, the Word Error Rate is significantly reduced.

Without the change in #2873, the WER results are:

Model	dev-clean	dev-other	test-clean	test-other
WAV2VEC2_ASR_LARGE_LV60K_10M	10.59	15.62	9.58	16.33
WAV2VEC2_ASR_LARGE_LV60K_100H	2.80	6.01	2.82	6.34
WAV2VEC2_ASR_LARGE_LV60K_960H	2.36	4.43	2.41	4.96
HUBERT_ASR_LARGE	1.85	3.46	2.09	3.89
HUBERT_ASR_XLARGE	2.21	3.40	2.26	4.05

After applying layer normalization, the updated WER results are:

Model	dev-clean	dev-other	test-clean	test-other
WAV2VEC2_ASR_LARGE_LV60K_10M	6.77	10.03	6.87	10.51
WAV2VEC2_ASR_LARGE_LV60K_100H	2.19	4.55	2.32	4.64
WAV2VEC2_ASR_LARGE_LV60K_960H	1.78	3.51	2.03	3.68
HUBERT_ASR_LARGE	1.77	3.32	2.03	3.68
HUBERT_ASR_XLARGE	1.73	2.72	1.90	3.16

Recipe

Fixed DDP training in HuBERT recipes (#3068)
If shuffle is set True in BucketizeBatchSampler, the seed is only the same for the first epoch. In later epochs, each BucketizeBatchSampler object will generate a different shuffled iteration list, which may cause DPP training to hang forever if the lengths of iteration lists are different across nodes. In the 2.0.0 release, the issue is fixed by using the same seed for RNG in all nodes.

IO

Fixed signature mismatch on _fail_info_fileobj (#3032)
Remove unnecessary AVFrame allocation (#3021)
This fixes the memory leak reported in torchaudio.io.StreamReader.

New Features

Ops

Added CUDA kernel for torchaudio.functional.lfilter (#3018)
Added data augmentation ops (#2801, #2809, #2829, #2811, #2871, #2874, #2892, #2935, #2977, #3001, #3009, #3061, #3072)
Introduces AddNoise, Convolve, FFTConvolve, Speed, SpeedPerturbation, Deemphasis, and Preemphasis in torchaudio.transforms, and add_noise, fftconvolve, convolve, speed, preemphasis, and deemphasis in torchaudio.functional.

Models

Added WavLM model (#2822, #2842)
Added XLS-R models (#2959)

Pipelines

Added WavLM bundles (#2833, #2895)
Added pre-trained pipelines for XLS-R models (#2978)

I/O

Added rgb48le and CUDA p010 support (HDR/10bit) to StreamReader (#3023)
Added fill_buffer method to torchaudio.io.StreamReader (#2954, #2971)
Added buffer_chunk_size=-1 option to torchaudio.io.StreamReader (#2969)
When buffer_chunk_size=-1, StreamReader does not drop any buffered frame. Together with the fill_buffer method, this is a recommended way to load the entire media.
```
reader = StreamReader("video.mp4")
reader.add_basic_audio_stream(buffer_chunk_size=-1)
reader.add_basic_video_stream(buffer_chunk_size=-1)
reader.fill_buffer()
audio, video = reader.pop_chunks()
```

Added PTS support to torchaudio.io.StreamReader (#2975)
torchaudio.io.SteramReader now gives PTS (presentation time stamp) of the media chunk it is returning. To maintain backward compatibility, the timestamp information is attached to the returned media chunk.

reader = StreamReader(...)
reader.add_basic_audio_stream(...)
reader.add_basic_video_stream(...)
for audio_chunk, video_chunk in reader.stream():
    # Fetch timestamp
    print(audio_chunk.pts)
    print(video_chunk.pts)
    # Chunks behave the same as torch.Tensor.
    audio_chunk.mean(dim=1)

Added playback function torchaudio.io.play_audio (#3026, #3051)
You can play audio with the torchaudio.io.play_audio function. (macOS only)
Added new dispatcher (#3015, #3058, #3073)

Other

Add utility functions to check information about FFmpeg (#2958, #3014)
The following functions are added to torchaudio.utils.ffmpeg_utils, which can be used to query into the dynamically linked FFmpeg libraries.
- get_demuxers()
- get_muxers()
- get_audio_decoders()
- get_audio_encoders()
- get_video_decoders()
- get_video_encoders()
- get_input_devices()
- get_output_devices()
- get_input_protocols()
- get_output_protocols()
- get_build_config()

Recipes

Add modularized SSL training recipe (#2876)

Improvements

I/O

Refactor StreamReader/Writer implementation
- Refactored StreamProcessor interface (#2791)
- Refactored Buffer implementation (#2939, #2943, #2962, #2984, #2988)
- Refactored AVFrame to Tensor conversions (#2940, #2946)
- Refactored and optimize yuv420p and nv12 processing (#2945)
- Abstracted away AVFormatContext from constructor (#3007)
- Removed unused/redundant things (#2995)
- Replaced torchaudio::ffmpeg namespace with torchaudio::io (#3013)
- Merged pop_chunks implementations (#3002)
- Cleaned up private methods (#3030)
- Moved drain method to private (#2996)
Added logging to torchaudio.io.StreamReader/Writer (#2878)
Fixed the #threads used by FilterGraph to 1 (#2985)
Fixed the default #threads used by decoder to 1 in torchaudio.io.StreamReader (#2949)
Moved libsox integration from libtorchaudio to libtorchaudio_sox (#2929)
Added query methods to FilterGraph (#2976)

Ops

Added logging to MelSpectrogram and Spectrogram (#2861)
Fixed filtering function fallback mechanism (#2953)
Enabled log probs input for RNN-T loss (#2798)
Refactored extension modules initialization (#2968)
Updated the guard mechanism for FFmpeg-related features (#3028)
Updated the guard mechanism for cuda_version (#2952)

Models

Renamed generator to vocoder in HiFiGAN model and factory functions (#2955)
Enforces contiguous tensor in CTC decoder (#3074)

Datasets

Validates the input path in LibriMix dataset (#2944)

Documentation

Fixed docs warnings for conformer w2v2 (#2900)
Updated model documentation structure (#2902)
Fixed document for MelScale and InverseMelScale (#2967)
Updated highlighting in doc (#3000)
Added installation / build instruction to doc (#3038)
Redirect build instruction to official doc (#3053)
Tweak docs around IO (#3064)
Improved docstring about input path to LibriMix (#2937)

Recipes

Simplify train step in Conformer RNN-T LibriSpeech recipe (#2981)
Update WER results for CTC n-gram decoding (#3070)
Update ssl example (#3060)
fix import bug in global_stats.py (#2858)
Fixes examples/source_separation for WSJ0_2mix dataset (#2987)

Tutorials

Added mel spectrogram visualization to Streaming ASR tutorial (#2974)
Fixed mel spectrogram visualization in TTS tutorial (#2989)
Updated data augmentation tutorial to use new operators (#3062)
Fixed hybrid demucs tutorial for CUDA (#3017)
Updated hardware accelerated video processing tutorial (#3050)

Builds

Fixed USE_CUDA detection (#3005)
Fixed USE_ROCM detection (#3008)
Added M1 Conda builds (#2840)
Added M1 Wheels builds (#2839)
Added CUDA 11.8 builds (#2951)
Switched CI to CUDA 11.7 from CUDA 11.6 (#3031, #3034)
Added python 3.11 support (#3039, #3071)
Updated C++ standard to 17 (#2973)

Tests

Fix integration test for WAV2VEC2_ASR_LARGE_LV60K_10M (#2910)
Fix CI tests on gpu machines (#2982)
Remove function input parameters from data aug functional tests (#3011)
Reduce the sample rate of some tests (#2963)

Style

Fix type of arguments in torchaudio.io classes (#2913)

audio - TorchAudio 0.13.1 Release Note

Published by mthrok almost 2 years ago

This is a minor release, which is compatible with PyTorch 1.13.1 and includes bug fixes, improvements and documentation updates. There is no new feature added.

Bug Fix

IO

Make buffer size configurable in ffmpeg file object operations and set size in backend (#2810)
Fix issue with the missing video frame in StreamWriter (#2789)
Fix decimal FPS handling StreamWriter (#2831)
Fix wrong frame allocation in StreamWriter (#2905)
Fix duplicated memory allocation in StreamWriter (#2906)

Model

Fix HuBERT model initialization (#2846, #2886)

Recipe

Fix issues in HuBERT fine-tuning recipe (#2851)
Fix automatic mixed precision in HuBERT pre-training recipe (#2854)

audio - torchaudio 0.13.0 Release Note

Published by carolineechen almost 2 years ago

Highlights

TorchAudio 0.13.0 release includes:

Source separation models and pre-trained bundles (Hybrid Demucs, ConvTasNet)
New datasets and metadata mode for the SUPERB benchmark
Custom language model support for CTC beam search decoding
StreamWriter for audio and video encoding

[Beta] Source Separation Models and Bundles

Hybrid Demucs is a music source separation model that uses both spectrogram and time domain features. It has demonstrated state-of-the-art performance in the Sony Music DeMixing Challenge. (citation: https://arxiv.org/abs/2111.03600)

The TorchAudio v0.13 release includes the following features

MUSDB_HQ Dataset, which is used in Hybrid Demucs training (docs)
Hybrid Demucs model architecture (docs)
Three factory functions suitable for different sample rate ranges
Pre-trained pipelines (docs) and tutorial

SDR Results of pre-trained pipelines on MUSDB-HQ test set

Pipeline	All	Drums	Bass	Other	Vocals
HDEMUCS_HIGH_MUSDB*	6.42	7.76	6.51	4.47	6.93
HDEMUCS_HIGH_MUSDB_PLUS**	9.37	11.38	10.53	7.24	8.32

* Trained on the training data of MUSDB-HQ dataset.
** Trained on both training and test sets of MUSDB-HQ and 150 extra songs from an internal database that were specifically produced for Meta.

Special thanks to @adefossez for the guidance.

ConvTasNet model architecture was added in TorchAudio 0.7.0. It is the first source separation model that outperforms the oracle ideal ratio mask. In this release, TorchAudio adds the pre-trained pipeline that is trained within TorchAudio on the Libri2Mix dataset. The pipeline achieves 15.6dB SDR improvement and 15.3dB Si-SNR improvement on the Libri2Mix test set.

[Beta] Datasets and Metadata Mode for SUPERB Benchmarks

With the addition of four new audio-related datasets, there is now support for all downstream tasks in version 1 of the SUPERB benchmark. Furthermore, these datasets support metadata mode through a get_metadata function, which enables faster dataset iteration or preprocessing without the need to load or store waveforms.

Datasets with metadata functionality:

LIBRISPEECH (docs)
LibriMix (docs)
QUESST14 (docs)
SPEECHCOMMANDS (docs)
(new) FluentSpeechCommands (docs)
(new) Snips (docs)
(new) IEMOCAP (docs)
(new) VoxCeleb1 (Identification, Verification)

[Beta] Custom Language Model support in CTC Beam Search Decoding

In release 0.12, TorchAudio released a CTC beam search decoder with KenLM language model support. This release, there is added functionality for creating custom Python language models that are compatible with the decoder, using the torchaudio.models.decoder.CTCDecoderLM wrapper.

[Beta] StreamWriter

torchaudio.io.StreamWriter is a class for encoding media including audio and video. This can handle a wide variety of codecs, chunk-by-chunk encoding and GPU encoding.

Backward-incompatible changes

[BC-breaking] Fix momentum in transforms.GriffinLim (#2568)
The GriffinLim implementations in transforms and functional used the momentum parameter differently, resulting in inconsistent results between the two implementations. The transforms.GriffinLim usage of momentum is updated to resolve this discrepancy.
Make torchaudio.info decode audio to compute num_frames if it is not found in metadata (#2740).
In such cases, torchaudio.info may now return non-zero values for num_frames.

Bug Fixes

Fix random Gaussian generation (#2639)
torchaudio.compliance.kaldi.fbank with dither option produced a different output from kaldi because it used a skewed, rather than gaussian, distribution for dither. This is updated in this release to correctly use a random gaussian instead.
Update download link for speech commands (#2777)
The previous download link for SpeechCommands v2 did not include data for the valid and test sets, resulting in errors when trying to use those subsets. Update the download link to correctly download the whole dataset.

New Features

IO

Add metadata to source stream info (#2461, #2464)
Add utility function to fetch FFmpeg library versions (#2467)
Add YUV444P support to StreamReader (#2516)
Add StreamWriter (#2628, #2648, #2505)
Support in-memory decoding via Tensor wrapper in StreamReader (#2694)
Add StreamReader Tensor Binding to src (#2699)
Add StreamWriter media device/streaming tutorial (#2708)
Add StreamWriter tutorial (#2698)

Ops

Add ITU-R BS.1770-4 loudness recommendation (#2472)
Add convolution operator (#2602)
Add additive noise function (#2608)

Models

Hybrid Demucs model implementation (#2506)
Docstring change for Hybrid Demucs (#2542, #2570)
Add NNLM support to CTC Decoder (#2528, #2658)
Move hybrid demucs model out of prototype (#2668)
Move conv_tasnet_base doc out of prototype (#2675)
Add custom lm example to decoder tutorial (#2762)

Pipelines

Add SourceSeparationBundle to prototype (#2440, #2559)
Adding pipeline changes, factory functions to HDemucs (#2547, #2565)
Create tutorial for HDemucs (#2572)
Add HDEMUCS_HIGH_MUSDB (#2601)
Move SourceSeparationBundle and pre-trained ConvTasNet pipeline into Beta (#2669)
Move Hybrid Demucs pipeline to beta (#2673)
Update description of HDemucs pipelines

Datasets

Add fluent speech commands (#2480, #2510)
Add musdb dataset and tests (#2484)
Add VoxCeleb1 dataset (#2349)
Add metadata function for LibriSpeech (#2653)
Add Speech Commands metadata function (#2687)
Add metadata mode for various datasets (#2697)
Add IEMOCAP dataset (#2732)
Add Snips Dataset (#2738)
Add metadata for Librimix (#2751)
Add file name to returned item in Snips dataset (#2775)
Update IEMOCAP variants and labels (#2778)

Improvements

IO

Replace runtime_error exception with TORCH_CHECK (#2550, #2551, #2592)
Refactor StreamReader (#2507, #2508, #2512, #2530, #2531, #2533, #2534)
Refactor sox C++ (#2636, #2663)
Delay the import of kaldi_io (#2573)

Ops

Speed up resample with kernel generation modification (#2553, #2561)
The kernel generation for resampling is optimized in this release. The following table illustrates the performance improvements from the previous release for the torchaudio.functional.resample function using the sinc resampling method, on float32 tensor with two channels and one second duration.

CPU

torchaudio version	8k → 16k [Hz]	16k → 8k	16k → 44.1k	44.1k → 16k
0.13	0.256	0.549	0.769	0.820
0.12	0.386	0.534	31.8	12.1

CUDA

torchaudio version	8k → 16k [Hz]	16k → 8k	16k → 44.1k	44.1k → 16k
0.13	0.332	0.336	0.345	0.381
0.12	0.524	0.334	64.4	22.8

Add normalization parameter on spectrogram and inverse spectrogram (#2554)
Replace assert with raise for ops (#2579, #2599)
Replace CHECK_ by TORCH_CHECK_ (#2582)
Fix argument validation in TorchAudio filtering (#2609)

Models

Switch to flashlight decoder from upstream (#2557)
Add dimension and shape check (#2563)
Replace assert with raise in models (#2578, #2590)
Migrate CTC decoder code (#2580)
Enable CTC decoder in Windows (#2587)

Datasets

Replace assert with raise in datasets (#2571)
Add unit test for LibriMix dataset (#2659)
Add gtzan download note (#2763)

Tutorials

Tweak tutorials (#2630, #2733)
Update ASR inference tutorial (#2631)
Update and fix tutorials (#2661, #2701)
Introduce IO section to getting started tutorials (#2703)
Update HW video processing tutorial (#2739)
Update tutorial author information (#2764)
Fix typos in tacotron2 tutorial (#2761)
Fix fading in hybrid demucs tutorial (#2771)
Fix leaking matplotlib figure (#2769)
Update resampling tutorial (#2773)

Recipes

Use lazy import for joblib (#2498)
Revise LibriSpeech Conformer RNN-T recipe (#2535)
Fix bug in Conformer RNN-T recipe (#2611)
Replace bg_iterator in examples (#2645)
Remove obsolete examples (#2655)
Fix LibriSpeech Conforner RNN-T eval script (#2666)
Replace IValue::toString()->string() with IValue::toStringRef() (#2700)
Improve wav2vec2/hubert model for pre-training (#2716)
Improve hubert recipe for pre-training and fine-tuning (#2744)

WER improvement on LibriSpeech dev and test sets

	Viterbi (v0.12)	Viterbi (v0.13)	KenLM (v0.12)	KenLM (v0.13)
dev-clean	10.7	10.9	4.4	4.2
dev-other	18.3	17.5	9.7	9.4
test-clean	10.8	10.9	4.4	4.4
test-other	18.5	17.8	10.1	9.5

Documentation

Examples

Add example for Vol transform (#2597)
Add example for Vad transform (#2598)
Add example for SlidingWindowCmn transform (#2600)
Add example for MelScale transform (#2616)
Add example for AmplitudeToDB transform (#2615)
Add example for InverseMelScale transform (#2635)
Add example for MFCC transform (#2637)
Add example for LFCC transform (#2640)
Add example for Loudness transform (#2641)

Other

Remove CTC decoder prototype message (#2459)
Fix docstring (#2540)
Dataset docstring change (#2575)
Fix typo - "dimension" (#2596)
Add note for lexicon free decoder output (#2603)
Fix stylecheck (#2606)
Fix dataset docs parsing issue with extra spaces (#2607)
Remove outdated doc (#2617)
Use double quotes for string in functional and transforms (#2618)
Fix doc warning (#2627)
Update README.md (#2633)
Sphinx-gallery updates (#2629, #2638, #2736, #2678, #2679)
Tweak documentation (#2656)
Consolidate bibliography / reference (#2676)
Tweak badge link URL generation (#2677)
Adopt :autosummary: in torchaudio docs (#2664, #2681, #2683, #2684, #2693, #2689, #2690, #2692)
Update sox info docstring to account for mp3 frame count handling (#2742)
Fix HuBERT docstring (#2746)
Fix CTCDecoder doc (#2766)
Fix torchaudio.backend doc (#2781)

Build/CI

Simplify the requirements to minimum runtime dependencies (#2313)
Bump version to 0.13 (#2460)
Add tagged builds to torchaudio (#2471)
Update config.guess to the latest (#2479)
Pin MKL to 2020.04 (#2486)
Integration test fix deleting temporary directory (#2569)
Refactor cmake (#2585)
Introducing pytorch-cuda metapackage (#2612)
Move xcode to 14 from 12.5 (#2622)
Update nightly wheels to ROCm5.2 (#2672)
Lint updates (#2389, #2487)
M1 build updates (#2473, #2474, #2496, #2674)
CUDA-related updates: versions, builds, and checks (#2501, #2623, #2670, #2707, #2710, #2721, #2724)
Release-related updates (#2489, #2492, #2495, #2759)
Fix Anaconda upload (#2581, #2621)
Fix windows python 3.8 loading path (#2735, #2747)

audio - torchaudio 0.12.1 Release Note

Published by atalman about 2 years ago

This is a minor release, which is compatible with PyTorch 1.12.1 and include small bug fixes, improvements and documentation update. There is no new feature added.

Bug Fix

#2560 Fix fall back failure in sox_io backend
#2588 Fix hubert fine-tuning recipe bugs

Improvement

#2552 Remove unused boost source code
#2527 Improve speech enhancement tutorial
#2544 Update forced alignment tutorial
#2595 Update data augmentation tutorial

For the full feature of v0.12, please refer to the v0.12.0 release note.

audio - v0.12.0

Published by hwangjeff over 2 years ago

TorchAudio 0.12.0 Release Notes

Highlights

TorchAudio 0.12.0 includes the following:

CTC beam search decoder
New beamforming modules and methods
Streaming API

[Beta] CTC beam search decoder

To support inference-time decoding, the release adds the wav2letter CTC beam search decoder, ported over from Flashlight (GitHub). Both lexicon and lexicon-free decoding are supported, and decoding can be done without a language model or with a KenLM n-gram language model. Compatible token, lexicon, and certain pretrained KenLM files for the LibriSpeech dataset are also available for download.

For usage details, please check out the documentation and ASR inference tutorial.

[Beta] New beamforming modules and methods

To improve flexibility in usage, the release adds two new beamforming modules under torchaudio.transforms: SoudenMVDR and RTFMVDR. They differ from MVDR mainly in that they:

Use power spectral density (PSD) and relative transfer function (RTF) matrices as inputs instead of time-frequency masks. The module can be integrated with neural networks that directly predict complex-valued STFT coefficients of speech and noise.
Add reference_channel as an input argument in the forward method to allow users to select the reference channel in model training or dynamically change the reference channel in inference.

Besides the two modules, the release adds new function-level beamforming methods under torchaudio.functional. These include

For usage details, please check out the documentation at torchaudio.transforms and torchaudio.functional and the Speech Enhancement with MVDR Beamforming tutorial.

[Beta] Streaming API

StreamReader is TorchAudio’s new I/O API. It is backed by FFmpeg† and allows users to

Decode various audio and video formats, including MP4 and AAC.
Handle various input forms, such as local files, network protocols, microphones, webcams, screen captures and file-like objects.
Iterate over and decode media chunk-by-chunk, while changing the sample rate or frame rate.
Apply various audio and video filters, such as low-pass filter and image scaling.
Decode video with Nvidia's hardware-based decoder (NVDEC).

For usage details, please check out the documentation and tutorials:

† To use StreamReader, FFmpeg libraries are required. Please install FFmpeg. The coverage of codecs depends on how these libraries are configured. TorchAudio official binaries are compiled to work with FFmpeg 4 libraries; FFmpeg 5 can be used if TorchAudio is built from source.

Backwards-incompatible changes

I/O

MP3 decoding is now handled by FFmpeg in sox_io backend. (#2419, #2428)
- FFmpeg is now used as fallback in sox_io backend, and now MP3 decoding is handled by FFmpeg. To load MP3 audio with torchaudio.load, please install a compatible version of FFmpeg (Version 4 when using an official binary distribution).
- Note that, whereas the previous MP3 decoding scheme pads the output audio, the new scheme does not. As a consequence, the new version returns shorter audio tensors.
- torchaudio.info now returns num_frames=0 for MP3.

Models

Change underlying implementation of RNN-T hypothesis to tuple (#2339)
- In release 0.11, Hypothesis subclassed namedtuple. Containers of namedtuple instances, however, are incompatible with the PyTorch Lite Interpreter. To achieve compatibility, Hypothesis has been modified in release 0.12 to instead alias tuple. This affects RNNTBeamSearch as it accepts and returns a list of Hypothesis instances.

Bug Fixes

Ops

Fix return dtype in MVDR module (#2376)
- In release 0.11, the MVDR module converts the dtype of input spectrum to complex128 to improve the precision and robustness of downstream matrix computations. The output dtype, however, is not correctly converted back to the original dtype. In release 0.12, we fix the output dtype to be consistent with the original input dtype.

Build

Fix Kaldi submodule integration (#2269)
Pin jinja2 version for build_docs (#2292)
Use sourceforge url to fetch zlib (#2297)

New Features

I/O

Add Streaming API (#2041, #2042, #2043, #2044, #2045, #2046, #2047, #2111, #2113, #2114, #2115, #2135, #2164, #2168, #2202, #2204, #2263, #2264, #2312, #2373, #2378, #2402, #2403, #2427, #2429)
Add YUV420P format support to Streaming API (#2334)
Support specifying decoder and its options (#2327)
Add NV12 format support in Streaming API (#2330)
Add HW acceleration support on Streaming API (#2331)
Add file-like object support to Streaming API (#2400)
Make FFmpeg log level configurable (#2439)
Set the default ffmpeg log level to FATAL (#2447)

Ops

New beamforming methods (#2227, #2228, #2229, #2230, #2231, #2232, #2369, #2401)
New MVDR modules (#2367, #2368)
Add and refactor CTC lexicon beam search decoder (#2075, #2079, #2089, #2112, #2117, #2136, #2174, #2184, #2185, #2273, #2289)
Add lexicon free CTC decoder (#2342)
Add Pretrained LM Support for Decoder (#2275)
Move CTC beam search decoder to beta (#2410)

Datasets

Add QUESST14 dataset (#2290, #2435, #2458)
Add LibriLightLimited dataset (#2302)

Improvements

I/O

Use FFmpeg-based I/O as fallback in sox_io backend. (#2416, #2418, #2423)

Ops

Raise error for resampling int waveform (#2318)
Move multi-channel modules to a separate file (#2382)
Refactor MVDR module (#2383)

Models

Add an option to use Tanh instead of ReLU in RNNT joiner (#2319)
Support GroupNorm and re-ordering Convolution/MHA in Conformer (#2320)
Add extra arguments to hubert pretrain factory functions (#2345)
Add feature_grad_mult argument to HuBERTPretrainModel (#2335)

Datasets

Refactor LibriSpeech dataset (#2387)
Raising RuntimeErrors when datasets missing (#2430)

Performance

Make Pitchshift for faster by caching resampling kernel (#2441)
The following table illustrates the performance improvement over the previous release by comparing the time in msecs it takes torchaudio.transforms.PitchShift, after its first call, to perform the operation on float32 Tensor with two channels and 8000 frames, resampled to 44.1 kHz across various shifted steps.

TorchAudio Version	2	3	4	5
0.12	2.76	5	1860	223
0.11	6.71	161	8680	1450

Tests

Add complex dtype support in functional autograd test (#2244)
Refactor torchscript consistency test in functional (#2246)
Add unit tests for PyTorch Lightning modules of emformer_rnnt recipes (#2240)
Refactor batch consistency test in functional (#2245)
Run smoke tests on regular PRs (#2364)
Refactor smoke test executions (#2365)
Move seed to setup (#2425)
Remove possible manual seeds from test files (#2436)

Build

Revise the parameterization of third party libraries (#2282)
Use zlib v1.2.12 with GitHub source (#2300)
Fix ffmpeg integration for ffmpeg 5.0 (#2326)
Use custom FFmpeg libraries for torchaudio binary distributions (#2355)
Adding m1 builds to torchaudio (#2421)

Other

Add download utility specialized for torchaudio (#2283)
Use module-level __getattr__ to implement delayed initialization (#2377)
Update build_doc job to use Conda CUDA package (#2395)
Update I/O initialization (#2417)
Add Python 3.10 (build and test) (#2224)
Retrieve version from version.txt (#2434)
Disable OpenMP on mac (#2431)

Examples

Ops

Add CTC decoder example for librispeech (#2130, #2161)
Fix LM, arguments in CTC decoding script (#2235, #2315)
Use pretrained LM API for decoder example (#2317)

Pipelines

Refactor pipeline_demo.py to support variant EMFORMER_RNNT bundles (#2203)
Refactor eval and pipeline_demo scripts in emformer_rnnt (#2238)
Refactor pipeline_demo script in emformer_rnnt recipes (#2239)
Add EMFORMER_RNNT_BASE_MUSTC into pipeline demo script (#2248)

Tests

Add unit tests for Emformer RNN-T LibriSpeech recipe (#2216)
Add fixed random seed for Emformer RNN-T recipe test (#2220)

Training recipes

Add recipe for HuBERT model pre-training (#2143, #2198, #2296, #2310, #2311, #2412)
Add HuBERT fine-tuning recipe (#2352)
Refactor Emformer RNNT recipes (#2212)
Fix bugs from Emformer RNN-T recipes merge (#2217)
Add SentencePiece model training script for LibriSpeech Emformer RNN-T (#2218)
Add training recipe for Emformer RNNT trained on MuST-C release v2.0 dataset (#2219)
Refactor ArgumentParser arguments in emformer_rnnt recipes (#2236)
Add shebang lines to scripts in emformer_rnnt recipes (#2237)
Introduce DistributedBatchSampler (#2299)
Add Conformer RNN-T LibriSpeech training recipe (#2329)
Refactor LibriSpeech Conformer RNN-T recipe (#2366)
Refactor LibriSpeech Lightning datamodule to accommodate different dataset implementations (#2437)

Prototypes

Models

Add Conformer RNN-T model prototype (#2322)
Add ConvEmformer module (streaming-capable Conformer) (#2324, #2358)
Add conv_tasnet_base factory function to prototype (#2411)

Pipelines

Add EMFORMER_RNNT_BASE_MUSTC bundle to torchaudio.prototype (#2241)

Documentation

Add ASR CTC decoding inference tutorial (#2106)
Update context building to not delay the inference (#2213)
Update online ASR tutorial (#2226)
Update CTC decoder docs and add citation (#2278)
[Doc] fix typo and backlink (#2281)
Fix calculation of SNR value in tutorial (#2285)
Add notes about prototype features in tutorials (#2288)
Update README around version compatibility matrix (#2293)
Update decoder pretrained lm docs (#2291)
Add devices/properties badges (#2321)
Fix LibriMix documentation (#2351)
Update wavernn.py (#2347)
Add citations for datasets (#2371)
Update audio I/O tutorials (#2385)
Update MVDR beamforming tutorial (#2398)
Update audio feature extraction tutorial (#2391)
Update audio resampling tutorial (#2386)
Update audio data augmentation tutorial (#2388)
Add tutorial to use NVDEC with Stream API (#2393)
Expand subsections in tutorials by default (#2397)
Fix documentation (#2407)
Fix documentation (#2409)
Dataset doc fixes (#2426)
Update CTC decoder docs (#2443)
Split Streaming API tutorials into two (#2446)
Update HW decoding tutorial and add notes about unseekable object (#2408)

audio - v0.11.0

Published by nateanl over 2 years ago

torchaudio 0.11.0 Release Note

Highlights

TorchAudio 0.11.0 release includes:

Emformer (paper) RNN-T components, training recipe, and pre-trained pipeline for streaming ASR
Voxpopuli pre-trained pipelines
HuBERTPretrainModel for training HuBERT from scratch
Conformer model for speech recognition
Drop Python 3.6 support

[Beta] Emformer RNN-T

To support streaming ASR use cases, the release adds implementations of Emformer (docs), an RNN-T model that uses Emformer (emformer_rnnt_base), and an RNN-T beam search decoder (RNNTBeamSearch). It also includes a pipeline bundle (EMFORMER_RNNT_BASE_LIBRISPEECH) that wraps pre- and post-processing components, the beam search decoder, and the RNN-T Emformer model with weights pre-trained on LibriSpeech, which in whole allow for performing streaming ASR inference out of the box. For reference and reproducibility, the release provides the training recipe used to produce the pre-trained weights in the examples directory.

[Beta] HuBERT Pretrain Model

The masked prediction training of HuBERT model requires the masked logits, unmasked logits, and feature norm as the outputs. The logits are for cross-entropy losses and the feature norm is for penalty loss. The release adds HuBERTPretrainModel and corresponding factory functions (hubert_pretrain_base, hubert_pretrain_large, and hubert_pretrain_xlarge) to enable training from scratch.

[Beta] Conformer (paper)

The release adds an implementation of Conformer (docs), a convolution-augmented transformer architecture that has achieved state-of-the-art results on speech recognition benchmarks.

Backward-incompatible changes

Ops

Removed deprecated F.magphase, F.angle, F.complex_norm, and T.ComplexNorm. (#1934, #1935, #1942)
- Utility functions for pseudo complex types were deprecated in 0.10, and now they are removed in 0.11. For the detail of this migration plan, please refer to #1337.
Dropped pseudo complex support from F.spectrogram, T.Spectrogram, F.phase_vocoder, and T.TimeStretch (#1957, #1958)
- The support for the pseudo complex type was deprecated in 0.10, and now they are removed in 0.11. For the detail of this migration plan, please refer to #1337.
Removed deprecated create_fb_matrix (#1998)
- create_fb_matrix was replaced by melscale_fbanks in release 0.10. It is removed in 0.11. Please use melscale_fbanks.

Datasets

Removed deprecated VCTK (#1825)
- The original VCTK archive file is no longer accessible. Please migrate to VCTK_092 class for the latest version of the dataset.
Removed deprecated dataset utils (#1826)
- Undocumented methods diskcache_iterator and bg_iterator were deprecated in 0.10. They are removed in 0.11. Please cease the usage of them.

Models

Removed unused dimension from pretrained Wav2Vec2 ASR (#1914)
- The final linear layer of Wav2Vec2 ASR models included dimensions (<s>, <pad>, </s>, <unk>) that were not related to ASR tasks and not used. These dimensions were removed.

Build

Dropped support for Python3.6 (#2119, #2139)
- Following the lifecycle of Python-3.6, torchaudio dropped the support for Python 3.6.

New Features

RNN-T Emformer

Introduced Emformer (#1801)
Added Emformer RNN-T model (#2003)
Added RNN-T beam search decoder (#2028)
Cleaned up Emformer module (#2091)
Added pretrained Emformer RNN-T streaming ASR inference pipeline (#2093)
Reorganized RNN-T components in prototype module (#2110)
Added integration test for Emformer RNN-T LibriSpeech pipeline (#2172)
Registered RNN-T pipeline global stats constants as buffers (#2175)
Refactored RNN-T factory function to support num_symbols argument (#2178)
Fixed output shape description in RNN-T docstrings (#2179)
Removed invalid token blanking logic from RNN-T decoder (#2180)
Updated stale prototype references (#2189)
Revised RNN-T pipeline streaming decoding logic (#2192)
Cleaned up Emformer (#2207)
Applied minor fixes to Emformer implementation (#2252)

Conformer

Introduced Conformer (#2068)
Removed subsampling and positional embedding logic from Conformer (#2171)
Moved ASR features out of prototype (#2187)
Passed bias and dropout args to Conformer convolution block (#2215)
Adjusted Conformer args (#2223)

Datasets

Added DR-VCTK dataset (#1819)

Models

Added HuBERT pretrain model to enable training from scratch (#2064)
Added feature mean square value to HuBERT Pretrain model output (#2128)

Pipelines

Added wav2vec2 ASR French pretrained from voxpopuli (#1919)
Added wav2vec2 ASR Spanish pretrained model from voxpopuli (#1924)
Added wav2vec2 ASR German pretrained model from voxpopuli (#1953)
Added wav2vec2 ASR Italian pretrained model from voxpopuli (#1954)
Added wav2vec2 ASR English pretrained model from voxpopuli (#1956)

Build

Added CUDA-11.5 builds to torchaudio (#2067)

Improvements

I/O

Fixed load behavior for 24-bit input (#2084)

Ops

Added OpenMP support (#1761)
Improved MVDR stability (#2004)
Relaxed dtype for MVDR (#2024)
Added warnings in mu_law* for the wrong input type (#2034)
Added parameter p to TimeMasking (#2090)
Removed unused vars from RNN-T loss (#2142)
Removed complex32 dtype in F.griffinlim (#2233)

Datasets

Deprecated data utils (#2073)
Updated URLs for libritts (#2074)
Added subset support for TEDLIUM release3 dataset (#2157)

Models

Replaced dropout with Dropout (#1815)
Inplace initialization of RNN weights (#2010)
Updated to xavier_uniform and avoid legacy data.uniform_ initialization (#2018)
Allowed Tacotron2 decode batch_size 1 examples (#2156)

Pipelines

Added tool to convert voxpopuli model (#1923)
Refactored wav2vec2 pipeline util (#1925)
Allowed the customization of axis exclusion for ASR head (#1932)
Tweaked wav2vec2 checkpoint conversion tool (#1938)
Added melkwargs setting for MFCC in HuBERT pipeline (#1949)

Documentation

Added 0.10.0 to version compatibility matrix (#1862)
Removed MACOSX_DEPLOYMENT_TARGET (#1880)
Updated intersphinx inventory (#1893)
Updated compatibility matrix to include LTS version (#1896)
Updated CONTRIBUTING with doc conventions (#1898)
Added anaconda stats to README (#1910)
Updated README.md (#1916)
Added citation information (#1947)
Updated CONTRIBUTING.md (#1975)
Doc fixes (#1982)
Added tutorial to CONTRIBUTING (#1990)
Fixed docstring (#2002)
Fixed minor typo (#2012)
Updated audio augmentation tutorial (#2082)
Added Sphinx gallery automatically (#2101)
Disabled matplotlib warning in tutorial rendering (#2107)
Updated prototype documentations (#2108)
Added custom CSS to make signatures appear in multi-line (#2123)
Updated prototype pipeline documentation (#2148)
Tweaked documentation (#2152)

Tests

Refactored integration test (#1922)
Enabled integration tests on CI (#1939)
Removed facebook folder in wav2vec unit tests (#2015)
Temporarily skipped threadpool test (#2025)
Revised Griffin-Lim transform test to reduce execution time (#2037)
Fixed CircleCI test failures (#2069)
Do not auto-skip tests on CI (#2127)
Relaxed absolute tolerance for Kaldi compat tests (#2165)
Added tacotron2 unit test with different batch_size (#2176)

Build

Updated GPU resource class (#1791)
Updated the main version to 0.11.0 (#1793)
Updated windows cuda installer 11.1.0 to 11.1.1 (#1795)
Renamed build_tools to tools (#1812)
Limit Windows GPU testing to CUDA-11.3 only (#1842)
Used cu113 for unittest_windows_gpu (#1853)
USE_CUDA in windows and reduce one vcvarsall (#1854)
Check torch installation before building package (#1867)
Install tools from conda instead of brew (#1873)
Cleaned up setup.py (#1900)
Moved TorchAudio conda package to use pytorch-mutex (#1904)
Updated smoke test docker image (#1905)
Fixed formatting CIRCLECI_TAG when building docs (#1915)
Fetch third party sources automatically (#1966)
Disabled SPHINXOPT=-W for local env (#2013)
Improved installing nightly pytorch (#2026)
Improved cuda installation on windows (#2032)
Refactored the library loading mechanism (#2038)
Cleaned up libtorchaudio customization logic (#2039)
Refactored and functionize the library definition (#2040)
Introduced helper function to define extension (#2077)
Standardized the location of third-party source code (#2086)
Show lint diff with color (#2102)
Updated third party submodule setup (#2132)
Suppressed stderr from subprocess in setup.py (#2133)
Fixed header include (#2135)
Updated ROCM version 4.1 -> 4.3.1 and 4.5 (#2186)
Added "cu102" back (#2190)
Pinned flake8 version (#2191)

Style

Removed trailing whitespace (#1803)
Fixed style checks (#1913)
Resolved lint warning (#1971)
Enabled CLANGFORMAT (#1999)
Fixed style checks in examples/tutorials (#2006)
OSS config for lint checks (#2066)
Excluded sphinx-gallery examples (#2071)
Reverted linting exemptions introduced in #2071 (#2087)
Applied arc lint to pytorch audio (#2096)
Enforced lint checks and fix/mute lint errors (#2116)

Other

Replaced issue templates with new issue forms (#1802)
Notify merger if PR is incorrectly labeled (#1937)
Added script to collect PRs between commits (#1943)
Fixed PR labeling requirement (#1946)
Refactored collecting-PR script for release note (#1951)
Fixed bandit failure (#1960)
Renamed bug fix label (#1961)
Updated PR label notifier (#1964)
Reverted "Update PR label notifier (#1964)" (#1965)
Consolidated network utils (#1974)
Added PR collecting script (#2008)
Re-sync with internal repository (#2017)
Updated script for getting PR merger and labels (#2030)
Fixed third party archive fetch job (#2095)
Use python:3.X Docker image for build doc (#2151)
Updated PR labeling workflow (#2160)
Fixed librosa calls (#2208)

Examples

Ops

Removed the MVDR tutorial in examples (#2109)
Abstracted BucketizeSampler to be usable outside of HuBERT example (#2147)
Refactored BucketizeBatchSampler and HuBERTDataset (#2150)
Removed multiprocessing from audio dataset tutorial (#2163)

Models

Added training recipe for RNN-T Emformer ASR model (#2052)
Added global stats script and new json for LibriSpeech RNN-T training recipe (#2183)

Pipelines

Added preprocessing scripts for HuBERT model training (#1911)
Supported multi-node training for source separation pipeline (#1968)
Added bucketize sampler and dataset for HuBERT Base model training pipeline (#2000)
Added librispeech inference script (#2130)

Other

Added unmaintained warnings (#1813)
torch.quantization -> torch.ao.quantization (#1823)
Use download.pytorch.org for asset URL (#2182)
Added deprecation path for renamed training type plugins (#11227)
Renamed DDPPlugin to DDPStrategy (#11142)

audio - torchaudio v0.10.2 Minor release

Published by atalman over 2 years ago

This is a minor release compatible with PyTorch 1.10.2.

There is no feature change in torchaudio from 0.10.1. For the full feature of v0.10, please refer to the v0.10.0 release notes.

audio - torchaudio 0.10.1 Release Note

Published by mthrok almost 3 years ago

This is a minor release, which is compatible with PyTorch 1.10.1 and include small bug fix, improvements and documentation update. There is no new feature added.

Bug Fix

#2050 Allow whitespace as TORCH_CUDA_ARCH_LIST delimiter

Improvement

#2054 Fetch third party source code automatically
The build process now fetches third party source code (git submodule and cmake external projects)
#2059 Improve documentation

For the full feature of v0.10, please refer to the v0.10.0 release note.

audio - v0.10.0

Published by carolineechen almost 3 years ago

torchaudio 0.10.0 Release Note

Highlights

torchaudio 0.10.0 release includes:

New models (Tacotron2, HuBERT) and datasets (CMUDict, LibriMix)
Pretrained model support for ASR (Wav2Vec2, HuBERT) and TTS (WaveRNN, Tacotron2)
New operations (RNN Transducer loss, MVDR beamforming, PitchShift, etc)
CUDA-enabled binaries

[Beta] Wav2Vec2 / HuBERT Models and Pretrained Weights

HuBERT model architectures (“base”, “large” and “extra large” configurations) are added. In addition to that, support for pretrained weights from wav2vec 2.0, Unsupervised Cross-lingual Representation Learning and HuBERT are added.

These pretrained weights can be used for feature extractions and downstream task adaptation.

>>> import torchaudio
>>>
>>> # Build the model and load pretrained weight.
>>> model = torchaudio.pipelines.HUBERT_BASE.get_model()
>>> # Perform feature extraction.
>>> features, lengths = model.extract_features(waveforms)
>>> # Pass the features to downstream task
>>> ...

Some of the pretrained weights are fine-tuned for ASR tasks. The following example illustrates how to use weights and access to associated information, such as labels, which can be used in subsequent CTC decoding steps. (Note: torchaudio does not provide a CTC decoding mechanism.)

>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.HUBERT_ASR_LARGE
>>>
>>> # Build the model and load pretrained weight.
>>> model = bundle.get_model()
Downloading:
100%|███████████████████████████████| 1.18G/1.18G [00:17<00:00, 73.8MB/s]
>>> # Check the corresponding labels of the output.
>>> labels = bundle.get_labels()
>>> print(labels)
('<s>', '<pad>', '</s>', '<unk>', '|', 'E', 'T', 'A', 'O', 'N', 'I', 'H', 'S', 'R', 'D', 'L', 'U', 'M', 'W', 'C', 'F', 'G', 'Y', 'P', 'B', 'V', 'K', "'", 'X', 'J', 'Q', 'Z')
>>>
>>> # Infer the label probability distribution
>>> waveform, sample_rate = torchaudio.load(hello-world.wav')
>>>
>>> emissions, _ = model(waveform)
>>>
>>> # Pass emission to (hypothetical) decoder
>>> transcripts = ctc_decode(emissions, labels)
>>> print(transcripts[0])
HELLO WORLD

[Beta] Tacotron2 and TTS Pipeline

A new model architecture, Tacotron2 is added, alongside several pretrained weights for TTS (text-to-speech). Since these TTS pipelines are composed of multiple models and specific data processing, so as to make it easy to use associated objects, a notion of bundle is introduced. Bundles provide a common access point to create a pipeline with a set of pretrained weights. They are available under torchaudio.pipelines module.
The following example illustrates a TTS pipeline where two models (Tacotron2 and WaveRNN) are used together.

>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH
>>>
>>> # Build text processor, Tacotron2 and vocoder (WaveRNN) model
>>> processor = bundle.get_text_preprocessor()
>>> tacotron2 = bundle.get_tacotron2()
Downloading:
100%|███████████████████████████████| 107M/107M [00:01<00:00, 87.9MB/s]
>>> vocoder = bundle.get_vocoder()
Downloading:
100%|███████████████████████████████| 16.7M/16.7M [00:00<00:00, 78.1MB/s]
>>>
>>> text = "Hello World!"
>>>
>>> # Encode text
>>> input, lengths = processor(text)
>>>
>>> # Generate (mel-scale) spectrogram
>>> specgram, lengths, _ = tacotron2.infer(input, lengths)
>>>
>>> # Convert spectrogram to waveform
>>> waveforms, lengths = vocoder(specgram, lengths)
>>>
>>> # Save audio
>>> torchaudio.save('hello-world.wav', waveforms, vocoder.sample_rate)

[Beta] RNN Transducer Loss

The loss function used in the RNN transducer architecture, which is widely used for speech recognition tasks, is added. The loss function (torchaudio.functional.rnnt_loss or torchaudio.transforms.RNNTLoss) supports float16 and float32 logits, has autograd and torchscript support, and can be run on both CPU and GPU, which has a custom CUDA kernel implementation for improved performance.

[Beta] MVDR Beamforming

This release adds support for MVDR beamforming on multi-channel audio using Time-Frequency masks. There are three solutions (ref_channel, stv_evd, stv_power) and it supports single-channel and multi-channel (perform average in the method) masks. It provides an online option that recursively updates the parameters for streaming audio.
Please refer to the MVDR tutorial.

GPU Build

This release adds GPU builds that support custom CUDA kernels in torchaudio, like the one being used for RNN transducer loss. Following this change, torchaudio’s binary distribution now includes CPU-only versions and CUDA-enabled versions. To use CUDA-enabled binaries, PyTorch also needs to be compatible with CUDA.

Additional Features

torchaudio.functional.lfilter now supports batch processing and multiple filters. Additional operations, including pitch shift, LFCC, and inverse spectrogram, are now supported in this release. The datasets CMUDict and LibriMix are added as well.

Backward Incompatible Changes

I/O

Default to PCM_16 for flac on soundfile backend (#1604)
- When saving FLAC format with “soundfile” backend, PCM_24 (the previous default) could cause warping. The default has been changed to PCM_16, which does not suffer this.

Ops

Default to native complex type when returning raw spectrogram (#1549)
- When power=None, torchaudio.functional.spectrogram and torchaudio.transforms.Spectrogram now defaults to return_complex=True, which returns Tensor of native complex type (such as torch.cfloat and torch.cdouble). To use a pseudo complex type, pass the resulting tensor to torch.view_as_real.
Remove deprecated kaldi.resample_waveform (#1555)
- Please use torchaudio.functional.resample.
Replace waveform with specgram in SlidingWindowCmn (#1859)
- The argument name was corrected to specgram.
Ensure integer input frequencies for resample (#1857)
- Sampling rates were silently cast to integers in the resampling implementation, so it now requires integer sampling rate inputs to ensure expected resampling quality.

Wav2Vec2

Update extract_features of Wav2Vec2Model (#1776)
- The previous implementation returned outputs from convolutional feature extractors. To match the behavior with the original fairseq’s implementation, the method was changed to return the outputs of the intermediate layers of transformer layers. To achieve the original behavior, please use Wav2Vec2Model.feature_extractor().
Move fine-tune specific module out of wav2vec2 encoder (#1782)
- The internal structure of Wav2Vec2Model was updated. Wav2Vec2Model.encoder.read_out module is moved to Wav2Vec2Model.aux. If you have serialized state dict, please replace the key encoder.read_out with aux.
Updated wav2vec2 factory functions for more customizability (#1783, #1804, #1830)
- The signatures of wav2vec2 factory functions are changed. num_out parameter has been changed to aux_num_out and other parameters are added before it. Please update the code from wav2vec2_base(num_out) to wav2vec2_base(aux_num_out=num_out).

Deprecations

Add melscale_fbanks and deprecate create_fb_matrix (#1653)
- As linear_fbanks is introduced, create_fb_matrix is renamed to melscale_fbanks. The original create_fb_matrix is now deprecated. Please use melscale_fbanks.
Deprecate VCTK dataset (#1810)
- This dataset has been taken down and is no longer available. Please use VCTK_092 dataset.
Deprecate data utils (#1809)
- bg_iterator and diskcache_iterator are known to not improve the throughput of data loaders. Please cease their usage.

New Features

Models

Tacotron2

Add Tacotron2 model (#1621, #1647, #1844)
Add Tacotron2 loss function (#1764)
Add Tacotron2 inference method (#1648, #1839, #1849)
Add phoneme text preprocessing for Tacotron2 (#1668)
Move Tacotron2 out of prototype (#1714)

HuBERT

Add HuBERT model architectures (#1769, #1811)

Pretrained Weights and Pipelines

Add pretrained weights for wavernn (#1612) 
Add Tacotron2 pretrained models (#1693) 
Add HUBERT pretrained weights (#1821, #1824) 
Add pretrained weights from wav2vec2.0 and XLSR papers (#1827) 
Add customization support to wav2vec2 labels (#1834) 
Default pretrained weights to eval mode (#1843) 
Move wav2vec2 pretrained models to pipelines module (#1876) 
Add TTS bundle/pipelines (#1872) 
Fix vocoder interface (#1895) 
Fix Phonemizer download (#1897)

RNN Transducer Loss

Add reduction parameter for RNNT loss (#1590) 
Rename RNNT loss C++ parameters (#1602) 
Rename transducer to RNNT (#1603) 
Remove gradient variable from RNNT loss Python code (#1616) 
Remove reuse_logits_for_grads option for RNNT loss (#1610) 
Remove fused_log_softmax option from RNNT loss (#1615) 
RNNT loss resolve null gradient (#1707) 
Move RNNT loss out of prototype (#1711)

MVDR Beamforming

Add MVDR module to example (#1709) 
Add normalization to steering vector solutions in MVDR Module (#1765) 
Move MVDR and PSD modules to transforms (#1771) 
Add MVDR beamforming tutorial to example directory (#1768)

Ops

Add edit_distance (#1601) 
Add PitchShift to functional and transform (#1629) 
Add LFCC feature to transforms (#1611) 
Add InverseSpectrogram to transforms and functional (#1652)

Datasets

Add CMUDict dataset (#1627) 
Move LibriMix dataset to datasets directory (#1833)

Improvements

I/O

Make buffer size for function info configurable (#1634)

Ops

Replace deprecated AutoNonVariableTypeMode (#1583) 
Remove lazy behavior from MelScale (#1636) 
Simplify axis value checks (#1501) 
Use at::parallel_for in lfilter core loop (#1557) 
Add filterbanks support to lfilter (#1587) 
Add batch support to lfilter (#1638) 
Use integer rates in pitch shift resample (#1861)

Models

Rename infer method to forward for WaveRNNInferenceWrapper (#1650) 
Refactor WaveRNN infer and move it to the codebase (#1704) 
Make the core wav2vec2 factory function public (#1829) 
Refactor WaveRNNInferenceWrapper (#1845) 
Store n_bits in WaveRNN (#1847) 
Replace custom padding with torch’s native impl (#1846) 
Avoid concatenation in loop (#1850) 
Add lengths param to WaveRNN.infer (#1851) 
Add sample rate to wav2vec2 bundle (#1878) 
Remove factory functions of Tacotron2 and WaveRNN (#1874)

Datasets

Fix encoding of CMUDict data reading (#1665) 
Rename utterance to transcript in datasets (#1841) 
Clean up constructor of CMUDict (#1852)

Performance

Refactor transforms.Fade on GPU computation (#1871)

CUDA

Tensor shape	[1,4,8000]	[1,4,16000]	[1,4,32000]
0.10	119	120	123
0.9	160	184	240

Unit: msec

Examples

Add text preprocessing utilities for TTS pipeline (#1639) 
Replace simple_ctc with Python greedy decoder (#1558) 
Add an inference example for WaveRNN (#1637) 
Refactor coding style for WaveRNN example (#1663) 
Add style checks on example files on CI (#1667) 
Add Tacotron2 training script (#1642) 
Add an inference example for Tacotron2 (#1654) 
Fix Tacotron2 inference example (#1716) 
Fix WaveRNN training example (#1740) 
Training recipe for ConvTasNet on Libri2Mix dataset (#1757)

Build

Update skipIfNoCuda decorator and force GPU tests in GPU CIs (#1559) 
Temporarily pin nightly version on Linux/macOS CPU unittest (#1598) 
Temporarily pin nightly version on Linux GPU unitest (#1606) 
Revert CI hot fix (#1614) 
Expose USE_CUDA in build (#1609) 
Pin MKL to 2021.2.0 (#1655) 
Simplify extension initialization (#1649) 
Synchronize extension initialization mechanism with fbcode (#1682) 
Ensure we’re propagating BUILD_VERSION (#1697) 
Guard Kaldi’s version generation (#1715) 
Update sphinx to 3.5.4 (#1685) 
Default to BUILD_SOX=1 in non-Windows systems (#1725) 
Add CUDA install step to Win Packaging jobs (#1732) 
setup.py should parse TORCH_CUDA_ARCH_LIST (#1733) 
Simplify the extension initialization process (#1734) 
Fix CUDA build logic for _torchaudio.so (#1737) 
Enable Linux wheel/conda GPU package builds (#1730) 
Increase no_output_timeout to 20m for WinConda (#1738) 
Build torchaudio for 11.3 as well (#1747) 
Upload wheels to respective folders (#1751) 
Extract PyBind11 feature implementations (#1739) 
Update the way to access libsox global config (#1755) 
Fix ROCM build error (#1729) 
Fix compile warnings (#1762) 
Migrate CircleCI docker image (#1767) 
Split extension into custom impl and Python wrapper libraries (#1752) 
Put libtorchaudio in lib directory (#1773) 
Update win gpu image from previous to stable (#1786) 
Set libtorch audio suffix as pyd on Windows (#1788) 
Fix build on Windows with CUDA (#1787) 
Enable audio windows cuda tests (#1777) 
Set release and base PyTorch version (#1816) 
Exclude prototype if it is in release (#1870) 
Log prototype exclusion (#1882) 
Update prototype exclusion (#1885) 
Remove alpha from version number (#1901)

Testing

Migrate resample tests from kaldi to functional (#1520) 
Add autograd gradcheck test for RNN transducer loss (#1532) 
Fix HF wav2vec2 test (#1585) 
Update unit test CUDA to 10.2 (#1605) 
Fix CircleCI unittest environemnt 
Remove skipIfRocm from test_fileobj_flac in soundfile.save_test (#1626) 
MFCC test refactor (#1618) 
Refactor RNNT Loss Unit Tests (#1630) 
Reduce sample rate to avoid test time out (#1640) 
Refactor text preprocessing tests in Tacotron2 example (#1635) 
Move test initialization logic to dedicated directory (#1680) 
Update pitch shift batch consistency test (#1700) 
Refactor scripting in test (#1727) 
Update the version of fairseq used for testing (#1745) 
Put output tensor on proper device in get_whitenoise (#1744) 
Refactor batch consistency test in transforms (#1772) 
Tweak test name by appending factory function name (#1780) 
Enable audio windows cuda tests (#1777) 
Skip hubert_asr_xlarge TS test on Windows (#1800) 
Skip hubert_xlarge TS test on Windows (#1807)

Others

Remove unused files (#1588) 
Remove residuals for removed modules (#1599) 
Remove torchscript bc test references (#1623) 
Remove torchaudio._internal.fft module (#1631)

Misc

Rename master branch to main (#1649) 
Fix Python spacing (#1670) 
Lint fix (#1726) 
Add .gitattributes (#1731) 
Style fixes (#1766) 
Update reference from master to main elsewhere (#1784)

Bug Fixes

Fix models import (#1664) 
Fix HF model integration (#1781)

Documentation

README Updates 
- Update README (#1544) 
- Remove NumPy dependency from README (#1582) 
- Fix typos and sentence structure in README.md (#1633) 
- Update and move convention section to CONTRIBUTING.md (#1635) 
- Remove unnecessary README (#1728) 
- Add link to TTS colab example to README (#1748) 
- Fix typo in source separation README (#1774) 
Docstring Changes 
- Set removal version of pseudo complex support (#1553) 
- Update docs (#1584) 
- Add return type in doc for RNNT loss (#1591) 
- Improve RNNT loss docstrings (#1642) 
- Add documentation for CMUDict’s property (#1683) 
- Refactor lfilter docs (#1698) 
- Standardize optional types in docstrings (#1746) 
- Fix return type of wav2vec2 model (#1790) 
- Add equations to MVDR docstring (#1789) 
- Standardize tensor shapes format in docs (#1838) 
- Add license to pre-trained model doc (#1836) 
- Update Tacotron2 docs (#1840) 
- Fix PitchShift docstring (#1866) 
- Update descriptions of lengths parameters (#1890) 
- Standardization and minor fixes (#1892) 
- Update models/pipelines doc (#1894) 
Docs formatting 
- Remove override CSS (#1554) 
- Add prototype.tacotron2 page to docs (#1695) 
- Add doc for InverseSepctrogram (#1706) 
- Add sections to transforms docs (#1720) 
- Add edit_distance to documentation with a new category Metric (#1743) 
- Fix model subsections (#1775) 
- List all the pre-trained models on right bar (#1828) 
- Put pretrained weights to subsection (#1879) 
Examples (see #1564) 
- Add example code for Resample (#1644) 
- Fix examples in transforms (#1646) 
- Add example for ComplexNorm (#1658) 
- Add example for MuLawEncoding (#1586) 
- Add example for Spectrogram (#1566) 
- Add example for GriffinLim (#1671) 
- Add example for MuLawDecoding (#1684) 
- Add example for Fade transform (#1719) 
- Update RNNT loss docs and add example (#1835) 
- Add SpecAugment figure/citation (#1887) 
- Add filter bank figures (#1891)