Bot releases are hidden (Show)

audio - v0.9.0

Published by mthrok over 3 years ago

torchaudio 0.9.0 Release Note

Highlights

torchaudio 0.9.0 release includes:

Lots of performance improvements. (filtering, resampling, spectral operation)
Popular wav2vec2.0 model architecture.
Improved autograd support.

[Beta] Wav2Vec2.0 Model

This release includes model architectures from wav2vec2.0 paper with utility functions that allow importing pretrained model parameters published on fairseq and Hugging Face Hub. Now you can easily run speech recognition with torchaudio. These model architectures also support TorchScript, and you can deploy them with ONNX or in non-Python environments, such as C++, Android and iOS. Please checkout our C++, Android and iOS examples. The following snippets illustrate how to create a deployable model.

# Import fine-tuned model from Hugging Face Hub
import transformers
from torchaudio.models.wav2vec2.utils import import_huggingface_model

original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
imported = import_huggingface_model(original)

# Import fine-tuned model from fairseq
import fairseq
from torchaudio.models.wav2vec2.utils import import_fairseq_model

Original, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task(
    ["wav2vec_small_960h.pt"], arg_overrides={'data': "<data_dir>"})
imported = import_fairseq_model(original[0].w2v_encoder)

# Build uninitialized model and load state dict
from torchaudio.models import wav2vec2_base

model = wav2vec2_base(num_out=32)
model.load_state_dict(imported.state_dict())

# Quantize / script / optimize for mobile
quantized_model = torch.quantization.quantize_dynamic(
    model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)
scripted_model = torch.jit.script(quantized_model)
optimized_model = optimize_for_mobile(scripted_model)
optimized_model.save("model_for_deployment.pt")

Filtering Improvement

The internal implementation of lfilter has been updated to support autograd on both CPU and CUDA. Additionally, the performance on CPU is significantly improved. These improvements also apply to biquad variants.

The following table illustrates the performance improvements compared against the previous releases. lfilter was applied on float32 tensors with one channel and different number of frames.

Unit: msec

Complex Tensor Migration

torchaudio has functions that handle complex-valued tensors. In early days when PyTorch did not have a complex dtype, torchaudio adopted the convention to use an extra dimension to represent real and imaginary parts. In PyTorch 1.6, new dtyps, such as torch.cfloat and torch.cdouble were introduced to represent complex values natively. (In the following, we refer to torchaudio’s original convention as pseudo complex types, and PyTorch’s native dtype as native complex types.)

As the native complex types have become mature and stable, torchaudio has started to migrate complex functions to use the native complex type. In this release, the internal implementation was updated to use the native complex types, and interfaces were updated to allow passing/receiving native complex type directly. Users can choose to keep using the pseudo complex type or opt in to use native complex type. However, please note that the use of the pseudo complex type is now deprecated. These functions are tested to support TorchScript and autograd. For the detail of this migration plan, please refer to #1337.

Additionally, switching the internal implementation to the native complex types improved the performance. Since the internal implementation uses native complex type regardless of which complex type is passed/returned, users will automatically benefit from this performance improvement.

The following table illustrates the performance improvements from the previous release by comparing the time it takes for complex transforms to perform operation on float32 Tensor with two channels and 256 frames.

CPU

Unit: msec

CUDA

Unit: msec

Improved Autograd Support

Along with the work of Complex Tensor Migration and Filtering Improvement mentioned above, more tests were added to ensure the autograd support. Now the following operations are guaranteed to support autograd up to second order.

Functionals

lfilter
allpass_biquad
biquad
band_biquad
bandpass_biquad
bandrefect_biquad
bass_biquad
equalizer_biquad
treble_biquad
highpass_biquad
lowpass_biquad

Transforms

AmplitudeToDB
ComputeDeltas
Fade
GriffinLim
TimeMasking
FrequencyMasking
MFCC
MelScale
MelSpectrogram
Resample
SpectralCentroid
Spectrogram
SlidingWindowCmn
TimeStretch*
Vol

NOTE:

Autograd test for transforms also covers the following functionals.
- amplitude_to_DB
- spectrogram
- griffinlim
- resample
- phase_vocoder*
- mask_along_axis_iid
- mask_along_axis
- gain
- spectral_centroid
torchaudio.transforms.TimeStretch and torchaudio.functional.phase_vocoder call atan2, which is not differentiable around zero. Therefore these functions are differentiable only when the input spectrogram does not contain values around zero.

[Beta] Resampling Improvement

In release 0.8, the resampling operation was vectorized and its performance improved. In this release, the implementation of the resampling algorithm has been further revised.

Kaiser window has been added for a wider range of resampling quality.
rolloff parameter has been added for anti-aliasing control.
torchaudio.transforms.Resample precomputes the kernel using float64 precision and caches it for even faster operation.
New entry point, torchaudio.functional.resample has been added and the original entry point, torchaudio.compliance.kaldi.resample_waveform is deprecated.

The following table illustrates the performance improvements from the previous release by comparing the time it takes for torchaudio.transforms.Resample to complete the operation on float32 tensor with two channels and one-second duration.

CPU

Unit: msec

CUDA

Unit: msec

Improved Windows Support

torchaudio implements some operations in C++ for reasons such as performance and integration with third-party libraries. This C++ module was only available on Linux and macOS. In this release, Windows packages also come with C++ module.

This C++ module in Windows package includes the efficient filtering implementation mentioned above, however, “sox_io” backend and torchaudio.functional.compute_kaldi_pitch are not included.

I/O Functions Migration

Since the 0.6 release, we have continuously improved I/O functionality. Specifically, in 0.8 the default backend has been changed from “sox” to “sox_io”, and the similar API change has been applied to “soundfile” backend. The 0.9 release concludes this migration by removing the deprecated backends. For the detail please refer to #903.

Backward Incompatible Changes

I/O

Deprecated backends and functions were removed (#1311, #1329, #1362)
- Please see #903 for the migration.
Added validation of the number of channels when saving GSM (#1384)
- Please make sure that signal has only one channel when saving into GSM.

Ops

Removed deprecated normalized argument from torchaudio.functional.griffinlim (#1369)
- This argument was never used. Please remove the argument from your call.
Renamed torchaudio.functional.sliding_window_cmn arg for correctness (#1347)
- The first argument is supposed to spectrogram. If you have used keyword argument waveform=..., please change it to specgram=...
Changed torchaudio.transforms.Resample to precompute and cache the resampling kernel. (#1499, #1514)
- To use the transform in devices other than CPU, please move the instantiated object to the target device.
```
resampler = torchaudio.transforms.Resample(orig_freq=8000, new_freq=44100)
resampler.to(torch.device("cuda"))
```

Dataset

Removed deprecated arguments from CommonVoice (#1534)
- torchaudio no longer supports programmatic download of Common Voice dataset. Please remove the arguments from your code.

Deprecations

Deprecated the use of pseudo complex type (#1445, #1492)
- torchaudio is adopting native complex type and the use of pseudo complex type and the related utility functions are now deprecated. Please refer to #1337 for the migration process.
Deprecated torchaudio.compliance.kaldi.resample_waveform (#1533)
- Please use torchaudio.functional.resample.
torchaudio.transforms.MelScale now expects valid n_stft value (#1515)
- Please provide a valid value to n_stft.

New Features

[Beta] Wav2Vec2.0

Added wav2vec2.0 model (#1529)
Added wav2vec2.0 HuggingFace importer (#1530)
Added wav2vec2.0 fairseq importer (#1531)
Added speech recognition C++ example (#1538)
- Please refer to C++ example for the detail.

Filtering

Added C++ implementation of torchaudio.functional.lfilter (#1319)
Added autograd support to torchaudio.functional.lfilter (#1310, #1441)

[Beta] Resampling

Added torchaudio.functional.resample (#1402)
Added rolloff parameter (#1488)
Added kaiser window support to resampling (#1509)
Added kernel caching mechanism in torchaudio.transforms.Resample (#1499, #1514, #1556)
Skip resampling when sampling rate is not changed (#1537)

Native Complex Tensor

Added complex tensor support to torchaudio.functional.phase_vocoder and torchaudio.transforms.TimeStretch (#1410)
Added return_complex to torchaudio.functional.spectrogram and torchaudio.transforms.Spectrogram (#1366, #1551)

Improvements

I/O

Added file path to I/O error messages (#1523)
Added __str__ override to AudioMetaData for easy print (#1339)
Fixed uninitialized variable in sox/utils.cpp (#1306)
Replaced UB sox conversion macros with tensor op (#1370)
Removed check_length from validate_input_file (#1312)

Ops

Added warning for non-integer resampling frequencies (#1490)
Adopted native complex tensors in torchaudio.functional.griffinlim (#1368)
Prohibited scripting torchaudio.transforms.MelScale when n_stft is invalid (#1505)
Added input dimension check to VAD (#1513)
Added HTK-compatible option to Mel-scale conversion (#593)

Models

Added vanilla DeepSpeech model (#1399)

Datasets

Fixed checksum for the YESNO dataset (#1405)

Misc

Added missing transforms to __all__ (#1458)
Removed reference_cast in make_boxed_from_unboxed_functor (#1300)
Removed unused normalized constant from torchaudio.transforms.GriffinLim (#1433)
Removed unused helper function (#1396)

Examples

Added libtorchaudio C++ example (#1349)
Refactored libtorchaudio example (#1486)
Replaced librosa's Mel scale conversion with torchaudio’s in WaveRNN example (#1444)

Build

Updated config.guess to support source build in recent architectures (#1484)
Explicitly disabled wavpack when building SoX (#1462)
Added ROCm support to source build (#1411)
Added Windows C++ binary build (#1345, #1371)
Made kaldi selective in build (#1342)
Made sox selective (#1338)

Testing

Added autograd test for torchaudio.functional.lfilter and biquad variants (#1400, #1438)
Added autograd test for transforms (overview: #1414)
- torchaudio.transforms.FrequencyMasking (#1498)
- torchaudio.transforms.SlidingWindowCmn (#1482)
- torchaudio.transforms.MelScale (#1467)
- torchaudio.transforms.Vol (#1460)
- torchaudio.transforms.TimeStretch (#1420)
- torchaudio.transforms.AmplitudeToDB (#1447)
- torchaudio.transforms.GriffinLim (#1421)
- torchaudio.transforms.SpectralCentroid (#1425)
- torchaudio.transforms.ComputeDeltas (#1422)
- torchaudio.transforms.Fade (#1424)
- torchaudio.transforms.Resample (#1416)
- torchaudio.transforms.MFCC (#1415)
- torchaudio.transforms.Spectrogram / MelSpectrogram (#1340)
Added test for a batch of different items in the functional batch consistency test. (#1315)
Added test for validating torchaudio.functional.lfilter shape (#1360)
Added TorchScript test for torchaudio.functional.resample (#1516)
Added TorchScript test for torchaudio.functional.phase_vocoder (#1379)
Added steps to save and load the scripted object in TorchScript (#1446)
Added GPU support to functional tests (#1475)
Added GPU support to transform librosa compatibility test (#1439)
Added GPU support to functional librosa compatibility test (#1436)
Improved HTTP fetch test reliability (#1512)
Refactored functional batch consistency test (#1341)
Refactored test classes for complex (#1491)
Refactored sox_io load test (#1394)
Refactored Kaldi compatibility tests (#1359)
Refactored functional test (#1435, #1463)
Refactored transform tests (#1356)
Refactored librosa compatibility test (#1350)
Refactored sox compatibility test (#1344)
Refactored librosa compatibility test (#1259)
Removed the use I/O functions in batch consistency test (#1521)
Removed skipIfNoSoxBackend (#1390)
Removed VAD from batch consistency tests (#1451)
Replaced deprecated floor_divide with div (#1455)
Replaced torch.assert_allclose with assertEqual (#1387)
Shortened torchaudio.functional.lfilter autograd tests input size (#1443)
Updated torchaudio.transforms.InverseMelScale comparison test (#1437)

Bug Fixes

Updated torchaudio.transforms.TimeMasking and torchaudio.transforms.FrequencyMasking to perform out-of-place masking (#1481)
Annotate power of torchaudio.transforms.MelSpectrogram as float only (#1572)

Performance

Adopted torch.nn.functional.conv1d in torchaudio.functional.lfilter (#1318)
Added C++ implementation of torchaudio.functional.overdrive (#1299)

Documentation

Update docs (#1550)
Reformat resample docs (#1548)
Updated resampling documentation (#1519)
Added the clarification that sox_effects.apply_effects_tensor is CPU-only (#1459)
Removed instructions on using external sox (#1365, #1281)
Added navigation with left/right arrow keys (#1336)
Fixed docstring of sliding_window_cmn (#1383)
Update contributing guide (#1372)
Fix broken links in contribution guide (#1361)
Added Windows build instructions (#1440)
Fixed typo (#1471, #1397, #1293)
Added WER to readme in wav2letter pipeline (#1470)
Fixed wav2letter usage example (#1060)
Added Google Analytics support (#1466)

audio - v0.8.1

Published by vincentqb over 3 years ago

Highlights

This release depends on pytorch 1.8.1.

Bug Fixes

Added back support for 24-bit signed LPCM wav via sox_io backend. (#1389)

audio - v0.8.0

Published by vincentqb over 3 years ago

Highlights

This release supports Python 3.9.

I/O Improvements

Continuing from the previous release, torchaudio improves the audio I/O mechanism. In this release, we have four major updates.

Backend migration.
We have migrated the default backend for audio I/O. The new default backend is “sox_io” (for Linux/macOS). The interface for “soundfile” backend has been also changed to align that of “sox_io”. Following the change of default backends, the legacy backend/interface have been marked as deprecated. The legacy backend/interface are still accessible, though it is strongly discouraged to use them. For the detail on the migration, please refer to #903.

File-like object support.
We have added file-like object support to I/O functions and sox_effects. You can perform the info, load, save and apply_effects_file operation on file-like objects.

# Query audio metadata over HTTP
# Will only fetch the first few kB
with requests.get(URL, stream=True) as response:
  metadata = torchaudio.info(response.raw)

# Load audio from tar file
# No need to extract TAR file.
with tarfile.open(TAR_PATH, mode='r') as tarfile_:
  fileobj = tarfile_.extractfile(SAMPLE_TAR_ITEM)
  waveform, sample_rate = torchaudio.load(fileobj)

# Saving to Bytes buffer
# Using BytesIO, you can perform in-memory encoding/decoding.
buffer_ = io.BytesIO()
torchaudio.save(buffer_, waveform, sample_rate, format="wav")

# Apply effects (lowpass filter / resampling) while loading audio from S3
client = boto3.client('s3')
response = client.get_object(Bucket=S3_BUCKET, Key=S3_KEY)
waveform, sample_rate = torchaudio.sox_effects.apply_effect_file(
  response['Body'], [["lowpass", "-1", "300"], ["rate", "8000"]])

[Beta] Codec Application.
Built upon the file-like object support, we added functional.apply_codec function, which can degrades audio data by applying audio codecs supported by “sox_io” backend, in in-memory fashion.
```
# Apply MP3 codec
degraded = F.apply_codec(
  waveform, sample_rate, format="mp3", compression=-9)
# Apply GSM codec
degraded = F.apply_codec(waveform, sample_rate, format="gsm")
```

Encoding options.
We have added encoding options to save function of new backends. Now you can change the format and encodings with format, encoding and bits_per_sample options

# Save without any encoding option.
# The function will pick the encoding which the provided data fit
# For Tensor of float32 type, that is 32-bit floating-point PCM.
torchaudio.save("data.wav", waveform, sample_rate)

# Save as 16-bit signed integer Linear PCM
# The resulting file occupies half the storage but loses precision
torchaudio.save(
  "data.wav", waveform, sample_rate, encoding="PCM_S", bits_per_sample=16)

More format support to "sox_io"’s save function.
We have added support for GSM, HTK, AMB, and AMR-NB formats to "sox_io"’s save function.

Switch to CMake-based build

torchaudio was utilizing CMake to build third party dependencies. Now torchaudio uses CMake to build its C++ extension. This will open the door to integrate torchaudio in non-Python environments (such as C++ applications and mobile). We will work on adding example applications and mobile integrations in upcoming releases.

Backwards Incompatible Changes

Removed deprecated transform and target_transform arguments from VCTK and YESNO datasets. (#1120) If you were relying on the previous behavior, we recommend that you apply the transforms in the collate function.
Removed torchaudio.datasets.utils.walk_files (#1111) and replaced by Path and glob. (#1069, #1101). If you relied on the function, we recommend that you use glob instead.
Removed torchaudio.data.utils.unicode_csv_reader. (#1086) If you relied on the function, we recommend that you replace by csv.reader.
Disabled CommonVoice download as users are required to sign user agreement. Please download and extract the dataset manually, and replace the root argument by the subfolder for the version and language of interest, see #1082 for more details. (#1018, #1079, #1080, #1082)
Removed legacy sox effects (#977, #1001). Please migrate to apply_effects_file or apply_effects_tensor.
Switched the default backend to the ones with new interfaces (#978). If you were relying on the previous behavior, you can return to the previous behavior by following instructions in #975 for one more release.

New Features

Added GSM, HTK, AMB, AMR-NB and AMR-WB format support to “sox_io” backend. (#1276, #1291, #1277, #1275, #1066)
Added encoding options (format, bits_per_sample and encoding) to save function. (#1226, #1177, #1129, #1104)
Added new attributes (bits_per_sample and encoding) to the info function return type (AudioMetaData) (#1177, #1206, #1324)
Added format override to libsox-based file input. (load, info, sox_effects.apply_effects_file) (#1104)
Added file-like object support in “sox_io”, and “soundfile” backend and sox_effects.apply_effects_file. (#1115)
[Beta] Added the Kaldi Pitch feature. (#1243, #1260)
[Beta] Added the SpectralCentroid transform. (#1167, #1216, #1316)
[Beta] Added codec transformation apply_codec. (#1200)

Improvements

Exposed normalization method to Mel transforms. (#1212)
Exposed additional STFT arguments to Spectrogram (#892) and to MelSpectrogram (#1211).
Added support for pathlib.Path to apply_effects_file (#1048) and to CMUARCTIC (#1025), YESNO (#1015), COMMONVOICE (#1027), VCTK and LJSPEECH (#1028), GTZAN (#1032), SPEECHCOMMANDS (#1039), TEDLIUM (#1045), LIBRITTS and LIBRISPEECH (#1046).
Added SpeechCommands train/valid/test split. (#966, #1012)

Internals

Replaced if-elseif-else with switch in sox C++ code. (#1270)
Refactored C++ interface for sox_io's get_info_file (#1232) and get_encodinginfo (#1233).
Add explicit functional import in init. (#1228)
Refactored YESNO dataset (#1127), LJSPEECH dataset (#1143).
Removed Python 2.7 reference from setup.py. (#1182)
Merged flake8 configurations into single .flake8 file. (#1172, #1214)
Updated calls to torch.stft to use return_complex=True. (#1096, #1013)
Cleaned up handling of optional args in C++ with c10:optional. (#1043)
Removed unused imports in sox effects. (#1052)
Introduced functional submodule to organize functionals. (#1003)
[Testing] Refactored MelSpectrogram librosa compatibility test to decouple from other tests. (#1267)
[Testing] Moved batch tests for functionals. (#1254)
[Testing] Refactored tests for backend (#1239) and for functionals (#1237).
[Testing] Removed dependency on pytest from testing (#1157, #1188)
[Testing] Refactored unitests for VCTK (#1134), SPEECHCOMMANDS (#1136), LIBRISPEECH (#1140), TEDLIUM (#1135), LJSPEECH (#1138), LIBRITTS (#1139), CMUARCTIC (#1147), GTZAN(#1148), COMMONVOICE and YESNO (#1133).
[Testing] Removed dependency on COMMONVOICE dataset from tests. (#1132)
[Build] Fixed Python 3.9 support (#1242)
[Build] Switched to cmake for build. (#1187, #1246, #1249)
[Build] Restructured C++ code to allow per file registration of custom ops. (#1221)
[Build] Added logging to sox/CMakeLists.txt. (#1190)
[Build] Disabled C++11 ABI when necessary for libtorch compatibility. (#880)
[Build] Reorganized libsox source and build directory to accommodate additional third party code. (#1161, #1176)
[Build] Refactored sox source files and moved into dedicated subfolder. (#1106)
[Build] Enabled custom clean function for python setup.py clean. (#1142)
[CI] Documented undocumented parameters. Added CI check. (#1248)
[CI] Fixed sphinx warnings in documentation. Turned warnings into errors. (#1247)
[CI] Print CPU info before running unit test. (#1218)
[CI] Fixed clang-format job and fixed newly detected formatting issues. (#981, #1198, #1222)
[CI] Updated unit test base Docker image. (#1193)
[CI] Disabled CCI cache which is now known to be flaky. (#1189)
[CI] Disabled torchscript BC test which is known to fail. (#1192)
[CI] Stripped version suffix for pytorch. (#1185)
[CI] Ran smoke test with CPU package for pytorch due to known issue with CUDA 11. (#1105)
[CI] Added missing empty line at the end of config.yml. (#1020)
[CI] Added automatic documentation build and push to branch in CI. (#1006, #1034, #1041, #1049, #1091, #1093, #1098, #1100, #1121)
[CI] Ran GPU test for all pull requests and fixed current setup. (#998, #1014, #1191)
[CI] Skipped tests that is known to fail on macOS Python 3.6/3.7. (#999)
[CI] Changed the order of installation and aligned with Windows. (#987)
[CI] Fixed documentation rendering by using Sphinx 2.4.4. (#974)
[Doc] Added subcategories to functional documentation. (#1325)
[Doc] Added a version selector in documentation. (#1273)
[Doc] Updated compilation recommendation in README. (#1263)
[Doc] Added CONTRIBUTING.md. (#1241)
[Doc] Added instructions to install parametrized package. (#1164)
[Doc] Fixed the return type for load functions. (#1122)
[Doc] Added missing modules and minor fixes. (#1022, #1056, #1117)
[Doc] Fixed spelling and links in README. (#1029, #1037, #1062, #1110, #1261)
[Doc] Grouped filtering functionals in documentation page. (#1005, #1004)
[Doc] Updated the compatibility matrix with torchaudio 0.7 (#979)
[Doc] Added description of prototype/beta/stable features. (#968)

Bug Fixes

Fixed amplitude_to_DB clamping behaviour on batches. (#1113)
Disabled audio devices in sox builds which could interfere in the build process when detected. (#1153)
Fixed COMMONVOICE for French where the audio file extension was missing on load. (#1126)
Disabled OpenMP support for libsox which can produce errors when used in DataLoader. (#1026)
Fixed noise_down_time argument in VAD by properly propagating it. (#1017)
Removed print-freq option to compute validation loss at each epoch in wav2letter pipeline. (#997)
Migrated from torch.rfft to torch.fft.rfft and cfloat following change in pytorch. (#941)
Fixed interactive ASR demo to aligned with latest version of FAIRSeq. (#996)

Deprecations

The normalized argument is unused and will be removed from griffinlim. (#1036)
The previous sox and soundfile backend remain available for one release, see #903 for details. (#975)

Performance

Added C++ lfilter core loop for faster iteration on CPU. (#1244)
Leveraged julius resampling implementation to make resampling faster. (#1087)

audio - v0.7.2

Published by vincentqb almost 4 years ago

Highlights

This release introduces support for python 3.9. There is no 0.7.1 release, and the following changes are compared to 0.7.0.

Improvements

Add python 3.9 support (#1061)

Bug Fixes

Temporarily disable OpenMP support for libsox (#1054)

Deprecations

Disallow download=True in CommonVoice (#1076)

audio - v0.7.0

Published by vincentqb almost 4 years ago

Highlights

Example Pipelines

torchaudio is expanding its support for models and end-to-end applications. Please file an issue on github to provide feedback on them.

Speech Recognition: Building on the addition of the Wav2Letter model for speech recognition in the last release, we added a training example pipelines for speech recognition that uses the LibriSpeech dataset.
Text-to-Speech: With the goal of supporting text-to-speech applications, we added a vocoder based on the WaveRNN model. WaveRNN model is based on the implementation from this repository. The original implementation was introduced in "Efficient Neural Audio Synthesis". We provide an example training pipeline in the example folder that uses the LibriTTS dataset added to torchaudio in this release.
Source Separation: We also support source separation with the addition of the ConvTasNet model, based on the paper "Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation." An example training pipeline is provided with the wsj0-mix dataset.

I/O Improvements

As you are likely already aware from the last release we’re currently in the process of making sox_io, which ships with new features such as TorchScript support and performance improvements, the new default. If you want to benefit from these features now, we encourage you to migrate. For more information see issue #903.

Backwards Incompatible Changes

Switched all %-based string formatting to str.format to adopt changes in PyTorch, leading to improved error messages for TorchScript (#850)
Split sox_utils.list_formats() for read and write (#811)
Made directory traversal order alphabetical and breadth-first, consistent across operating systems (#814)
Changed GTZAN so that it only traverses filenames belonging to the dataset (#791)

New Features

Added ConvTasNet model (#920, #933) with pipeline (#894)
Added canonical pipeline with wav2letter (#632)
The WaveRNN model (#705, #797, #801, #810, #836) is available with a canonical pipeline (#749, #802, #831, #863)
Added all 3 releases of tedlium dataset (#882, #934, #945, #895)
Added VCTK_092 dataset (#812)
Added LibriTTS (#790, #820)
Added SPHERE support to sox_io backend (#871)
Added torchscript sox effects (#760)
Added a flag to change the interface of soundfile backend to the one identical to sox_io backend. (#922)

Improvements

Added soundfile compatibility backend. (#922)
Improved the speed of torchaudio.compliance.kaldi.fbank (#947)
Improved the speed of phaser (#660)
Added warning when a Mel filter is all zero (#914)
Added pathlib.Path support to sox_io backend (#907)
Simplified C++ registration with TORCH_LIBRARY (#840)
Merged sox effect and sox_io C++ implementation (#779)

Internal

CI: Added test to validate torchscript backward compatibility (#838)
CI: Used mocked datasets to test CMUArctic (#829), CommonVoice (#827), Speech Commands (#824), LJSpeech (#826), LibriSpeech (#825), YESNO (#792, #832)
CI: Made *nix unit test fail if C++ extension is not available (#847, #849)
CI: Separated I/O in testing. (#813, #773, #783)
CI: Added smoke tests to sox_io and sox_effects (#806)
CI: Tested utilities have been refactored (#805, #808, #809, #817, #822, #831)
Doc: Added how to run tests (#843)
Doc: Added 0.6.0 to version matrix in README (#833)

Bug Fixes

Fixed device in interactive ASR example (#900)
Fixed incorrect extension parsing (#885)
Fixed dither with noise_shaping = True (#865)
Run unit test with non-editable installation (#845), and set zip_safe = False to disable egg installation (#842)
Sorted GTZAN dataset and use on-the-fly data in GTZAN test (#819)

Deprecations

Removed istft wrapper in favor of torch.istft. (#841)
Deprecated SoxEffect and SoxEffectsChain (#787)
I/O: Deprecated sox backend. (#904)
I/O: Deprecated the current interface of soundfile. (#922)
I/O: Deprecated load_wav functions. (#905)

audio - v0.6.0

Published by vincentqb about 4 years ago

Highlights

torchaudio now includes a new model module (with wav2letter included), new functionals (contrast, cvm, dcshift, overdrive, vad, phaser, flanger, biquad), datasets (GTZAN, CMU), and a new optional sox backend with support for torchscript. torchaudio now also supports Windows, with the soundfile backend.

torchaudio requires python 3.6 or more recent.

Backwards Incompatible Changes

We reorganized the C++ resources (#630) and replaced C++ bindings for sox_effects init/list/shutdown with torch binding (#748).
We removed code specific to python 2 (#691), and we no longer tests against python 2 (#575) and 3.5 (#577)

New Features

We now support Windows. (#604, #637, #642, #655, #743)
We now have a model module which includes wav2letter. (#462, #722)
We added the GTZAN and CMU datasets. (#668, #710)
We now have the contrast functional (#551), cvm (#540), dcshift (#558), overdrive (#569), vad (#578, #599), phaser (#587, #607, #702), flanger (#651, #702), biquad (#661).
We added a new sox_io backend (#718, #728, #734, #727, #763, #752, #731, #732, #726, #780) that is compatible with torchscript with a new AudioMetaData class (#761).
MelSpectrogram now has power and normalized parameters (#633), and slaney normalization (#589, #641).
lfilter now has a clamp option. (#600)
Griffin-Lim can now have zero momentum. (#601)
sliding_window_cmn now supports batching. (#570)
Downloaded datasets now verify checksums. (#499)

Improvements

We added ogg/vorbis/opus support to binary distribution (#750, #755).
We replaced the use of torch.norm in spectrogram to improve performance (#747).
We now use fused operations in lfilter for faster computation. (#517, #564)
STFT is now called directly from torchaudio. (#531)
We redesigned the backend mechanism to support torchscript, by restructuring the code (#695, #696, #700, #706, #707, #698), adding dynamic listing (#697)
torchaudio can be built along with sox, or can use external sox. (#625, #669, #739)
We redesigned the sox_effects module. (#708)
We added more details to compilation instructions. (#667)
We updated the README with instructions on changing the backend. (#553)
We now have a version compatibility matrix in README. (#685)
We now use cmake to build third party libraries (#753).
We now use CircleCI instead of travis (#576, #584, #598, #603, #636, #738) and we test on GPU (#586, #777).
We run the test suite against nightlies. (#538, #678)
We redesigned our test suite: with new helper functions (#514, #519, #521, #565, #616, #690, #692, #694), standard pytorch test utilities (#513, #640, #643, #645, #646, #652, #650, #712), separated CPU and GPU tests (#513, #528, #644), more descriptive names (#532), clearer organization (#539, #541, #542, #664, #672, #687, #703, #716, #732), standardized name (#559), and backend aware (#719). This is detailed in a new README for testing (#566, #759).
We now support typing, for datasets (#511, #522), for backends (#527), for init (#526), and inline (#530), with mypy configuration (#524, #544, #590).

Bug Fixes

We removed in place operations so that Griffin-Lim can be backpropagated through. (#730)
We fixed kaldi MFCC on GPU. (#681)
We removed multiple definitions of SoxEffect in C++. (#635)
We fixed the docstring of masking. (#612)
We replaced views by reshape for batching. (#594)
We fixed missing conda environment when testing in python 3.8. (#582)
We ensure that sox is not exposed in windows. (#579)
We corrected the instructions to install nightlies. (#547, #552)
We fix the seed of mask_along_iid. (#529)
We correctly report GPU tests as skipped instead of passed. (#516)

Deprecations

Since sox_effects is now automatically initialized and shutdown (#572, #693), we are deprecating these functions (#709).
ISTFT is migrating to torch. (#523)

audio - v0.5.1

Published by seemethere over 4 years ago

Highlights

Updated pinned version of PyTorch to v1.5.1

audio - v0.5.0

Published by vincentqb over 4 years ago

Highlights

torchaudio includes new transforms (e.g. Griffin-Lim and inverse Mel scale), new filters (e.g. all pass, fade, band pass/reject, band, treble, deemph, riaa), and datasets (LJ Speech and SpeechCommands).

Backwards Incompatible Changes

torchaudio no longer supports python 2. We removed future and six imports. We added inline typing. (#413, #478, #479, #482, #486)
We fixed CommonVoice dataset download, and updated to the latest version. (#498)
We now skip data point with missing data in VCTK dataset. (#484)

New Features

We now have the Vol transforms, and DB_to_amplitude.(#468, #469)
We now have the InverseMelScale (#448)
We now have the Griffin-Lim functional. (#365)
We now support allpass, fade, bandpass, bandreject, band, treble, deemph, riaa. (#444, #449, #464, #470, #508)
We now offer LJSpeech and SpeechCommands datasets. (#439, #437)

Improvements

We added inline typing to SoxEffects and Kaldi compliance. (#490, #497)
We refactored the tests. (#480, #485, #496, #491, #501, #502, #503, #506, #507, #509)
We now run tests with sox only when sox is available. (#419)
We extended batch support to MelScale, MelSpectrogram, MFCC, Resample. (#391, #435)
The speed of torchaudio.functional.istft was improved. (#471)
We now have transform and functional tests for AmplitudeToDB. (#463)
We now ignore pycharm and OSX files in git. (#461)
TimeStretch now has a batch test. (#459)
Docstrings in transforms were polished. (#442)
TimeStretch and AmplitudeToDB are now torch.nn.Module. (#456)
Resample is now jitable. (#441)
We support python 3.8. (#397)
Add cuda test for complex norm. (#421)
Dither is jitable with the latest version of pytorch. (#417)
Batching uses view instead of reshape. (#409)
We refactored the jitability test. (#395)
In .circleci, we removed a conditional block that wasn't doing anything. (#399)
We now have Windows CI for building. (#394 and #398)
We corrected the use of standard variable names in code. (#393)
We adopted native-Python code generation convention. (#378)
torchaudio.istft creates tensors directly on device. (#377)
torchaudio.compliance.kaldi.resample_waveform is now jitable. (#362)
The runtime of torchaudio.functional.lfilter was decreased. (#374)

Bug Fixes

We fixed flake8 errors. (#504, #505)
We fixed Windows test by only testing with cpu-only binaries. (#489)
Spelling correction in docstrings for transforms.FrequencyMasking and transforms.TimeMasking. (#474)
In .circleci, we switched to use token for conda uploads. (#460)
The default value of dither parameter was changed. (#453)
TimeStretch moves device correctly. (#457)
Adding dev-other option in librispeech. (#433)
In build script, we install the correct version of pytorch for pip. (#412)
Upgrading dataset DeprecationWarning to UserWarning so that the user gets the warning. (#402)
Make power of spectrogram a float to work with complex norm. (#392)
Fix random seed for flaky test_griffinlim test. (#388)
Apply 'nightly' branch filter to binary uploads. (#385)
Fixed build errors: added explicitly utf8 decoration, added explicit utf_8_encoder definition if not available, explicitly cast to int. (#380)

Deprecations

None

audio - v0.4.0

Published by vincentqb almost 5 years ago

torchaudio 0.4 improves on current transformations, datasets, and backend support.

We introduce an interactive speech recognition demo. (#266, #229, #248)
SoX is now optional, and a new extensible backend dispatch mechanism exposes SoundFile as an alternative to SoX.
The interface for datasets has been unified. This enables the addition of two large datasets: LibriSpeech and Common Voice.
New filters such as biquad, data augmentation such as time and frequency masking, and transforms such as gain and dither, and new feature computation such as deltas, are now available.
Transformations now support batches and are jitable.

We would like to thank again our contributors and the wider community for their significant contributions to this release. In particular we'd like to thank @keunwoochoi, @ksanjeevan, and all the other maintainers and contributors of torchaudio-contrib for their significant and valuable additions around augmentations (#285) and batching (#327).

Breaking Changes

torchaudio now requires PyTorch 1.3.0 or newer, see https://pytorch.org/ for installation instructions. (#312)
We make jit compilation optional for functions and use nn.Module where possible. (#314, #326, #342, #369)
By unifying the interface for datasets, we changed the interface for VCTK and YESNO (#303, #316). In particular, the construction parameters downsample, transform, target_transform, and return_dict are being deprecated.
SoxEffectsChain.EFFECTS_AVAILABLE replaced by SoxEffectsChain().EFFECTS_AVAILABLE (#355)
This is the last version to support Python 2.

New Features

SoX is now optional, and a new extensible backend dispatch mechanism exposes SoundFile as an alternative to SoX. This makes it possible to use torchaudio even when SoX or SoundFile are not installed or available. (#355)
We now have a unified dataset interface that loads in memory only one item at a time enabling new large datasets: LibriSpeech and CommonVoice. (#303, #316, #330)
We introduce a pitch detection algorithm: torchaudio.functional.detect_pitch_frequency. (#313, #322)
We offer data augmentations in torchaudio.transforms: TimeStretch, FrequencyMasking, TimeMasking. (#285, #333, #348)
We introduce a complex norm transform: torchaudio.transform.ComplexNorm. (#285, #333)
We now have a new audio feature generation for computing deltas: torchaudio.functional.compute_deltas. (#268, #326)
We introduce torchaudio.functional.gain and torchaudio.functional.dither (#319, #360). We welcome work to continue the effort to implement features available in SoX, see #260.
We now include equalizer_biquad (#315, #340), lowpass_biquad, highpass_biquad (#275), lfilter, and biquad (#275, #291, #326) in torchaudio.functional.
MFCC is available as torchaudio.functional.mfcc. (#228)

Improvements

We now support batching in transforms. (#327, #337, #404)
Functions are now jitable, and nn.Module is used where possible. (#314, #326, #342, #362, #369, #395)
Downloads of large files are now automatically resumed with new download function. (#320)
New tests for ISTFT are added. (#279)
We introduce nightly builds. (#301)
We now have smoke tests for builds. (#346, #359)

Bug Fixes

Fix mismatch between MelScale and librosa. (#294)
Fix torchaudio.compliance.kaldi.resample_waveform where internal variables where not moved to the GPU when used. (#277)
Fix a bug that occurred when importing torchaudio built outside of a git repository. (#276)
Fix istft where the dtype and device of parameters were not created on the same device as the tensor provided by the user. (#264)
Fix size mismatch when saving and loading from state dictionary (load_state_dict). (#246)
Clarified internal naming convention within transforms and functionals. (#298)
Fix build script to be more tolerant to download drops. (#280, #284, #305)
Correct documentation for SoxEffectsChain. (#283)
Fix resample error with cuda tensors. (#277)
Fix error when importing version outside of git. (#276)
Fix missing asound in linux build. (#254)
Fix deprecated torch. (#254)
Fix link in README. (#253)
Fix window device in ISTFT. (#240)
Documentation: Fix range in documentation for torchaudio.load to [-1, 1]. (#283)

audio - v0.3.2

Published by vincentqb almost 5 years ago

This release is to update the dependency to PyTorch 1.3.1.

audio - v0.3.1

Published by vincentqb almost 5 years ago

This release is to update the dependency to PyTorch 1.3.0.

Minor Fix

Updated settings for curl in build scripts (#280, #284, #297).

audio - v0.3.0 Standardization, JIT/CUDA Support, Kaldi Compliance Interface, ISTFT

Published by jamarshon about 5 years ago

Highlights

torchaudio as an extension of PyTorch

torchaudio has been redesigned to be an extension of PyTorch and part of the domain APIs (DAPI) ecosystem. Domain specific libraries such as this one are kept separated in order to maintain a coherent environment for each of them. As such, torchaudio is an ML library that provides relevant signal processing functionality, but it is not a general signal processing library. The full rationale of this new standardization can be found in the README.md.

In light of these changes some transforms have been removed or have different argument names and conventions. See the section on backwards breaking changes for a migration guide.

We provide binaries via pip and conda. They require PyTorch 1.2.0 and newer. See https://pytorch.org/ for installation instructions.

Community

We would like to thank our contributors and the wider community for their significant contributions to this release. We are happy to see an active community around torchaudio and are eager to further grow and support it.

In particular we'd like to thank @keunwoochoi, @ksanjeevan, and all the other maintainers and contributors of torchaudio-contrib for their significant and valuable additions around standardization and the support of complex numbers (https://github.com/pytorch/audio/pull/131, https://github.com/pytorch/audio/issues/110, https://github.com/keunwoochoi/torchaudio-contrib/issues/61, https://github.com/keunwoochoi/torchaudio-contrib/issues/36).

Kaldi Compliance Interface

An implementation of basic transforms with a Kaldi-like interface.

We added the functions spectrogram, fbank, and resample_waveform (https://github.com/pytorch/audio/pull/119, https://github.com/pytorch/audio/pull/127, and https://github.com/pytorch/audio/pull/134). For more details see the documentation on torchaudio.compliance.kaldi which mirrors the arguments and outputs of Kaldi features.

As an example we can look at the sinc interpolation resampling similar to Kaldi’s implementation. In the figure below, the blue dots are the original signal and red dots are the downsampled signal with half the original frequency. The red dot elements are approximately every other original element.

resampling

specgram = torchaudio.compliance.kaldi.spectrogram(waveform, frame_length=...)
fbank = torchaudio.compliance.kaldi.fbank(waveform, num_mel_bins=...)
resampled_waveform = torchaudio.compliance.kaldi.resample_waveform(waveform, orig_freq=...)

Inverse short time Fourier transform

Constructing a signal from a spectrogram can be used in applications like source separation or to generate audio signals to listen to. More specifically torchaudio.functional.istft is the inverse of torch.stft. It has the same parameters (+ additional optional parameter of length) and returns the least squares estimation of an original signal.

torch.manual_seed(0)
n_fft = 5
waveform = torch.rand(2, 5)
stft = torch.stft(waveform, n_fft=n_fft)
approx_waveform = torchaudio.functional.istft(stft, n_fft=n_fft, length=waveform.size(1))
>>> waveform
tensor([[0.4963, 0.7682, 0.0885, 0.1320, 0.3074],
        [0.6341, 0.4901, 0.8964, 0.4556, 0.6323]])
>>> approx_waveform
tensor([[0.4963, 0.7682, 0.0885, 0.1320, 0.3074],
        [0.6341, 0.4901, 0.8964, 0.4556, 0.6323]])

Breaking Changes

Removed Compose:
Please use core abstractions such as nn.Sequential() or a for-loop over a list of transforms.
SPECTROGRAM, F2M, and MEL have been removed. Please use Spectrogram, MelScale, and MelSpectrogram
Removed formatting transforms ( LC2CL and BLC2CBL): While the LC layout might be common in signal processing, support for it is out of scope of this library and transforms such as LC2CL only aid their proliferation. Please use transpose if you need this behavior.
Removed Scale, PadTrim, DownmixMono: Please use division in place of Scale torch.nn.functional.pad/trim in place of PadTrim , torch.mean on the channel dimension in place of DownmixMono.
torchaudio.legacy has been removed. Please use torchaudio.load and torchaudio.save
Spectrogram used to be of dimension (channel, time, freq) and is now (channel, freq, time). Similarly for MelScale, MelSpectrogram, and MFCC, time is the last dimension. Please see our README for an explanation of the rationale behind these changes. Please use transpose to get the previous behavior.
MuLawExpanding was renamed to MuLawDecoding as the inverse of MuLawEncoding ( https://github.com/pytorch/audio/pull/159)
SpectrogramToDB was renamed to AmplitudeToDB ( https://github.com/pytorch/audio/pull/170). The input does not necessarily have to be a spectrogram and as such can be used in many more cases as the name should reflect.

New Features

Performance

JIT and CUDA

JIT support added to Spectrogram, AmplitudeToDB, MelScale, MelSpectrogram, MFCC, MuLawEncoding, and MuLawDecoding. (https://github.com/pytorch/audio/pull/118)
CUDA support added to Spectrogram, AmplitudeToDB, MelScale, MelSpectrogram, MFCC, MuLawEncoding, and MuLawDecoding (https://github.com/pytorch/audio/pull/118)

Bug Fixes

Fix test_transforms.py where double tensors were compared with floats (https://github.com/pytorch/audio/pull/132)
Fix vctk.read_audio (issue https://github.com/pytorch/audio/issues/143) as there were issues with downsampling using SoxEffectsChain (https://github.com/pytorch/audio/pull/145)
Fix segfault passing null to sox_close (https://github.com/pytorch/audio/pull/174)

audio - torchaudio's First Official Release (v0.2.0)

Published by jamarshon about 5 years ago

Background

The goal of this release is to fix the current API as there will be future changes that breaking backward compatibility in order to improve the library as more thought is given to design, capabilities, and usability.

While this release is compatible with all currently known PyTorch versions (<=1.2.0), the available binaries will only require Pytorch 1.1.0. Installation commands:

# Wheels for Python 2 are NOT supported
# Python 3.5
$ pip3 install http://download.pytorch.org/whl/torchaudio-0.2-cp35-cp35m-linux_x86_64.whl
# Python 3.6
$ pip3 install http://download.pytorch.org/whl/torchaudio-0.2-cp36-cp36m-linux_x86_64.whl
# Python 3.7
$ pip3 install http://download.pytorch.org/whl/torchaudio-0.2-cp37-cp37m-linux_x86_64.whl

What's new?

Fixed broken tests and setup automatic testing environment
Read in Kaldi files (“.ark”, “.scp”)
Separation of state and computation into transforms.py and functional.py
Loading and saving to file
Datasets VCTK and YESNO
SoxEffects and SoxEffectsChain in torchaudio.sox_effects

CI and Testing

A continuous integration (Travis CI) has been setup in https://github.com/pytorch/audio/pull/117. This means all the tests have been fixed and their status can be checked in https://travis-ci.org/pytorch/audio. The test files have to be run separately via build_tools/travis/test_script.sh because closing sox after a test file is completed prevents it from being reopened. The testing framework is pytest.

# Run the whole test suite
$ build_tools/travis/test_script.sh
# Run an individual test
$ python -m pytest test/test_transforms.py

Kaldi IO

Kaldi IO has been added as an optional dependency in https://github.com/pytorch/audio/pull/111. torchaudio provides a simple wrapper around this by converting the np.ndarray into torch.Tensor. Functions include: read_vec_int_ark, read_vec_flt_scp, read_vec_flt_ark, read_mat_scp, and read_mat_ark.

>>> # read ark to a 'dictionary'
>>> d = { u:d for u,d in torchaudio.kaldi_io.read_vec_int_ark(file) }

Separation of State and Computation

In https://github.com/pytorch/audio/pull/105, the computations have been moved into functional.py. The reasoning behind this is that tracking state is a separate problem by itself and should be separate from computing a function. It also allows us to annotate the functional as weak scriptable, which in turn allows us to utilize the JIT and create efficient code. The functional itself might then also be used by other functionals, which is much easier and more efficient than having another Module create an instance of the class. This also makes it easier to implement performance improvements and create a generic API. If someone implements a function that adheres to the contract of your functional, it can be an immediate drop-in. This is important if we want to support different backends (e.g. move a functional entirely into C++).

>>> torchaudio.transforms.Spectrogram(n_fft=...)(waveform)
>>> torchaudio.functional.spectrogram(waveform, …)

Loading and saving to file

Tensors can be read and written to various file formats (e.g. “mp3”, “wav”, etc.) through torchaudio.

sound, sample_rate = torchaudio.load(‘input.wav’)
torchaudio.save(‘output.wav’, sound)

Transforms and functionals

Transforms

class Compose(object):
    def __init__(self, transforms):
    def __call__(self, audio):
        
class Scale(object):
    def __init__(self, factor=2**31):
    def __call__(self, tensor):
        
class PadTrim(object):
    def __init__(self, max_len, fill_value=0, channels_first=True):
    def __call__(self, tensor):
       
class DownmixMono(object):
    def __init__(self, channels_first=None):
    def __call__(self, tensor):

class LC2CL(object):
    def __call__(self, tensor):

def SPECTROGRAM(*args, **kwargs):

class Spectrogram(object):
    def __init__(self, n_fft=400, ws=None, hop=None,
                 pad=0, window=torch.hann_window,
                 power=2, normalize=False, wkwargs=None):
    def __call__(self, sig):
        
def F2M(*args, **kwargs):

class MelScale(object):
    def __init__(self, n_mels=128, sr=16000, f_max=None, f_min=0., n_stft=None):
    def __call__(self, spec_f):

class SpectrogramToDB(object):
    def __init__(self, stype="power", top_db=None):
    def __call__(self, spec):
       
class MFCC(object):
    def __init__(self, sr=16000, n_mfcc=40, dct_type=2, norm='ortho', log_mels=False,
                 melkwargs=None):
    def __call__(self, sig):

class MelSpectrogram(object):
    def __init__(self, sr=16000, n_fft=400, ws=None, hop=None, f_min=0., f_max=None,
                 pad=0, n_mels=128, window=torch.hann_window, wkwargs=None):
    def __call__(self, sig):

def MEL(*args, **kwargs):

class BLC2CBL(object):
    def __call__(self, tensor):

class MuLawEncoding(object):
    def __init__(self, quantization_channels=256):
    def __call__(self, x):

class MuLawExpanding(object):
    def __init__(self, quantization_channels=256):
    def __call__(self, x_mu):

Functional

def scale(tensor, factor):
    # type: (Tensor, int) -> Tensor

def pad_trim(tensor, ch_dim, max_len, len_dim, fill_value):
    # type: (Tensor, int, int, int, float) -> Tensor

def downmix_mono(tensor, ch_dim):
    # type: (Tensor, int) -> Tensor

def LC2CL(tensor):
    # type: (Tensor) -> Tensor

def spectrogram(sig, pad, window, n_fft, hop, ws, power, normalize):
    # type: (Tensor, int, Tensor, int, int, int, int, bool) -> Tensor

def create_fb_matrix(n_stft, f_min, f_max, n_mels):
    # type: (int, float, float, int) -> Tensor

def mel_scale(spec_f, f_min, f_max, n_mels, fb=None):
    # type: (Tensor, float, float, int, Optional[Tensor]) -> Tuple[Tensor, Tensor]

def spectrogram_to_DB(spec, multiplier, amin, db_multiplier, top_db=None):
    # type: (Tensor, float, float, float, Optional[float]) -> Tensor

def create_dct(n_mfcc, n_mels, norm):
    # type: (int, int, string) -> Tensor

def MFCC(sig, mel_spect, log_mels, s2db, dct_mat):
    # type: (Tensor, MelSpectrogram, bool, SpectrogramToDB, Tensor) -> Tensor

def BLC2CBL(tensor):
    # type: (Tensor) -> Tensor

def mu_law_encoding(x, qc):
    # type: (Tensor, int) -> Tensor

def mu_law_expanding(x_mu, qc):
    # type: (Tensor, int) -> Tensor

Datasets VCTK and YESNO

All datasets are subclasses of torch.utils.data.Dataset i.e, they have __getitem__ and __len__ methods implemented. Hence, they can all be passed to a torch.utils.data.DataLoader which can load multiple samples parallelly using torch.multiprocessing workers. For example:

yesno_data = torchaudio.datasets.YESNO('.', download=True)
data_loader = torch.utils.data.DataLoader(yesno_data,
                                          batch_size=1,
                                          shuffle=True,
                                          num_workers=args.nThreads)

The two datasets available are VCTK and YESNO. They download the datasets and preprocess them so that the loaded data is in convenient format.

SoxEffects and SoxEffectsChain

SoxEffects and SoxEffectsChain in torchaudio.sox_effects expose sox operations through a Python interface. Various useful effects like downmixing a multichannel signal or resampling a signal can be done here.

torchaudio.initialize_sox()
E = torchaudio.sox_effects.SoxEffectsChain()
E.append_effect_to_chain("rate", [16000])  # resample to 16000hz
E.append_effect_to_chain("channels", ["1"])  # mono signal
E.set_input_file(fn)
waveform, sample_rate = E.sox_build_flow_effects()
torchaudio.shutdown_sox()