Data manipulation and transformation for audio signal processing, powered by PyTorch
BSD-2-CLAUSE License
Bot releases are hidden (Show)
Published by mthrok over 3 years ago
torchaudio 0.9.0 release includes:
This release includes model architectures from wav2vec2.0 paper with utility functions that allow importing pretrained model parameters published on fairseq and Hugging Face Hub. Now you can easily run speech recognition with torchaudio. These model architectures also support TorchScript, and you can deploy them with ONNX or in non-Python environments, such as C++, Android and iOS. Please checkout our C++, Android and iOS examples. The following snippets illustrate how to create a deployable model.
# Import fine-tuned model from Hugging Face Hub
import transformers
from torchaudio.models.wav2vec2.utils import import_huggingface_model
original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
imported = import_huggingface_model(original)
# Import fine-tuned model from fairseq
import fairseq
from torchaudio.models.wav2vec2.utils import import_fairseq_model
Original, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task(
["wav2vec_small_960h.pt"], arg_overrides={'data': "<data_dir>"})
imported = import_fairseq_model(original[0].w2v_encoder)
# Build uninitialized model and load state dict
from torchaudio.models import wav2vec2_base
model = wav2vec2_base(num_out=32)
model.load_state_dict(imported.state_dict())
# Quantize / script / optimize for mobile
quantized_model = torch.quantization.quantize_dynamic(
model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)
scripted_model = torch.jit.script(quantized_model)
optimized_model = optimize_for_mobile(scripted_model)
optimized_model.save("model_for_deployment.pt")
The internal implementation of lfilter
has been updated to support autograd on both CPU and CUDA. Additionally, the performance on CPU is significantly improved. These improvements also apply to biquad
variants.
The following table illustrates the performance improvements compared against the previous releases. lfilter
was applied on float32
tensors with one channel and different number of frames.
Unit: msec
torchaudio
has functions that handle complex-valued tensors. In early days when PyTorch did not have a complex dtype, torchaudio
adopted the convention to use an extra dimension to represent real and imaginary parts. In PyTorch 1.6, new dtyps, such as torch.cfloat
and torch.cdouble
were introduced to represent complex values natively. (In the following, we refer to torchaudio
’s original convention as pseudo complex types, and PyTorch’s native dtype as native complex types.)
As the native complex types have become mature and stable, torchaudio
has started to migrate complex functions to use the native complex type. In this release, the internal implementation was updated to use the native complex types, and interfaces were updated to allow passing/receiving native complex type directly. Users can choose to keep using the pseudo complex type or opt in to use native complex type. However, please note that the use of the pseudo complex type is now deprecated. These functions are tested to support TorchScript and autograd. For the detail of this migration plan, please refer to #1337.
Additionally, switching the internal implementation to the native complex types improved the performance. Since the internal implementation uses native complex type regardless of which complex type is passed/returned, users will automatically benefit from this performance improvement.
The following table illustrates the performance improvements from the previous release by comparing the time it takes for complex transforms to perform operation on float32
Tensor with two channels and 256 frames.
Unit: msec
Unit: msec
Along with the work of Complex Tensor Migration and Filtering Improvement mentioned above, more tests were added to ensure the autograd support. Now the following operations are guaranteed to support autograd up to second order.
lfilter
allpass_biquad
biquad
band_biquad
bandpass_biquad
bandrefect_biquad
bass_biquad
equalizer_biquad
treble_biquad
highpass_biquad
lowpass_biquad
AmplitudeToDB
ComputeDeltas
Fade
GriffinLim
TimeMasking
FrequencyMasking
MFCC
MelScale
MelSpectrogram
Resample
SpectralCentroid
Spectrogram
SlidingWindowCmn
TimeStretch
*
Vol
NOTE:
amplitude_to_DB
spectrogram
griffinlim
resample
phase_vocoder
*
mask_along_axis_iid
mask_along_axis
gain
spectral_centroid
torchaudio.transforms.TimeStretch
and torchaudio.functional.phase_vocoder
call atan2
, which is not differentiable around zero. Therefore these functions are differentiable only when the input spectrogram does not contain values around zero.In release 0.8, the resampling operation was vectorized and its performance improved. In this release, the implementation of the resampling algorithm has been further revised.
rolloff
parameter has been added for anti-aliasing control.torchaudio.transforms.Resample
precomputes the kernel using float64
precision and caches it for even faster operation.torchaudio.functional.resample
has been added and the original entry point, torchaudio.compliance.kaldi.resample_waveform
is deprecated.The following table illustrates the performance improvements from the previous release by comparing the time it takes for torchaudio.transforms.Resample
to complete the operation on float32
tensor with two channels and one-second duration.
Unit: msec
Unit: msec
torchaudio implements some operations in C++ for reasons such as performance and integration with third-party libraries. This C++ module was only available on Linux and macOS. In this release, Windows packages also come with C++ module.
This C++ module in Windows package includes the efficient filtering implementation mentioned above, however, “sox_io”
backend and torchaudio.functional.compute_kaldi_pitch
are not included.
Since the 0.6 release, we have continuously improved I/O functionality. Specifically, in 0.8 the default backend has been changed from “sox”
to “sox_io”
, and the similar API change has been applied to “soundfile”
backend. The 0.9 release concludes this migration by removing the deprecated backends. For the detail please refer to #903.
normalized
argument from torchaudio.functional.griffinlim
(#1369)
torchaudio.functional.sliding_window_cmn
arg for correctness (#1347)
waveform=...
, please change it to specgram=...
torchaudio.transforms.Resample
to precompute and cache the resampling kernel. (#1499, #1514)
resampler = torchaudio.transforms.Resample(orig_freq=8000, new_freq=44100)
resampler.to(torch.device("cuda"))
torchaudio
no longer supports programmatic download of Common Voice dataset. Please remove the arguments from your code.torchaudio
is adopting native complex type and the use of pseudo complex type and the related utility functions are now deprecated. Please refer to #1337 for the migration process.torchaudio.compliance.kaldi.resample_waveform
(#1533)
torchaudio.functional.resample
.torchaudio.transforms.MelScale
now expects valid n_stft
value (#1515)
n_stft
.torchaudio.functional.lfilter
(#1319)torchaudio.functional.lfilter
(#1310, #1441)torchaudio.functional.resample
(#1402)rolloff
parameter (#1488)torchaudio.transforms.Resample
(#1499, #1514, #1556)torchaudio.functional.phase_vocoder
and torchaudio.transforms.TimeStretch
(#1410)return_complex
to torchaudio.functional.spectrogram
and torchaudio.transforms.Spectrogram
(#1366, #1551)__str__
override to AudioMetaData
for easy print (#1339)sox/utils.cpp
(#1306)check_length
from validate_input_file
(#1312)torchaudio.functional.griffinlim
(#1368)torchaudio.transforms.MelScale
when n_stft
is invalid (#1505)__all__
(#1458)reference_cast
in make_boxed_from_unboxed_functor
(#1300)torchaudio.transforms.GriffinLim
(#1433)librosa
's Mel scale conversion with torchaudio
’s in WaveRNN example (#1444)config.guess
to support source build in recent architectures (#1484)torchaudio.functional.lfilter
and biquad
variants (#1400, #1438)torchaudio.transforms.FrequencyMasking
(#1498)torchaudio.transforms.SlidingWindowCmn
(#1482)torchaudio.transforms.MelScale
(#1467)torchaudio.transforms.Vol
(#1460)torchaudio.transforms.TimeStretch
(#1420)torchaudio.transforms.AmplitudeToDB
(#1447)torchaudio.transforms.GriffinLim
(#1421)torchaudio.transforms.SpectralCentroid
(#1425)torchaudio.transforms.ComputeDeltas
(#1422)torchaudio.transforms.Fade
(#1424)torchaudio.transforms.Resample
(#1416)torchaudio.transforms.MFCC
(#1415)torchaudio.transforms.Spectrogram
/ MelSpectrogram
(#1340)torchaudio.functional.lfilter
shape (#1360)torchaudio.functional.resample
(#1516)torchaudio.functional.phase_vocoder
(#1379)floor_divide
with div
(#1455)torch.assert_allclose
with assertEqual
(#1387)torchaudio.functional.lfilter
autograd tests input size (#1443)torchaudio.transforms.InverseMelScale
comparison test (#1437)torchaudio.transforms.TimeMasking
and torchaudio.transforms.FrequencyMasking
to perform out-of-place masking (#1481)power
of torchaudio.transforms.MelSpectrogram
as float only (#1572)torch.nn.functional.conv1d
in torchaudio.functional.lfilter
(#1318)torchaudio.functional.overdrive
(#1299)sox_effects.apply_effects_tensor
is CPU-only (#1459)sliding_window_cmn
(#1383)Published by vincentqb over 3 years ago
This release depends on pytorch 1.8.1.
Published by vincentqb over 3 years ago
This release supports Python 3.9.
Continuing from the previous release, torchaudio improves the audio I/O mechanism. In this release, we have four major updates.
Backend migration.
We have migrated the default backend for audio I/O. The new default backend is “sox_io” (for Linux/macOS). The interface for “soundfile” backend has been also changed to align that of “sox_io”. Following the change of default backends, the legacy backend/interface have been marked as deprecated. The legacy backend/interface are still accessible, though it is strongly discouraged to use them. For the detail on the migration, please refer to #903.
File-like object support.
We have added file-like object support to I/O functions and sox_effects. You can perform the info
, load
, save
and apply_effects_file
operation on file-like objects.
# Query audio metadata over HTTP
# Will only fetch the first few kB
with requests.get(URL, stream=True) as response:
metadata = torchaudio.info(response.raw)
# Load audio from tar file
# No need to extract TAR file.
with tarfile.open(TAR_PATH, mode='r') as tarfile_:
fileobj = tarfile_.extractfile(SAMPLE_TAR_ITEM)
waveform, sample_rate = torchaudio.load(fileobj)
# Saving to Bytes buffer
# Using BytesIO, you can perform in-memory encoding/decoding.
buffer_ = io.BytesIO()
torchaudio.save(buffer_, waveform, sample_rate, format="wav")
# Apply effects (lowpass filter / resampling) while loading audio from S3
client = boto3.client('s3')
response = client.get_object(Bucket=S3_BUCKET, Key=S3_KEY)
waveform, sample_rate = torchaudio.sox_effects.apply_effect_file(
response['Body'], [["lowpass", "-1", "300"], ["rate", "8000"]])
[Beta] Codec Application.
Built upon the file-like object support, we added functional.apply_codec
function, which can degrades audio data by applying audio codecs supported by “sox_io” backend, in in-memory fashion.
# Apply MP3 codec
degraded = F.apply_codec(
waveform, sample_rate, format="mp3", compression=-9)
# Apply GSM codec
degraded = F.apply_codec(waveform, sample_rate, format="gsm")
Encoding options.
We have added encoding options to save function of new backends. Now you can change the format and encodings with format
, encoding
and bits_per_sample
options
# Save without any encoding option.
# The function will pick the encoding which the provided data fit
# For Tensor of float32 type, that is 32-bit floating-point PCM.
torchaudio.save("data.wav", waveform, sample_rate)
# Save as 16-bit signed integer Linear PCM
# The resulting file occupies half the storage but loses precision
torchaudio.save(
"data.wav", waveform, sample_rate, encoding="PCM_S", bits_per_sample=16)
More format support to "sox_io"’s save function.
We have added support for GSM, HTK, AMB, and AMR-NB formats to "sox_io"’s save function.
torchaudio was utilizing CMake to build third party dependencies. Now torchaudio uses CMake to build its C++ extension. This will open the door to integrate torchaudio in non-Python environments (such as C++ applications and mobile). We will work on adding example applications and mobile integrations in upcoming releases.
Published by vincentqb almost 4 years ago
This release introduces support for python 3.9. There is no 0.7.1 release, and the following changes are compared to 0.7.0.
download=True
in CommonVoice (#1076)Published by vincentqb almost 4 years ago
torchaudio is expanding its support for models and end-to-end applications. Please file an issue on github to provide feedback on them.
As you are likely already aware from the last release we’re currently in the process of making sox_io
, which ships with new features such as TorchScript support and performance improvements, the new default. If you want to benefit from these features now, we encourage you to migrate. For more information see issue #903.
str.format
to adopt changes in PyTorch, leading to improved error messages for TorchScript (#850)sox_utils.list_formats()
for read and write (#811)VCTK_092
dataset (#812)sox_io
backend (#871)soundfile
backend to the one identical to sox_io
backend. (#922)soundfile
compatibility backend. (#922)torchaudio.compliance.kaldi.fbank
(#947)pathlib.Path
support to sox_io
backend (#907)sox_io
C++ implementation (#779)sox_io
and sox_effects
(#806)noise_shaping = True
(#865)zip_safe = False
to disable egg installation (#842)istft
wrapper in favor of torch.istft. (#841)SoxEffect
and SoxEffectsChain
(#787)sox
backend. (#904)soundfile
. (#922)load_wav
functions. (#905)Published by vincentqb about 4 years ago
torchaudio now includes a new model module (with wav2letter included), new functionals (contrast, cvm, dcshift, overdrive, vad, phaser, flanger, biquad), datasets (GTZAN, CMU), and a new optional sox backend with support for torchscript. torchaudio now also supports Windows, with the soundfile backend.
torchaudio requires python 3.6 or more recent.
Published by vincentqb over 4 years ago
torchaudio includes new transforms (e.g. Griffin-Lim and inverse Mel scale), new filters (e.g. all pass, fade, band pass/reject, band, treble, deemph, riaa), and datasets (LJ Speech and SpeechCommands).
Published by vincentqb almost 5 years ago
torchaudio 0.4 improves on current transformations, datasets, and backend support.
We would like to thank again our contributors and the wider community for their significant contributions to this release. In particular we'd like to thank @keunwoochoi, @ksanjeevan, and all the other maintainers and contributors of torchaudio-contrib for their significant and valuable additions around augmentations (#285) and batching (#327).
downsample
, transform
, target_transform
, and return_dict
are being deprecated.torchaudio.functional.detect_pitch_frequency
. (#313, #322)torchaudio.transforms
: TimeStretch
, FrequencyMasking
, TimeMasking
. (#285, #333, #348)torchaudio.transform.ComplexNorm
. (#285, #333)torchaudio.functional.compute_deltas
. (#268, #326)torchaudio.functional.gain
and torchaudio.functional.dither
(#319, #360). We welcome work to continue the effort to implement features available in SoX, see #260.equalizer_biquad
(#315, #340), lowpass_biquad
, highpass_biquad
(#275), lfilter
, and biquad
(#275, #291, #326) in torchaudio.functional
.torchaudio.functional.mfcc
. (#228)MelScale
and librosa. (#294)torchaudio.compliance.kaldi.resample_waveform
where internal variables where not moved to the GPU when used. (#277)istft
where the dtype
and device
of parameters were not created on the same device as the tensor provided by the user. (#264)load_state_dict
). (#246)torchaudio.load
to [-1, 1]. (#283)Published by vincentqb almost 5 years ago
This release is to update the dependency to PyTorch 1.3.1.
Published by vincentqb almost 5 years ago
This release is to update the dependency to PyTorch 1.3.0.
Published by jamarshon about 5 years ago
torchaudio has been redesigned to be an extension of PyTorch and part of the domain APIs (DAPI) ecosystem. Domain specific libraries such as this one are kept separated in order to maintain a coherent environment for each of them. As such, torchaudio is an ML library that provides relevant signal processing functionality, but it is not a general signal processing library. The full rationale of this new standardization can be found in the README.md.
In light of these changes some transforms have been removed or have different argument names and conventions. See the section on backwards breaking changes for a migration guide.
We provide binaries via pip and conda. They require PyTorch 1.2.0 and newer. See https://pytorch.org/ for installation instructions.
We would like to thank our contributors and the wider community for their significant contributions to this release. We are happy to see an active community around torchaudio and are eager to further grow and support it.
In particular we'd like to thank @keunwoochoi, @ksanjeevan, and all the other maintainers and contributors of torchaudio-contrib for their significant and valuable additions around standardization and the support of complex numbers (https://github.com/pytorch/audio/pull/131, https://github.com/pytorch/audio/issues/110, https://github.com/keunwoochoi/torchaudio-contrib/issues/61, https://github.com/keunwoochoi/torchaudio-contrib/issues/36).
An implementation of basic transforms with a Kaldi-like interface.
We added the functions spectrogram, fbank, and resample_waveform (https://github.com/pytorch/audio/pull/119, https://github.com/pytorch/audio/pull/127, and https://github.com/pytorch/audio/pull/134). For more details see the documentation on torchaudio.compliance.kaldi which mirrors the arguments and outputs of Kaldi features.
As an example we can look at the sinc interpolation resampling similar to Kaldi’s implementation. In the figure below, the blue dots are the original signal and red dots are the downsampled signal with half the original frequency. The red dot elements are approximately every other original element.
specgram = torchaudio.compliance.kaldi.spectrogram(waveform, frame_length=...)
fbank = torchaudio.compliance.kaldi.fbank(waveform, num_mel_bins=...)
resampled_waveform = torchaudio.compliance.kaldi.resample_waveform(waveform, orig_freq=...)
Constructing a signal from a spectrogram can be used in applications like source separation or to generate audio signals to listen to. More specifically torchaudio.functional.istft is the inverse of torch.stft. It has the same parameters (+ additional optional parameter of length
) and returns the least squares estimation of an original signal.
torch.manual_seed(0)
n_fft = 5
waveform = torch.rand(2, 5)
stft = torch.stft(waveform, n_fft=n_fft)
approx_waveform = torchaudio.functional.istft(stft, n_fft=n_fft, length=waveform.size(1))
>>> waveform
tensor([[0.4963, 0.7682, 0.0885, 0.1320, 0.3074],
[0.6341, 0.4901, 0.8964, 0.4556, 0.6323]])
>>> approx_waveform
tensor([[0.4963, 0.7682, 0.0885, 0.1320, 0.3074],
[0.6341, 0.4901, 0.8964, 0.4556, 0.6323]])
Compose
:SPECTROGRAM
, F2M
, and MEL
have been removed. Please use Spectrogram
, MelScale
, and MelSpectrogram
LC2CL
and BLC2CBL
): While the LC layout might be common in signal processing, support for it is out of scope of this library and transforms such as LC2CL only aid their proliferation. Please use transpose if you need this behavior.Scale
, PadTrim
, DownmixMono
: Please use division in place of Scale
torch.nn.functional.pad/trim in place of PadTrim
, torch.mean on the channel dimension in place of DownmixMono
.torchaudio.legacy
has been removed. Please use torchaudio.load
and torchaudio.save
Spectrogram
used to be of dimension (channel, time, freq) and is now (channel, freq, time). Similarly for MelScale
, MelSpectrogram
, and MFCC
, time is the last dimension. Please see our README for an explanation of the rationale behind these changes. Please use transpose to get the previous behavior.MuLawExpanding
was renamed to MuLawDecoding
as the inverse of MuLawEncoding
( https://github.com/pytorch/audio/pull/159)SpectrogramToDB
was renamed to AmplitudeToDB
( https://github.com/pytorch/audio/pull/170). The input does not necessarily have to be a spectrogram and as such can be used in many more cases as the name should reflect.Spectrogram
, AmplitudeToDB
, MelScale
, MelSpectrogram
, MFCC
, MuLawEncoding
, and MuLawDecoding
. (https://github.com/pytorch/audio/pull/118)Spectrogram
, AmplitudeToDB
, MelScale
, MelSpectrogram
, MFCC
, MuLawEncoding
, and MuLawDecoding
(https://github.com/pytorch/audio/pull/118)test_transforms.py
where double tensors were compared with floats (https://github.com/pytorch/audio/pull/132)vctk.read_audio
(issue https://github.com/pytorch/audio/issues/143) as there were issues with downsampling using SoxEffectsChain
(https://github.com/pytorch/audio/pull/145)sox_close
(https://github.com/pytorch/audio/pull/174)Published by jamarshon about 5 years ago
The goal of this release is to fix the current API as there will be future changes that breaking backward compatibility in order to improve the library as more thought is given to design, capabilities, and usability.
While this release is compatible with all currently known PyTorch versions (<=1.2.0), the available binaries will only require Pytorch 1.1.0. Installation commands:
# Wheels for Python 2 are NOT supported
# Python 3.5
$ pip3 install http://download.pytorch.org/whl/torchaudio-0.2-cp35-cp35m-linux_x86_64.whl
# Python 3.6
$ pip3 install http://download.pytorch.org/whl/torchaudio-0.2-cp36-cp36m-linux_x86_64.whl
# Python 3.7
$ pip3 install http://download.pytorch.org/whl/torchaudio-0.2-cp37-cp37m-linux_x86_64.whl
A continuous integration (Travis CI) has been setup in https://github.com/pytorch/audio/pull/117. This means all the tests have been fixed and their status can be checked in https://travis-ci.org/pytorch/audio. The test files have to be run separately via build_tools/travis/test_script.sh because closing sox after a test file is completed prevents it from being reopened. The testing framework is pytest.
# Run the whole test suite
$ build_tools/travis/test_script.sh
# Run an individual test
$ python -m pytest test/test_transforms.py
Kaldi IO has been added as an optional dependency in https://github.com/pytorch/audio/pull/111. torchaudio provides a simple wrapper around this by converting the np.ndarray
into torch.Tensor
. Functions include: read_vec_int_ark
, read_vec_flt_scp
, read_vec_flt_ark
, read_mat_scp
, and read_mat_ark
.
>>> # read ark to a 'dictionary'
>>> d = { u:d for u,d in torchaudio.kaldi_io.read_vec_int_ark(file) }
In https://github.com/pytorch/audio/pull/105, the computations have been moved into functional.py. The reasoning behind this is that tracking state is a separate problem by itself and should be separate from computing a function. It also allows us to annotate the functional as weak scriptable, which in turn allows us to utilize the JIT and create efficient code. The functional itself might then also be used by other functionals, which is much easier and more efficient than having another Module create an instance of the class. This also makes it easier to implement performance improvements and create a generic API. If someone implements a function that adheres to the contract of your functional, it can be an immediate drop-in. This is important if we want to support different backends (e.g. move a functional entirely into C++).
>>> torchaudio.transforms.Spectrogram(n_fft=...)(waveform)
>>> torchaudio.functional.spectrogram(waveform, …)
Tensors can be read and written to various file formats (e.g. “mp3”, “wav”, etc.) through torchaudio.
sound, sample_rate = torchaudio.load(‘input.wav’)
torchaudio.save(‘output.wav’, sound)
Transforms
class Compose(object):
def __init__(self, transforms):
def __call__(self, audio):
class Scale(object):
def __init__(self, factor=2**31):
def __call__(self, tensor):
class PadTrim(object):
def __init__(self, max_len, fill_value=0, channels_first=True):
def __call__(self, tensor):
class DownmixMono(object):
def __init__(self, channels_first=None):
def __call__(self, tensor):
class LC2CL(object):
def __call__(self, tensor):
def SPECTROGRAM(*args, **kwargs):
class Spectrogram(object):
def __init__(self, n_fft=400, ws=None, hop=None,
pad=0, window=torch.hann_window,
power=2, normalize=False, wkwargs=None):
def __call__(self, sig):
def F2M(*args, **kwargs):
class MelScale(object):
def __init__(self, n_mels=128, sr=16000, f_max=None, f_min=0., n_stft=None):
def __call__(self, spec_f):
class SpectrogramToDB(object):
def __init__(self, stype="power", top_db=None):
def __call__(self, spec):
class MFCC(object):
def __init__(self, sr=16000, n_mfcc=40, dct_type=2, norm='ortho', log_mels=False,
melkwargs=None):
def __call__(self, sig):
class MelSpectrogram(object):
def __init__(self, sr=16000, n_fft=400, ws=None, hop=None, f_min=0., f_max=None,
pad=0, n_mels=128, window=torch.hann_window, wkwargs=None):
def __call__(self, sig):
def MEL(*args, **kwargs):
class BLC2CBL(object):
def __call__(self, tensor):
class MuLawEncoding(object):
def __init__(self, quantization_channels=256):
def __call__(self, x):
class MuLawExpanding(object):
def __init__(self, quantization_channels=256):
def __call__(self, x_mu):
Functional
def scale(tensor, factor):
# type: (Tensor, int) -> Tensor
def pad_trim(tensor, ch_dim, max_len, len_dim, fill_value):
# type: (Tensor, int, int, int, float) -> Tensor
def downmix_mono(tensor, ch_dim):
# type: (Tensor, int) -> Tensor
def LC2CL(tensor):
# type: (Tensor) -> Tensor
def spectrogram(sig, pad, window, n_fft, hop, ws, power, normalize):
# type: (Tensor, int, Tensor, int, int, int, int, bool) -> Tensor
def create_fb_matrix(n_stft, f_min, f_max, n_mels):
# type: (int, float, float, int) -> Tensor
def mel_scale(spec_f, f_min, f_max, n_mels, fb=None):
# type: (Tensor, float, float, int, Optional[Tensor]) -> Tuple[Tensor, Tensor]
def spectrogram_to_DB(spec, multiplier, amin, db_multiplier, top_db=None):
# type: (Tensor, float, float, float, Optional[float]) -> Tensor
def create_dct(n_mfcc, n_mels, norm):
# type: (int, int, string) -> Tensor
def MFCC(sig, mel_spect, log_mels, s2db, dct_mat):
# type: (Tensor, MelSpectrogram, bool, SpectrogramToDB, Tensor) -> Tensor
def BLC2CBL(tensor):
# type: (Tensor) -> Tensor
def mu_law_encoding(x, qc):
# type: (Tensor, int) -> Tensor
def mu_law_expanding(x_mu, qc):
# type: (Tensor, int) -> Tensor
All datasets are subclasses of torch.utils.data.Dataset i.e, they have __getitem__
and __len__
methods implemented. Hence, they can all be passed to a torch.utils.data.DataLoader which can load multiple samples parallelly using torch.multiprocessing workers. For example:
yesno_data = torchaudio.datasets.YESNO('.', download=True)
data_loader = torch.utils.data.DataLoader(yesno_data,
batch_size=1,
shuffle=True,
num_workers=args.nThreads)
The two datasets available are VCTK and YESNO. They download the datasets and preprocess them so that the loaded data is in convenient format.
SoxEffects and SoxEffectsChain in torchaudio.sox_effects expose sox operations through a Python interface. Various useful effects like downmixing a multichannel signal or resampling a signal can be done here.
torchaudio.initialize_sox()
E = torchaudio.sox_effects.SoxEffectsChain()
E.append_effect_to_chain("rate", [16000]) # resample to 16000hz
E.append_effect_to_chain("channels", ["1"]) # mono signal
E.set_input_file(fn)
waveform, sample_rate = E.sox_build_flow_effects()
torchaudio.shutdown_sox()