From c4adefe54ec1a2246194cae20c0ff4e6e07ebb04 Mon Sep 17 00:00:00 2001 From: Vincent Quenneville-Belair Date: Fri, 26 Jul 2019 07:55:17 -0700 Subject: [PATCH 01/15] adding manifesto to readme. --- README.md | 28 ++++++++++++++++++++++++++-- 1 file changed, 26 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 51e5bbafa4..0bf4ae101b 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@ torchaudio: an audio library for PyTorch -================================================ +======================================== [![Build Status](https://travis-ci.org/pytorch/audio.svg?branch=master)](https://travis-ci.org/pytorch/audio) @@ -54,6 +54,30 @@ torchaudio.save('foo_save.mp3', sound, sample_rate) # saves tensor to file ``` API Reference ------------ +------------- API Reference is located here: http://pytorch.org/audio/ + +Conventions +----------- + +Torchaudio is standardized around the following conventions. The following variables are used with their corresponding definitions. + +* waveform: a tensor of audio samples with shape (channels, time) +* sample_rate: the rate of audio samples (samples per second) +* specgram: a tensor of spectrogram with shape (channels, frequency, time) +* mel_specgram: a mel spectrogram with shape (channels, frequency, time) +* hop_length: the number of samples between the starts of consecutive frames +* n_freqs: the number of bins in a linear spectrogram +* min_freq: the lowest frequency of the lowest band in a spectrogram +* max_freq: the highest frequency of the highest band in a spectrogram +* n_fft: the number of fourier bins +* n_mfcc, n_mels: to be consistent with other similarly named variables, with shape (channel, n_mfcc, time) and (channel, n_mels, times) +* win_length: the length of the STFT window +* window_fn: for functions that creates windows e.g. torch.hann_window + +A spectrogram can be converted to DB scale or Mel scale, using AmplitudeToDB and AmplitudetoMel. + +The input (Spectrogram, MFCC, MelSpectrogram, Resample, etc.) of all transforms and functions assumes channel first. The output of STFT is (channel, frequency, time, 2). + +The Kaldi compliance interface follow Kaldi's interface. From 117b43b4664a88fe8b8927ea77a652dd9d69070f Mon Sep 17 00:00:00 2001 From: Vincent Quenneville-Belair Date: Fri, 26 Jul 2019 08:40:03 -0700 Subject: [PATCH 02/15] shape of transforms --- README.md | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 0bf4ae101b..58be8722fd 100644 --- a/README.md +++ b/README.md @@ -61,12 +61,12 @@ API Reference is located here: http://pytorch.org/audio/ Conventions ----------- -Torchaudio is standardized around the following conventions. The following variables are used with their corresponding definitions. +Torchaudio is standardized around the following naming conventions. * waveform: a tensor of audio samples with shape (channels, time) * sample_rate: the rate of audio samples (samples per second) -* specgram: a tensor of spectrogram with shape (channels, frequency, time) -* mel_specgram: a mel spectrogram with shape (channels, frequency, time) +* specgram: a tensor of spectrogram with shape (channels, time) +* mel_specgram: a mel spectrogram with shape (channels, time) * hop_length: the number of samples between the starts of consecutive frames * n_freqs: the number of bins in a linear spectrogram * min_freq: the lowest frequency of the lowest band in a spectrogram @@ -76,8 +76,17 @@ Torchaudio is standardized around the following conventions. The following varia * win_length: the length of the STFT window * window_fn: for functions that creates windows e.g. torch.hann_window -A spectrogram can be converted to DB scale or Mel scale, using AmplitudeToDB and AmplitudetoMel. +Transforms expect the following shapes. In particular, the input of all transforms and functions assumes channel first. -The input (Spectrogram, MFCC, MelSpectrogram, Resample, etc.) of all transforms and functions assumes channel first. The output of STFT is (channel, frequency, time, 2). +* Spectrogram: (channel, time) -> (channel, frequency, time, 2) +* MelScale: (channel, time) -> (channel, n_mels, time) +* MFCC: (channel, time) -> (channel, n_mfcc, time) +* MuLawEncode: (channel, time) -> (channel, n_mulaw, time) +* MuLawDecode: (channel, n_mulaw, time) -> (channel, time) +* Resample: (channel, time) -> (channel, time) +* STFT: (channel, time) -> (channel, frequency, time, 2). +* ISTFT: (channel, frequency, time) -> (channel, time, 2). + +A spectrogram can be converted to DB scale or Mel scale, using AmplitudeToDB and AmplitudeToMel. The Kaldi compliance interface follow Kaldi's interface. From fa72cf4a310482a3a418f67535276475284b8962 Mon Sep 17 00:00:00 2001 From: Vincent Quenneville-Belair Date: Fri, 26 Jul 2019 08:46:10 -0700 Subject: [PATCH 03/15] typo. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 58be8722fd..1c2cfb9ae7 100644 --- a/README.md +++ b/README.md @@ -89,4 +89,4 @@ Transforms expect the following shapes. In particular, the input of all transfor A spectrogram can be converted to DB scale or Mel scale, using AmplitudeToDB and AmplitudeToMel. -The Kaldi compliance interface follow Kaldi's interface. +The Kaldi compliance interface follows Kaldi's interface. From 416dfafd5832fb1f1b25252ca00ccca63cecc1dd Mon Sep 17 00:00:00 2001 From: Vincent Quenneville-Belair Date: Fri, 26 Jul 2019 09:00:07 -0700 Subject: [PATCH 04/15] complex input too. --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 1c2cfb9ae7..44e95f9177 100644 --- a/README.md +++ b/README.md @@ -84,8 +84,8 @@ Transforms expect the following shapes. In particular, the input of all transfor * MuLawEncode: (channel, time) -> (channel, n_mulaw, time) * MuLawDecode: (channel, n_mulaw, time) -> (channel, time) * Resample: (channel, time) -> (channel, time) -* STFT: (channel, time) -> (channel, frequency, time, 2). -* ISTFT: (channel, frequency, time) -> (channel, time, 2). +* STFT: (channel, time, 2) -> (channel, frequency, time, 2). +* ISTFT: (channel, frequency, time, 2) -> (channel, time, 2). A spectrogram can be converted to DB scale or Mel scale, using AmplitudeToDB and AmplitudeToMel. From d9de34688bb53e155f87355b71f754cfbfb270fe Mon Sep 17 00:00:00 2001 From: Vincent Quenneville-Belair Date: Fri, 26 Jul 2019 09:29:05 -0700 Subject: [PATCH 05/15] listing kaldi's function. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 44e95f9177..c26fb36bde 100644 --- a/README.md +++ b/README.md @@ -89,4 +89,4 @@ Transforms expect the following shapes. In particular, the input of all transfor A spectrogram can be converted to DB scale or Mel scale, using AmplitudeToDB and AmplitudeToMel. -The Kaldi compliance interface follows Kaldi's interface. +The Kaldi compliance interface follows Kaldi's interface, and provides access to: Kaldi's `fbank`, `spectrogram`, and `resample_waveform`. From 81ce38409a609a8a97906d904ba86ab2572e1a3c Mon Sep 17 00:00:00 2001 From: Vincent Quenneville-Belair Date: Fri, 26 Jul 2019 11:59:23 -0700 Subject: [PATCH 06/15] mulaw shape. --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index c26fb36bde..8b7f858cb4 100644 --- a/README.md +++ b/README.md @@ -81,8 +81,8 @@ Transforms expect the following shapes. In particular, the input of all transfor * Spectrogram: (channel, time) -> (channel, frequency, time, 2) * MelScale: (channel, time) -> (channel, n_mels, time) * MFCC: (channel, time) -> (channel, n_mfcc, time) -* MuLawEncode: (channel, time) -> (channel, n_mulaw, time) -* MuLawDecode: (channel, n_mulaw, time) -> (channel, time) +* MuLawEncode: (channel, time) -> (channel, time) +* MuLawDecode: (channel, time) -> (channel, time) * Resample: (channel, time) -> (channel, time) * STFT: (channel, time, 2) -> (channel, frequency, time, 2). * ISTFT: (channel, frequency, time, 2) -> (channel, time, 2). From 9cecc6d8012822c39eef73bc6a0045b564637d73 Mon Sep 17 00:00:00 2001 From: Vincent Quenneville-Belair Date: Fri, 26 Jul 2019 12:12:29 -0700 Subject: [PATCH 07/15] +AmplitudeToDB -Kaldi. --- README.md | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/README.md b/README.md index 8b7f858cb4..ec20b8d41f 100644 --- a/README.md +++ b/README.md @@ -79,6 +79,7 @@ Torchaudio is standardized around the following naming conventions. Transforms expect the following shapes. In particular, the input of all transforms and functions assumes channel first. * Spectrogram: (channel, time) -> (channel, frequency, time, 2) +* AmplitudeToDB: (channel, frequency, time, 2) -> (channel, frequency, time, 2) * MelScale: (channel, time) -> (channel, n_mels, time) * MFCC: (channel, time) -> (channel, n_mfcc, time) * MuLawEncode: (channel, time) -> (channel, time) @@ -86,7 +87,3 @@ Transforms expect the following shapes. In particular, the input of all transfor * Resample: (channel, time) -> (channel, time) * STFT: (channel, time, 2) -> (channel, frequency, time, 2). * ISTFT: (channel, frequency, time, 2) -> (channel, time, 2). - -A spectrogram can be converted to DB scale or Mel scale, using AmplitudeToDB and AmplitudeToMel. - -The Kaldi compliance interface follows Kaldi's interface, and provides access to: Kaldi's `fbank`, `spectrogram`, and `resample_waveform`. From 0602d6c234ff28bdb7205d526dc7ea233dbb440f Mon Sep 17 00:00:00 2001 From: Vincent Quenneville-Belair Date: Fri, 26 Jul 2019 12:29:01 -0700 Subject: [PATCH 08/15] Shape of spectrogram. --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index ec20b8d41f..6ee665a792 100644 --- a/README.md +++ b/README.md @@ -63,10 +63,10 @@ Conventions Torchaudio is standardized around the following naming conventions. -* waveform: a tensor of audio samples with shape (channels, time) +* waveform: a tensor of audio samples with shape (channel, time) * sample_rate: the rate of audio samples (samples per second) -* specgram: a tensor of spectrogram with shape (channels, time) -* mel_specgram: a mel spectrogram with shape (channels, time) +* specgram: a tensor of spectrogram with shape (channel, frequency, time) +* mel_specgram: a mel spectrogram with shape (channel, frequency, time) * hop_length: the number of samples between the starts of consecutive frames * n_freqs: the number of bins in a linear spectrogram * min_freq: the lowest frequency of the lowest band in a spectrogram From 5dbbad1350d6ab276bde4909bf2ebc59c91f16cd Mon Sep 17 00:00:00 2001 From: Vincent Quenneville-Belair Date: Fri, 26 Jul 2019 12:48:50 -0700 Subject: [PATCH 09/15] Fourier, n_mels, n_freqs. --- README.md | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 6ee665a792..4161992834 100644 --- a/README.md +++ b/README.md @@ -65,25 +65,26 @@ Torchaudio is standardized around the following naming conventions. * waveform: a tensor of audio samples with shape (channel, time) * sample_rate: the rate of audio samples (samples per second) -* specgram: a tensor of spectrogram with shape (channel, frequency, time) -* mel_specgram: a mel spectrogram with shape (channel, frequency, time) +* specgram: a tensor of spectrogram with shape (channel, n_freqs, time) +* mel_specgram: a mel spectrogram with shape (channel, n_mels, time) * hop_length: the number of samples between the starts of consecutive frames * n_freqs: the number of bins in a linear spectrogram * min_freq: the lowest frequency of the lowest band in a spectrogram * max_freq: the highest frequency of the highest band in a spectrogram -* n_fft: the number of fourier bins +* n_fft: the number of Fourier bins * n_mfcc, n_mels: to be consistent with other similarly named variables, with shape (channel, n_mfcc, time) and (channel, n_mels, times) * win_length: the length of the STFT window * window_fn: for functions that creates windows e.g. torch.hann_window Transforms expect the following shapes. In particular, the input of all transforms and functions assumes channel first. -* Spectrogram: (channel, time) -> (channel, frequency, time, 2) -* AmplitudeToDB: (channel, frequency, time, 2) -> (channel, frequency, time, 2) +* Spectrogram: (channel, time) -> (channel, n_freqs, time, 2) +* AmplitudeToDB: (channel, n_freqs, time, 2) -> (channel, n_freqs, time, 2) * MelScale: (channel, time) -> (channel, n_mels, time) +* MelSpectrogram: (channel, time) -> (channel, n_mels, time, 2) * MFCC: (channel, time) -> (channel, n_mfcc, time) * MuLawEncode: (channel, time) -> (channel, time) * MuLawDecode: (channel, time) -> (channel, time) * Resample: (channel, time) -> (channel, time) -* STFT: (channel, time, 2) -> (channel, frequency, time, 2). -* ISTFT: (channel, frequency, time, 2) -> (channel, time, 2). +* STFT: (channel, time, 2) -> (channel, n_freqs, time, 2). +* ISTFT: (channel, n_freqs, time, 2) -> (channel, time, 2). From cf3cfab436ec172aad6f1327c07efd3355327039 Mon Sep 17 00:00:00 2001 From: Vincent Quenneville-Belair Date: Fri, 26 Jul 2019 12:51:45 -0700 Subject: [PATCH 10/15] time. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 4161992834..43f81a241c 100644 --- a/README.md +++ b/README.md @@ -72,7 +72,7 @@ Torchaudio is standardized around the following naming conventions. * min_freq: the lowest frequency of the lowest band in a spectrogram * max_freq: the highest frequency of the highest band in a spectrogram * n_fft: the number of Fourier bins -* n_mfcc, n_mels: to be consistent with other similarly named variables, with shape (channel, n_mfcc, time) and (channel, n_mels, times) +* n_mfcc, n_mels: to be consistent with other similarly named variables, with shape (channel, n_mfcc, time) and (channel, n_mels, time) * win_length: the length of the STFT window * window_fn: for functions that creates windows e.g. torch.hann_window From 17b5667cac2eaed426b205e9713c77bdd7e78742 Mon Sep 17 00:00:00 2001 From: Vincent Quenneville-Belair Date: Fri, 26 Jul 2019 14:10:49 -0700 Subject: [PATCH 11/15] dimensions (or dimension names) vs number of them. --- README.md | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index 43f81a241c..b44ca2655d 100644 --- a/README.md +++ b/README.md @@ -63,28 +63,28 @@ Conventions Torchaudio is standardized around the following naming conventions. -* waveform: a tensor of audio samples with shape (channel, time) -* sample_rate: the rate of audio samples (samples per second) -* specgram: a tensor of spectrogram with shape (channel, n_freqs, time) -* mel_specgram: a mel spectrogram with shape (channel, n_mels, time) +* waveform: a tensor of audio samples with dimensions (channel, time) +* sample_rate: the rate of audio dimensions (samples per second) +* specgram: a tensor of spectrogram with dimensions (channel, freq, time) +* mel_specgram: a mel spectrogram with dimensions (channel, freq, time) * hop_length: the number of samples between the starts of consecutive frames -* n_freqs: the number of bins in a linear spectrogram +* n_fft: the number of Fourier bins +* n_mfcc, n_mel: the number of mel and MFCC bins, +* n_freq: the number of bins in a linear spectrogram * min_freq: the lowest frequency of the lowest band in a spectrogram * max_freq: the highest frequency of the highest band in a spectrogram -* n_fft: the number of Fourier bins -* n_mfcc, n_mels: to be consistent with other similarly named variables, with shape (channel, n_mfcc, time) and (channel, n_mels, time) * win_length: the length of the STFT window * window_fn: for functions that creates windows e.g. torch.hann_window -Transforms expect the following shapes. In particular, the input of all transforms and functions assumes channel first. +Transforms expect the following dimensions. In particular, the input of all transforms and functions assumes channel first. -* Spectrogram: (channel, time) -> (channel, n_freqs, time, 2) -* AmplitudeToDB: (channel, n_freqs, time, 2) -> (channel, n_freqs, time, 2) -* MelScale: (channel, time) -> (channel, n_mels, time) -* MelSpectrogram: (channel, time) -> (channel, n_mels, time, 2) -* MFCC: (channel, time) -> (channel, n_mfcc, time) +* Spectrogram: (channel, time) -> (channel, freq, time, 2) +* AmplitudeToDB: (channel, freq, time, 2) -> (channel, freq, time, 2) +* MelScale: (channel, time) -> (channel, mel, time) +* MelSpectrogram: (channel, time) -> (channel, mel, time, 2) +* MFCC: (channel, time) -> (channel, mfcc, time) * MuLawEncode: (channel, time) -> (channel, time) * MuLawDecode: (channel, time) -> (channel, time) * Resample: (channel, time) -> (channel, time) -* STFT: (channel, time, 2) -> (channel, n_freqs, time, 2). -* ISTFT: (channel, n_freqs, time, 2) -> (channel, time, 2). +* STFT: (channel, time, 2) -> (channel, freq, time, 2) +* ISTFT: (channel, freq, time, 2) -> (channel, time, 2) From ac29e50a009bb4164a7e7be3c87bd5401df62a76 Mon Sep 17 00:00:00 2001 From: Vincent Quenneville-Belair Date: Fri, 26 Jul 2019 14:14:30 -0700 Subject: [PATCH 12/15] typo. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index b44ca2655d..31e1490124 100644 --- a/README.md +++ b/README.md @@ -69,7 +69,7 @@ Torchaudio is standardized around the following naming conventions. * mel_specgram: a mel spectrogram with dimensions (channel, freq, time) * hop_length: the number of samples between the starts of consecutive frames * n_fft: the number of Fourier bins -* n_mfcc, n_mel: the number of mel and MFCC bins, +* n_mfcc, n_mel: the number of mel and MFCC bins * n_freq: the number of bins in a linear spectrogram * min_freq: the lowest frequency of the lowest band in a spectrogram * max_freq: the highest frequency of the highest band in a spectrogram From c36ed5162b22ced92bf5cddab1ddc331c311c25c Mon Sep 17 00:00:00 2001 From: Vincent Quenneville-Belair Date: Fri, 26 Jul 2019 14:15:25 -0700 Subject: [PATCH 13/15] mel. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 31e1490124..199f50c0ed 100644 --- a/README.md +++ b/README.md @@ -66,7 +66,7 @@ Torchaudio is standardized around the following naming conventions. * waveform: a tensor of audio samples with dimensions (channel, time) * sample_rate: the rate of audio dimensions (samples per second) * specgram: a tensor of spectrogram with dimensions (channel, freq, time) -* mel_specgram: a mel spectrogram with dimensions (channel, freq, time) +* mel_specgram: a mel spectrogram with dimensions (channel, mel, time) * hop_length: the number of samples between the starts of consecutive frames * n_fft: the number of Fourier bins * n_mfcc, n_mel: the number of mel and MFCC bins From 9a52757df9fd5c053085e39f729229977f22d63a Mon Sep 17 00:00:00 2001 From: Vincent Quenneville-Belair Date: Fri, 26 Jul 2019 14:36:27 -0700 Subject: [PATCH 14/15] order, and complex. --- README.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 199f50c0ed..0eb23bb9a1 100644 --- a/README.md +++ b/README.md @@ -69,7 +69,7 @@ Torchaudio is standardized around the following naming conventions. * mel_specgram: a mel spectrogram with dimensions (channel, mel, time) * hop_length: the number of samples between the starts of consecutive frames * n_fft: the number of Fourier bins -* n_mfcc, n_mel: the number of mel and MFCC bins +* n_mel, n_mfcc: the number of mel and MFCC bins * n_freq: the number of bins in a linear spectrogram * min_freq: the lowest frequency of the lowest band in a spectrogram * max_freq: the highest frequency of the highest band in a spectrogram @@ -78,13 +78,15 @@ Torchaudio is standardized around the following naming conventions. Transforms expect the following dimensions. In particular, the input of all transforms and functions assumes channel first. -* Spectrogram: (channel, time) -> (channel, freq, time, 2) -* AmplitudeToDB: (channel, freq, time, 2) -> (channel, freq, time, 2) +* Spectrogram: (channel, time) -> (channel, freq, time, complex) +* AmplitudeToDB: (channel, freq, time, complex) -> (channel, freq, time, complex) * MelScale: (channel, time) -> (channel, mel, time) -* MelSpectrogram: (channel, time) -> (channel, mel, time, 2) +* MelSpectrogram: (channel, time) -> (channel, mel, time, complex) * MFCC: (channel, time) -> (channel, mfcc, time) * MuLawEncode: (channel, time) -> (channel, time) * MuLawDecode: (channel, time) -> (channel, time) * Resample: (channel, time) -> (channel, time) -* STFT: (channel, time, 2) -> (channel, freq, time, 2) -* ISTFT: (channel, freq, time, 2) -> (channel, time, 2) +* STFT: (channel, time, complex) -> (channel, freq, time, complex) +* ISTFT: (channel, freq, time, complex) -> (channel, time, complex) + +where complex refers to the 2 dimensions required to represent a complex number using real numbers. From da66e1d73f31dc0d5f3f79837d6aa15f2b6c186d Mon Sep 17 00:00:00 2001 From: Vincent Quenneville-Belair Date: Fri, 26 Jul 2019 14:40:13 -0700 Subject: [PATCH 15/15] no complex in transforms. --- README.md | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 0eb23bb9a1..cbbfa87853 100644 --- a/README.md +++ b/README.md @@ -78,15 +78,11 @@ Torchaudio is standardized around the following naming conventions. Transforms expect the following dimensions. In particular, the input of all transforms and functions assumes channel first. -* Spectrogram: (channel, time) -> (channel, freq, time, complex) -* AmplitudeToDB: (channel, freq, time, complex) -> (channel, freq, time, complex) +* Spectrogram: (channel, time) -> (channel, freq, time) +* AmplitudeToDB: (channel, freq, time) -> (channel, freq, time) * MelScale: (channel, time) -> (channel, mel, time) -* MelSpectrogram: (channel, time) -> (channel, mel, time, complex) +* MelSpectrogram: (channel, time) -> (channel, mel, time) * MFCC: (channel, time) -> (channel, mfcc, time) * MuLawEncode: (channel, time) -> (channel, time) * MuLawDecode: (channel, time) -> (channel, time) * Resample: (channel, time) -> (channel, time) -* STFT: (channel, time, complex) -> (channel, freq, time, complex) -* ISTFT: (channel, freq, time, complex) -> (channel, time, complex) - -where complex refers to the 2 dimensions required to represent a complex number using real numbers.