-
Notifications
You must be signed in to change notification settings - Fork 739
torchaudio-contrib: some augmentations #285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
torchaudio/functional.py
Outdated
|
|
||
|
|
||
| @torch.jit.script | ||
| def stft(waveform, pad, window, n_fft, hop_length, win_length): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stft is already in pytorch core. I know that not having batching is annoying but replicating a function across both libraries is too. We're working on a more principled abstraction around batching, but we'd like to avoid having a reimplementation of this. Maybe we can do this in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're working on a more principled abstraction around batching, but we'd like to avoid having a reimplementation of this. Maybe we can do this in a separate PR?
👍
Gotcha, so I can take this out no problem and then have the STFT layer just use torch.stft? I think the bigger point for this was to have the non-normalized output of stft as a functional/layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying there's an operation provided by torchaudio that normalizes the output when you would like it not to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so Spectrogram calls the spectrogram functional which is basically computing the stft and then getting the power of the complex tensor (saying "normalizing" might have confusing on my part, sorry).
But in order to use the phase vocoder, the TimeStretch transform has to be applied to the output of the stft before the complex norm (i.e. before doingspec_f = spec_f.pow(power).sum(-1)).
So that's why I was kinda wanting torchaudio to have a transform that gives the complex STFT like we have in contrib, because if someone then wants to work with randomly stretched spectrograms, they can simply do:
nn.Sequential(STFT, TimeStretch, ComplexNorm, ...)whereas currently STFT and ComplexNorm are "coupled" together as Spectrogram, and we can't put the time stretching in between (i.e. nn.Sequential(Spectrogram, TimeStretch, ...) won't work).
In this PR, we can focus on single inputs in order to avoid adding STFT here again, as mentioned by @cpuhrsch. For batching, I would prefer waiting on a standardized approach. If that is so important, I'd open a separate PR for that so we don't block the rest :) |
Quick thought: I'm not necessarily suggesting to change something, but maybe |
vincentqb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Eventually, it might be nice to add the snippet as a quick test to ensure all work well together in the future.
Yeah that's actually how we have it in -contrib. The layers are STFT, ApplyFilterbank, TimeStretch, etc. So then if say you want a db melspectrogram, simply chain |
I'd say we write the implementations here as if |
|
So I've removed the functional and transform complex-and-also-batched |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added some tests for the masking functionals. Are the checks I wrote enough? I can add comments and/or more checks if needed. Should the next steps be writing some preliminary tests for the augmentation layers?
Is there something in SpecAugment we could compare against? Otherwise, the tests and the demo you provided here seems good to me.
I can also put a gist together showing how to chain these transforms.
If you have a demo of some code that could be fun to show, it could be shown as a new torchaudio tutorial, e.g. here, or example. :) These are useful since they also provide an integration test. This can be done as a separate PR.
vincentqb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. LGTM!
Hmm I'll look into it! I think the check we have to make sure the masked # of columns is < the mask parameter is a good start but there may be other things to compare with.
Yeah absolutely. Happy to do a separate PR with a demo/tutorial once we've sorted out the |
|
PS: let's not worry about this in this PR, but we try to make the transforms be thin wrappers around functionals as much as possible. :) |
This reverts commit 008791c.
This reverts commit 008791c.
This reverts commit 008791c.
* STFT transform and function from #285 * merge options in existing functionality. * remove dimension 2 check. add test. * using ... * update spectrogram test.
Adding augmentations:
TimeStretch,TimeMaskingandFrequencyMasking.@vincentqb @cpuhrsch @keunwoochoi
On complex STFT
We had discussions about this in torchaudio-contrib, about how it's useful to have a transform that outputs the complex stft. So basically the same as the current
Spectrogrambut without normalizing the output. In our previous PR we added thecomplex_normto the functionals so then theSpectrogramtransform could just call the complex stft functional (which will basically just wraptorch.stftplus whatever padding we're currently doing, batching (?)) and then pass the output tocomplex_normfunctional.An example where this can be needed is for example with the introduction of the
TimeStretchlayer. It takes as input a complex spectrogram so if a user wanted to make a module where a spectrogram is computed and stretched it could look something like:I have added a
STFTtransform as well as astftfunctional to be able to showcase theTimeStretchaugmentation (we have something like this in torchaudio-contrib), but of course this should be reworked once we get some feedback!On batching
I've made
STFTbe able to handle a batch dimension, and so can the augmentation layers. I can't seem to find it but I remember a discussion on how the layers should handle batch of inputs. Should the layers apply transforms only to single inputs for now? I can make changes if that is the case.On the augmentations
So we talked about what augmentations to do in #259 about doing
TimeStretch,TimeMasking,FrequencyMaskingandPitchShiftbut I've only included the first three in this PR since pitch shifting I think will require a bit more discussion and this can be a solid first step. I haven't included the tests for them yet since I figured many things might change after feedback is given. A quick summary of what they do:Time Stretching
Simply wrapping and allowing for batching for the

phase_vocoderfunctional introduced in the last PR. A fixed rate can be passed when initializing, or a (randomly generated if doing augmentation) rate can be given to the forward method each time. Looks like:Time and Frequency masking
From SpecAugment, apply masks of a desired value to an input spectrogram. I've included a flag in case the input is batched and the user wants the masks to be independent of each other. Examples:


Same mask for batch:
Independent mask for batch:
Some demo code:
gives:
