- 
                Notifications
    You must be signed in to change notification settings 
- Fork 736
Description
Hi all,
I think it's good timing to discuss a potential merging plan from torchaudio-contrib to here, especially because there's going to be new features and changes by @jamarshon @cpuhrsch.
Main idea
A lot of things are well summarized in https://github.com/keunwoochoi/torchaudio-contrib. In short, we wanted to re-design torch-based audio processing so that
- things can be Layers, which are based on correspondingFunctionals
- names for layers and arguments are carefully chosen
- all work for multi-channel
- complex numbers are supported when it makes sense (e.g., STFTs)
Review - layers
. torchaudio-contrib already covers lots of functions that transform.py is covering now, but not all of them. And that's why I feel like it's time to discuss this here.
Let me list the classes in transform.py one by one with some notes.
1. Already in torchaudio-contrib. Hoping we'd replace.
- class Spectrogram: we have it in torchaudio-contrib. On top of this, we also have- STFTlayer which outputs complex representations (same as- torch.stftsince we're wrapping it).
- class MelScale: we have it and would like to suggest to change the name to something more general. We named it- class MelFilterbank, assuming there can be other types of filterbanks, too. It also supports- htkand non-- htkmel filterbanks.
- class SpectrogramToDB: we would like to propose a more general approach --- class AmplitudeToDb(ref=1.0, amin=1e-7)and- class DbToAmplitude(ref=1.0), because decibel-scaling is about changing it's unit, not the core content of the input.
- class MelSpectrogram: we have it, which returns a- nn.Sequentialmodel consists of Spectrogram and mel-scale filter bank.
- class MuLawEncoding,- class MuLawExpanding: we have it, actually a 99% copy of the implementation here.
2. Wouldn't need these
- class Compose: we wouldn't need it because once things are based on- Layerspeople can simply build a- nn.Sequential().
- class Scale: It does- 16int-->- float. I think we need to deprecate this because if we really need this, it should be with a more intuitive and precise name, and probably should support other conversions as well.
3. To-be-added
- class DownmixMono: I would like to have one. But we also consider having a time-frequency representation-based downmix (energy-preserving operation) (@faroit). I'm open for discussion. Personally I'd prefer to have separate classes,- DownmixWaveform()and- DownmixSpecgram(). Maybe until we have a better one, we should keep it as it is.
- class MFCC: we currently don't have it. The current torch/audio implementation uses- s2db (SpectrogramToDB), but this class seems little arbitrary for me, so we might want to re-implement it.
4. Not sure about these
- class PadTrim: I don't actually know why we need it exactly, would love to hear about this!
- class LC2CL: So far, torchaudio-contrib code hasn't considered- channel-firsttensors. If it's a thing, we'd i) update our code to make them compatible and ii) have the same or a similar class to this. But, ..do we really need this?
- class BLC2CBL: same as- LC2CL-- I'd like to know its use cases.
Review - argument and variable names
As summarised --> keunwoochoi/torchaudio-contrib#46, we'd like to use
- waveformsfor a batch of waveforms
- real_specgramsfor magnitude spectrograms
- complex_specgramsfor complex spectrograms
 . (This is relatively less-discussed).
Audio loading
@faroit has been working on replacing Sox with others. But here in this issue, I'd like to focus on the topics above.
So,
- Any opinion on this?
- Any answers to the questions I have!
- If it looks good, what else would you like to have in the one-shot PR that would replace the current transforms.py?