Provide more efficient implementation of lfilter on CPU

[lfilter](https://github.com/pytorch/audio/blob/135e966dfd3520db4859854a6f109bfbec0970d6/torchaudio/functional/filtering.py#L811) is at the heart of [biquad](https://github.com/pytorch/audio/blob/135e966dfd3520db4859854a6f109bfbec0970d6/torchaudio/functional/filtering.py#L272), which is called by 11 other biquad variants such as [band_biquad](https://github.com/pytorch/audio/blob/135e966dfd3520db4859854a6f109bfbec0970d6/torchaudio/functional/filtering.py#L98) or [highpass_biquad](https://github.com/pytorch/audio/blob/135e966dfd3520db4859854a6f109bfbec0970d6/torchaudio/functional/filtering.py#L785).

Some basic investigation shows that, at the moment, the majority of the time is spent within a [core loop](https://github.com/pytorch/audio/blob/135e966dfd3520db4859854a6f109bfbec0970d6/torchaudio/functional/filtering.py#L880-L885) copied below

```
    for i_sample, o0 in enumerate(input_signal_windows.t()):
        windowed_output_signal = padded_output_waveform[
            :, i_sample:i_sample + n_order
        ]
        o0.addmv_(windowed_output_signal, a_coeffs_flipped, alpha=-1)
        padded_output_waveform[:, i_sample + n_order - 1] = o0
```

What's most likely slow is the very repeated and narrow indexing into the torch.Tensor padded_output_waveform resulting in very small Tensors, which is then repeated for each sample in ```input_signal_windows```. This exacerbates known issues with framework overhead. Writing a naive one-off kernel in C++ on CPU using the raw data pointers is expected to result in significant improvements, because it will avoid creating intermediate Tensor structs and because addmv is passed very small inputs. It is also rather simple in nature and it should be quick to verify any potential gains with a naive implementation on which we can then build.

### Steps

1. Setup a basic benchmark, that runs lfilter on some basic input data. The test can serve as a [good starting ground](https://github.com/pytorch/audio/blob/135e966dfd3520db4859854a6f109bfbec0970d6/test/torchaudio_unittest/functional/functional_impl.py#L17-L21).
2. Create torchaudio/csrc/lfilter.cpp by copying the [transducer bindings](https://github.com/pytorch/audio/blob/135e966dfd3520db4859854a6f109bfbec0970d6/torchaudio/csrc/transducer.cpp).
2. Modify lfilter.cpp to host a function of signature ```_lfilter_core_loop(Tensor input_signal_windows, int64_t n_order, Tensor padded_output_waveform)``` and translate the above code into C++. For this a composite operator such as [group_norm](https://github.com/pytorch/pytorch/blob/550f26c6d5c466266a5b9c8bbec140c2e54dc6e9/aten/src/ATen/native/group_norm.cpp#L126-L161) can provide some additional code example, but the C++ is very similar to the Python API.
3. Write a pure C for-loop using the underlying data pointers. For this verify that the inputs all live on CPU, are all contiguous and of type float32 (for now). One case of the [batchnorm inference code](https://github.com/pytorch/pytorch/blob/fe67438f327dc1b8b796159f200d7d5ae204740e/aten/src/ATen/native/Normalization.cpp#L97-L143) is a code example of something like this.
4. Run tests and measure performance gains. We can then use this analysis to decide on whether we need more sophisticated strategies for this.

### Building and Testing

1. Install nightly build of PyTorch
2. Clone repo and build torchaudio
    ```
    # mac
    BUILD_SOX=1 MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py develop
    # linux
    BUILD_SOX=1 python setup.py develop
    ```
3. Run ```py.test -s -v test/torchaudio_unittest/functional/functional_cpu_test.py``` or ```py.test -s -v test/torchaudio_unittest/functional/functional_cpu_test.py::TestLFilterFloat32::test_simple``` for an lfilter specific test and faster iteration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Provide more efficient implementation of lfilter on CPU #1238

Steps

Building and Testing

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Provide more efficient implementation of lfilter on CPU #1238

Description

Steps

Building and Testing

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions