Skip to content

Provide more efficient implementation of lfilter on CPU #1238

@cpuhrsch

Description

@cpuhrsch

lfilter is at the heart of biquad, which is called by 11 other biquad variants such as band_biquad or highpass_biquad.

Some basic investigation shows that, at the moment, the majority of the time is spent within a core loop copied below

    for i_sample, o0 in enumerate(input_signal_windows.t()):
        windowed_output_signal = padded_output_waveform[
            :, i_sample:i_sample + n_order
        ]
        o0.addmv_(windowed_output_signal, a_coeffs_flipped, alpha=-1)
        padded_output_waveform[:, i_sample + n_order - 1] = o0

What's most likely slow is the very repeated and narrow indexing into the torch.Tensor padded_output_waveform resulting in very small Tensors, which is then repeated for each sample in input_signal_windows. This exacerbates known issues with framework overhead. Writing a naive one-off kernel in C++ on CPU using the raw data pointers is expected to result in significant improvements, because it will avoid creating intermediate Tensor structs and because addmv is passed very small inputs. It is also rather simple in nature and it should be quick to verify any potential gains with a naive implementation on which we can then build.

Steps

  1. Setup a basic benchmark, that runs lfilter on some basic input data. The test can serve as a good starting ground.
  2. Create torchaudio/csrc/lfilter.cpp by copying the transducer bindings.
  3. Modify lfilter.cpp to host a function of signature _lfilter_core_loop(Tensor input_signal_windows, int64_t n_order, Tensor padded_output_waveform) and translate the above code into C++. For this a composite operator such as group_norm can provide some additional code example, but the C++ is very similar to the Python API.
  4. Write a pure C for-loop using the underlying data pointers. For this verify that the inputs all live on CPU, are all contiguous and of type float32 (for now). One case of the batchnorm inference code is a code example of something like this.
  5. Run tests and measure performance gains. We can then use this analysis to decide on whether we need more sophisticated strategies for this.

Building and Testing

  1. Install nightly build of PyTorch
  2. Clone repo and build torchaudio
    # mac
    BUILD_SOX=1 MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py develop
    # linux
    BUILD_SOX=1 python setup.py develop
    
  3. Run py.test -s -v test/torchaudio_unittest/functional/functional_cpu_test.py or py.test -s -v test/torchaudio_unittest/functional/functional_cpu_test.py::TestLFilterFloat32::test_simple for an lfilter specific test and faster iteration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions