feat: parallelized for-loop in lfilter #1557

yoyolicoris · 2021-06-07T03:30:12Z

relates to #1476.

Replace the for-loop along channel dimension with at::parallel_for should gain some speed.

To benchmark the changes, one can use the same script I presented in #1441 (comment)

update to new version

Accommodate changes in lfilter

Rebase to master

Get synced

yoyolicoris · 2021-06-07T10:44:14Z

`at::parallel_for`

[-------------- IIR filter --------------]
                   |  forward  |  backward
1 threads: -------------------------------
      [32, 256]    |    337.6  |   1075.9 
      [32, 1024]   |    738.5  |   2584.3 
      [32, 4096]   |   2524.8  |   9364.3 
      [64, 256]    |    460.9  |   1536.4 
      [64, 1024]   |   1224.4  |   4683.3 
      [64, 4096]   |   9359.8  |  23350.0 
      [128, 256]   |    746.1  |   2598.0 
      [128, 1024]  |   2574.4  |   9781.5 
      [128, 4096]  |  22190.1  |  52730.7 
2 threads: -------------------------------
      [32, 256]    |    300.7  |    930.1 
      [32, 1024]   |    595.6  |   1923.3 
      [32, 4096]   |   1880.7  |   6998.0 
      [64, 256]    |    392.9  |   1225.6 
      [64, 1024]   |    905.3  |   3012.8 
      [64, 4096]   |   7791.2  |  18255.1 
      [128, 256]   |    603.3  |   1908.1 
      [128, 1024]  |   1773.9  |   6606.6 
      [128, 4096]  |  21661.1  |  44138.0 
4 threads: -------------------------------
      [32, 256]    |    284.4  |    859.1 
      [32, 1024]   |    510.3  |   1586.0 
      [32, 4096]   |   1750.4  |   6001.8 
      [64, 256]    |    345.5  |   1063.2 
      [64, 1024]   |    790.4  |   2402.6 
      [64, 4096]   |   9509.3  |  16874.3 
      [128, 256]   |    512.7  |   1592.8 
      [128, 1024]  |   1490.4  |   5722.3 
      [128, 4096]  |  21262.4  |  38738.6 

Times are in microseconds (us).

`regular for`

[-------------- IIR filter --------------]
                   |  forward  |  backward
1 threads: -------------------------------
      [32, 256]    |    327.1  |   1027.5 
      [32, 1024]   |    712.4  |   2505.0 
      [32, 4096]   |   2525.3  |   9485.6 
      [64, 256]    |    485.8  |   1583.1 
      [64, 1024]   |   1266.9  |   4837.8 
      [64, 4096]   |   7617.4  |  22548.3 
      [128, 256]   |    751.1  |   2572.2 
      [128, 1024]  |   2493.4  |  10458.1 
      [128, 4096]  |  22067.7  |  53279.5 
2 threads: -------------------------------
      [32, 256]    |    293.1  |    893.5 
      [32, 1024]   |    583.9  |   1869.9 
      [32, 4096]   |   1799.5  |   6761.2 
      [64, 256]    |    414.2  |   1270.5 
      [64, 1024]   |    926.2  |   3096.8 
      [64, 4096]   |   7121.1  |  18839.0 
      [128, 256]   |    607.0  |   1915.1 
      [128, 1024]  |   1752.3  |   7151.7 
      [128, 4096]  |  23125.6  |  44405.3 
4 threads: -------------------------------
      [32, 256]    |    275.8  |    828.8 
      [32, 1024]   |    514.1  |   1588.4 
      [32, 4096]   |   1590.3  |   6235.1 
      [64, 256]    |    348.8  |   1087.3 
      [64, 1024]   |    801.0  |   2430.8 
      [64, 4096]   |   7748.5  |  17719.1 
      [128, 256]   |    515.8  |   1585.6 
      [128, 1024]  |   1433.9  |   5702.6 
      [128, 4096]  |  20941.0  |  40453.1 

Times are in microseconds (us).

Seems like there's not so much differences.

mthrok · 2021-06-07T14:04:44Z

Hi @yoyololicon

Thanks for working on this. And thanks for benchmarking it as well.
What command did you use to compile? It is possible that our standard compilation is not passing OMP-related flags.

yoyolicoris · 2021-06-07T14:13:28Z

I just type python setup.py develop to build the source. Is there any compilation flags that I can apply?

mthrok · 2021-06-07T14:23:35Z

I just type python setup.py develop to build the source. Is there any compilation flags that I can apply?

I do not have a definitive answer. Probably you can try passing something like -fopenmp to CMake. However, note that the same OMP library as PyTorch uses has to be used for the compilation. (Intel OMP and GNU OMP cannot be mixed) and torchaudio has not figured out the way to detect OMP and apply it automatically. (We have to consider many different build cases, and figuring it out for all the platforms is not easy AFAIK.)

If you check the configuration of your PyTorch, then you might get a clue.

$ python -c 'import torch;print(torch.__config__.show())'
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

I think in the above case I need to install Intel OpenMP and pass the compilation flag.

mthrok · 2021-06-07T14:32:12Z

BTW: On the separate note, I was wondering, since the core of lfilter is convolution, is it possible to leverage the convolution operation for batch / filter bank application? I was thinking to look into the plausibility of the approach after the release when I have time. But if you know a reason that is not possible, then I will think of a different way.

yoyolicoris · 2021-06-07T15:00:35Z

I do not have a definitive answer. Probably you can try passing something like -fopenmp to CMake. However, note that the same OMP library as PyTorch uses has to be used for the compilation. (Intel OMP and GNU OMP cannot be mixed) and torchaudio has not figured out the way to detect OMP and apply it automatically. (We have to consider many different build cases, and figuring it out for all the platforms is not easy AFAIK.)

Thanks for the informations, will look into this later.

BTW: On the separate note, I was wondering, since the core of lfilter is convolution, is it possible to leverage the convolution operation for batch / filter bank application? I was thinking to look into the plausibility of the approach after the release when I have time. But if you know a reason that is not possible, then I will think of a different way.

I assume you are talking about the FIR part of the lfilter, which is just regular convolution. I can say it is totally possible to implement batch and filter bank application in the same time by using pytorch F.conv1d operator. The backward pass can also be done with convolution operator.

yoyolicoris · 2021-06-08T01:20:18Z

After adding -fopenmp in CMakeList;

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -fopenmp ${TORCH_CXX_FLAGS}")

It does gain some speed, especially when number of threads = 4; when threads = 1 it introduces some extra overhead.

[-------------- IIR filter --------------]
                   |  forward  |  backward
1 threads: -------------------------------
      [32, 256]    |    344.6  |   1078.0 
      [32, 1024]   |    755.7  |   2597.8 
      [32, 4096]   |   2622.4  |   9882.9 
      [64, 256]    |    490.1  |   1588.8 
      [64, 1024]   |   1225.9  |   4533.5 
      [64, 4096]   |   8193.3  |  22827.1 
      [128, 256]   |    727.2  |   2515.0 
      [128, 1024]  |   2364.4  |   9630.4 
      [128, 4096]  |  22848.5  |  51971.1 
2 threads: -------------------------------
      [32, 256]    |    290.9  |    912.4 
      [32, 1024]   |    528.6  |   1787.4 
      [32, 4096]   |   1734.2  |   6401.9 
      [64, 256]    |    379.5  |   1211.1 
      [64, 1024]   |    771.5  |   2754.4 
      [64, 4096]   |   6484.9  |  16385.8 
      [128, 256]   |    528.2  |   1744.4 
      [128, 1024]  |   1422.7  |   6248.1 
      [128, 4096]  |  20337.9  |  41913.5 
4 threads: -------------------------------
      [32, 256]    |    261.5  |    822.1 
      [32, 1024]   |    425.8  |   1433.9 
      [32, 4096]   |    969.4  |   5079.1 
      [64, 256]    |    304.2  |    979.5 
      [64, 1024]   |    583.3  |   1963.3 
      [64, 4096]   |   5668.1  |  15176.9 
      [128, 256]   |    415.4  |   1365.0 
      [128, 1024]  |    985.4  |   4659.8 
      [128, 4096]  |  19454.9  |  36843.9 

Times are in microseconds (us).

mthrok · 2021-08-16T15:52:30Z

Hi @yoyololicon

I recently learned that at::parallel_for is not a good use for batch peallelization as it will disable the intra-op parallelization. I guess we can close the PR. What do you think?

yoyolicoris · 2021-08-17T02:53:52Z

I recently learned that at::parallel_for is not a good use for batch peallelization as it will disable the intra-op parallelization. I guess we can close the PR. What do you think?

@mthrok
Thanks for your kind info.
Do you mean it will disable intra-op parallelization from other concurrent process? Cuz host_lfilter_core_loop seems doesn't have any intra-op. Please correct me if I'm wrong.

mthrok · 2021-08-27T21:00:49Z

I recently learned that at::parallel_for is not a good use for batch peallelization as it will disable the intra-op parallelization. I guess we can close the PR. What do you think?

@mthrok
Thanks for your kind info.
Do you mean it will disable intra-op parallelization from other concurrent process? Cuz host_lfilter_core_loop seems doesn't have any intra-op. Please correct me if I'm wrong.

Hi @yoyololicon

Sorry for the late response. You are right. the code inside of parallel_for does not use any parallelism, so this should be good. (although we need to enable the OMP multithreading at build time)

mthrok · 2021-08-27T21:01:18Z

Can you resolve the conflict?

mthrok · 2021-08-30T14:25:18Z

@yoyololicon Thanks!

yoyolicoris and others added 9 commits February 19, 2021 11:02

Merge pull request #1 from pytorch/master

fe29555

update to new version

Merge pull request #4 from pytorch/master

f695635

Accommodate changes in lfilter

Merge pull request #9 from pytorch/master

68f551b

Rebase to master

Merge pull request #10 from pytorch/master

537a153

Get synced

Merge branch 'master' of https://github.com/yoyololicon/audio

86fcc3d

Merge branch 'master' of https://github.com/yoyololicon/audio

01b73e1

Merge branch 'master' of https://github.com/yoyololicon/audio

b4839d5

Merge branch 'master' of https://github.com/pytorch/audio

415362b

feat: replace for loop with parallel_for

a8da5cf

facebook-github-bot added the CLA Signed label Jun 7, 2021

yoyolicoris changed the title ~~feat: parallelized for loop in lfilter~~ feat: parallelized for-loop in lfilter Jun 7, 2021

Merge branch 'main' into feat/parallel-lfilter-loop

32bad24

mthrok merged commit a525abb into pytorch:main Aug 30, 2021

yoyolicoris deleted the feat/parallel-lfilter-loop branch August 30, 2021 14:47

mthrok mentioned this pull request Sep 14, 2021

Add OpenMP support #1761

Merged

4 tasks

feat: parallelized for-loop in lfilter #1557

feat: parallelized for-loop in lfilter #1557

Uh oh!

Conversation

yoyolicoris commented Jun 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yoyolicoris commented Jun 7, 2021

at::parallel_for

regular for

Uh oh!

mthrok commented Jun 7, 2021

Uh oh!

yoyolicoris commented Jun 7, 2021

Uh oh!

mthrok commented Jun 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mthrok commented Jun 7, 2021

Uh oh!

yoyolicoris commented Jun 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yoyolicoris commented Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mthrok commented Aug 16, 2021

Uh oh!

yoyolicoris commented Aug 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mthrok commented Aug 27, 2021

Uh oh!

mthrok commented Aug 27, 2021

Uh oh!

mthrok commented Aug 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yoyolicoris commented Jun 7, 2021 •

edited

Loading

`at::parallel_for`

`regular for`

mthrok commented Jun 7, 2021 •

edited

Loading

yoyolicoris commented Jun 7, 2021 •

edited

Loading

yoyolicoris commented Jun 8, 2021 •

edited

Loading

yoyolicoris commented Aug 17, 2021 •

edited

Loading