Skip to content

Conversation

@yoyolicoris
Copy link
Contributor

@yoyolicoris yoyolicoris commented Jun 7, 2021

relates to #1476.

Replace the for-loop along channel dimension with at::parallel_for should gain some speed.

To benchmark the changes, one can use the same script I presented in #1441 (comment)

@yoyolicoris yoyolicoris changed the title feat: parallelized for loop in lfilter feat: parallelized for-loop in lfilter Jun 7, 2021
@yoyolicoris
Copy link
Contributor Author

at::parallel_for

[-------------- IIR filter --------------]
                   |  forward  |  backward
1 threads: -------------------------------
      [32, 256]    |    337.6  |   1075.9 
      [32, 1024]   |    738.5  |   2584.3 
      [32, 4096]   |   2524.8  |   9364.3 
      [64, 256]    |    460.9  |   1536.4 
      [64, 1024]   |   1224.4  |   4683.3 
      [64, 4096]   |   9359.8  |  23350.0 
      [128, 256]   |    746.1  |   2598.0 
      [128, 1024]  |   2574.4  |   9781.5 
      [128, 4096]  |  22190.1  |  52730.7 
2 threads: -------------------------------
      [32, 256]    |    300.7  |    930.1 
      [32, 1024]   |    595.6  |   1923.3 
      [32, 4096]   |   1880.7  |   6998.0 
      [64, 256]    |    392.9  |   1225.6 
      [64, 1024]   |    905.3  |   3012.8 
      [64, 4096]   |   7791.2  |  18255.1 
      [128, 256]   |    603.3  |   1908.1 
      [128, 1024]  |   1773.9  |   6606.6 
      [128, 4096]  |  21661.1  |  44138.0 
4 threads: -------------------------------
      [32, 256]    |    284.4  |    859.1 
      [32, 1024]   |    510.3  |   1586.0 
      [32, 4096]   |   1750.4  |   6001.8 
      [64, 256]    |    345.5  |   1063.2 
      [64, 1024]   |    790.4  |   2402.6 
      [64, 4096]   |   9509.3  |  16874.3 
      [128, 256]   |    512.7  |   1592.8 
      [128, 1024]  |   1490.4  |   5722.3 
      [128, 4096]  |  21262.4  |  38738.6 

Times are in microseconds (us).

regular for

[-------------- IIR filter --------------]
                   |  forward  |  backward
1 threads: -------------------------------
      [32, 256]    |    327.1  |   1027.5 
      [32, 1024]   |    712.4  |   2505.0 
      [32, 4096]   |   2525.3  |   9485.6 
      [64, 256]    |    485.8  |   1583.1 
      [64, 1024]   |   1266.9  |   4837.8 
      [64, 4096]   |   7617.4  |  22548.3 
      [128, 256]   |    751.1  |   2572.2 
      [128, 1024]  |   2493.4  |  10458.1 
      [128, 4096]  |  22067.7  |  53279.5 
2 threads: -------------------------------
      [32, 256]    |    293.1  |    893.5 
      [32, 1024]   |    583.9  |   1869.9 
      [32, 4096]   |   1799.5  |   6761.2 
      [64, 256]    |    414.2  |   1270.5 
      [64, 1024]   |    926.2  |   3096.8 
      [64, 4096]   |   7121.1  |  18839.0 
      [128, 256]   |    607.0  |   1915.1 
      [128, 1024]  |   1752.3  |   7151.7 
      [128, 4096]  |  23125.6  |  44405.3 
4 threads: -------------------------------
      [32, 256]    |    275.8  |    828.8 
      [32, 1024]   |    514.1  |   1588.4 
      [32, 4096]   |   1590.3  |   6235.1 
      [64, 256]    |    348.8  |   1087.3 
      [64, 1024]   |    801.0  |   2430.8 
      [64, 4096]   |   7748.5  |  17719.1 
      [128, 256]   |    515.8  |   1585.6 
      [128, 1024]  |   1433.9  |   5702.6 
      [128, 4096]  |  20941.0  |  40453.1 

Times are in microseconds (us).

Seems like there's not so much differences.

@mthrok
Copy link
Contributor

mthrok commented Jun 7, 2021

Hi @yoyololicon

Thanks for working on this. And thanks for benchmarking it as well.
What command did you use to compile? It is possible that our standard compilation is not passing OMP-related flags.

@yoyolicoris
Copy link
Contributor Author

I just type python setup.py develop to build the source. Is there any compilation flags that I can apply?

@mthrok
Copy link
Contributor

mthrok commented Jun 7, 2021

I just type python setup.py develop to build the source. Is there any compilation flags that I can apply?

I do not have a definitive answer. Probably you can try passing something like -fopenmp to CMake. However, note that the same OMP library as PyTorch uses has to be used for the compilation. (Intel OMP and GNU OMP cannot be mixed) and torchaudio has not figured out the way to detect OMP and apply it automatically. (We have to consider many different build cases, and figuring it out for all the platforms is not easy AFAIK.)

If you check the configuration of your PyTorch, then you might get a clue.

$ python -c 'import torch;print(torch.__config__.show())'
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

I think in the above case I need to install Intel OpenMP and pass the compilation flag.

@mthrok
Copy link
Contributor

mthrok commented Jun 7, 2021

BTW: On the separate note, I was wondering, since the core of lfilter is convolution, is it possible to leverage the convolution operation for batch / filter bank application? I was thinking to look into the plausibility of the approach after the release when I have time. But if you know a reason that is not possible, then I will think of a different way.

@yoyolicoris
Copy link
Contributor Author

yoyolicoris commented Jun 7, 2021

I do not have a definitive answer. Probably you can try passing something like -fopenmp to CMake. However, note that the same OMP library as PyTorch uses has to be used for the compilation. (Intel OMP and GNU OMP cannot be mixed) and torchaudio has not figured out the way to detect OMP and apply it automatically. (We have to consider many different build cases, and figuring it out for all the platforms is not easy AFAIK.)

Thanks for the informations, will look into this later.

BTW: On the separate note, I was wondering, since the core of lfilter is convolution, is it possible to leverage the convolution operation for batch / filter bank application? I was thinking to look into the plausibility of the approach after the release when I have time. But if you know a reason that is not possible, then I will think of a different way.

I assume you are talking about the FIR part of the lfilter, which is just regular convolution. I can say it is totally possible to implement batch and filter bank application in the same time by using pytorch F.conv1d operator. The backward pass can also be done with convolution operator.

@yoyolicoris
Copy link
Contributor Author

yoyolicoris commented Jun 8, 2021

After adding -fopenmp in CMakeList;

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -fopenmp ${TORCH_CXX_FLAGS}")

It does gain some speed, especially when number of threads = 4; when threads = 1 it introduces some extra overhead.

[-------------- IIR filter --------------]
                   |  forward  |  backward
1 threads: -------------------------------
      [32, 256]    |    344.6  |   1078.0 
      [32, 1024]   |    755.7  |   2597.8 
      [32, 4096]   |   2622.4  |   9882.9 
      [64, 256]    |    490.1  |   1588.8 
      [64, 1024]   |   1225.9  |   4533.5 
      [64, 4096]   |   8193.3  |  22827.1 
      [128, 256]   |    727.2  |   2515.0 
      [128, 1024]  |   2364.4  |   9630.4 
      [128, 4096]  |  22848.5  |  51971.1 
2 threads: -------------------------------
      [32, 256]    |    290.9  |    912.4 
      [32, 1024]   |    528.6  |   1787.4 
      [32, 4096]   |   1734.2  |   6401.9 
      [64, 256]    |    379.5  |   1211.1 
      [64, 1024]   |    771.5  |   2754.4 
      [64, 4096]   |   6484.9  |  16385.8 
      [128, 256]   |    528.2  |   1744.4 
      [128, 1024]  |   1422.7  |   6248.1 
      [128, 4096]  |  20337.9  |  41913.5 
4 threads: -------------------------------
      [32, 256]    |    261.5  |    822.1 
      [32, 1024]   |    425.8  |   1433.9 
      [32, 4096]   |    969.4  |   5079.1 
      [64, 256]    |    304.2  |    979.5 
      [64, 1024]   |    583.3  |   1963.3 
      [64, 4096]   |   5668.1  |  15176.9 
      [128, 256]   |    415.4  |   1365.0 
      [128, 1024]  |    985.4  |   4659.8 
      [128, 4096]  |  19454.9  |  36843.9 

Times are in microseconds (us).

@mthrok
Copy link
Contributor

mthrok commented Aug 16, 2021

Hi @yoyololicon

I recently learned that at::parallel_for is not a good use for batch peallelization as it will disable the intra-op parallelization. I guess we can close the PR. What do you think?

@yoyolicoris
Copy link
Contributor Author

yoyolicoris commented Aug 17, 2021

I recently learned that at::parallel_for is not a good use for batch peallelization as it will disable the intra-op parallelization. I guess we can close the PR. What do you think?

@mthrok
Thanks for your kind info.
Do you mean it will disable intra-op parallelization from other concurrent process? Cuz host_lfilter_core_loop seems doesn't have any intra-op. Please correct me if I'm wrong.

@mthrok
Copy link
Contributor

mthrok commented Aug 27, 2021

I recently learned that at::parallel_for is not a good use for batch peallelization as it will disable the intra-op parallelization. I guess we can close the PR. What do you think?

@mthrok
Thanks for your kind info.
Do you mean it will disable intra-op parallelization from other concurrent process? Cuz host_lfilter_core_loop seems doesn't have any intra-op. Please correct me if I'm wrong.

Hi @yoyololicon

Sorry for the late response. You are right. the code inside of parallel_for does not use any parallelism, so this should be good. (although we need to enable the OMP multithreading at build time)

@mthrok
Copy link
Contributor

mthrok commented Aug 27, 2021

Can you resolve the conflict?

@mthrok mthrok merged commit a525abb into pytorch:main Aug 30, 2021
@mthrok
Copy link
Contributor

mthrok commented Aug 30, 2021

@yoyololicon Thanks!

@yoyolicoris yoyolicoris deleted the feat/parallel-lfilter-loop branch August 30, 2021 14:47
@mthrok mthrok mentioned this pull request Sep 14, 2021
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants