Skip to content

Conversation

@bhargavkathivarapu
Copy link
Contributor

New PR for overdrive C++ extension similar to lfilter . Relates to #580 ( old PR )

Copy link
Contributor

@mthrok mthrok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bhargavkathivarapu

Thanks for the contribution, and sorry for taking so long to get back to you.
Overall it looks good to me.
@cpuhrsch Can you take a second look?

# for i in range(waveform.shape[-1]):
# last_out = temp[:, i] - last_in + 0.995 * last_out
# last_in = temp[:, i]
# output_waveform[:, i] = waveform[:, i] * 0.5 + last_out * 0.75
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can get rid of these comments now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mthrok sure will remove them.
Also I will try to implement the parallel_for as mentioned by @cpuhrsch in the old PR comments ( #580 ).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bhargavkathivarapu sounds good. Could you try measure the performance improvement?
Something like the following and change the shape of tensor to variety of shapes.
If numactl is missing, you can install it with sudo apt-get install -y numactl on Ubuntu.

OMP_NUM_THREADS=1 numactl --membind 0 --cpubind 0 python -m timeit -n 100 -r 5 -s """
import torch;
import torchaudio
x = torch.zeros(32, 100, dtype=torch.float)
""" """
torchaudio.functional.overdrive(x)
"""

Copy link
Contributor

@cpuhrsch cpuhrsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM depending some basic benchmarks that show clear improvement.

@bhargavkathivarapu
Copy link
Contributor Author

bhargavkathivarapu commented Feb 24, 2021

@mthrok

Changes made

  • Removed commented code
  • Added parallel_for in overdrive cpp.

Performance details
( 60 sec audio clip, VM with 2 vCPUs)

Screenshot from 2021-02-24 17-17-58

Numctl details ( VM with 2 vCPUs)
Command used :

OMP_NUM_THREADS=1 numactl --membind 0 --cpubind 0 python -m timeit -n 100 -r 5 -s """
import torch
import torchaudio
x = torch.zeros(2, 100, dtype=torch.float)
torchaudio.functional.overdrive(x)
"""

(2 , 100 ) = 100 loops, best of 5: 14.1 nsec per loop
(32 , 100 ) = 100 loops, best of 5: 16.8 nsec per loop
(1000, 100) = 100 loops, best of 5: 28.2 nsec per loop
(2, 100000) = 100 loops, best of 5: 40.4 nsec per loop
(100, 100000) = 100 loops, best of 5: 44.3 nsec per loop
(1000, 100000) = 100 loops, best of 5: 55.7 nsec per loop
(10000, 10000) =100 loops, best of 5: 49.7 nsec per loop

@bhargavkathivarapu bhargavkathivarapu marked this pull request as ready for review February 24, 2021 12:13
@mthrok
Copy link
Contributor

mthrok commented Feb 24, 2021

@mthrok

Changes made

  • Removed commented code
  • Added parallel_for in overdrive cpp.

Performance details
( 60 sec audio clip, VM with 2 vCPUs)

Screenshot from 2021-02-24 17-17-58

Numctl details ( VM with 2 vCPUs)
Command used :

OMP_NUM_THREADS=1 numactl --membind 0 --cpubind 0 python -m timeit -n 100 -r 5 -s """
import torch
import torchaudio
x = torch.zeros(2, 100, dtype=torch.float)
torchaudio.functional.overdrive(x)
"""

(2 , 100 ) = 100 loops, best of 5: 14.1 nsec per loop
(32 , 100 ) = 100 loops, best of 5: 16.8 nsec per loop
(1000, 100) = 100 loops, best of 5: 28.2 nsec per loop
(2, 100000) = 100 loops, best of 5: 40.4 nsec per loop
(100, 100000) = 100 loops, best of 5: 44.3 nsec per loop
(1000, 100000) = 100 loops, best of 5: 55.7 nsec per loop
(10000, 10000) =100 loops, best of 5: 49.7 nsec per loop

@bhargavkathivarapu

Thanks for the update. So my understanding is that the speed is comparable to sox implementation.

Can you compare the same torchaudio.functional.overdrive function and Tensor shape before adding the C++ code? That will tell more directly how the C++ implementation improves the performance.

@mthrok
Copy link
Contributor

mthrok commented Feb 24, 2021

Can you merge the latest master commit to include #1297? I believe I killed the flaky bug. 🐛

@bhargavkathivarapu
Copy link
Contributor Author

bhargavkathivarapu commented Feb 24, 2021

@mthrok . Now all checks passed after merging master.

Can you compare the same torchaudio.functional.overdrive function and Tensor shape before adding the C++ code? That will tell more directly how the C++ implementation improves the performance.

( Reference 60 second clip ) Tensor shape of w = [1, 1323000]
Existing torchaudio overdrive timing
Screenshot from 2021-02-24 21-06-37

Comparison for that tensor w

Implementation Relative run time Run time
sox overdrive 1X 38.7ms
overdrive new(cpp) 2X 78.2ms
overdrive old (python) ~2000X 77000ms

@mthrok
Copy link
Contributor

mthrok commented Feb 24, 2021

@mthrok . Now all checks passed after merging master.

Can you compare the same torchaudio.functional.overdrive function and Tensor shape before adding the C++ code? That will tell more directly how the C++ implementation improves the performance.

( Reference 60 second clip ) Tensor shape of w = [1, 1323000]
Existing torchaudio overdrive timing
Screenshot from 2021-02-24 21-06-37

Comparison for that tensor w

Implementation Relative run time Run time
sox overdrive 1X 38.7ms
overdrive new(cpp) 2X 78.2ms
overdrive old (python) ~2000X 77000ms

Wonderful! Thanks!

@mthrok mthrok merged commit 23e9ed3 into pytorch:master Feb 24, 2021
mthrok pushed a commit to mthrok/audio that referenced this pull request Feb 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants