cudev: Add __shfl_down implementation for long long and unsigned long for CUDA Tookit < 9.0 #3963

cudawarped · 2025-06-25T12:41:37Z

Draft fix for #3962.

Support for __shfl_down on long long was not introduced until CUDA Toolkit 9.0. I don't know if this is just software support or if hardware support was added as well. Its a long shot but it may be the reason that the tests are failing on Compute Capbability 5.3 devices.

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
The PR is proposed to the proper branch
There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

troelsy · 2025-06-30T07:30:52Z

Hi, I thought I would chime in as it relates to my recent PR. The company I work for uses Jetson TX2 with CC=6.2, CUDA Toolkit 10.2 and everything seems to work, so I took a look at it. It looks like all devices that support warp shuffle (CC≥3.0) will support shuffle with long long as long as the CUDA Toolkit ≥ 9.0.

To the question about if it is implemented in software or hardware, I think warp shuffle are always done 32 bit at a time because the registers are limited to 32 bit. It will just be two shuffles for 64 bit types. The PTX also indicate this in Compiler Explorer: https://godbolt.org/z/nxdYcqoWe. If the PTX view doesn't show, try opening a new compiler window.

I think the if-statement should be changed to check the CUDA Toolkit version instead. The current code will change the behavior on Jetson TX2 even though it should be supported. Does OpenCV specify a minimum version of CUDA Toolkit?

cudawarped · 2025-06-30T07:43:34Z

@troelsy The flag was just a test to try and fix the crash on CC 5.3 devices. Do you have access to CUDA toolkit < 9.0 to test whether _shufl_down compiles or not on CUDA toolkit < 9.0? Godbolt doesn't have NVCC <= 9.1.85.

troelsy · 2025-06-30T09:01:19Z

TX2 should be able to run CUDA Toolkit 8.0, but my department doesn't have access to the firmware, so I can't try it out

cudawarped · 2025-06-30T09:05:26Z

@asmorkalov Do you have a machine with CUDA Toolkit < 9.0 on it to check this?

asmorkalov · 2025-06-30T10:17:08Z

No, unfortunately. I want to deploy something with desktop PC, but after the 4.12 release. It's almost ready.

cudawarped · 2025-06-30T10:37:43Z

Would you like me to kill this PR or change the #define from __CUDA_ARCH__ < 700 to __CUDACC_VER_MAJOR__ < 9?

…long for CUDA Tookit versions < 9.0

asmorkalov · 2025-08-14T09:34:00Z

@cudawarped Is it ready for merge?

cudawarped · 2025-08-14T10:07:16Z

@asmorkalov 👍

cudawarped mentioned this pull request Jun 25, 2025

CUDA_Arithm/ThresholdOtsu.Accuracy/0 fails on Jetson NANO (old one) #3962

Closed

cudawarped marked this pull request as draft June 25, 2025 13:23

asmorkalov approved these changes Jun 27, 2025

View reviewed changes

asmorkalov added the category: cuda label Jun 27, 2025

cudev: Add _shfl_down implementation for long long and unsigned long …

af8945e

…long for CUDA Tookit versions < 9.0

cudawarped force-pushed the fix_shufl_down_on_cc_lt_70 branch from 3051f9a to af8945e Compare June 30, 2025 12:18

cudawarped changed the title ~~cudev: Add __shfl_down implementation for long long and unsigned long on devices of CC < 7.0~~ cudev: Add __shfl_down implementation for long long and unsigned long for CUDA Tookit < 9.0 Jun 30, 2025

asmorkalov self-assigned this Aug 14, 2025

asmorkalov marked this pull request as ready for review August 14, 2025 10:09

asmorkalov merged commit 408ee9f into opencv:4.x Aug 14, 2025
1 check passed

This was referenced Aug 20, 2025

5.x merge 4.x #3989

Closed

5.x merge 4.x #3990

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cudev: Add __shfl_down implementation for long long and unsigned long for CUDA Tookit < 9.0 #3963

cudev: Add __shfl_down implementation for long long and unsigned long for CUDA Tookit < 9.0 #3963

Uh oh!

cudawarped commented Jun 25, 2025 •

edited

Loading

Uh oh!

troelsy commented Jun 30, 2025

Uh oh!

cudawarped commented Jun 30, 2025

Uh oh!

troelsy commented Jun 30, 2025

Uh oh!

cudawarped commented Jun 30, 2025

Uh oh!

asmorkalov commented Jun 30, 2025

Uh oh!

cudawarped commented Jun 30, 2025

Uh oh!

asmorkalov commented Aug 14, 2025

Uh oh!

cudawarped commented Aug 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cudev: Add __shfl_down implementation for long long and unsigned long for CUDA Tookit < 9.0 #3963

cudev: Add __shfl_down implementation for long long and unsigned long for CUDA Tookit < 9.0 #3963

Uh oh!

Conversation

cudawarped commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Readiness Checklist

Uh oh!

troelsy commented Jun 30, 2025

Uh oh!

cudawarped commented Jun 30, 2025

Uh oh!

troelsy commented Jun 30, 2025

Uh oh!

cudawarped commented Jun 30, 2025

Uh oh!

asmorkalov commented Jun 30, 2025

Uh oh!

cudawarped commented Jun 30, 2025

Uh oh!

asmorkalov commented Aug 14, 2025

Uh oh!

cudawarped commented Aug 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cudawarped commented Jun 25, 2025 •

edited

Loading