Backports for v5.8.4 #2872

maleadt · 2025-09-03T07:02:15Z

No description provided.

Requiring multiple launches probably didn't make it worth it anyway, and it introduces complexities wrt. the launch configuration, having to recompile and re-compute the size of the partial reduction.

github-actions · 2025-09-03T07:02:51Z

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic release-5.8) to apply these changes.

Click here to view the suggested changes.

diff --git a/lib/cupti/wrappers.jl b/lib/cupti/wrappers.jl
index 30eeee63a..030d564dd 100644
--- a/lib/cupti/wrappers.jl
+++ b/lib/cupti/wrappers.jl
@@ -27,12 +27,13 @@ const cupti_versions = [
     v"12.7",
     v"12.8",
     v"12.9",
-    v"12.9.1"]
+    v"12.9.1",
+]
 
 function version()
     version_ref = Ref{Cuint}()
     cuptiGetVersion(version_ref)
-    if CUDA.runtime_version() < v"13"
+    return if CUDA.runtime_version() < v"13"
         cupti_versions[version_ref[]]
     else
         major, ver = divrem(version_ref[], 10000)
diff --git a/src/CUDA.jl b/src/CUDA.jl
index 8a82201a0..97402ecd5 100644
--- a/src/CUDA.jl
+++ b/src/CUDA.jl
@@ -54,7 +54,7 @@ using Printf
 # - Base.aligned_sizeof is the size of an object in an array/inline alloced
 # Both of them are equivalent for immutable objects, but differ for mutable singtons and Symbol
 # We use `aligned_sizeof` since we care about the size of a type in an array
-@generated function aligned_sizeof(::Type{T}) where T
+@generated function aligned_sizeof(::Type{T}) where {T}
     return :($(Base.aligned_sizeof(T)))
 end
 
diff --git a/src/mapreduce.jl b/src/mapreduce.jl
index 7f0b242e4..a1bf3dea3 100644
--- a/src/mapreduce.jl
+++ b/src/mapreduce.jl
@@ -198,7 +198,7 @@ function GPUArrays.mapreducedim!(f::F, op::OP, R::AnyCuArray{T},
     # If `Rother` is large enough, then a naive loop is more efficient than partial reductions.
     if length(Rother) >= serial_mapreduce_threshold(dev)
         args = (f, op, init, Rreduce, Rother, R, A)
-        kernel = @cuda launch=false serial_mapreduce_kernel(args...)
+        kernel = @cuda launch = false serial_mapreduce_kernel(args...)
         kernel_config = launch_configuration(kernel.fun)
         threads = kernel_config.threads
         blocks = cld(length(Rother), threads)
@@ -257,24 +257,26 @@ function GPUArrays.mapreducedim!(f::F, op::OP, R::AnyCuArray{T},
         # TODO: provide a version that atomically reduces from different blocks
 
         # temporary empty array whose type will match the final partial array
-	    partial = similar(R, ntuple(_ -> 0, Val(ndims(R)+1)))
+        partial = similar(R, ntuple(_ -> 0, Val(ndims(R) + 1)))
 
         # NOTE: we can't use the previously-compiled kernel, or its launch configuration,
         #       since the type of `partial` might not match the original output container
         #       (e.g. if that was a view).
-        partial_kernel = @cuda launch=false partial_mapreduce_grid(f, op, init, Rreduce, Rother, Val(shuffle), partial, A)
-        partial_kernel_config = launch_configuration(partial_kernel.fun; shmem=compute_shmem∘compute_threads)
+        partial_kernel = @cuda launch = false partial_mapreduce_grid(f, op, init, Rreduce, Rother, Val(shuffle), partial, A)
+        partial_kernel_config = launch_configuration(partial_kernel.fun; shmem = compute_shmem ∘ compute_threads)
         partial_reduce_threads = compute_threads(partial_kernel_config.threads)
         partial_reduce_shmem = compute_shmem(partial_reduce_threads)
         partial_reduce_blocks = if other_blocks >= partial_kernel_config.blocks
             1
         else
-            min(cld(length(Rreduce), partial_reduce_threads),
-                cld(partial_kernel_config.blocks, other_blocks))
+            min(
+                cld(length(Rreduce), partial_reduce_threads),
+                cld(partial_kernel_config.blocks, other_blocks)
+            )
         end
         partial_threads = partial_reduce_threads
         partial_shmem = partial_reduce_shmem
-        partial_blocks = partial_reduce_blocks*other_blocks
+        partial_blocks = partial_reduce_blocks * other_blocks
 
         partial = similar(R, (size(R)..., partial_blocks))
         if init === nothing
@@ -282,8 +284,10 @@ function GPUArrays.mapreducedim!(f::F, op::OP, R::AnyCuArray{T},
             partial .= R
         end
 
-        partial_kernel(f, op, init, Rreduce, Rother, Val(shuffle), partial, A;
-                       threads=partial_threads, blocks=partial_blocks, shmem=partial_shmem)
+        partial_kernel(
+            f, op, init, Rreduce, Rother, Val(shuffle), partial, A;
+            threads = partial_threads, blocks = partial_blocks, shmem = partial_shmem
+        )
 
         GPUArrays.mapreducedim!(identity, op, R, partial; init)
     end
diff --git a/test/base/array.jl b/test/base/array.jl
index ee1b42cd9..b7b3ce82e 100644
--- a/test/base/array.jl
+++ b/test/base/array.jl
@@ -718,7 +718,7 @@ end
 @testset "large map reduce" begin
   dev = device()
 
-  big_size = CUDA.serial_mapreduce_threshold(dev) + 5
+    big_size = CUDA.serial_mapreduce_threshold(dev) + 5
   a = rand(Float32, big_size, 31)
   c = CuArray(a)

maleadt · 2025-09-03T14:42:41Z

CI issue looks legit. Fixed in JuliaGPU/GPUArrays.jl#617.

kshyatt · 2025-09-05T14:06:30Z

Looks like some profiler related tests are failing because of NVIDIA/NVTX#125

CUDA 13.0 update 1 has in theory released ( https://docs.nvidia.com/cuda/ ) but it looks like the runfiles aren't yet available, so as soon as those drop we can build them on Yggdrasil and fix these tests.

kshyatt · 2025-09-09T06:20:31Z

Pushed a commit testing the various memory allocators on CUDA 12.9 for now hoping to bypass the CUPTI issue on CUDA 13 so we can merge this

maleadt · 2025-09-09T10:31:55Z

Looks like some profiler related tests are failing because of NVIDIA/NVTX#125

We should disable the NVTX-related tests for CTK 13.0 (by checking if the CUPTI version matches the one from 13.0.0)

codecov · 2025-09-09T17:17:58Z

Codecov Report

❌ Patch coverage is 90.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 75.27%. Comparing base (d3ce45c) to head (f1cc693).

Files with missing lines	Patch %	Lines
lib/cupti/wrappers.jl	80.00%	1 Missing ⚠️

Additional details and impacted files

@@               Coverage Diff                @@
##           release-5.8    #2872       +/-   ##
================================================
- Coverage        89.82%   75.27%   -14.56%     
================================================
  Files              150      147        -3     
  Lines            13229    13114      -115     
================================================
- Hits             11883     9871     -2012     
- Misses            1346     3243     +1897

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kshyatt · 2025-09-09T17:57:47Z

Should we revert my change to run the specials on 12.9 or is this good to merge?

maleadt · 2025-09-09T18:03:01Z

The mapreduce fix here ruins performance, so this should probably wait for #2880

Co-authored-by: Tim Besard <[email protected]>

kshyatt · 2025-09-11T08:35:26Z

Cherry-picked that on top of this branch

maleadt · 2025-09-11T12:33:20Z

Subpackages fail because they install an older version of CUDA.jl. Maybe we should unconditionally depend on the latest CUDA.jl, i.e., default to the logic in

CUDA.jl/.buildkite/pipeline.yml

Lines 149 to 156 in aa2e912

    
                         catch 
        
                           # if we fail to instantiate, assume that we need newer dependencies 
        
                           deps = [PackageSpec(path=".")] 
        
                           if "{{matrix.package}}" == "cuTensorNet" 
        
                             push!(deps, PackageSpec(path="lib/cutensor")) 
        
                           end 
        
                           Pkg.develop(deps) 
        
                         end

? That does expose us to inadvertently relying on unreleased functionality, which isn't great either...

Thoughts, @kshyatt ?

kshyatt · 2025-09-11T12:35:29Z

In general I'd say defaulting to the latest makes sense since in most cases we won't be working on backports branches. Given NV's proclivity for breaking stuff in minor versions I'm not sure what else we can do, honestly...

maleadt · 2025-09-11T15:36:35Z

defaulting to the latest makes sense since in most cases we won't be working on backports branches

The problem is that this is #master (Pkg.develop) vs the latest released version, the former of which we don't expect users to install.

kshyatt · 2025-09-11T17:16:34Z

Ah I see, well usually master isn't in the current odd situation where it drops support for a whole CUDA version that the last tagged release supported. The other option (which I don't really endorse) is to test the subpackages on multiple versions...

christiangnrd and others added 4 commits September 3, 2025 08:54

Always use generated aligned_sizeof (#2838)

120c0d3

Decode the CUPTI version. (#2843)

ae88c49

CUPTI: Add missing version number.

45531fb

mapreduce: remove block optimization. (#2869)

9aac867

Requiring multiple launches probably didn't make it worth it anyway, and it introduces complexities wrt. the launch configuration, having to recompile and re-compute the size of the partial reduction.

Test some specials with 12.9

6815a95

Put the CUDA restriction in the right place [only special]

f1cc693

mapreduce: reinstate and fix block optimization (#2880)

c282658

Co-authored-by: Tim Besard <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Backports for v5.8.4 #2872

Backports for v5.8.4 #2872

Uh oh!

maleadt commented Sep 3, 2025

Uh oh!

github-actions bot commented Sep 3, 2025 •

edited

Loading

Uh oh!

maleadt commented Sep 3, 2025

Uh oh!

kshyatt commented Sep 5, 2025

Uh oh!

kshyatt commented Sep 9, 2025

Uh oh!

maleadt commented Sep 9, 2025

Uh oh!

codecov bot commented Sep 9, 2025

Uh oh!

kshyatt commented Sep 9, 2025

Uh oh!

maleadt commented Sep 9, 2025

Uh oh!

kshyatt commented Sep 11, 2025

Uh oh!

maleadt commented Sep 11, 2025

Uh oh!

kshyatt commented Sep 11, 2025

Uh oh!

maleadt commented Sep 11, 2025

Uh oh!

kshyatt commented Sep 11, 2025

Uh oh!

Uh oh!

Backports for v5.8.4 #2872

Are you sure you want to change the base?

Backports for v5.8.4 #2872

Uh oh!

Conversation

maleadt commented Sep 3, 2025

Uh oh!

github-actions bot commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Sep 3, 2025

Uh oh!

kshyatt commented Sep 5, 2025

Uh oh!

kshyatt commented Sep 9, 2025

Uh oh!

maleadt commented Sep 9, 2025

Uh oh!

codecov bot commented Sep 9, 2025

Codecov Report

Uh oh!

kshyatt commented Sep 9, 2025

Uh oh!

maleadt commented Sep 9, 2025

Uh oh!

kshyatt commented Sep 11, 2025

Uh oh!

maleadt commented Sep 11, 2025

Uh oh!

kshyatt commented Sep 11, 2025

Uh oh!

maleadt commented Sep 11, 2025

Uh oh!

kshyatt commented Sep 11, 2025

Uh oh!

Uh oh!

github-actions bot commented Sep 3, 2025 •

edited

Loading