Skip to content

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Sep 3, 2025

No description provided.

christiangnrd and others added 4 commits September 3, 2025 08:54
Requiring multiple launches probably didn't make it worth it anyway,
and it introduces complexities wrt. the launch configuration, having
to recompile and re-compute the size of the partial reduction.
Copy link
Contributor

github-actions bot commented Sep 3, 2025

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic release-5.8) to apply these changes.

Click here to view the suggested changes.
diff --git a/lib/cupti/wrappers.jl b/lib/cupti/wrappers.jl
index 30eeee63a..030d564dd 100644
--- a/lib/cupti/wrappers.jl
+++ b/lib/cupti/wrappers.jl
@@ -27,12 +27,13 @@ const cupti_versions = [
     v"12.7",
     v"12.8",
     v"12.9",
-    v"12.9.1"]
+    v"12.9.1",
+]
 
 function version()
     version_ref = Ref{Cuint}()
     cuptiGetVersion(version_ref)
-    if CUDA.runtime_version() < v"13"
+    return if CUDA.runtime_version() < v"13"
         cupti_versions[version_ref[]]
     else
         major, ver = divrem(version_ref[], 10000)
diff --git a/src/CUDA.jl b/src/CUDA.jl
index 8a82201a0..97402ecd5 100644
--- a/src/CUDA.jl
+++ b/src/CUDA.jl
@@ -54,7 +54,7 @@ using Printf
 # - Base.aligned_sizeof is the size of an object in an array/inline alloced
 # Both of them are equivalent for immutable objects, but differ for mutable singtons and Symbol
 # We use `aligned_sizeof` since we care about the size of a type in an array
-@generated function aligned_sizeof(::Type{T}) where T
+@generated function aligned_sizeof(::Type{T}) where {T}
     return :($(Base.aligned_sizeof(T)))
 end
 
diff --git a/src/mapreduce.jl b/src/mapreduce.jl
index 7f0b242e4..a1bf3dea3 100644
--- a/src/mapreduce.jl
+++ b/src/mapreduce.jl
@@ -198,7 +198,7 @@ function GPUArrays.mapreducedim!(f::F, op::OP, R::AnyCuArray{T},
     # If `Rother` is large enough, then a naive loop is more efficient than partial reductions.
     if length(Rother) >= serial_mapreduce_threshold(dev)
         args = (f, op, init, Rreduce, Rother, R, A)
-        kernel = @cuda launch=false serial_mapreduce_kernel(args...)
+        kernel = @cuda launch = false serial_mapreduce_kernel(args...)
         kernel_config = launch_configuration(kernel.fun)
         threads = kernel_config.threads
         blocks = cld(length(Rother), threads)
@@ -257,24 +257,26 @@ function GPUArrays.mapreducedim!(f::F, op::OP, R::AnyCuArray{T},
         # TODO: provide a version that atomically reduces from different blocks
 
         # temporary empty array whose type will match the final partial array
-	    partial = similar(R, ntuple(_ -> 0, Val(ndims(R)+1)))
+        partial = similar(R, ntuple(_ -> 0, Val(ndims(R) + 1)))
 
         # NOTE: we can't use the previously-compiled kernel, or its launch configuration,
         #       since the type of `partial` might not match the original output container
         #       (e.g. if that was a view).
-        partial_kernel = @cuda launch=false partial_mapreduce_grid(f, op, init, Rreduce, Rother, Val(shuffle), partial, A)
-        partial_kernel_config = launch_configuration(partial_kernel.fun; shmem=compute_shmem∘compute_threads)
+        partial_kernel = @cuda launch = false partial_mapreduce_grid(f, op, init, Rreduce, Rother, Val(shuffle), partial, A)
+        partial_kernel_config = launch_configuration(partial_kernel.fun; shmem = compute_shmem ∘ compute_threads)
         partial_reduce_threads = compute_threads(partial_kernel_config.threads)
         partial_reduce_shmem = compute_shmem(partial_reduce_threads)
         partial_reduce_blocks = if other_blocks >= partial_kernel_config.blocks
             1
         else
-            min(cld(length(Rreduce), partial_reduce_threads),
-                cld(partial_kernel_config.blocks, other_blocks))
+            min(
+                cld(length(Rreduce), partial_reduce_threads),
+                cld(partial_kernel_config.blocks, other_blocks)
+            )
         end
         partial_threads = partial_reduce_threads
         partial_shmem = partial_reduce_shmem
-        partial_blocks = partial_reduce_blocks*other_blocks
+        partial_blocks = partial_reduce_blocks * other_blocks
 
         partial = similar(R, (size(R)..., partial_blocks))
         if init === nothing
@@ -282,8 +284,10 @@ function GPUArrays.mapreducedim!(f::F, op::OP, R::AnyCuArray{T},
             partial .= R
         end
 
-        partial_kernel(f, op, init, Rreduce, Rother, Val(shuffle), partial, A;
-                       threads=partial_threads, blocks=partial_blocks, shmem=partial_shmem)
+        partial_kernel(
+            f, op, init, Rreduce, Rother, Val(shuffle), partial, A;
+            threads = partial_threads, blocks = partial_blocks, shmem = partial_shmem
+        )
 
         GPUArrays.mapreducedim!(identity, op, R, partial; init)
     end
diff --git a/test/base/array.jl b/test/base/array.jl
index ee1b42cd9..b7b3ce82e 100644
--- a/test/base/array.jl
+++ b/test/base/array.jl
@@ -718,7 +718,7 @@ end
 @testset "large map reduce" begin
   dev = device()
 
-  big_size = CUDA.serial_mapreduce_threshold(dev) + 5
+    big_size = CUDA.serial_mapreduce_threshold(dev) + 5
   a = rand(Float32, big_size, 31)
   c = CuArray(a)
 

@maleadt
Copy link
Member Author

maleadt commented Sep 3, 2025

CI issue looks legit. Fixed in JuliaGPU/GPUArrays.jl#617.

@kshyatt
Copy link
Member

kshyatt commented Sep 5, 2025

Looks like some profiler related tests are failing because of NVIDIA/NVTX#125

CUDA 13.0 update 1 has in theory released ( https://docs.nvidia.com/cuda/ ) but it looks like the runfiles aren't yet available, so as soon as those drop we can build them on Yggdrasil and fix these tests.

@kshyatt
Copy link
Member

kshyatt commented Sep 9, 2025

Pushed a commit testing the various memory allocators on CUDA 12.9 for now hoping to bypass the CUPTI issue on CUDA 13 so we can merge this

@maleadt
Copy link
Member Author

maleadt commented Sep 9, 2025

Looks like some profiler related tests are failing because of NVIDIA/NVTX#125

We should disable the NVTX-related tests for CTK 13.0 (by checking if the CUPTI version matches the one from 13.0.0)

Copy link

codecov bot commented Sep 9, 2025

Codecov Report

❌ Patch coverage is 90.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 75.27%. Comparing base (d3ce45c) to head (f1cc693).

Files with missing lines Patch % Lines
lib/cupti/wrappers.jl 80.00% 1 Missing ⚠️
Additional details and impacted files
@@               Coverage Diff                @@
##           release-5.8    #2872       +/-   ##
================================================
- Coverage        89.82%   75.27%   -14.56%     
================================================
  Files              150      147        -3     
  Lines            13229    13114      -115     
================================================
- Hits             11883     9871     -2012     
- Misses            1346     3243     +1897     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kshyatt
Copy link
Member

kshyatt commented Sep 9, 2025

Should we revert my change to run the specials on 12.9 or is this good to merge?

@maleadt
Copy link
Member Author

maleadt commented Sep 9, 2025

The mapreduce fix here ruins performance, so this should probably wait for #2880

@kshyatt
Copy link
Member

kshyatt commented Sep 11, 2025

Cherry-picked that on top of this branch

@maleadt
Copy link
Member Author

maleadt commented Sep 11, 2025

Subpackages fail because they install an older version of CUDA.jl. Maybe we should unconditionally depend on the latest CUDA.jl, i.e., default to the logic in

catch
# if we fail to instantiate, assume that we need newer dependencies
deps = [PackageSpec(path=".")]
if "{{matrix.package}}" == "cuTensorNet"
push!(deps, PackageSpec(path="lib/cutensor"))
end
Pkg.develop(deps)
end
? That does expose us to inadvertently relying on unreleased functionality, which isn't great either...

Thoughts, @kshyatt ?

@kshyatt
Copy link
Member

kshyatt commented Sep 11, 2025

In general I'd say defaulting to the latest makes sense since in most cases we won't be working on backports branches. Given NV's proclivity for breaking stuff in minor versions I'm not sure what else we can do, honestly...

@maleadt
Copy link
Member Author

maleadt commented Sep 11, 2025

defaulting to the latest makes sense since in most cases we won't be working on backports branches

The problem is that this is #master (Pkg.develop) vs the latest released version, the former of which we don't expect users to install.

@kshyatt
Copy link
Member

kshyatt commented Sep 11, 2025

Ah I see, well usually master isn't in the current odd situation where it drops support for a whole CUDA version that the last tagged release supported. The other option (which I don't really endorse) is to test the subpackages on multiple versions...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants