-
Notifications
You must be signed in to change notification settings - Fork 250
Backports for v5.8.4 #2872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release-5.8
Are you sure you want to change the base?
Backports for v5.8.4 #2872
Conversation
Requiring multiple launches probably didn't make it worth it anyway, and it introduces complexities wrt. the launch configuration, having to recompile and re-compute the size of the partial reduction.
Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/lib/cupti/wrappers.jl b/lib/cupti/wrappers.jl
index 30eeee63a..030d564dd 100644
--- a/lib/cupti/wrappers.jl
+++ b/lib/cupti/wrappers.jl
@@ -27,12 +27,13 @@ const cupti_versions = [
v"12.7",
v"12.8",
v"12.9",
- v"12.9.1"]
+ v"12.9.1",
+]
function version()
version_ref = Ref{Cuint}()
cuptiGetVersion(version_ref)
- if CUDA.runtime_version() < v"13"
+ return if CUDA.runtime_version() < v"13"
cupti_versions[version_ref[]]
else
major, ver = divrem(version_ref[], 10000)
diff --git a/src/CUDA.jl b/src/CUDA.jl
index 8a82201a0..97402ecd5 100644
--- a/src/CUDA.jl
+++ b/src/CUDA.jl
@@ -54,7 +54,7 @@ using Printf
# - Base.aligned_sizeof is the size of an object in an array/inline alloced
# Both of them are equivalent for immutable objects, but differ for mutable singtons and Symbol
# We use `aligned_sizeof` since we care about the size of a type in an array
-@generated function aligned_sizeof(::Type{T}) where T
+@generated function aligned_sizeof(::Type{T}) where {T}
return :($(Base.aligned_sizeof(T)))
end
diff --git a/src/mapreduce.jl b/src/mapreduce.jl
index 7f0b242e4..a1bf3dea3 100644
--- a/src/mapreduce.jl
+++ b/src/mapreduce.jl
@@ -198,7 +198,7 @@ function GPUArrays.mapreducedim!(f::F, op::OP, R::AnyCuArray{T},
# If `Rother` is large enough, then a naive loop is more efficient than partial reductions.
if length(Rother) >= serial_mapreduce_threshold(dev)
args = (f, op, init, Rreduce, Rother, R, A)
- kernel = @cuda launch=false serial_mapreduce_kernel(args...)
+ kernel = @cuda launch = false serial_mapreduce_kernel(args...)
kernel_config = launch_configuration(kernel.fun)
threads = kernel_config.threads
blocks = cld(length(Rother), threads)
@@ -257,24 +257,26 @@ function GPUArrays.mapreducedim!(f::F, op::OP, R::AnyCuArray{T},
# TODO: provide a version that atomically reduces from different blocks
# temporary empty array whose type will match the final partial array
- partial = similar(R, ntuple(_ -> 0, Val(ndims(R)+1)))
+ partial = similar(R, ntuple(_ -> 0, Val(ndims(R) + 1)))
# NOTE: we can't use the previously-compiled kernel, or its launch configuration,
# since the type of `partial` might not match the original output container
# (e.g. if that was a view).
- partial_kernel = @cuda launch=false partial_mapreduce_grid(f, op, init, Rreduce, Rother, Val(shuffle), partial, A)
- partial_kernel_config = launch_configuration(partial_kernel.fun; shmem=compute_shmem∘compute_threads)
+ partial_kernel = @cuda launch = false partial_mapreduce_grid(f, op, init, Rreduce, Rother, Val(shuffle), partial, A)
+ partial_kernel_config = launch_configuration(partial_kernel.fun; shmem = compute_shmem ∘ compute_threads)
partial_reduce_threads = compute_threads(partial_kernel_config.threads)
partial_reduce_shmem = compute_shmem(partial_reduce_threads)
partial_reduce_blocks = if other_blocks >= partial_kernel_config.blocks
1
else
- min(cld(length(Rreduce), partial_reduce_threads),
- cld(partial_kernel_config.blocks, other_blocks))
+ min(
+ cld(length(Rreduce), partial_reduce_threads),
+ cld(partial_kernel_config.blocks, other_blocks)
+ )
end
partial_threads = partial_reduce_threads
partial_shmem = partial_reduce_shmem
- partial_blocks = partial_reduce_blocks*other_blocks
+ partial_blocks = partial_reduce_blocks * other_blocks
partial = similar(R, (size(R)..., partial_blocks))
if init === nothing
@@ -282,8 +284,10 @@ function GPUArrays.mapreducedim!(f::F, op::OP, R::AnyCuArray{T},
partial .= R
end
- partial_kernel(f, op, init, Rreduce, Rother, Val(shuffle), partial, A;
- threads=partial_threads, blocks=partial_blocks, shmem=partial_shmem)
+ partial_kernel(
+ f, op, init, Rreduce, Rother, Val(shuffle), partial, A;
+ threads = partial_threads, blocks = partial_blocks, shmem = partial_shmem
+ )
GPUArrays.mapreducedim!(identity, op, R, partial; init)
end
diff --git a/test/base/array.jl b/test/base/array.jl
index ee1b42cd9..b7b3ce82e 100644
--- a/test/base/array.jl
+++ b/test/base/array.jl
@@ -718,7 +718,7 @@ end
@testset "large map reduce" begin
dev = device()
- big_size = CUDA.serial_mapreduce_threshold(dev) + 5
+ big_size = CUDA.serial_mapreduce_threshold(dev) + 5
a = rand(Float32, big_size, 31)
c = CuArray(a)
|
CI issue looks legit. Fixed in JuliaGPU/GPUArrays.jl#617. |
Looks like some profiler related tests are failing because of NVIDIA/NVTX#125 CUDA 13.0 update 1 has in theory released ( https://docs.nvidia.com/cuda/ ) but it looks like the runfiles aren't yet available, so as soon as those drop we can build them on Yggdrasil and fix these tests. |
Pushed a commit testing the various memory allocators on CUDA 12.9 for now hoping to bypass the CUPTI issue on CUDA 13 so we can merge this |
We should disable the NVTX-related tests for CTK 13.0 (by checking if the CUPTI version matches the one from 13.0.0) |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release-5.8 #2872 +/- ##
================================================
- Coverage 89.82% 75.27% -14.56%
================================================
Files 150 147 -3
Lines 13229 13114 -115
================================================
- Hits 11883 9871 -2012
- Misses 1346 3243 +1897 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Should we revert my change to run the specials on 12.9 or is this good to merge? |
The |
Co-authored-by: Tim Besard <[email protected]>
Cherry-picked that on top of this branch |
Subpackages fail because they install an older version of CUDA.jl. Maybe we should unconditionally depend on the latest CUDA.jl, i.e., default to the logic in CUDA.jl/.buildkite/pipeline.yml Lines 149 to 156 in aa2e912
Thoughts, @kshyatt ? |
In general I'd say defaulting to the latest makes sense since in most cases we won't be working on backports branches. Given NV's proclivity for breaking stuff in minor versions I'm not sure what else we can do, honestly... |
The problem is that this is |
Ah I see, well usually |
No description provided.