[Cuda] Handle -fcuda-short-ptr even with -nocudalib #111682

frasercrmck · 2024-10-09T14:03:54Z

When passed -nocudalib/-nogpulib, Cuda's argument handling would bail out before handling -fcuda-short-ptr, meaning the frontend and backend data layouts would mismatch.

When padded -nocudalib/-nogpulib, Cuda's argument handling would bail out before handling -fcuda-short-ptr, meaning the frontend and backend data layouts would mismatch.

llvmbot · 2024-10-09T14:04:29Z

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-clang-driver

Author: Fraser Cormack (frasercrmck)

Changes

When padded -nocudalib/-nogpulib, Cuda's argument handling would bail out before handling -fcuda-short-ptr, meaning the frontend and backend data layouts would mismatch.

Full diff: https://github.com/llvm/llvm-project/pull/111682.diff

2 Files Affected:

(modified) clang/lib/Driver/ToolChains/Cuda.cpp (+4-4)
(added) clang/test/Driver/cuda-short-ptr.cu (+6)

diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp
index 509cd87b28c37e..7a70cf1c5694fd 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -848,6 +848,10 @@ void CudaToolChain::addClangTargetOptions(
   if (CudaInstallation.version() >= CudaVersion::CUDA_90)
     CC1Args.push_back("-fcuda-allow-variadic-functions");
 
+  if (DriverArgs.hasFlag(options::OPT_fcuda_short_ptr,
+                         options::OPT_fno_cuda_short_ptr, false))
+    CC1Args.append({"-mllvm", "--nvptx-short-ptr"});
+
   if (DriverArgs.hasArg(options::OPT_nogpulib))
     return;
 
@@ -873,10 +877,6 @@ void CudaToolChain::addClangTargetOptions(
 
   clang::CudaVersion CudaInstallationVersion = CudaInstallation.version();
 
-  if (DriverArgs.hasFlag(options::OPT_fcuda_short_ptr,
-                         options::OPT_fno_cuda_short_ptr, false))
-    CC1Args.append({"-mllvm", "--nvptx-short-ptr"});
-
   if (CudaInstallationVersion >= CudaVersion::UNKNOWN)
     CC1Args.push_back(
         DriverArgs.MakeArgString(Twine("-target-sdk-version=") +
diff --git a/clang/test/Driver/cuda-short-ptr.cu b/clang/test/Driver/cuda-short-ptr.cu
new file mode 100644
index 00000000000000..e0ae4505e0b567
--- /dev/null
+++ b/clang/test/Driver/cuda-short-ptr.cu
@@ -0,0 +1,6 @@
+// Checks that cuda compilation does the right thing when passed -fcuda-short-ptr
+
+// RUN: %clang -### --target=x86_64-linux-gnu -c -march=haswell --cuda-gpu-arch=sm_20 -fcuda-short-ptr -nocudainc -nocudalib --cuda-path=%S/Inputs/CUDA/usr/local/cuda %s 2>&1 | FileCheck %s
+
+// CHECK: "-mllvm" "--nvptx-short-ptr"
+// CHECK-SAME: "-fcuda-short-ptr"

jhuber6

Seems reasonable, which architectures require this? I know that NVIDIA deprecated the 32-bit nvptx target in CUDA 12 or something.

frasercrmck · 2024-10-09T14:17:56Z

Seems reasonable, which architectures require this? I know that NVIDIA deprecated the 32-bit nvptx target in CUDA 12 or something.

I'm not an expert on CUDA but, AFAICT, even in 64-bit CUDA, certain pointers such as those pointing to shared memory are 32 bit, because the size of shared memory is somewhere in the kB range. This generates better code, fewer registers, etc. I'm not sure why the option isn't enabled by default, personally - it seems like nvcc is doing this by default.

I was just playing with the option downstream and noticed this issue.

jhuber6 · 2024-10-09T14:20:32Z

Seems reasonable, which architectures require this? I know that NVIDIA deprecated the 32-bit nvptx target in CUDA 12 or something.

I'm not an expert on CUDA but, AFAICT, even in 64-bit CUDA, certain pointers such as those pointing to shared memory are 32 bit, because the size of shared memory is somewhere in the kB range. This generates better code, fewer registers, etc. I'm not sure why the option isn't enabled by default, personally - it seems like nvcc is doing this by default.

I was just playing with the option downstream and noticed this issue.

I figured it was something like that, since it saves a register per address. I don't know the history for why this isn't the default, it's pretty much just a data layout modifier to state that certain address spaces are 32-bit. Maybe @Artem-B or @jlebar can comment.

frasercrmck · 2024-10-09T14:22:35Z

Seems reasonable, which architectures require this? I know that NVIDIA deprecated the 32-bit nvptx target in CUDA 12 or something.

I'm not an expert on CUDA but, AFAICT, even in 64-bit CUDA, certain pointers such as those pointing to shared memory are 32 bit, because the size of shared memory is somewhere in the kB range. This generates better code, fewer registers, etc. I'm not sure why the option isn't enabled by default, personally - it seems like nvcc is doing this by default.
I was just playing with the option downstream and noticed this issue.

I figured it was something like that, since it saves a register per address. I don't know the history for why this isn't the default, it's pretty much just a data layout modifier to state that certain address spaces are 32-bit. Maybe @Artem-B or @jlebar can comment.

Just threw together a nonsensical example for godbolt: https://godbolt.org/z/bhdEhrxd7. Notice the mov.u32 %r7, As, etc.

ldrumm · 2024-10-09T14:31:29Z

I'm not sure why we would ever want the current default if this is an option.
I'm trying to see it, but I can't work out a case where a 64bit pointer would make sense, since the even tens-of-thousands of money supercomputer cards have less than 256KiB of addressable shared memory.

It might be a bit of an intrusive change (albeit a relatively mechanical one), but until we see a GPU come to market that has >4GiB addressable shared memory, I think we should use the "short pointer" datalayout as default

jhuber6 · 2024-10-09T14:36:07Z

I'm not sure why we would ever want the current default if this is an option. I'm trying to see it, but I can't work out a case where a 64bit pointer would make sense, since the even tens-of-thousands of money supercomputer cards have less than 256KiB of addressable shared memory.

It might be a bit of an intrusive change (albeit a relatively mechanical one), but until we see a GPU come to market that has >4GiB addressable shared memory, I think we should use the "short pointer" datalayout as default

It also applies to constant and private / local address spaces. I don't think those hit 4 GiB yet but it's more feasible than shared. Making address space 3 32-bit by default would make sense to me.

frasercrmck · 2024-10-09T15:16:42Z

I'm not sure why we would ever want the current default if this is an option. I'm trying to see it, but I can't work out a case where a 64bit pointer would make sense, since the even tens-of-thousands of money supercomputer cards have less than 256KiB of addressable shared memory.
It might be a bit of an intrusive change (albeit a relatively mechanical one), but until we see a GPU come to market that has >4GiB addressable shared memory, I think we should use the "short pointer" datalayout as default

It also applies to constant and private / local address spaces. I don't think those hit 4 GiB yet but it's more feasible than shared. Making address space 3 32-bit by default would make sense to me.

I was also considering separating out address space 3 into its own option, yep. At that point I suppose you might as well split out the remaining two, and have -fcuda-short-ptr alias the three of them.

frasercrmck · 2024-10-10T09:47:09Z

I'm not sure why we would ever want the current default if this is an option. I'm trying to see it, but I can't work out a case where a 64bit pointer would make sense, since the even tens-of-thousands of money supercomputer cards have less than 256KiB of addressable shared memory.
It might be a bit of an intrusive change (albeit a relatively mechanical one), but until we see a GPU come to market that has >4GiB addressable shared memory, I think we should use the "short pointer" datalayout as default

It also applies to constant and private / local address spaces. I don't think those hit 4 GiB yet but it's more feasible than shared. Making address space 3 32-bit by default would make sense to me.

I was also considering separating out address space 3 into its own option, yep. At that point I suppose you might as well split out the remaining two, and have -fcuda-short-ptr alias the three of them.

@jhuber6 do you have any thoughts on how we'd best split up -fcuda-short-ptr? I was initially thinking of -f[no-]cuda-short-ptr={shared,local,const} and have the old -f[no-]cuda-short-ptr alias "all". However, it seems like this sort of (bitfield toggling on//off) option isn't supported by clang's automated marshalling, and the documented advice is to reconsider the option, rather than do custom marshalling.

It seems as though a comma-joined operation is best marshalled as a list of strings, but having the corresponding "no-" case is pretty rare, and would undoubtedly add extra code and complicate things when trying to tie it in with the old option format.

Maybe we don't need -fno-cuda-short-ptr=..., which I think would simplify things here. Of course, I might be thinking about this all wrong, and the frontend and cc1 options could be split up somehow.

jhuber6 · 2024-10-10T12:09:26Z

We don't need marshalling because this isn't a cc1 option. This is just handled by the driver which forwards it as -mllvm to the backend. You'd need to update the LLVM option to take multiple options and then make the clang driver option pick between them.

Artem-B · 2024-10-10T17:57:40Z

I'm not sure why the option isn't enabled by default, personally

While it does indeed help with generating better code, using this option while compiling CUDA code may be problematic.
Front-end is not aware of address spaces and all pointers are generic, so sizeof(any pointer) == 8. If we enable short pointers, then, after inferring AS some pointers become 32-bit and this discrepancy may lead to interesting bugs.

IMO short pointers are currently are only safe to use on IR level. You may get by using them from CUDA, but I do not think that enabling them by default is a good idea.

[Cuda] Handle -fcuda-short-ptr even with -nocudalib

67b4b2a

When padded -nocudalib/-nogpulib, Cuda's argument handling would bail out before handling -fcuda-short-ptr, meaning the frontend and backend data layouts would mismatch.

llvmbot added clang Clang issues not falling into any other category clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' labels Oct 9, 2024

frasercrmck requested a review from jhuber6 October 9, 2024 14:04

jhuber6 approved these changes Oct 9, 2024

View reviewed changes

frasercrmck merged commit 72a957b into llvm:main Oct 9, 2024
12 checks passed

frasercrmck deleted the nocudalib-fcuda-short-ptr branch October 9, 2024 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Cuda] Handle -fcuda-short-ptr even with -nocudalib #111682

[Cuda] Handle -fcuda-short-ptr even with -nocudalib #111682

Uh oh!

frasercrmck commented Oct 9, 2024 •

edited

Loading

Uh oh!

llvmbot commented Oct 9, 2024 •

edited

Loading

Uh oh!

jhuber6 left a comment •

edited

Loading

Uh oh!

frasercrmck commented Oct 9, 2024

Uh oh!

jhuber6 commented Oct 9, 2024

Uh oh!

frasercrmck commented Oct 9, 2024

Uh oh!

ldrumm commented Oct 9, 2024

Uh oh!

jhuber6 commented Oct 9, 2024

Uh oh!

frasercrmck commented Oct 9, 2024

Uh oh!

Uh oh!

frasercrmck commented Oct 10, 2024

Uh oh!

jhuber6 commented Oct 10, 2024

Uh oh!

Artem-B commented Oct 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Cuda] Handle -fcuda-short-ptr even with -nocudalib #111682

[Cuda] Handle -fcuda-short-ptr even with -nocudalib #111682

Uh oh!

Conversation

frasercrmck commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhuber6 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frasercrmck commented Oct 9, 2024

Uh oh!

jhuber6 commented Oct 9, 2024

Uh oh!

frasercrmck commented Oct 9, 2024

Uh oh!

ldrumm commented Oct 9, 2024

Uh oh!

jhuber6 commented Oct 9, 2024

Uh oh!

frasercrmck commented Oct 9, 2024

Uh oh!

Uh oh!

frasercrmck commented Oct 10, 2024

Uh oh!

jhuber6 commented Oct 10, 2024

Uh oh!

Artem-B commented Oct 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

frasercrmck commented Oct 9, 2024 •

edited

Loading

llvmbot commented Oct 9, 2024 •

edited

Loading

jhuber6 left a comment •

edited

Loading