Skip to content

Conversation

@frasercrmck
Copy link
Contributor

@frasercrmck frasercrmck commented Oct 9, 2024

When passed -nocudalib/-nogpulib, Cuda's argument handling would bail out before handling -fcuda-short-ptr, meaning the frontend and backend data layouts would mismatch.

When padded -nocudalib/-nogpulib, Cuda's argument handling would bail
out before handling -fcuda-short-ptr, meaning the frontend and backend
data layouts would mismatch.
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' labels Oct 9, 2024
@frasercrmck frasercrmck requested a review from jhuber6 October 9, 2024 14:04
@llvmbot
Copy link
Member

llvmbot commented Oct 9, 2024

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-clang-driver

Author: Fraser Cormack (frasercrmck)

Changes

When padded -nocudalib/-nogpulib, Cuda's argument handling would bail out before handling -fcuda-short-ptr, meaning the frontend and backend data layouts would mismatch.


Full diff: https://github.com/llvm/llvm-project/pull/111682.diff

2 Files Affected:

  • (modified) clang/lib/Driver/ToolChains/Cuda.cpp (+4-4)
  • (added) clang/test/Driver/cuda-short-ptr.cu (+6)
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp
index 509cd87b28c37e..7a70cf1c5694fd 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -848,6 +848,10 @@ void CudaToolChain::addClangTargetOptions(
   if (CudaInstallation.version() >= CudaVersion::CUDA_90)
     CC1Args.push_back("-fcuda-allow-variadic-functions");
 
+  if (DriverArgs.hasFlag(options::OPT_fcuda_short_ptr,
+                         options::OPT_fno_cuda_short_ptr, false))
+    CC1Args.append({"-mllvm", "--nvptx-short-ptr"});
+
   if (DriverArgs.hasArg(options::OPT_nogpulib))
     return;
 
@@ -873,10 +877,6 @@ void CudaToolChain::addClangTargetOptions(
 
   clang::CudaVersion CudaInstallationVersion = CudaInstallation.version();
 
-  if (DriverArgs.hasFlag(options::OPT_fcuda_short_ptr,
-                         options::OPT_fno_cuda_short_ptr, false))
-    CC1Args.append({"-mllvm", "--nvptx-short-ptr"});
-
   if (CudaInstallationVersion >= CudaVersion::UNKNOWN)
     CC1Args.push_back(
         DriverArgs.MakeArgString(Twine("-target-sdk-version=") +
diff --git a/clang/test/Driver/cuda-short-ptr.cu b/clang/test/Driver/cuda-short-ptr.cu
new file mode 100644
index 00000000000000..e0ae4505e0b567
--- /dev/null
+++ b/clang/test/Driver/cuda-short-ptr.cu
@@ -0,0 +1,6 @@
+// Checks that cuda compilation does the right thing when passed -fcuda-short-ptr
+
+// RUN: %clang -### --target=x86_64-linux-gnu -c -march=haswell --cuda-gpu-arch=sm_20 -fcuda-short-ptr -nocudainc -nocudalib --cuda-path=%S/Inputs/CUDA/usr/local/cuda %s 2>&1 | FileCheck %s
+
+// CHECK: "-mllvm" "--nvptx-short-ptr"
+// CHECK-SAME: "-fcuda-short-ptr"

Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable, which architectures require this? I know that NVIDIA deprecated the 32-bit nvptx target in CUDA 12 or something.

@frasercrmck
Copy link
Contributor Author

Seems reasonable, which architectures require this? I know that NVIDIA deprecated the 32-bit nvptx target in CUDA 12 or something.

I'm not an expert on CUDA but, AFAICT, even in 64-bit CUDA, certain pointers such as those pointing to shared memory are 32 bit, because the size of shared memory is somewhere in the kB range. This generates better code, fewer registers, etc. I'm not sure why the option isn't enabled by default, personally - it seems like nvcc is doing this by default.

I was just playing with the option downstream and noticed this issue.

@jhuber6
Copy link
Contributor

jhuber6 commented Oct 9, 2024

Seems reasonable, which architectures require this? I know that NVIDIA deprecated the 32-bit nvptx target in CUDA 12 or something.

I'm not an expert on CUDA but, AFAICT, even in 64-bit CUDA, certain pointers such as those pointing to shared memory are 32 bit, because the size of shared memory is somewhere in the kB range. This generates better code, fewer registers, etc. I'm not sure why the option isn't enabled by default, personally - it seems like nvcc is doing this by default.

I was just playing with the option downstream and noticed this issue.

I figured it was something like that, since it saves a register per address. I don't know the history for why this isn't the default, it's pretty much just a data layout modifier to state that certain address spaces are 32-bit. Maybe @Artem-B or @jlebar can comment.

@frasercrmck
Copy link
Contributor Author

Seems reasonable, which architectures require this? I know that NVIDIA deprecated the 32-bit nvptx target in CUDA 12 or something.

I'm not an expert on CUDA but, AFAICT, even in 64-bit CUDA, certain pointers such as those pointing to shared memory are 32 bit, because the size of shared memory is somewhere in the kB range. This generates better code, fewer registers, etc. I'm not sure why the option isn't enabled by default, personally - it seems like nvcc is doing this by default.
I was just playing with the option downstream and noticed this issue.

I figured it was something like that, since it saves a register per address. I don't know the history for why this isn't the default, it's pretty much just a data layout modifier to state that certain address spaces are 32-bit. Maybe @Artem-B or @jlebar can comment.

Just threw together a nonsensical example for godbolt: https://godbolt.org/z/bhdEhrxd7. Notice the mov.u32 %r7, As, etc.

@ldrumm
Copy link
Contributor

ldrumm commented Oct 9, 2024

I'm not sure why we would ever want the current default if this is an option.
I'm trying to see it, but I can't work out a case where a 64bit pointer would make sense, since the even tens-of-thousands of money supercomputer cards have less than 256KiB of addressable shared memory.

It might be a bit of an intrusive change (albeit a relatively mechanical one), but until we see a GPU come to market that has >4GiB addressable shared memory, I think we should use the "short pointer" datalayout as default

@jhuber6
Copy link
Contributor

jhuber6 commented Oct 9, 2024

I'm not sure why we would ever want the current default if this is an option. I'm trying to see it, but I can't work out a case where a 64bit pointer would make sense, since the even tens-of-thousands of money supercomputer cards have less than 256KiB of addressable shared memory.

It might be a bit of an intrusive change (albeit a relatively mechanical one), but until we see a GPU come to market that has >4GiB addressable shared memory, I think we should use the "short pointer" datalayout as default

It also applies to constant and private / local address spaces. I don't think those hit 4 GiB yet but it's more feasible than shared. Making address space 3 32-bit by default would make sense to me.

@frasercrmck
Copy link
Contributor Author

I'm not sure why we would ever want the current default if this is an option. I'm trying to see it, but I can't work out a case where a 64bit pointer would make sense, since the even tens-of-thousands of money supercomputer cards have less than 256KiB of addressable shared memory.
It might be a bit of an intrusive change (albeit a relatively mechanical one), but until we see a GPU come to market that has >4GiB addressable shared memory, I think we should use the "short pointer" datalayout as default

It also applies to constant and private / local address spaces. I don't think those hit 4 GiB yet but it's more feasible than shared. Making address space 3 32-bit by default would make sense to me.

I was also considering separating out address space 3 into its own option, yep. At that point I suppose you might as well split out the remaining two, and have -fcuda-short-ptr alias the three of them.

@frasercrmck frasercrmck merged commit 72a957b into llvm:main Oct 9, 2024
12 checks passed
@frasercrmck frasercrmck deleted the nocudalib-fcuda-short-ptr branch October 9, 2024 15:17
@frasercrmck
Copy link
Contributor Author

I'm not sure why we would ever want the current default if this is an option. I'm trying to see it, but I can't work out a case where a 64bit pointer would make sense, since the even tens-of-thousands of money supercomputer cards have less than 256KiB of addressable shared memory.
It might be a bit of an intrusive change (albeit a relatively mechanical one), but until we see a GPU come to market that has >4GiB addressable shared memory, I think we should use the "short pointer" datalayout as default

It also applies to constant and private / local address spaces. I don't think those hit 4 GiB yet but it's more feasible than shared. Making address space 3 32-bit by default would make sense to me.

I was also considering separating out address space 3 into its own option, yep. At that point I suppose you might as well split out the remaining two, and have -fcuda-short-ptr alias the three of them.

@jhuber6 do you have any thoughts on how we'd best split up -fcuda-short-ptr? I was initially thinking of -f[no-]cuda-short-ptr={shared,local,const} and have the old -f[no-]cuda-short-ptr alias "all". However, it seems like this sort of (bitfield toggling on//off) option isn't supported by clang's automated marshalling, and the documented advice is to reconsider the option, rather than do custom marshalling.

It seems as though a comma-joined operation is best marshalled as a list of strings, but having the corresponding "no-" case is pretty rare, and would undoubtedly add extra code and complicate things when trying to tie it in with the old option format.

Maybe we don't need -fno-cuda-short-ptr=..., which I think would simplify things here. Of course, I might be thinking about this all wrong, and the frontend and cc1 options could be split up somehow.

@jhuber6
Copy link
Contributor

jhuber6 commented Oct 10, 2024

We don't need marshalling because this isn't a cc1 option. This is just handled by the driver which forwards it as -mllvm to the backend. You'd need to update the LLVM option to take multiple options and then make the clang driver option pick between them.

@Artem-B
Copy link
Member

Artem-B commented Oct 10, 2024

I'm not sure why the option isn't enabled by default, personally

While it does indeed help with generating better code, using this option while compiling CUDA code may be problematic.
Front-end is not aware of address spaces and all pointers are generic, so sizeof(any pointer) == 8. If we enable short pointers, then, after inferring AS some pointers become 32-bit and this discrepancy may lead to interesting bugs.

IMO short pointers are currently are only safe to use on IR level. You may get by using them from CUDA, but I do not think that enabling them by default is a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang Clang issues not falling into any other category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants