Skip to content

Conversation

@justinfargnoli
Copy link
Contributor

@justinfargnoli justinfargnoli commented Aug 15, 2024

When lowering return values from LLVM IR to SelectionDAG, we check that the number of values SelectionDAG tells us to return is equal to the number of values that ComputePTXValueVTs() tells us to return. However, this check can fail on valid IR. For example:

define <6 x half> @foo() {
  ret <6 x half> zeroinitializer
}

ComputePTXValueVTs() tells us to return 3 v2f16 values, while SelectionDAG tells us to return 6 f16 values. Thus, the compiler will crash.

ComputePTXValueVTs() supports all half element vectors with an even number of elements. Whereas SelectionDAG only supports power-of-2 sized vectors. This is the root of the discrepancy.

Assuming that the developers who added the code to ComputePTXValueVTs() overlooked this, I've restricted ComputePTXValueVTs() to compute the same number of return values as SelectionDAG, instead of extending SelectionDAG to support non-power-of-2 sized vectors.

@llvmbot
Copy link
Member

llvmbot commented Aug 15, 2024

@llvm/pr-subscribers-backend-nvptx

Author: Justin Fargnoli (justinfargnoli)

Changes

When lowering return values from LLVM IR to SelectionDAG, we check that the number of values SelectionDAG tells us to return is equal to the number of values that ComputePTXValueVTs() tells us to return. However, this check can fail on valid IR. For example:

define &lt;6 x half&gt; @<!-- -->foo() {
  ret &lt;6 x half&gt; zeroinitializer
}

ComputePTXValueVTs() tells us to return 3 v2f16 values, while SelectionDAG tells us to return 6 f16 values.

ComputePTXValueVTs() supports all half element vectors with an even number of elements. Whereas SelectionDAG only supports power-of-2 sized vectors.

Assuming that the developers who added the code to ComputePTXValueVTs() overlooked this, I've restricted ComputePTXValueVTs() to compute the same number of return values as SelectionDAG instead of extending SelectionDAG to support non-power-of-2 vector sizes.


Full diff: https://github.com/llvm/llvm-project/pull/104524.diff

2 Files Affected:

  • (modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp (+7-5)
  • (added) llvm/test/CodeGen/NVPTX/compute-ptx-value-vts.ll (+45)
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index 43a3fbf4d1306a..878c792a9b06ca 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -207,10 +207,11 @@ static void ComputePTXValueVTs(const TargetLowering &TLI, const DataLayout &DL,
     if (VT.isVector()) {
       unsigned NumElts = VT.getVectorNumElements();
       EVT EltVT = VT.getVectorElementType();
-      // Vectors with an even number of f16 elements will be passed to
-      // us as an array of v2f16/v2bf16 elements. We must match this so we
-      // stay in sync with Ins/Outs.
-      if ((Is16bitsType(EltVT.getSimpleVT())) && NumElts % 2 == 0) {
+      if ((Is16bitsType(EltVT.getSimpleVT())) && NumElts % 2 == 0 &&
+          isPowerOf2_32(NumElts)) {
+        // Vectors with an even number of f16 elements will be passed to
+        // us as an array of v2f16/v2bf16 elements. We must match this so we
+        // stay in sync with Ins/Outs.
         switch (EltVT.getSimpleVT().SimpleTy) {
         case MVT::f16:
           EltVT = MVT::v2f16;
@@ -226,7 +227,8 @@ static void ComputePTXValueVTs(const TargetLowering &TLI, const DataLayout &DL,
         }
         NumElts /= 2;
       } else if (EltVT.getSimpleVT() == MVT::i8 &&
-                 (NumElts % 4 == 0 || NumElts == 3)) {
+                 ((NumElts % 4 == 0 && isPowerOf2_32(NumElts)) ||
+                  NumElts == 3)) {
         // v*i8 are formally lowered as v4i8
         EltVT = MVT::v4i8;
         NumElts = (NumElts + 3) / 4;
diff --git a/llvm/test/CodeGen/NVPTX/compute-ptx-value-vts.ll b/llvm/test/CodeGen/NVPTX/compute-ptx-value-vts.ll
new file mode 100644
index 00000000000000..4c56b5fb5a34c3
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/compute-ptx-value-vts.ll
@@ -0,0 +1,45 @@
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_20
+
+define <6 x half> @half6() {
+  ret <6 x half> zeroinitializer
+}
+
+define <10 x half> @half10() {
+  ret <10 x half> zeroinitializer
+}
+
+define <14 x half> @half14() {
+  ret <14 x half> zeroinitializer
+}
+
+define <18 x half> @half18() {
+  ret <18 x half> zeroinitializer
+}
+
+define <998 x half> @half998() {
+  ret <998 x half> zeroinitializer
+}
+
+define <12 x i8> @byte12() {
+  ret <12 x i8> zeroinitializer
+}
+
+define <20 x i8> @byte20() {
+  ret <20 x i8> zeroinitializer
+}
+
+define <24 x i8> @byte24() {
+  ret <24 x i8> zeroinitializer
+}
+
+define <28 x i8> @byte28() {
+  ret <28 x i8> zeroinitializer
+}
+
+define <36 x i8> @byte36() {
+  ret <36 x i8> zeroinitializer
+}
+
+define <996 x i8> @byte996() {
+  ret <996 x i8> zeroinitializer
+}

@justinfargnoli justinfargnoli requested a review from Artem-B August 16, 2024 19:42
@justinfargnoli justinfargnoli requested a review from Artem-B August 17, 2024 01:10
; CHECK-NEXT: st.param.v4.b8 [func_retval0+84], {%rs1, %rs1, %rs1, %rs1};
; CHECK-NEXT: st.param.v4.b8 [func_retval0+88], {%rs1, %rs1, %rs1, %rs1};
; CHECK-NEXT: st.param.v4.b8 [func_retval0+92], {%rs1, %rs1, %rs1, %rs1};
; CHECK-NEXT: st.param.v4.b8 [func_retval0+96], {%rs1, %
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why:

  • Is a .b16 register being used here for the zero initialization, instead o f a .b8 ?
  • Is an <3 x i1> taking 3 bytes of storage instead of 1?

I'd expect a warp-wide <32 x i1> mask to be taking 4 bytes, but according to this, it takes 32 bytes.

I'm not suggesting this PR should fix those, but maybe we need an issue tracking these things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVPTX data layout says that i1 is stored as 8 bits.

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"

Comment on lines 332 to 334
define <2 x i8> @byte2() {
ret <2 x i8> zeroinitializer
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test exposes another crash caused by lowering a v2i8 return type.

Unless I'm able to fix the bug before I submit this PR, I'll fix the crash and reenable the test in a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the status of this crash?

Copy link
Contributor Author

@justinfargnoli justinfargnoli Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've briefly looked into it.

I've posted what I found on #104864.

However, I must postpone fixing this until I resolve two other bugs.

Comment on lines +246 to +249
; CHECK-NEXT: st.param.v4.b8 [func_retval0+0], {%rs1, %rs1, %rs1, %rs1};
; CHECK-NEXT: st.param.v4.b8 [func_retval0+4], {%rs1, %rs1, %rs1, %rs1};
; CHECK-NEXT: st.param.v4.b8 [func_retval0+8], {%rs1, %rs1, %rs1, %rs1};
; CHECK-NEXT: st.param.v2.b8 [func_retval0+12], {%rs1, %rs1};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another optimization opportunity would be to lower first 8 bytes in one instruction.

@justinfargnoli
Copy link
Contributor Author

Ping @Artem-B

NumElts /= 2;
} else if (EltVT.getSimpleVT() == MVT::i8 &&
(NumElts % 4 == 0 || NumElts == 3)) {
((NumElts % 4 == 0 && isPowerOf2_32(NumElts)) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is rather puzzling.
AFAICT, previously, we'd accept i8 vectors with multiples of 4 elements, and a special case of v3i8, and lowered them all as N * v4i8.

Now we'll only accept multiples of 4, that are also a power of 2 and it's not clear why. Is there any reason we should be able to handle v16i8 here but not v12i8?.

This, at the very least, needs a comment explaining what's going on, or, possibly, a change to the condition to better reflect what we're checking for here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, previously, we'd accept i8 vectors with multiples of 4 elements, and a special case of v3i8, and lowered them all as N * v4i8.

Without this PR, vector types like v12i8 will result in a compiler crash.

This PR should only disable this code on inputs, which, when invoked via LowerReturn(), would've resulted in a crash.

This, at the very least, needs a comment explaining what's going on

I've added a comment in 0c7cdb6.

@justinfargnoli justinfargnoli requested a review from Artem-B August 29, 2024 20:57
@justinfargnoli justinfargnoli merged commit cdaebf6 into llvm:main Aug 30, 2024
@llvm-ci
Copy link
Collaborator

llvm-ci commented Aug 30, 2024

LLVM Buildbot has detected a new failure on builder clang-hip-vega20 running on hip-vega20-0 while building llvm at step 3 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/123/builds/4576

Here is the relevant piece of the build log for the reference
Step 3 (annotate) failure: '../llvm-zorg/zorg/buildbot/builders/annotated/hip-build.sh --jobs=' (failure)
...
[38/40] : && /buildbot/hip-vega20-0/clang-hip-vega20/llvm/bin/clang++ -O3 -DNDEBUG  External/HIP/CMakeFiles/memmove-hip-6.0.2.dir/memmove.hip.o -o External/HIP/memmove-hip-6.0.2  --rocm-path=/buildbot/Externals/hip/rocm-6.0.2 --hip-link -rtlib=compiler-rt -unwindlib=libgcc -frtlib-add-rpath && cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP && /usr/local/bin/cmake -E create_symlink /buildbot/llvm-test-suite/External/HIP/memmove.reference_output /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/memmove.reference_output-hip-6.0.2
[39/40] /buildbot/hip-vega20-0/clang-hip-vega20/llvm/bin/clang++ -DNDEBUG  -O3 -DNDEBUG   -w -Werror=date-time --rocm-path=/buildbot/Externals/hip/rocm-6.0.2 --offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx1030 --offload-arch=gfx1100 -xhip -mfma -MD -MT External/HIP/CMakeFiles/TheNextWeek-hip-6.0.2.dir/workload/ray-tracing/TheNextWeek/main.cc.o -MF External/HIP/CMakeFiles/TheNextWeek-hip-6.0.2.dir/workload/ray-tracing/TheNextWeek/main.cc.o.d -o External/HIP/CMakeFiles/TheNextWeek-hip-6.0.2.dir/workload/ray-tracing/TheNextWeek/main.cc.o -c /buildbot/llvm-test-suite/External/HIP/workload/ray-tracing/TheNextWeek/main.cc
[40/40] : && /buildbot/hip-vega20-0/clang-hip-vega20/llvm/bin/clang++ -O3 -DNDEBUG  External/HIP/CMakeFiles/TheNextWeek-hip-6.0.2.dir/workload/ray-tracing/TheNextWeek/main.cc.o -o External/HIP/TheNextWeek-hip-6.0.2  --rocm-path=/buildbot/Externals/hip/rocm-6.0.2 --hip-link -rtlib=compiler-rt -unwindlib=libgcc -frtlib-add-rpath && cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP && /usr/local/bin/cmake -E create_symlink /buildbot/llvm-test-suite/External/HIP/TheNextWeek.reference_output /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/TheNextWeek.reference_output-hip-6.0.2
+ build_step 'Testing HIP test-suite'
+ echo '@@@BUILD_STEP Testing HIP test-suite@@@'
+ ninja -v check-hip-simple
@@@BUILD_STEP Testing HIP test-suite@@@
[0/1] cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP && /buildbot/hip-vega20-0/clang-hip-vega20/llvm/bin/llvm-lit -sv empty-hip-6.0.2.test with-fopenmp-hip-6.0.2.test saxpy-hip-6.0.2.test memmove-hip-6.0.2.test InOneWeekend-hip-6.0.2.test TheNextWeek-hip-6.0.2.test blender.test
-- Testing: 7 tests, 7 workers --
Testing:  0.. 10.. 20.. 30.. 40
FAIL: test-suite :: External/HIP/InOneWeekend-hip-6.0.2.test (4 of 7)
******************** TEST 'test-suite :: External/HIP/InOneWeekend-hip-6.0.2.test' FAILED ********************

/buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/timeit-target --timeout 7200 --limit-core 0 --limit-cpu 7200 --limit-file-size 209715200 --limit-rss-size 838860800 --append-exitstatus --redirect-output /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.out --redirect-input /dev/null --summary /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.time /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/InOneWeekend-hip-6.0.2
cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP ; /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/fpcmp-target /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.out InOneWeekend.reference_output-hip-6.0.2

+ cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP
+ /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/fpcmp-target /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.out InOneWeekend.reference_output-hip-6.0.2
/buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/fpcmp-target: Comparison failed, textual difference between 'M' and 'i'

********************
/usr/bin/strip: /bin/bash.stripped: Bad file descriptor
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90.. 
********************
Failed Tests (1):
  test-suite :: External/HIP/InOneWeekend-hip-6.0.2.test


Testing Time: 371.55s

Total Discovered Tests: 7
  Passed: 6 (85.71%)
  Failed: 1 (14.29%)
FAILED: External/HIP/CMakeFiles/check-hip-simple-hip-6.0.2 
cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP && /buildbot/hip-vega20-0/clang-hip-vega20/llvm/bin/llvm-lit -sv empty-hip-6.0.2.test with-fopenmp-hip-6.0.2.test saxpy-hip-6.0.2.test memmove-hip-6.0.2.test InOneWeekend-hip-6.0.2.test TheNextWeek-hip-6.0.2.test blender.test
ninja: build stopped: subcommand failed.
Step 12 (Testing HIP test-suite) failure: Testing HIP test-suite (failure)
@@@BUILD_STEP Testing HIP test-suite@@@
[0/1] cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP && /buildbot/hip-vega20-0/clang-hip-vega20/llvm/bin/llvm-lit -sv empty-hip-6.0.2.test with-fopenmp-hip-6.0.2.test saxpy-hip-6.0.2.test memmove-hip-6.0.2.test InOneWeekend-hip-6.0.2.test TheNextWeek-hip-6.0.2.test blender.test
-- Testing: 7 tests, 7 workers --
Testing:  0.. 10.. 20.. 30.. 40
FAIL: test-suite :: External/HIP/InOneWeekend-hip-6.0.2.test (4 of 7)
******************** TEST 'test-suite :: External/HIP/InOneWeekend-hip-6.0.2.test' FAILED ********************

/buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/timeit-target --timeout 7200 --limit-core 0 --limit-cpu 7200 --limit-file-size 209715200 --limit-rss-size 838860800 --append-exitstatus --redirect-output /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.out --redirect-input /dev/null --summary /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.time /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/InOneWeekend-hip-6.0.2
cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP ; /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/fpcmp-target /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.out InOneWeekend.reference_output-hip-6.0.2

+ cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP
+ /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/fpcmp-target /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP/Output/InOneWeekend-hip-6.0.2.test.out InOneWeekend.reference_output-hip-6.0.2
/buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/tools/fpcmp-target: Comparison failed, textual difference between 'M' and 'i'

********************
/usr/bin/strip: /bin/bash.stripped: Bad file descriptor
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90.. 
********************
Failed Tests (1):
  test-suite :: External/HIP/InOneWeekend-hip-6.0.2.test


Testing Time: 371.55s

Total Discovered Tests: 7
  Passed: 6 (85.71%)
  Failed: 1 (14.29%)
FAILED: External/HIP/CMakeFiles/check-hip-simple-hip-6.0.2 
cd /buildbot/hip-vega20-0/clang-hip-vega20/test-suite-build/External/HIP && /buildbot/hip-vega20-0/clang-hip-vega20/llvm/bin/llvm-lit -sv empty-hip-6.0.2.test with-fopenmp-hip-6.0.2.test saxpy-hip-6.0.2.test memmove-hip-6.0.2.test InOneWeekend-hip-6.0.2.test TheNextWeek-hip-6.0.2.test blender.test
ninja: build stopped: subcommand failed.
program finished with exit code 1
elapsedTime=485.177021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants