Skip to content

Conversation

harrisonGPU
Copy link
Contributor

@harrisonGPU harrisonGPU commented Jun 20, 2025

SILoadStoreOptimizer can now recognise consecutive 16-bit and 8-bit
TBUFFER_LOAD/TBUFFER_STORE instructions that each write

  • a single component (X), or
  • two components (XY),

and fold them into the wider native variants:

X + X          -->  XY
X + X + X + X  -->  XYZW
XY + XY        -->  XYZW
X + X + X      -->  XYZ
XY + X         -->  XYZ

The optimisation cuts the number of TBUFFER instructions, shrinking code
size and improving memory throughput.

@llvmbot
Copy link
Member

llvmbot commented Jun 20, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Harrison Hao (harrisonGPU)

Changes

SILoadStoreOptimizer can now recognise consecutive 16-bit
TBUFFER_LOAD/TBUFFER_STORE instructions that each write

  • a single component (X), or
  • two components (XY),

and fold them into the wider native variants:

X  +  X                 -->  XY
X  +  X  +  X  +  X     -->  XYZW
XY +  XY                -->  XYZW

The optimisation cuts the number of TBUFFER instructions, shrinking code
size and improving memory throughput.


Patch is 39.91 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/145078.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp (+32-6)
  • (modified) llvm/test/CodeGen/AMDGPU/merge-tbuffer.mir (+455)
diff --git a/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp b/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
index b0d6fd95cd271..83dbad9a1ba20 100644
--- a/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
@@ -1040,8 +1040,21 @@ bool SILoadStoreOptimizer::offsetsCanBeCombined(CombineInfo &CI,
   if (CI.Offset == Paired.Offset)
     return false;
 
+  // Use 2-byte element size if both tbuffer formats are 16-bit.
+  unsigned EltSize = CI.EltSize;
+  auto Has16BitComponents = [&](unsigned Format) -> bool {
+    const auto *Info = AMDGPU::getGcnBufferFormatInfo(Format, STI);
+    return Info && Info->BitsPerComp == 16;
+  };
+
+  if ((CI.InstClass == TBUFFER_LOAD || CI.InstClass == TBUFFER_STORE)) {
+    // TODO: Support merging 8-bit tbuffer load/store instructions
+    if (Has16BitComponents(CI.Format) && Has16BitComponents(Paired.Format))
+      EltSize = 2;
+  }
+
   // This won't be valid if the offset isn't aligned.
-  if ((CI.Offset % CI.EltSize != 0) || (Paired.Offset % CI.EltSize != 0))
+  if ((CI.Offset % EltSize != 0) || (Paired.Offset % EltSize != 0))
     return false;
 
   if (CI.InstClass == TBUFFER_LOAD || CI.InstClass == TBUFFER_STORE) {
@@ -1059,13 +1072,26 @@ bool SILoadStoreOptimizer::offsetsCanBeCombined(CombineInfo &CI,
         Info0->NumFormat != Info1->NumFormat)
       return false;
 
-    // TODO: Should be possible to support more formats, but if format loads
-    // are not dword-aligned, the merged load might not be valid.
-    if (Info0->BitsPerComp != 32)
+    // Buffer instructions support up to 4 components per access (e.g., x, xy,
+    // xyz, xyzw).
+    unsigned NumCombinedComponents = CI.Width + Paired.Width;
+    if (NumCombinedComponents > 4)
       return false;
 
-    if (getBufferFormatWithCompCount(CI.Format, CI.Width + Paired.Width, STI) == 0)
+    if (getBufferFormatWithCompCount(CI.Format, NumCombinedComponents, STI) ==
+        0)
       return false;
+
+    // Merge only when the two access ranges are strictly back-to-back,
+    // any gap or overlap can over-write data or leave holes.
+    unsigned BytePerComp = Info0->BitsPerComp / 8;
+    unsigned ElemIndex0 = CI.Offset / BytePerComp;
+    unsigned ElemIndex1 = Paired.Offset / BytePerComp;
+    if (!(ElemIndex0 + CI.Width == ElemIndex1 ||
+          ElemIndex1 + Paired.Width == ElemIndex0))
+      return false;
+
+    return true;
   }
 
   uint32_t EltOffset0 = CI.Offset / CI.EltSize;
@@ -1076,7 +1102,7 @@ bool SILoadStoreOptimizer::offsetsCanBeCombined(CombineInfo &CI,
   // Handle all non-DS instructions.
   if ((CI.InstClass != DS_READ) && (CI.InstClass != DS_WRITE)) {
     if (EltOffset0 + CI.Width != EltOffset1 &&
-            EltOffset1 + Paired.Width != EltOffset0)
+        EltOffset1 + Paired.Width != EltOffset0)
       return false;
     if (CI.CPol != Paired.CPol)
       return false;
diff --git a/llvm/test/CodeGen/AMDGPU/merge-tbuffer.mir b/llvm/test/CodeGen/AMDGPU/merge-tbuffer.mir
index 9766b427b4325..4a604513e9bbe 100644
--- a/llvm/test/CodeGen/AMDGPU/merge-tbuffer.mir
+++ b/llvm/test/CodeGen/AMDGPU/merge-tbuffer.mir
@@ -8706,3 +8706,458 @@ body:             |
     %8:vgpr_32 = TBUFFER_LOAD_FORMAT_X_BOTHEN_exact %4, %5:sgpr_128, 0, 8, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)
     %9:vgpr_32 = TBUFFER_LOAD_FORMAT_X_BOTHEN_exact %4, %5:sgpr_128, 0, 12, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)
 ...
+---
+
+name: gfx11_tbuffer_load_x_x_x_idxen_16bit
+body: |
+  bb.0.entry:
+    liveins: $sgpr0,$sgpr1,$sgpr2,$sgpr3,$vgpr0
+    ; GFX9-LABEL: name: gfx11_tbuffer_load_x_x_x_idxen_16bit
+    ; GFX9: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX9-NEXT: {{  $}}
+    ; GFX9-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+    ; GFX9-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+    ; GFX9-NEXT: %x0:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], %rsrc, 0, 0, 13, 0, 0, implicit $exec :: (dereferenceable load (s16), addrspace 8)
+    ; GFX9-NEXT: %x1:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], %rsrc, 0, 2, 13, 0, 0, implicit $exec :: (dereferenceable load (s16), addrspace 8)
+    ; GFX9-NEXT: %x2:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], %rsrc, 0, 4, 13, 0, 0, implicit $exec :: (dereferenceable load (s16), addrspace 8)
+    ;
+    ; GFX10-LABEL: name: gfx11_tbuffer_load_x_x_x_idxen_16bit
+    ; GFX10: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX10-NEXT: {{  $}}
+    ; GFX10-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+    ; GFX10-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+    ; GFX10-NEXT: [[TBUFFER_LOAD_FORMAT_XY_IDXEN:%[0-9]+]]:vreg_64 = TBUFFER_LOAD_FORMAT_XY_IDXEN [[COPY]], %rsrc, 0, 0, 29, 0, 0, implicit $exec :: (dereferenceable load (s32), align 2, addrspace 8)
+    ; GFX10-NEXT: %x0:vgpr_32 = COPY [[TBUFFER_LOAD_FORMAT_XY_IDXEN]].sub0
+    ; GFX10-NEXT: %x1:vgpr_32 = COPY killed [[TBUFFER_LOAD_FORMAT_XY_IDXEN]].sub1
+    ; GFX10-NEXT: %x2:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], %rsrc, 0, 4, 13, 0, 0, implicit $exec :: (dereferenceable load (s16), addrspace 8)
+    ;
+    ; GFX11-LABEL: name: gfx11_tbuffer_load_x_x_x_idxen_16bit
+    ; GFX11: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+    ; GFX11-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+    ; GFX11-NEXT: [[TBUFFER_LOAD_FORMAT_XY_IDXEN:%[0-9]+]]:vreg_64 = TBUFFER_LOAD_FORMAT_XY_IDXEN [[COPY]], %rsrc, 0, 0, 29, 0, 0, implicit $exec :: (dereferenceable load (s32), align 2, addrspace 8)
+    ; GFX11-NEXT: %x0:vgpr_32 = COPY [[TBUFFER_LOAD_FORMAT_XY_IDXEN]].sub0
+    ; GFX11-NEXT: %x1:vgpr_32 = COPY killed [[TBUFFER_LOAD_FORMAT_XY_IDXEN]].sub1
+    ; GFX11-NEXT: %x2:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], %rsrc, 0, 4, 13, 0, 0, implicit $exec :: (dereferenceable load (s16), addrspace 8)
+    %0:vgpr_32 = COPY $vgpr0
+    %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0,%subreg.sub0,$sgpr1,%subreg.sub1,$sgpr2,%subreg.sub2,$sgpr3,%subreg.sub3
+    %x0:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %0, %rsrc, 0, 0, 13, 0, 0, implicit $exec :: (dereferenceable load (s16),align 2,addrspace 8)
+    %x1:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %0, %rsrc, 0, 2, 13, 0, 0, implicit $exec :: (dereferenceable load (s16),align 2,addrspace 8)
+    %x2:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %0, %rsrc, 0, 4, 13, 0, 0, implicit $exec :: (dereferenceable load (s16),align 2,addrspace 8)
+...
+---
+
+name: gfx11_tbuffer_load_idxen_16_bit
+body: |
+  bb.0.entry:
+    liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4
+    ; GFX9-LABEL: name: gfx11_tbuffer_load_idxen_16_bit
+    ; GFX9: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4
+    ; GFX9-NEXT: {{  $}}
+    ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr_32 = COPY $sgpr4
+    ; GFX9-NEXT: [[COPY1:%[0-9]+]]:sgpr_32 = COPY $sgpr3
+    ; GFX9-NEXT: [[COPY2:%[0-9]+]]:sgpr_32 = COPY $sgpr2
+    ; GFX9-NEXT: [[COPY3:%[0-9]+]]:sgpr_32 = COPY $sgpr1
+    ; GFX9-NEXT: [[COPY4:%[0-9]+]]:sgpr_32 = COPY $sgpr0
+    ; GFX9-NEXT: [[REG_SEQUENCE:%[0-9]+]]:sgpr_128 = REG_SEQUENCE [[COPY4]], %subreg.sub0, [[COPY3]], %subreg.sub1, [[COPY2]], %subreg.sub2, [[COPY1]], %subreg.sub3
+    ; GFX9-NEXT: [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[COPY]]
+    ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 0, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN1:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 2, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN2:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 4, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN3:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 6, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN4:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 16, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN5:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 18, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN6:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 20, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN7:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 22, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN8:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 24, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ;
+    ; GFX10-LABEL: name: gfx11_tbuffer_load_idxen_16_bit
+    ; GFX10: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4
+    ; GFX10-NEXT: {{  $}}
+    ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr_32 = COPY $sgpr4
+    ; GFX10-NEXT: [[COPY1:%[0-9]+]]:sgpr_32 = COPY $sgpr3
+    ; GFX10-NEXT: [[COPY2:%[0-9]+]]:sgpr_32 = COPY $sgpr2
+    ; GFX10-NEXT: [[COPY3:%[0-9]+]]:sgpr_32 = COPY $sgpr1
+    ; GFX10-NEXT: [[COPY4:%[0-9]+]]:sgpr_32 = COPY $sgpr0
+    ; GFX10-NEXT: [[REG_SEQUENCE:%[0-9]+]]:sgpr_128 = REG_SEQUENCE [[COPY4]], %subreg.sub0, [[COPY3]], %subreg.sub1, [[COPY2]], %subreg.sub2, [[COPY1]], %subreg.sub3
+    ; GFX10-NEXT: [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[COPY]]
+    ; GFX10-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 0, 71, 0, 0, implicit $exec :: (dereferenceable load (s128), align 1, addrspace 8)
+    ; GFX10-NEXT: [[COPY6:%[0-9]+]]:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub0_sub1
+    ; GFX10-NEXT: [[COPY7:%[0-9]+]]:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub2_sub3
+    ; GFX10-NEXT: [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY6]].sub0
+    ; GFX10-NEXT: [[COPY9:%[0-9]+]]:vgpr_32 = COPY killed [[COPY6]].sub1
+    ; GFX10-NEXT: [[COPY10:%[0-9]+]]:vgpr_32 = COPY [[COPY7]].sub0
+    ; GFX10-NEXT: [[COPY11:%[0-9]+]]:vgpr_32 = COPY killed [[COPY7]].sub1
+    ; GFX10-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN1:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 16, 71, 0, 0, implicit $exec :: (dereferenceable load (s128), align 1, addrspace 8)
+    ; GFX10-NEXT: [[COPY12:%[0-9]+]]:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN1]].sub0_sub1
+    ; GFX10-NEXT: [[COPY13:%[0-9]+]]:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN1]].sub2_sub3
+    ; GFX10-NEXT: [[COPY14:%[0-9]+]]:vgpr_32 = COPY [[COPY12]].sub0
+    ; GFX10-NEXT: [[COPY15:%[0-9]+]]:vgpr_32 = COPY killed [[COPY12]].sub1
+    ; GFX10-NEXT: [[COPY16:%[0-9]+]]:vgpr_32 = COPY [[COPY13]].sub0
+    ; GFX10-NEXT: [[COPY17:%[0-9]+]]:vgpr_32 = COPY killed [[COPY13]].sub1
+    ; GFX10-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 24, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    ;
+    ; GFX11-LABEL: name: gfx11_tbuffer_load_idxen_16_bit
+    ; GFX11: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: [[COPY:%[0-9]+]]:sgpr_32 = COPY $sgpr4
+    ; GFX11-NEXT: [[COPY1:%[0-9]+]]:sgpr_32 = COPY $sgpr3
+    ; GFX11-NEXT: [[COPY2:%[0-9]+]]:sgpr_32 = COPY $sgpr2
+    ; GFX11-NEXT: [[COPY3:%[0-9]+]]:sgpr_32 = COPY $sgpr1
+    ; GFX11-NEXT: [[COPY4:%[0-9]+]]:sgpr_32 = COPY $sgpr0
+    ; GFX11-NEXT: [[REG_SEQUENCE:%[0-9]+]]:sgpr_128 = REG_SEQUENCE [[COPY4]], %subreg.sub0, [[COPY3]], %subreg.sub1, [[COPY2]], %subreg.sub2, [[COPY1]], %subreg.sub3
+    ; GFX11-NEXT: [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[COPY]]
+    ; GFX11-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 0, 57, 0, 0, implicit $exec :: (dereferenceable load (s128), align 1, addrspace 8)
+    ; GFX11-NEXT: [[COPY6:%[0-9]+]]:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub0_sub1
+    ; GFX11-NEXT: [[COPY7:%[0-9]+]]:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub2_sub3
+    ; GFX11-NEXT: [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY6]].sub0
+    ; GFX11-NEXT: [[COPY9:%[0-9]+]]:vgpr_32 = COPY killed [[COPY6]].sub1
+    ; GFX11-NEXT: [[COPY10:%[0-9]+]]:vgpr_32 = COPY [[COPY7]].sub0
+    ; GFX11-NEXT: [[COPY11:%[0-9]+]]:vgpr_32 = COPY killed [[COPY7]].sub1
+    ; GFX11-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN1:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 16, 57, 0, 0, implicit $exec :: (dereferenceable load (s128), align 1, addrspace 8)
+    ; GFX11-NEXT: [[COPY12:%[0-9]+]]:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN1]].sub0_sub1
+    ; GFX11-NEXT: [[COPY13:%[0-9]+]]:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN1]].sub2_sub3
+    ; GFX11-NEXT: [[COPY14:%[0-9]+]]:vgpr_32 = COPY [[COPY12]].sub0
+    ; GFX11-NEXT: [[COPY15:%[0-9]+]]:vgpr_32 = COPY killed [[COPY12]].sub1
+    ; GFX11-NEXT: [[COPY16:%[0-9]+]]:vgpr_32 = COPY [[COPY13]].sub0
+    ; GFX11-NEXT: [[COPY17:%[0-9]+]]:vgpr_32 = COPY killed [[COPY13]].sub1
+    ; GFX11-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 24, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %4:sgpr_32 = COPY $sgpr4
+    %3:sgpr_32 = COPY $sgpr3
+    %2:sgpr_32 = COPY $sgpr2
+    %1:sgpr_32 = COPY $sgpr1
+    %0:sgpr_32 = COPY $sgpr0
+    %5:sgpr_128 = REG_SEQUENCE %0:sgpr_32, %subreg.sub0, %1:sgpr_32, %subreg.sub1, %2:sgpr_32, %subreg.sub2, %3:sgpr_32, %subreg.sub3
+    %8:vgpr_32 = COPY %4:sgpr_32
+    %7:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 0, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %9:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 2, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %11:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 4, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %13:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 6, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %15:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 16, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %17:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 18, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %19:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 20, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %21:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 22, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+    %22:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 24, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+...
+---
+
+name: gfx11_tbuffer_load_xy_xy_idxen_uint_16_bit
+body: |
+  bb.0.entry:
+    liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX9-LABEL: name: gfx11_tbuffer_load_xy_xy_idxen_uint_16_bit
+    ; GFX9: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX9-NEXT: {{  $}}
+    ; GFX9-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+    ; GFX9-NEXT: %idx:vgpr_32 = COPY $vgpr0
+    ; GFX9-NEXT: %v0:vreg_64 = TBUFFER_LOAD_FORMAT_XY_IDXEN %idx, %rsrc, 0, 0, 27, 0, 0, implicit $exec :: (dereferenceable load (s32), align 2, addrspace 4)
+    ; GFX9-NEXT: %v1:vreg_64 = TBUFFER_LOAD_FORMAT_XY_IDXEN %idx, %rsrc, 0, 4, 27, 0, 0, implicit $exec :: (dereferenceable load (s32), align 2, addrspace 4)
+    ;
+    ; GFX10-LABEL: name: gfx11_tbuffer_load_xy_xy_idxen_uint_16_bit
+    ; GFX10: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX10-NEXT: {{  $}}
+    ; GFX10-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+    ; GFX10-NEXT: %idx:vgpr_32 = COPY $vgpr0
+    ; GFX10-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN %idx, %rsrc, 0, 0, 69, 0, 0, implicit $exec :: (dereferenceable load (s64), align 2, addrspace 4)
+    ; GFX10-NEXT: %v0:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub0_sub1
+    ; GFX10-NEXT: %v1:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub2_sub3
+    ;
+    ; GFX11-LABEL: name: gfx11_tbuffer_load_xy_xy_idxen_uint_16_bit
+    ; GFX11: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+    ; GFX11-NEXT: %idx:vgpr_32 = COPY $vgpr0
+    ; GFX11-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN %idx, %rsrc, 0, 0, 55, 0, 0, implicit $exec :: (dereferenceable load (s64), align 2, addrspace 4)
+    ; GFX11-NEXT: %v0:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub0_sub1
+    ; GFX11-NEXT: %v1:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub2_sub3
+    %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1,%subreg.sub1, $sgpr2,%subreg.sub2, $sgpr3,%subreg.sub3
+    %idx:vgpr_32 = COPY $vgpr0
+    %v0:vreg_64 = TBUFFER_LOAD_FORMAT_XY_IDXEN  %idx, %rsrc, 0, 0, 27, 0, 0, implicit $exec :: (dereferenceable load (s32),align 2,addrspace 4)
+    %v1:vreg_64 = TBUFFER_LOAD_FORMAT_XY_IDXEN  %idx, %rsrc, 0, 4, 27, 0, 0, implicit $exec :: (dereferenceable load (s32),align 2,addrspace 4)
+...
+---
+
+name: gfx11_tbuffer_load_xy_xy_idxen_sint_16_bit
+body: |
+  bb.0.entry:
+    liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX9-LABEL: name: gfx11_tbuffer_load_xy_xy_idxen_sint_16_bit
+    ; GFX9: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX9-NEXT: {{  $}}
+    ; GFX9-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+    ; GFX9-NEXT: %idx:vgpr_32 = COPY $vgpr0
+    ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN %idx, %rsrc, 0, 0, 28, 0, 0, implicit $exec :: (dereferenceable load (s64), align 2, addrspace 4)
+    ; GFX9-NEXT: %v0:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub0_sub1
+    ; GFX9-NEXT: %v1:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub2_sub3
+    ;
+    ; GFX10-LABEL: name: gfx11_tbuffer_load_xy_xy_idxen_sint_16_bit
+    ; GFX10: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX10-NEXT: {{  $}}
+    ; GFX10-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+    ; GFX10-NEXT: %idx:vgpr_32 = COPY $vgpr0
+    ; GFX10-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN %idx, %rsrc, 0, 0, 70, 0, 0, implicit $exec :: (dereferenceable load (s64), align 2, addrspace 4)
+    ; GFX10-NEXT: %v0:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub0_sub1
+    ; GFX10-NEXT: %v1:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub2_sub3
+    ;
+    ; GFX11-LABEL: name: gfx11_tbuffer_load_xy_xy_idxen_sint_16_bit
+    ; GFX11: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+    ; GFX11-NEXT: {{  $}}
+    ; GFX11-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+    ; GFX11-NEXT: %idx:vgpr_32 = COPY $vgpr0
+    ; GFX11-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN %idx, %rsrc, 0, ...
[truncated]

@harrisonGPU harrisonGPU requested a review from shiltian June 20, 2025 17:37
@harrisonGPU harrisonGPU self-assigned this Jun 20, 2025
@harrisonGPU harrisonGPU force-pushed the amdgpu/tbuffer-16bit-merge branch from bb567ab to 47589ee Compare June 21, 2025 03:31
Comment on lines -1062 to -1064
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will your patch merge two 16-bit loads at offsets 2 and 4, into a single 32-bit load at offset 2?

If it does that then the merged load is not dword aligned. Is that allowed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I understand your concern. I’ve verified section 9.5 “Alignment” of the RDNA 3 Shader Instruction Set Architecture manual, which states:

Formatted ops such as BUFFER_LOAD_FORMAT_* must be aligned as follows:
• 1-byte formats → 1-byte alignment
• 2-byte formats → 2-byte alignment
• 4-byte and larger formats → 4-byte alignment

I’ve therefore added an explicit alignment check and a new Lit test, gfx11_tbuffer_load_x_off2_off4_16bit_no_merge.

Reference:
https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should be doing this kind of merging in the IR. SILoadStoreOptimizer was originally intended only for the case of combining the DS read/write from non-consecutive offsets. Everything else could have been done like a normal vectorization

@harrisonGPU
Copy link
Contributor Author

I still think we should be doing this kind of merging in the IR. SILoadStoreOptimizer was originally intended only for the case of combining the DS read/write from non-consecutive offsets. Everything else could have been done like a normal vectorization

Thanks Matt. If we try to merge tbuffer loads in the IR, we first have to expose the buffer‐format information there. At the moment SILoadStoreOptimizer already has easy access to that data (e.g. BitsPerComp and NumFormat), so extending the existing pass feels more pragmatic.
The pass can already merge 32-bit tbuffer loads; to cover 16-bit and 8-bit cases we mainly need to handle the different element sizes, which is a relatively small change compared with plumbing format metadata through the whole IR pipeline.

@arsenm
Copy link
Contributor

arsenm commented Jun 26, 2025

Thanks Matt. If we try to merge tbuffer loads in the IR, we first have to expose the buffer‐format information there. At the moment SILoadStoreOptimizer already has easy access to that data (e.g. BitsPerComp and NumFormat),

Right this is a consequence of doing the wrong thing up front and then the next 100 patches keep following along with the original mistake

@harrisonGPU
Copy link
Contributor Author

Thanks Matt. If we try to merge tbuffer loads in the IR, we first have to expose the buffer‐format information there. At the moment SILoadStoreOptimizer already has easy access to that data (e.g. BitsPerComp and NumFormat),

Right this is a consequence of doing the wrong thing up front and then the next 100 patches keep following along with the original mistake

Well, I think we should first support 16-bit and 8-bit TBUFFER merging, and then consider moving it to the IR stage. Right now, it could impact performance.

@harrisonGPU
Copy link
Contributor Author

@jayfoad @arsenm Could you please take a look?

@harrisonGPU harrisonGPU requested review from jayfoad and arsenm July 9, 2025 06:03
@harrisonGPU harrisonGPU changed the title [AMDGPU] Support merging 16-bit TBUFFER load/store instruction [AMDGPU] Support merging 16-bit and 8-bit TBUFFER load/store instruction Jul 10, 2025
@harrisonGPU harrisonGPU requested a review from jayfoad July 10, 2025 14:41
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can getGcnBufferFormatInfo ever fail here? Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will cause some lit tests to fail. I think this is a problem with the test itself, for example, if we run it with gfx900 but the test is meant to target gfx10, the format won't be found and Info will be null. This will trigger failures, like in gfx10_tbuffer_load_x_xyz when run on gfx900.
So I think it makes sense to keep the null check for compatibility. right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really we ought to split the tests into separate files for gfx9 and gfx10.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I separate this and create a new NFC PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think separating merge-tbuffer.mir makes sense, but we still have these lit tests that need to be separated:

Failed Tests (6):
  LLVM :: CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.tbuffer.load.ll
  LLVM :: CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.tbuffer.store.ll
  LLVM :: CodeGen/AMDGPU/llvm.amdgcn.raw.tbuffer.load.ll
  LLVM :: CodeGen/AMDGPU/llvm.amdgcn.raw.tbuffer.store.ll
  LLVM :: CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.tbuffer.load.ll
  LLVM :: CodeGen/AMDGPU/llvm.amdgcn.struct.tbuffer.load.ll

Should I continue to do this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to change this pass so that if !Info then the instruction is not added to the list of mergeable instructions. I.e. either change setMI so that it can return failure, or check that getGcnBufferFormatInfo succeeds before calling setMI. I think this is safer than treating unknown formats as if they are 32-bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jay, I tried changing setMI to return a bool, but it causes some DS_Load tests to fail. Also, if we want to check getGcnBufferFormatInfo before calling setMI, we would need to check whether the instruction is a tbuffer load/store first , otherwise it could also affect DS_Load behavior.
I believe the current approach is still safe. Even if Info is null and the instruction is added to the mergeable list, we still check whether the format info is valid before merging. If it's null, the instruction won't be merged.
So from a correctness standpoint, I think the current logic is safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jay, I’ve already updated the patch. I now check that getGcnBufferFormatInfo succeeds before calling setMI. :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now you don't need the if (Info) check here, right?

And there are similar checks in offsetsCanBeCombined that can be removed.

harrisonGPU added a commit that referenced this pull request Jul 16, 2025
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Jul 16, 2025
@harrisonGPU harrisonGPU force-pushed the amdgpu/tbuffer-16bit-merge branch from dba154b to 3e66c02 Compare July 16, 2025 04:28
@harrisonGPU harrisonGPU requested a review from jayfoad July 16, 2025 06:20
@harrisonGPU
Copy link
Contributor Author

Ping.

@harrisonGPU harrisonGPU force-pushed the amdgpu/tbuffer-16bit-merge branch from cd979ac to 15a9d34 Compare August 18, 2025 07:03
@harrisonGPU
Copy link
Contributor Author

Rebase.

@harrisonGPU harrisonGPU force-pushed the amdgpu/tbuffer-16bit-merge branch from 520b084 to 8cf32f0 Compare August 19, 2025 09:13
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now you don't need the if (Info) check here, right?

And there are similar checks in offsetsCanBeCombined that can be removed.

@harrisonGPU
Copy link
Contributor Author

So now you don't need the if (Info) check here, right?

And there are similar checks in offsetsCanBeCombined that can be removed.

I have already removed it.

Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@harrisonGPU harrisonGPU merged commit 23a5a7b into llvm:main Aug 20, 2025
9 checks passed
@harrisonGPU harrisonGPU deleted the amdgpu/tbuffer-16bit-merge branch August 20, 2025 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants