-
Notifications
You must be signed in to change notification settings - Fork 15.1k
[AMDGPU] Support merging 16-bit and 8-bit TBUFFER load/store instruction #145078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] Support merging 16-bit and 8-bit TBUFFER load/store instruction #145078
Conversation
@llvm/pr-subscribers-backend-amdgpu Author: Harrison Hao (harrisonGPU) ChangesSILoadStoreOptimizer can now recognise consecutive 16-bit
and fold them into the wider native variants:
The optimisation cuts the number of TBUFFER instructions, shrinking code Patch is 39.91 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/145078.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp b/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
index b0d6fd95cd271..83dbad9a1ba20 100644
--- a/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
@@ -1040,8 +1040,21 @@ bool SILoadStoreOptimizer::offsetsCanBeCombined(CombineInfo &CI,
if (CI.Offset == Paired.Offset)
return false;
+ // Use 2-byte element size if both tbuffer formats are 16-bit.
+ unsigned EltSize = CI.EltSize;
+ auto Has16BitComponents = [&](unsigned Format) -> bool {
+ const auto *Info = AMDGPU::getGcnBufferFormatInfo(Format, STI);
+ return Info && Info->BitsPerComp == 16;
+ };
+
+ if ((CI.InstClass == TBUFFER_LOAD || CI.InstClass == TBUFFER_STORE)) {
+ // TODO: Support merging 8-bit tbuffer load/store instructions
+ if (Has16BitComponents(CI.Format) && Has16BitComponents(Paired.Format))
+ EltSize = 2;
+ }
+
// This won't be valid if the offset isn't aligned.
- if ((CI.Offset % CI.EltSize != 0) || (Paired.Offset % CI.EltSize != 0))
+ if ((CI.Offset % EltSize != 0) || (Paired.Offset % EltSize != 0))
return false;
if (CI.InstClass == TBUFFER_LOAD || CI.InstClass == TBUFFER_STORE) {
@@ -1059,13 +1072,26 @@ bool SILoadStoreOptimizer::offsetsCanBeCombined(CombineInfo &CI,
Info0->NumFormat != Info1->NumFormat)
return false;
- // TODO: Should be possible to support more formats, but if format loads
- // are not dword-aligned, the merged load might not be valid.
- if (Info0->BitsPerComp != 32)
+ // Buffer instructions support up to 4 components per access (e.g., x, xy,
+ // xyz, xyzw).
+ unsigned NumCombinedComponents = CI.Width + Paired.Width;
+ if (NumCombinedComponents > 4)
return false;
- if (getBufferFormatWithCompCount(CI.Format, CI.Width + Paired.Width, STI) == 0)
+ if (getBufferFormatWithCompCount(CI.Format, NumCombinedComponents, STI) ==
+ 0)
return false;
+
+ // Merge only when the two access ranges are strictly back-to-back,
+ // any gap or overlap can over-write data or leave holes.
+ unsigned BytePerComp = Info0->BitsPerComp / 8;
+ unsigned ElemIndex0 = CI.Offset / BytePerComp;
+ unsigned ElemIndex1 = Paired.Offset / BytePerComp;
+ if (!(ElemIndex0 + CI.Width == ElemIndex1 ||
+ ElemIndex1 + Paired.Width == ElemIndex0))
+ return false;
+
+ return true;
}
uint32_t EltOffset0 = CI.Offset / CI.EltSize;
@@ -1076,7 +1102,7 @@ bool SILoadStoreOptimizer::offsetsCanBeCombined(CombineInfo &CI,
// Handle all non-DS instructions.
if ((CI.InstClass != DS_READ) && (CI.InstClass != DS_WRITE)) {
if (EltOffset0 + CI.Width != EltOffset1 &&
- EltOffset1 + Paired.Width != EltOffset0)
+ EltOffset1 + Paired.Width != EltOffset0)
return false;
if (CI.CPol != Paired.CPol)
return false;
diff --git a/llvm/test/CodeGen/AMDGPU/merge-tbuffer.mir b/llvm/test/CodeGen/AMDGPU/merge-tbuffer.mir
index 9766b427b4325..4a604513e9bbe 100644
--- a/llvm/test/CodeGen/AMDGPU/merge-tbuffer.mir
+++ b/llvm/test/CodeGen/AMDGPU/merge-tbuffer.mir
@@ -8706,3 +8706,458 @@ body: |
%8:vgpr_32 = TBUFFER_LOAD_FORMAT_X_BOTHEN_exact %4, %5:sgpr_128, 0, 8, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)
%9:vgpr_32 = TBUFFER_LOAD_FORMAT_X_BOTHEN_exact %4, %5:sgpr_128, 0, 12, 22, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)
...
+---
+
+name: gfx11_tbuffer_load_x_x_x_idxen_16bit
+body: |
+ bb.0.entry:
+ liveins: $sgpr0,$sgpr1,$sgpr2,$sgpr3,$vgpr0
+ ; GFX9-LABEL: name: gfx11_tbuffer_load_x_x_x_idxen_16bit
+ ; GFX9: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+ ; GFX9-NEXT: {{ $}}
+ ; GFX9-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+ ; GFX9-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+ ; GFX9-NEXT: %x0:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], %rsrc, 0, 0, 13, 0, 0, implicit $exec :: (dereferenceable load (s16), addrspace 8)
+ ; GFX9-NEXT: %x1:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], %rsrc, 0, 2, 13, 0, 0, implicit $exec :: (dereferenceable load (s16), addrspace 8)
+ ; GFX9-NEXT: %x2:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], %rsrc, 0, 4, 13, 0, 0, implicit $exec :: (dereferenceable load (s16), addrspace 8)
+ ;
+ ; GFX10-LABEL: name: gfx11_tbuffer_load_x_x_x_idxen_16bit
+ ; GFX10: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+ ; GFX10-NEXT: {{ $}}
+ ; GFX10-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+ ; GFX10-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+ ; GFX10-NEXT: [[TBUFFER_LOAD_FORMAT_XY_IDXEN:%[0-9]+]]:vreg_64 = TBUFFER_LOAD_FORMAT_XY_IDXEN [[COPY]], %rsrc, 0, 0, 29, 0, 0, implicit $exec :: (dereferenceable load (s32), align 2, addrspace 8)
+ ; GFX10-NEXT: %x0:vgpr_32 = COPY [[TBUFFER_LOAD_FORMAT_XY_IDXEN]].sub0
+ ; GFX10-NEXT: %x1:vgpr_32 = COPY killed [[TBUFFER_LOAD_FORMAT_XY_IDXEN]].sub1
+ ; GFX10-NEXT: %x2:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], %rsrc, 0, 4, 13, 0, 0, implicit $exec :: (dereferenceable load (s16), addrspace 8)
+ ;
+ ; GFX11-LABEL: name: gfx11_tbuffer_load_x_x_x_idxen_16bit
+ ; GFX11: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+ ; GFX11-NEXT: {{ $}}
+ ; GFX11-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
+ ; GFX11-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+ ; GFX11-NEXT: [[TBUFFER_LOAD_FORMAT_XY_IDXEN:%[0-9]+]]:vreg_64 = TBUFFER_LOAD_FORMAT_XY_IDXEN [[COPY]], %rsrc, 0, 0, 29, 0, 0, implicit $exec :: (dereferenceable load (s32), align 2, addrspace 8)
+ ; GFX11-NEXT: %x0:vgpr_32 = COPY [[TBUFFER_LOAD_FORMAT_XY_IDXEN]].sub0
+ ; GFX11-NEXT: %x1:vgpr_32 = COPY killed [[TBUFFER_LOAD_FORMAT_XY_IDXEN]].sub1
+ ; GFX11-NEXT: %x2:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY]], %rsrc, 0, 4, 13, 0, 0, implicit $exec :: (dereferenceable load (s16), addrspace 8)
+ %0:vgpr_32 = COPY $vgpr0
+ %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0,%subreg.sub0,$sgpr1,%subreg.sub1,$sgpr2,%subreg.sub2,$sgpr3,%subreg.sub3
+ %x0:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %0, %rsrc, 0, 0, 13, 0, 0, implicit $exec :: (dereferenceable load (s16),align 2,addrspace 8)
+ %x1:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %0, %rsrc, 0, 2, 13, 0, 0, implicit $exec :: (dereferenceable load (s16),align 2,addrspace 8)
+ %x2:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %0, %rsrc, 0, 4, 13, 0, 0, implicit $exec :: (dereferenceable load (s16),align 2,addrspace 8)
+...
+---
+
+name: gfx11_tbuffer_load_idxen_16_bit
+body: |
+ bb.0.entry:
+ liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4
+ ; GFX9-LABEL: name: gfx11_tbuffer_load_idxen_16_bit
+ ; GFX9: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4
+ ; GFX9-NEXT: {{ $}}
+ ; GFX9-NEXT: [[COPY:%[0-9]+]]:sgpr_32 = COPY $sgpr4
+ ; GFX9-NEXT: [[COPY1:%[0-9]+]]:sgpr_32 = COPY $sgpr3
+ ; GFX9-NEXT: [[COPY2:%[0-9]+]]:sgpr_32 = COPY $sgpr2
+ ; GFX9-NEXT: [[COPY3:%[0-9]+]]:sgpr_32 = COPY $sgpr1
+ ; GFX9-NEXT: [[COPY4:%[0-9]+]]:sgpr_32 = COPY $sgpr0
+ ; GFX9-NEXT: [[REG_SEQUENCE:%[0-9]+]]:sgpr_128 = REG_SEQUENCE [[COPY4]], %subreg.sub0, [[COPY3]], %subreg.sub1, [[COPY2]], %subreg.sub2, [[COPY1]], %subreg.sub3
+ ; GFX9-NEXT: [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[COPY]]
+ ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 0, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN1:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 2, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN2:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 4, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN3:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 6, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN4:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 16, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN5:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 18, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN6:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 20, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN7:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 22, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN8:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 24, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ ;
+ ; GFX10-LABEL: name: gfx11_tbuffer_load_idxen_16_bit
+ ; GFX10: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4
+ ; GFX10-NEXT: {{ $}}
+ ; GFX10-NEXT: [[COPY:%[0-9]+]]:sgpr_32 = COPY $sgpr4
+ ; GFX10-NEXT: [[COPY1:%[0-9]+]]:sgpr_32 = COPY $sgpr3
+ ; GFX10-NEXT: [[COPY2:%[0-9]+]]:sgpr_32 = COPY $sgpr2
+ ; GFX10-NEXT: [[COPY3:%[0-9]+]]:sgpr_32 = COPY $sgpr1
+ ; GFX10-NEXT: [[COPY4:%[0-9]+]]:sgpr_32 = COPY $sgpr0
+ ; GFX10-NEXT: [[REG_SEQUENCE:%[0-9]+]]:sgpr_128 = REG_SEQUENCE [[COPY4]], %subreg.sub0, [[COPY3]], %subreg.sub1, [[COPY2]], %subreg.sub2, [[COPY1]], %subreg.sub3
+ ; GFX10-NEXT: [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[COPY]]
+ ; GFX10-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 0, 71, 0, 0, implicit $exec :: (dereferenceable load (s128), align 1, addrspace 8)
+ ; GFX10-NEXT: [[COPY6:%[0-9]+]]:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub0_sub1
+ ; GFX10-NEXT: [[COPY7:%[0-9]+]]:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub2_sub3
+ ; GFX10-NEXT: [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY6]].sub0
+ ; GFX10-NEXT: [[COPY9:%[0-9]+]]:vgpr_32 = COPY killed [[COPY6]].sub1
+ ; GFX10-NEXT: [[COPY10:%[0-9]+]]:vgpr_32 = COPY [[COPY7]].sub0
+ ; GFX10-NEXT: [[COPY11:%[0-9]+]]:vgpr_32 = COPY killed [[COPY7]].sub1
+ ; GFX10-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN1:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 16, 71, 0, 0, implicit $exec :: (dereferenceable load (s128), align 1, addrspace 8)
+ ; GFX10-NEXT: [[COPY12:%[0-9]+]]:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN1]].sub0_sub1
+ ; GFX10-NEXT: [[COPY13:%[0-9]+]]:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN1]].sub2_sub3
+ ; GFX10-NEXT: [[COPY14:%[0-9]+]]:vgpr_32 = COPY [[COPY12]].sub0
+ ; GFX10-NEXT: [[COPY15:%[0-9]+]]:vgpr_32 = COPY killed [[COPY12]].sub1
+ ; GFX10-NEXT: [[COPY16:%[0-9]+]]:vgpr_32 = COPY [[COPY13]].sub0
+ ; GFX10-NEXT: [[COPY17:%[0-9]+]]:vgpr_32 = COPY killed [[COPY13]].sub1
+ ; GFX10-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 24, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ ;
+ ; GFX11-LABEL: name: gfx11_tbuffer_load_idxen_16_bit
+ ; GFX11: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $sgpr4
+ ; GFX11-NEXT: {{ $}}
+ ; GFX11-NEXT: [[COPY:%[0-9]+]]:sgpr_32 = COPY $sgpr4
+ ; GFX11-NEXT: [[COPY1:%[0-9]+]]:sgpr_32 = COPY $sgpr3
+ ; GFX11-NEXT: [[COPY2:%[0-9]+]]:sgpr_32 = COPY $sgpr2
+ ; GFX11-NEXT: [[COPY3:%[0-9]+]]:sgpr_32 = COPY $sgpr1
+ ; GFX11-NEXT: [[COPY4:%[0-9]+]]:sgpr_32 = COPY $sgpr0
+ ; GFX11-NEXT: [[REG_SEQUENCE:%[0-9]+]]:sgpr_128 = REG_SEQUENCE [[COPY4]], %subreg.sub0, [[COPY3]], %subreg.sub1, [[COPY2]], %subreg.sub2, [[COPY1]], %subreg.sub3
+ ; GFX11-NEXT: [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[COPY]]
+ ; GFX11-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 0, 57, 0, 0, implicit $exec :: (dereferenceable load (s128), align 1, addrspace 8)
+ ; GFX11-NEXT: [[COPY6:%[0-9]+]]:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub0_sub1
+ ; GFX11-NEXT: [[COPY7:%[0-9]+]]:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub2_sub3
+ ; GFX11-NEXT: [[COPY8:%[0-9]+]]:vgpr_32 = COPY [[COPY6]].sub0
+ ; GFX11-NEXT: [[COPY9:%[0-9]+]]:vgpr_32 = COPY killed [[COPY6]].sub1
+ ; GFX11-NEXT: [[COPY10:%[0-9]+]]:vgpr_32 = COPY [[COPY7]].sub0
+ ; GFX11-NEXT: [[COPY11:%[0-9]+]]:vgpr_32 = COPY killed [[COPY7]].sub1
+ ; GFX11-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN1:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 16, 57, 0, 0, implicit $exec :: (dereferenceable load (s128), align 1, addrspace 8)
+ ; GFX11-NEXT: [[COPY12:%[0-9]+]]:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN1]].sub0_sub1
+ ; GFX11-NEXT: [[COPY13:%[0-9]+]]:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN1]].sub2_sub3
+ ; GFX11-NEXT: [[COPY14:%[0-9]+]]:vgpr_32 = COPY [[COPY12]].sub0
+ ; GFX11-NEXT: [[COPY15:%[0-9]+]]:vgpr_32 = COPY killed [[COPY12]].sub1
+ ; GFX11-NEXT: [[COPY16:%[0-9]+]]:vgpr_32 = COPY [[COPY13]].sub0
+ ; GFX11-NEXT: [[COPY17:%[0-9]+]]:vgpr_32 = COPY killed [[COPY13]].sub1
+ ; GFX11-NEXT: [[TBUFFER_LOAD_FORMAT_X_IDXEN:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN [[COPY5]], [[REG_SEQUENCE]], 0, 24, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ %4:sgpr_32 = COPY $sgpr4
+ %3:sgpr_32 = COPY $sgpr3
+ %2:sgpr_32 = COPY $sgpr2
+ %1:sgpr_32 = COPY $sgpr1
+ %0:sgpr_32 = COPY $sgpr0
+ %5:sgpr_128 = REG_SEQUENCE %0:sgpr_32, %subreg.sub0, %1:sgpr_32, %subreg.sub1, %2:sgpr_32, %subreg.sub2, %3:sgpr_32, %subreg.sub3
+ %8:vgpr_32 = COPY %4:sgpr_32
+ %7:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 0, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ %9:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 2, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ %11:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 4, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ %13:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 6, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ %15:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 16, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ %17:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 18, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ %19:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 20, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ %21:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 22, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+ %22:vgpr_32 = TBUFFER_LOAD_FORMAT_X_IDXEN %8:vgpr_32, %5:sgpr_128, 0, 24, 13, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
+...
+---
+
+name: gfx11_tbuffer_load_xy_xy_idxen_uint_16_bit
+body: |
+ bb.0.entry:
+ liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+ ; GFX9-LABEL: name: gfx11_tbuffer_load_xy_xy_idxen_uint_16_bit
+ ; GFX9: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+ ; GFX9-NEXT: {{ $}}
+ ; GFX9-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+ ; GFX9-NEXT: %idx:vgpr_32 = COPY $vgpr0
+ ; GFX9-NEXT: %v0:vreg_64 = TBUFFER_LOAD_FORMAT_XY_IDXEN %idx, %rsrc, 0, 0, 27, 0, 0, implicit $exec :: (dereferenceable load (s32), align 2, addrspace 4)
+ ; GFX9-NEXT: %v1:vreg_64 = TBUFFER_LOAD_FORMAT_XY_IDXEN %idx, %rsrc, 0, 4, 27, 0, 0, implicit $exec :: (dereferenceable load (s32), align 2, addrspace 4)
+ ;
+ ; GFX10-LABEL: name: gfx11_tbuffer_load_xy_xy_idxen_uint_16_bit
+ ; GFX10: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+ ; GFX10-NEXT: {{ $}}
+ ; GFX10-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+ ; GFX10-NEXT: %idx:vgpr_32 = COPY $vgpr0
+ ; GFX10-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN %idx, %rsrc, 0, 0, 69, 0, 0, implicit $exec :: (dereferenceable load (s64), align 2, addrspace 4)
+ ; GFX10-NEXT: %v0:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub0_sub1
+ ; GFX10-NEXT: %v1:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub2_sub3
+ ;
+ ; GFX11-LABEL: name: gfx11_tbuffer_load_xy_xy_idxen_uint_16_bit
+ ; GFX11: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+ ; GFX11-NEXT: {{ $}}
+ ; GFX11-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+ ; GFX11-NEXT: %idx:vgpr_32 = COPY $vgpr0
+ ; GFX11-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN %idx, %rsrc, 0, 0, 55, 0, 0, implicit $exec :: (dereferenceable load (s64), align 2, addrspace 4)
+ ; GFX11-NEXT: %v0:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub0_sub1
+ ; GFX11-NEXT: %v1:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub2_sub3
+ %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1,%subreg.sub1, $sgpr2,%subreg.sub2, $sgpr3,%subreg.sub3
+ %idx:vgpr_32 = COPY $vgpr0
+ %v0:vreg_64 = TBUFFER_LOAD_FORMAT_XY_IDXEN %idx, %rsrc, 0, 0, 27, 0, 0, implicit $exec :: (dereferenceable load (s32),align 2,addrspace 4)
+ %v1:vreg_64 = TBUFFER_LOAD_FORMAT_XY_IDXEN %idx, %rsrc, 0, 4, 27, 0, 0, implicit $exec :: (dereferenceable load (s32),align 2,addrspace 4)
+...
+---
+
+name: gfx11_tbuffer_load_xy_xy_idxen_sint_16_bit
+body: |
+ bb.0.entry:
+ liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+ ; GFX9-LABEL: name: gfx11_tbuffer_load_xy_xy_idxen_sint_16_bit
+ ; GFX9: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+ ; GFX9-NEXT: {{ $}}
+ ; GFX9-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+ ; GFX9-NEXT: %idx:vgpr_32 = COPY $vgpr0
+ ; GFX9-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN %idx, %rsrc, 0, 0, 28, 0, 0, implicit $exec :: (dereferenceable load (s64), align 2, addrspace 4)
+ ; GFX9-NEXT: %v0:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub0_sub1
+ ; GFX9-NEXT: %v1:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub2_sub3
+ ;
+ ; GFX10-LABEL: name: gfx11_tbuffer_load_xy_xy_idxen_sint_16_bit
+ ; GFX10: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+ ; GFX10-NEXT: {{ $}}
+ ; GFX10-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+ ; GFX10-NEXT: %idx:vgpr_32 = COPY $vgpr0
+ ; GFX10-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN %idx, %rsrc, 0, 0, 70, 0, 0, implicit $exec :: (dereferenceable load (s64), align 2, addrspace 4)
+ ; GFX10-NEXT: %v0:vreg_64 = COPY [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub0_sub1
+ ; GFX10-NEXT: %v1:vreg_64 = COPY killed [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN]].sub2_sub3
+ ;
+ ; GFX11-LABEL: name: gfx11_tbuffer_load_xy_xy_idxen_sint_16_bit
+ ; GFX11: liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0
+ ; GFX11-NEXT: {{ $}}
+ ; GFX11-NEXT: %rsrc:sgpr_128 = REG_SEQUENCE $sgpr0, %subreg.sub0, $sgpr1, %subreg.sub1, $sgpr2, %subreg.sub2, $sgpr3, %subreg.sub3
+ ; GFX11-NEXT: %idx:vgpr_32 = COPY $vgpr0
+ ; GFX11-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_IDXEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_IDXEN %idx, %rsrc, 0, ...
[truncated]
|
bb567ab
to
47589ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will your patch merge two 16-bit loads at offsets 2 and 4, into a single 32-bit load at offset 2?
If it does that then the merged load is not dword aligned. Is that allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I understand your concern. I’ve verified section 9.5 “Alignment” of the RDNA 3 Shader Instruction Set Architecture manual, which states:
Formatted ops such as BUFFER_LOAD_FORMAT_* must be aligned as follows:
• 1-byte formats → 1-byte alignment
• 2-byte formats → 2-byte alignment
• 4-byte and larger formats → 4-byte alignment
I’ve therefore added an explicit alignment check and a new Lit test, gfx11_tbuffer_load_x_off2_off4_16bit_no_merge
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should be doing this kind of merging in the IR. SILoadStoreOptimizer was originally intended only for the case of combining the DS read/write from non-consecutive offsets. Everything else could have been done like a normal vectorization
Thanks Matt. If we try to merge tbuffer loads in the IR, we first have to expose the buffer‐format information there. At the moment |
Right this is a consequence of doing the wrong thing up front and then the next 100 patches keep following along with the original mistake |
Well, I think we should first support 16-bit and 8-bit TBUFFER merging, and then consider moving it to the IR stage. Right now, it could impact performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can getGcnBufferFormatInfo
ever fail here? Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will cause some lit tests to fail. I think this is a problem with the test itself, for example, if we run it with gfx900 but the test is meant to target gfx10, the format won't be found and Info will be null. This will trigger failures, like in gfx10_tbuffer_load_x_xyz
when run on gfx900.
So I think it makes sense to keep the null check for compatibility. right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really we ought to split the tests into separate files for gfx9 and gfx10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I separate this and create a new NFC PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think separating merge-tbuffer.mir makes sense, but we still have these lit tests that need to be separated:
Failed Tests (6):
LLVM :: CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.tbuffer.load.ll
LLVM :: CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.tbuffer.store.ll
LLVM :: CodeGen/AMDGPU/llvm.amdgcn.raw.tbuffer.load.ll
LLVM :: CodeGen/AMDGPU/llvm.amdgcn.raw.tbuffer.store.ll
LLVM :: CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.tbuffer.load.ll
LLVM :: CodeGen/AMDGPU/llvm.amdgcn.struct.tbuffer.load.ll
Should I continue to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to change this pass so that if !Info
then the instruction is not added to the list of mergeable instructions. I.e. either change setMI
so that it can return failure, or check that getGcnBufferFormatInfo
succeeds before calling setMI
. I think this is safer than treating unknown formats as if they are 32-bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Jay, I tried changing setMI
to return a bool
, but it causes some DS_Load
tests to fail. Also, if we want to check getGcnBufferFormatInfo
before calling setMI
, we would need to check whether the instruction is a tbuffer load/store first , otherwise it could also affect DS_Load
behavior.
I believe the current approach is still safe. Even if Info
is null and the instruction is added to the mergeable list, we still check whether the format info is valid before merging. If it's null, the instruction won't be merged.
So from a correctness standpoint, I think the current logic is safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Jay, I’ve already updated the patch. I now check that getGcnBufferFormatInfo succeeds before calling setMI. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now you don't need the if (Info)
check here, right?
And there are similar checks in offsetsCanBeCombined
that can be removed.
…for gfx9, gfx10, and gfx11 (#148786) For llvm/llvm-project#145078
dba154b
to
3e66c02
Compare
Ping. |
cd979ac
to
15a9d34
Compare
Rebase. |
520b084
to
8cf32f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now you don't need the if (Info)
check here, right?
And there are similar checks in offsetsCanBeCombined
that can be removed.
I have already removed it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
SILoadStoreOptimizer can now recognise consecutive 16-bit and 8-bit
TBUFFER_LOAD
/TBUFFER_STORE
instructions that each writeX
), orXY
),and fold them into the wider native variants:
The optimisation cuts the number of TBUFFER instructions, shrinking code
size and improving memory throughput.