Skip to content

Commit e089aab

Browse files
Pierre-vhLukacma
authored andcommitted
[AMDGPU] Update code sequence for CU-mode Release Fences in GFX10+ (llvm#161638)
They were previously optimized to not emit any waitcnt, which is technically correct because there is no reordering of operations at workgroup scope in CU mode for GFX10+. This breaks transitivity however, for example if we have the following sequence of events in one thread: - some stores - store atomic release syncscope("workgroup") - barrier then another thread follows with - barrier - load atomic acquire - store atomic release syncscope("agent") It does not work because, while the other thread sees the stores, it cannot release them at the wider scope. Our release fences aren't strong enough to "wait" on stores from other waves. We also cannot strengthen our release fences any further to allow for releasing other wave's stores because only GFX12 can do that with `global_wb`. GFX10-11 do not have the writeback instruction. It'd also add yet another level of complexity to code sequences, with both acquire/release having CU-mode only alternatives. Lastly, acq/rel are always used together. The price for synchronization has to be paid either at the acq, or the rel. Strengthening the releases would just make the memory model more complex but wouldn't help performance. So the choice here is to streamline the code sequences by making CU and WGP mode emit almost identical (vL0 inv is not needed in CU mode) code for release (or stronger) atomic ordering. This also removes the `vm_vsrc(0)` wait before barriers. Now that the release fence in CU mode is strong enough, it is no longer needed. Supersedes llvm#160501 Solves SC1-6454
1 parent a64c510 commit e089aab

16 files changed

+2270
-769
lines changed

llvm/docs/AMDGPUUsage.rst

Lines changed: 3 additions & 58 deletions
Large diffs are not rendered by default.

llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp

Lines changed: 14 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -397,12 +397,6 @@ class SICacheControl {
397397
bool IsCrossAddrSpaceOrdering,
398398
Position Pos) const = 0;
399399

400-
/// Inserts any necessary instructions before the barrier start instruction
401-
/// \p MI in order to support pairing of barriers and fences.
402-
virtual bool insertBarrierStart(MachineBasicBlock::iterator &MI) const {
403-
return false;
404-
};
405-
406400
/// Virtual destructor to allow derivations to be deleted.
407401
virtual ~SICacheControl() = default;
408402
};
@@ -583,12 +577,8 @@ class SIGfx10CacheControl : public SIGfx7CacheControl {
583577
bool IsCrossAddrSpaceOrdering, Position Pos,
584578
AtomicOrdering Order, bool AtomicsOnly) const override;
585579

586-
bool insertAcquire(MachineBasicBlock::iterator &MI,
587-
SIAtomicScope Scope,
588-
SIAtomicAddrSpace AddrSpace,
589-
Position Pos) const override;
590-
591-
bool insertBarrierStart(MachineBasicBlock::iterator &MI) const override;
580+
bool insertAcquire(MachineBasicBlock::iterator &MI, SIAtomicScope Scope,
581+
SIAtomicAddrSpace AddrSpace, Position Pos) const override;
592582
};
593583

594584
class SIGfx11CacheControl : public SIGfx10CacheControl {
@@ -2069,8 +2059,11 @@ bool SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
20692059
// the WGP. Therefore need to wait for operations to complete to ensure
20702060
// they are visible to waves in the other CU as the L0 is per CU.
20712061
// Otherwise in CU mode and all waves of a work-group are on the same CU
2072-
// which shares the same L0.
2073-
if (!ST.isCuModeEnabled()) {
2062+
// which shares the same L0. Note that we still need to wait when
2063+
// performing a release in this mode to respect the transitivity of
2064+
// happens-before, e.g. other waves of the workgroup must be able to
2065+
// release the memory from another wave at a wider scope.
2066+
if (!ST.isCuModeEnabled() || isReleaseOrStronger(Order)) {
20742067
if ((Op & SIMemOp::LOAD) != SIMemOp::NONE)
20752068
VMCnt |= true;
20762069
if ((Op & SIMemOp::STORE) != SIMemOp::NONE)
@@ -2225,22 +2218,6 @@ bool SIGfx10CacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
22252218
return Changed;
22262219
}
22272220

2228-
bool SIGfx10CacheControl::insertBarrierStart(
2229-
MachineBasicBlock::iterator &MI) const {
2230-
// We need to wait on vm_vsrc so barriers can pair with fences in GFX10+ CU
2231-
// mode. This is because a CU mode release fence does not emit any wait, which
2232-
// is fine when only dealing with vmem, but isn't sufficient in the presence
2233-
// of barriers which do not go through vmem.
2234-
// GFX12.5 does not require this additional wait.
2235-
if (!ST.isCuModeEnabled() || ST.hasGFX1250Insts())
2236-
return false;
2237-
2238-
BuildMI(*MI->getParent(), MI, MI->getDebugLoc(),
2239-
TII->get(AMDGPU::S_WAITCNT_DEPCTR))
2240-
.addImm(AMDGPU::DepCtr::encodeFieldVmVsrc(0));
2241-
return true;
2242-
}
2243-
22442221
bool SIGfx11CacheControl::enableLoadCacheBypass(
22452222
const MachineBasicBlock::iterator &MI, SIAtomicScope Scope,
22462223
SIAtomicAddrSpace AddrSpace) const {
@@ -2419,15 +2396,20 @@ bool SIGfx12CacheControl::insertWait(MachineBasicBlock::iterator &MI,
24192396
// In WGP mode the waves of a work-group can be executing on either CU
24202397
// of the WGP. Therefore need to wait for operations to complete to
24212398
// ensure they are visible to waves in the other CU as the L0 is per CU.
2399+
//
24222400
// Otherwise in CU mode and all waves of a work-group are on the same CU
2423-
// which shares the same L0.
2401+
// which shares the same L0. Note that we still need to wait when
2402+
// performing a release in this mode to respect the transitivity of
2403+
// happens-before, e.g. other waves of the workgroup must be able to
2404+
// release the memory from another wave at a wider scope.
24242405
//
24252406
// GFX12.5:
24262407
// CU$ has two ports. To ensure operations are visible at the workgroup
24272408
// level, we need to ensure all operations in this port have completed
24282409
// so the other SIMDs in the WG can see them. There is no ordering
24292410
// guarantee between the ports.
2430-
if (!ST.isCuModeEnabled() || ST.hasGFX1250Insts()) {
2411+
if (!ST.isCuModeEnabled() || ST.hasGFX1250Insts() ||
2412+
isReleaseOrStronger(Order)) {
24312413
if ((Op & SIMemOp::LOAD) != SIMemOp::NONE)
24322414
LOADCnt |= true;
24332415
if ((Op & SIMemOp::STORE) != SIMemOp::NONE)
@@ -3017,11 +2999,6 @@ bool SIMemoryLegalizer::run(MachineFunction &MF) {
30172999
MI = II->getIterator();
30183000
}
30193001

3020-
if (ST.getInstrInfo()->isBarrierStart(MI->getOpcode())) {
3021-
Changed |= CC->insertBarrierStart(MI);
3022-
continue;
3023-
}
3024-
30253002
if (!(MI->getDesc().TSFlags & SIInstrFlags::maybeAtomic))
30263003
continue;
30273004

llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll

Lines changed: 24 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -551,7 +551,9 @@ define amdgpu_kernel void @workgroup_one_as_release() #0 {
551551
;
552552
; GFX10CU-LABEL: name: workgroup_one_as_release
553553
; GFX10CU: bb.0.entry:
554+
; GFX10CU-NEXT: S_WAITCNT_soft 16240
554555
; GFX10CU-NEXT: S_WAITCNT_lds_direct
556+
; GFX10CU-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
555557
; GFX10CU-NEXT: S_ENDPGM 0
556558
;
557559
; GFX11WGP-LABEL: name: workgroup_one_as_release
@@ -562,6 +564,8 @@ define amdgpu_kernel void @workgroup_one_as_release() #0 {
562564
;
563565
; GFX11CU-LABEL: name: workgroup_one_as_release
564566
; GFX11CU: bb.0.entry:
567+
; GFX11CU-NEXT: S_WAITCNT_soft 1015
568+
; GFX11CU-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
565569
; GFX11CU-NEXT: S_ENDPGM 0
566570
entry:
567571
fence syncscope("workgroup-one-as") release
@@ -587,7 +591,9 @@ define amdgpu_kernel void @workgroup_one_as_acq_rel() #0 {
587591
;
588592
; GFX10CU-LABEL: name: workgroup_one_as_acq_rel
589593
; GFX10CU: bb.0.entry:
594+
; GFX10CU-NEXT: S_WAITCNT_soft 16240
590595
; GFX10CU-NEXT: S_WAITCNT_lds_direct
596+
; GFX10CU-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
591597
; GFX10CU-NEXT: S_ENDPGM 0
592598
;
593599
; GFX11WGP-LABEL: name: workgroup_one_as_acq_rel
@@ -599,6 +605,8 @@ define amdgpu_kernel void @workgroup_one_as_acq_rel() #0 {
599605
;
600606
; GFX11CU-LABEL: name: workgroup_one_as_acq_rel
601607
; GFX11CU: bb.0.entry:
608+
; GFX11CU-NEXT: S_WAITCNT_soft 1015
609+
; GFX11CU-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
602610
; GFX11CU-NEXT: S_ENDPGM 0
603611
entry:
604612
fence syncscope("workgroup-one-as") acq_rel
@@ -624,7 +632,9 @@ define amdgpu_kernel void @workgroup_one_as_seq_cst() #0 {
624632
;
625633
; GFX10CU-LABEL: name: workgroup_one_as_seq_cst
626634
; GFX10CU: bb.0.entry:
635+
; GFX10CU-NEXT: S_WAITCNT_soft 16240
627636
; GFX10CU-NEXT: S_WAITCNT_lds_direct
637+
; GFX10CU-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
628638
; GFX10CU-NEXT: S_ENDPGM 0
629639
;
630640
; GFX11WGP-LABEL: name: workgroup_one_as_seq_cst
@@ -636,6 +646,8 @@ define amdgpu_kernel void @workgroup_one_as_seq_cst() #0 {
636646
;
637647
; GFX11CU-LABEL: name: workgroup_one_as_seq_cst
638648
; GFX11CU: bb.0.entry:
649+
; GFX11CU-NEXT: S_WAITCNT_soft 1015
650+
; GFX11CU-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
639651
; GFX11CU-NEXT: S_ENDPGM 0
640652
entry:
641653
fence syncscope("workgroup-one-as") seq_cst
@@ -1305,8 +1317,9 @@ define amdgpu_kernel void @workgroup_release() #0 {
13051317
;
13061318
; GFX10CU-LABEL: name: workgroup_release
13071319
; GFX10CU: bb.0.entry:
1308-
; GFX10CU-NEXT: S_WAITCNT_soft 49279
1320+
; GFX10CU-NEXT: S_WAITCNT_soft 112
13091321
; GFX10CU-NEXT: S_WAITCNT_lds_direct
1322+
; GFX10CU-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
13101323
; GFX10CU-NEXT: S_ENDPGM 0
13111324
;
13121325
; GFX11WGP-LABEL: name: workgroup_release
@@ -1317,7 +1330,8 @@ define amdgpu_kernel void @workgroup_release() #0 {
13171330
;
13181331
; GFX11CU-LABEL: name: workgroup_release
13191332
; GFX11CU: bb.0.entry:
1320-
; GFX11CU-NEXT: S_WAITCNT_soft 64519
1333+
; GFX11CU-NEXT: S_WAITCNT_soft 7
1334+
; GFX11CU-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
13211335
; GFX11CU-NEXT: S_ENDPGM 0
13221336
entry:
13231337
fence syncscope("workgroup") release
@@ -1345,8 +1359,9 @@ define amdgpu_kernel void @workgroup_acq_rel() #0 {
13451359
;
13461360
; GFX10CU-LABEL: name: workgroup_acq_rel
13471361
; GFX10CU: bb.0.entry:
1348-
; GFX10CU-NEXT: S_WAITCNT_soft 49279
1362+
; GFX10CU-NEXT: S_WAITCNT_soft 112
13491363
; GFX10CU-NEXT: S_WAITCNT_lds_direct
1364+
; GFX10CU-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
13501365
; GFX10CU-NEXT: S_ENDPGM 0
13511366
;
13521367
; GFX11WGP-LABEL: name: workgroup_acq_rel
@@ -1358,7 +1373,8 @@ define amdgpu_kernel void @workgroup_acq_rel() #0 {
13581373
;
13591374
; GFX11CU-LABEL: name: workgroup_acq_rel
13601375
; GFX11CU: bb.0.entry:
1361-
; GFX11CU-NEXT: S_WAITCNT_soft 64519
1376+
; GFX11CU-NEXT: S_WAITCNT_soft 7
1377+
; GFX11CU-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
13621378
; GFX11CU-NEXT: S_ENDPGM 0
13631379
entry:
13641380
fence syncscope("workgroup") acq_rel
@@ -1386,8 +1402,9 @@ define amdgpu_kernel void @workgroup_seq_cst() #0 {
13861402
;
13871403
; GFX10CU-LABEL: name: workgroup_seq_cst
13881404
; GFX10CU: bb.0.entry:
1389-
; GFX10CU-NEXT: S_WAITCNT_soft 49279
1405+
; GFX10CU-NEXT: S_WAITCNT_soft 112
13901406
; GFX10CU-NEXT: S_WAITCNT_lds_direct
1407+
; GFX10CU-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
13911408
; GFX10CU-NEXT: S_ENDPGM 0
13921409
;
13931410
; GFX11WGP-LABEL: name: workgroup_seq_cst
@@ -1399,7 +1416,8 @@ define amdgpu_kernel void @workgroup_seq_cst() #0 {
13991416
;
14001417
; GFX11CU-LABEL: name: workgroup_seq_cst
14011418
; GFX11CU: bb.0.entry:
1402-
; GFX11CU-NEXT: S_WAITCNT_soft 64519
1419+
; GFX11CU-NEXT: S_WAITCNT_soft 7
1420+
; GFX11CU-NEXT: S_WAITCNT_VSCNT_soft undef $sgpr_null, 0
14031421
; GFX11CU-NEXT: S_ENDPGM 0
14041422
entry:
14051423
fence syncscope("workgroup") seq_cst

llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -150,7 +150,6 @@ define amdgpu_kernel void @barrier_release(<4 x i32> inreg %rsrc,
150150
; GFX10CU-NEXT: buffer_load_dword v0, s[8:11], 0 offen lds
151151
; GFX10CU-NEXT: v_mov_b32_e32 v0, s13
152152
; GFX10CU-NEXT: s_waitcnt vmcnt(0)
153-
; GFX10CU-NEXT: s_waitcnt_depctr 0xffe3
154153
; GFX10CU-NEXT: s_barrier
155154
; GFX10CU-NEXT: ds_read_b32 v0, v0
156155
; GFX10CU-NEXT: s_waitcnt lgkmcnt(0)

llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll

Lines changed: 8 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,6 @@ define amdgpu_kernel void @test_s_barrier() {
1515
;
1616
; GFX10-CU-LABEL: test_s_barrier:
1717
; GFX10-CU: ; %bb.0: ; %entry
18-
; GFX10-CU-NEXT: s_waitcnt_depctr 0xffe3
1918
; GFX10-CU-NEXT: s_barrier
2019
; GFX10-CU-NEXT: s_endpgm
2120
;
@@ -26,7 +25,6 @@ define amdgpu_kernel void @test_s_barrier() {
2625
;
2726
; GFX11-CU-LABEL: test_s_barrier:
2827
; GFX11-CU: ; %bb.0: ; %entry
29-
; GFX11-CU-NEXT: s_waitcnt_depctr 0xffe3
3028
; GFX11-CU-NEXT: s_barrier
3129
; GFX11-CU-NEXT: s_endpgm
3230
;
@@ -38,7 +36,6 @@ define amdgpu_kernel void @test_s_barrier() {
3836
;
3937
; GFX12-CU-LABEL: test_s_barrier:
4038
; GFX12-CU: ; %bb.0: ; %entry
41-
; GFX12-CU-NEXT: s_wait_alu 0xffe3
4239
; GFX12-CU-NEXT: s_barrier_signal -1
4340
; GFX12-CU-NEXT: s_barrier_wait -1
4441
; GFX12-CU-NEXT: s_endpgm
@@ -63,8 +60,8 @@ define amdgpu_kernel void @test_s_barrier_workgroup_fence() {
6360
;
6461
; GFX10-CU-LABEL: test_s_barrier_workgroup_fence:
6562
; GFX10-CU: ; %bb.0: ; %entry
66-
; GFX10-CU-NEXT: s_waitcnt lgkmcnt(0)
67-
; GFX10-CU-NEXT: s_waitcnt_depctr 0xffe3
63+
; GFX10-CU-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
64+
; GFX10-CU-NEXT: s_waitcnt_vscnt null, 0x0
6865
; GFX10-CU-NEXT: s_barrier
6966
; GFX10-CU-NEXT: s_endpgm
7067
;
@@ -77,8 +74,8 @@ define amdgpu_kernel void @test_s_barrier_workgroup_fence() {
7774
;
7875
; GFX11-CU-LABEL: test_s_barrier_workgroup_fence:
7976
; GFX11-CU: ; %bb.0: ; %entry
80-
; GFX11-CU-NEXT: s_waitcnt lgkmcnt(0)
81-
; GFX11-CU-NEXT: s_waitcnt_depctr 0xffe3
77+
; GFX11-CU-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
78+
; GFX11-CU-NEXT: s_waitcnt_vscnt null, 0x0
8279
; GFX11-CU-NEXT: s_barrier
8380
; GFX11-CU-NEXT: s_endpgm
8481
;
@@ -94,8 +91,10 @@ define amdgpu_kernel void @test_s_barrier_workgroup_fence() {
9491
;
9592
; GFX12-CU-LABEL: test_s_barrier_workgroup_fence:
9693
; GFX12-CU: ; %bb.0: ; %entry
97-
; GFX12-CU-NEXT: s_wait_dscnt 0x0
98-
; GFX12-CU-NEXT: s_wait_alu 0xffe3
94+
; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
95+
; GFX12-CU-NEXT: s_wait_samplecnt 0x0
96+
; GFX12-CU-NEXT: s_wait_storecnt 0x0
97+
; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
9998
; GFX12-CU-NEXT: s_barrier_signal -1
10099
; GFX12-CU-NEXT: s_barrier_wait -1
101100
; GFX12-CU-NEXT: s_endpgm
@@ -125,7 +124,6 @@ define amdgpu_kernel void @test_s_barrier_agent_fence() {
125124
; GFX10-CU: ; %bb.0: ; %entry
126125
; GFX10-CU-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
127126
; GFX10-CU-NEXT: s_waitcnt_vscnt null, 0x0
128-
; GFX10-CU-NEXT: s_waitcnt_depctr 0xffe3
129127
; GFX10-CU-NEXT: s_barrier
130128
; GFX10-CU-NEXT: s_endpgm
131129
;
@@ -140,7 +138,6 @@ define amdgpu_kernel void @test_s_barrier_agent_fence() {
140138
; GFX11-CU: ; %bb.0: ; %entry
141139
; GFX11-CU-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
142140
; GFX11-CU-NEXT: s_waitcnt_vscnt null, 0x0
143-
; GFX11-CU-NEXT: s_waitcnt_depctr 0xffe3
144141
; GFX11-CU-NEXT: s_barrier
145142
; GFX11-CU-NEXT: s_endpgm
146143
;
@@ -160,7 +157,6 @@ define amdgpu_kernel void @test_s_barrier_agent_fence() {
160157
; GFX12-CU-NEXT: s_wait_samplecnt 0x0
161158
; GFX12-CU-NEXT: s_wait_storecnt 0x0
162159
; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
163-
; GFX12-CU-NEXT: s_wait_alu 0xffe3
164160
; GFX12-CU-NEXT: s_barrier_signal -1
165161
; GFX12-CU-NEXT: s_barrier_wait -1
166162
; GFX12-CU-NEXT: s_endpgm

0 commit comments

Comments
 (0)