Skip to content

Conversation

@Pierre-vh
Copy link
Contributor

The previous implementation only waited on vm_vsrc(0), which works to
make sure all threads in the workgroup see the stores done before the barrier.
However, despite seeing the stores, the threads were unable to release them
at a wider scope. This caused failures in Vulkan CTS tests.

To correctly fulfill the memory model semantics, which require happens-before
to be transitive, we must wait for the stores to actually complete before the barrier,
so that another thread can release them.

Note that we still don't need to do anything for WGP mode because release fences
are strong enough in that mode. This only applies to CU mode because CU release
fences do not emit any code.

Solves SC1-6454

The previous implementation only waited on `vm_vsrc(0)`, which works to
make sure all threads in the workgroup see the stores done before the barrier.
However, despite seeing the stores, the threads were unable to release them
at a wider scope. This caused failures in Vulkan CTS tests.

To correctly fulfill the memory model semantics, which require happens-before
to be transitive, we must wait for the stores to actually complete before the barrier,
so that another thread can release them.

Note that we still don't need to do anything for WGP mode because release fences
are strong enough in that mode. This only applies to CU mode because CU release
fences do not emit any code.

Solves SC1-6454
Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@Pierre-vh Pierre-vh marked this pull request as ready for review September 24, 2025 11:10
@llvmbot
Copy link
Member

llvmbot commented Sep 24, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

Changes

The previous implementation only waited on vm_vsrc(0), which works to
make sure all threads in the workgroup see the stores done before the barrier.
However, despite seeing the stores, the threads were unable to release them
at a wider scope. This caused failures in Vulkan CTS tests.

To correctly fulfill the memory model semantics, which require happens-before
to be transitive, we must wait for the stores to actually complete before the barrier,
so that another thread can release them.

Note that we still don't need to do anything for WGP mode because release fences
are strong enough in that mode. This only applies to CU mode because CU release
fences do not emit any code.

Solves SC1-6454


Full diff: https://github.com/llvm/llvm-project/pull/160501.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+29-8)
  • (modified) llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll (-1)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll (+6-10)
diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index c85d2bb9fe9ae..ba6c29a855ebf 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -637,6 +637,8 @@ class SIGfx12CacheControl : public SIGfx11CacheControl {
                      SIAtomicAddrSpace AddrSpace, bool IsCrossAddrSpaceOrdering,
                      Position Pos) const override;
 
+  bool insertBarrierStart(MachineBasicBlock::iterator &MI) const override;
+
   bool enableLoadCacheBypass(const MachineBasicBlock::iterator &MI,
                              SIAtomicScope Scope,
                              SIAtomicAddrSpace AddrSpace) const override {
@@ -2174,17 +2176,19 @@ bool SIGfx10CacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
 
 bool SIGfx10CacheControl::insertBarrierStart(
     MachineBasicBlock::iterator &MI) const {
-  // We need to wait on vm_vsrc so barriers can pair with fences in GFX10+ CU
-  // mode. This is because a CU mode release fence does not emit any wait, which
-  // is fine when only dealing with vmem, but isn't sufficient in the presence
-  // of barriers which do not go through vmem.
-  // GFX12.5 does not require this additional wait.
-  if (!ST.isCuModeEnabled() || ST.hasGFX1250Insts())
+  if (!ST.isCuModeEnabled())
     return false;
 
+  // GFX10/11 CU MODE Workgroup fences do not emit anything.
+  // In the presence of barriers, we want to make sure previous memory
+  // operations are actually visible and can be released at a wider scope by
+  // another thread upon exiting the barrier. To make this possible, we must
+  // wait on previous stores.
+
   BuildMI(*MI->getParent(), MI, MI->getDebugLoc(),
-          TII->get(AMDGPU::S_WAITCNT_DEPCTR))
-      .addImm(AMDGPU::DepCtr::encodeFieldVmVsrc(0));
+          TII->get(AMDGPU::S_WAITCNT_VSCNT_soft))
+      .addReg(AMDGPU::SGPR_NULL, RegState::Undef)
+      .addImm(0);
   return true;
 }
 
@@ -2570,6 +2574,23 @@ bool SIGfx12CacheControl::insertRelease(MachineBasicBlock::iterator &MI,
   return Changed;
 }
 
+bool SIGfx12CacheControl::insertBarrierStart(
+    MachineBasicBlock::iterator &MI) const {
+  if (!ST.isCuModeEnabled() || ST.hasGFX1250Insts())
+    return false;
+
+  // GFX12 CU MODE Workgroup fences do not emit anything (except in GFX12.5).
+  // In the presence of barriers, we want to make sure previous memory
+  // operations are actually visible and can be released at a wider scope by
+  // another thread upon exiting the barrier. To make this possible, we must
+  // wait on previous stores.
+
+  BuildMI(*MI->getParent(), MI, MI->getDebugLoc(),
+          TII->get(AMDGPU::S_WAIT_STORECNT_soft))
+      .addImm(0);
+  return true;
+}
+
 bool SIGfx12CacheControl::enableVolatileAndOrNonTemporal(
     MachineBasicBlock::iterator &MI, SIAtomicAddrSpace AddrSpace, SIMemOp Op,
     bool IsVolatile, bool IsNonTemporal, bool IsLastUse = false) const {
diff --git a/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll b/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll
index b91963f08681c..d23509b5aa812 100644
--- a/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll
+++ b/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll
@@ -150,7 +150,6 @@ define amdgpu_kernel void @barrier_release(<4 x i32> inreg %rsrc,
 ; GFX10CU-NEXT:    buffer_load_dword v0, s[8:11], 0 offen lds
 ; GFX10CU-NEXT:    v_mov_b32_e32 v0, s13
 ; GFX10CU-NEXT:    s_waitcnt vmcnt(0)
-; GFX10CU-NEXT:    s_waitcnt_depctr 0xffe3
 ; GFX10CU-NEXT:    s_barrier
 ; GFX10CU-NEXT:    ds_read_b32 v0, v0
 ; GFX10CU-NEXT:    s_waitcnt lgkmcnt(0)
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll
index 516c3946f63dc..b5c3cce160a22 100644
--- a/llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll
@@ -15,7 +15,7 @@ define amdgpu_kernel void @test_s_barrier() {
 ;
 ; GFX10-CU-LABEL: test_s_barrier:
 ; GFX10-CU:       ; %bb.0: ; %entry
-; GFX10-CU-NEXT:    s_waitcnt_depctr 0xffe3
+; GFX10-CU-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-CU-NEXT:    s_barrier
 ; GFX10-CU-NEXT:    s_endpgm
 ;
@@ -26,7 +26,7 @@ define amdgpu_kernel void @test_s_barrier() {
 ;
 ; GFX11-CU-LABEL: test_s_barrier:
 ; GFX11-CU:       ; %bb.0: ; %entry
-; GFX11-CU-NEXT:    s_waitcnt_depctr 0xffe3
+; GFX11-CU-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-CU-NEXT:    s_barrier
 ; GFX11-CU-NEXT:    s_endpgm
 ;
@@ -38,7 +38,7 @@ define amdgpu_kernel void @test_s_barrier() {
 ;
 ; GFX12-CU-LABEL: test_s_barrier:
 ; GFX12-CU:       ; %bb.0: ; %entry
-; GFX12-CU-NEXT:    s_wait_alu 0xffe3
+; GFX12-CU-NEXT:    s_wait_storecnt 0x0
 ; GFX12-CU-NEXT:    s_barrier_signal -1
 ; GFX12-CU-NEXT:    s_barrier_wait -1
 ; GFX12-CU-NEXT:    s_endpgm
@@ -64,7 +64,7 @@ define amdgpu_kernel void @test_s_barrier_workgroup_fence() {
 ; GFX10-CU-LABEL: test_s_barrier_workgroup_fence:
 ; GFX10-CU:       ; %bb.0: ; %entry
 ; GFX10-CU-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX10-CU-NEXT:    s_waitcnt_depctr 0xffe3
+; GFX10-CU-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-CU-NEXT:    s_barrier
 ; GFX10-CU-NEXT:    s_endpgm
 ;
@@ -78,7 +78,7 @@ define amdgpu_kernel void @test_s_barrier_workgroup_fence() {
 ; GFX11-CU-LABEL: test_s_barrier_workgroup_fence:
 ; GFX11-CU:       ; %bb.0: ; %entry
 ; GFX11-CU-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-CU-NEXT:    s_waitcnt_depctr 0xffe3
+; GFX11-CU-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-CU-NEXT:    s_barrier
 ; GFX11-CU-NEXT:    s_endpgm
 ;
@@ -94,8 +94,7 @@ define amdgpu_kernel void @test_s_barrier_workgroup_fence() {
 ;
 ; GFX12-CU-LABEL: test_s_barrier_workgroup_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
-; GFX12-CU-NEXT:    s_wait_dscnt 0x0
-; GFX12-CU-NEXT:    s_wait_alu 0xffe3
+; GFX12-CU-NEXT:    s_wait_storecnt_dscnt 0x0
 ; GFX12-CU-NEXT:    s_barrier_signal -1
 ; GFX12-CU-NEXT:    s_barrier_wait -1
 ; GFX12-CU-NEXT:    s_endpgm
@@ -125,7 +124,6 @@ define amdgpu_kernel void @test_s_barrier_agent_fence() {
 ; GFX10-CU:       ; %bb.0: ; %entry
 ; GFX10-CU-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX10-CU-NEXT:    s_waitcnt_vscnt null, 0x0
-; GFX10-CU-NEXT:    s_waitcnt_depctr 0xffe3
 ; GFX10-CU-NEXT:    s_barrier
 ; GFX10-CU-NEXT:    s_endpgm
 ;
@@ -140,7 +138,6 @@ define amdgpu_kernel void @test_s_barrier_agent_fence() {
 ; GFX11-CU:       ; %bb.0: ; %entry
 ; GFX11-CU-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX11-CU-NEXT:    s_waitcnt_vscnt null, 0x0
-; GFX11-CU-NEXT:    s_waitcnt_depctr 0xffe3
 ; GFX11-CU-NEXT:    s_barrier
 ; GFX11-CU-NEXT:    s_endpgm
 ;
@@ -160,7 +157,6 @@ define amdgpu_kernel void @test_s_barrier_agent_fence() {
 ; GFX12-CU-NEXT:    s_wait_samplecnt 0x0
 ; GFX12-CU-NEXT:    s_wait_storecnt 0x0
 ; GFX12-CU-NEXT:    s_wait_loadcnt_dscnt 0x0
-; GFX12-CU-NEXT:    s_wait_alu 0xffe3
 ; GFX12-CU-NEXT:    s_barrier_signal -1
 ; GFX12-CU-NEXT:    s_barrier_wait -1
 ; GFX12-CU-NEXT:    s_endpgm

Copy link
Contributor

@perlfu perlfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - as far as an immediate fix for the issue goes.

Beyond the scope of this change... It would be nice if we could define where vm_vsrc(0) would be sufficient, and be able to apply that as an optimization. My suspicion is that it is sufficient in the majority of graphics scenarios.

@jayfoad
Copy link
Contributor

jayfoad commented Sep 24, 2025

Does this need a change in the memory model docs?

Copy link
Collaborator

@nhaehnle nhaehnle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should ever unconditionally insert waits purely based on a barrier. This pessimizes LDS-only barriers too much.

Quite frankly, I'd rather admit that trying to reduce CU-mode release fences to only VM_VSRC waits was a mistake, and go back to using the same waits there as for WGP-mode.

@nhaehnle
Copy link
Collaborator

Beyond the scope of this change... It would be nice if we could define where vm_vsrc(0) would be sufficient, and be able to apply that as an optimization. My suspicion is that it is sufficient in the majority of graphics scenarios.

I'm not so sure about that, unfortunately. The example you showed offline showed a problem when wave A has a workgroup-scope release fence and then wave B has an agent-scope release fence that should push out the data from A as well.

Since we can't do long-distance static analysis to understand what happens in other waves, we can only reduce a workgroup-scope release fence to vm_vsrc(0) if we change the code sequence for agent-scope release fences in a way that establishes this guarantee, and I'm not convinced that the hardware we have guarantees that. We can follow up offline.

@Pierre-vh
Copy link
Contributor Author

Quite frankly, I'd rather admit that trying to reduce CU-mode release fences to only VM_VSRC waits was a mistake, and go back to using the same waits there as for WGP-mode.

We never emitted a vm_vsrc(0) wait for workgroup release fences in CU mode.

This pessimizes LDS-only barriers too much.

Which LDS-only barriers ? The check only includes S_BARRIER.

Do you think we should instead pessimize all workgroup release fences in CU mode so they have a wait on storecnt?

@Pierre-vh
Copy link
Contributor Author

Does this need a change in the memory model docs?

We don't have the documentation for that upstream yet, I'll fix the downstream one

@Pierre-vh Pierre-vh requested a review from nhaehnle September 25, 2025 09:21
@nhaehnle
Copy link
Collaborator

nhaehnle commented Oct 1, 2025

This pessimizes LDS-only barriers too much.

Which LDS-only barriers ? The check only includes S_BARRIER.

Right, and an S_BARRIER by itself is not (or shouldn't be) evidence of any desired memory ordering. We need a fence for that, and if the fence is limited to LDS, then there should not be any loadcnt or storecnt waits.

Do you think we should instead pessimize all workgroup release fences in CU mode so they have a wait on storecnt?

Is it a pessimization? I don't think so. Isn't the example @perlfu gave offline evidence that if a release fence intends to fence global memory, then a storecnt wait is pretty much unavoidable?

Edit for context: The example was:

image store N
fence release workgroup scope
workgroup barrier
fence acquire workgroup scope
(conditional) fence release device (agent) scope
(conditional) atomic store to X
atomic load from Y
fence acquire device (agent) scope
image load M

... where Y/M of thread B is X/N of thread A. The image load in thread B should see the store in A (assuming the atomic load in B is after the atomic store in A in the memory location order of X == Y).

For that to be guaranteed, the workgroup scope release fence has to imply a storecnt wait.

But it shouldn't really matter that the synchronization mechanism in question is a barrier: the same would be true if the barrier was replaced e.g. by LDS atomics.

@ssahasra
Copy link
Collaborator

ssahasra commented Oct 1, 2025

Do you think we should instead pessimize all workgroup release fences in CU mode so they have a wait on storecnt?

Is it a pessimization? I don't think so. Isn't the example @perlfu gave offline evidence that if a release fence intends to fence global memory, then a storecnt wait is pretty much unavoidable?

I agree. It's a bug fix, not a pessimization. On the other hand, the programmer may know that a certain part of the program only cares about synchronization within the workgroup. For such a program, opting out of transitivity is an optimization, which needs a way to be expressed in LLVM IR.

@Pierre-vh
Copy link
Contributor Author

Do you think we should instead pessimize all workgroup release fences in CU mode so they have a wait on storecnt?

Is it a pessimization? I don't think so. Isn't the example @perlfu gave offline evidence that if a release fence intends to fence global memory, then a storecnt wait is pretty much unavoidable?

I agree. It's a bug fix, not a pessimization. On the other hand, the programmer may know that a certain part of the program only cares about synchronization within the workgroup. For such a program, opting out of transitivity is an optimization, which needs a way to be expressed in LLVM IR.

As I understand it, it's fine to not wait because the release only occurs when the other thread observes the atomic store done as part of the release sequence. This is why we need to do spin (loop) on an acquire if we don't have a barrier for example, because we know the release didn't occur until we load the right value.
So if we take barriers out of the picture, it is fine to not wait because when the store is seen, all previous stores are seen as well (for CU mode workgroup scope).

The problem here is very barrier specific because we're introducing a model where we synchronize without the classic release/acquire sequences that rely on an atomic store. Instead we're adding a barrier + fence pairing, and we synchronize when leaving the barrier. We remove the requirement to spin on the acquire when a barrier is present.

@Pierre-vh
Copy link
Contributor Author

After thinking about this a bit more, I think you're both right. There's a bug. I think in the absence of waits, we could have the following situation:

Thread 0:
  store atomic A relaxed
  store atomic B release syncscope("workgroup")

Then another thread in the workgroup could see the store to B without having the certainty that the store to A is done as well.
The store to A could be held into another memory channel for example.

It'd require some unfortunate timing to see that happen without barriers (hence why we never observed it until now), but as proved here it's possible to see it when using barriers.

We can fix it by using the same waits for WGP/CU mode.
Does everyone agree with that?

@ssahasra
Copy link
Collaborator

ssahasra commented Oct 1, 2025

Being safe sounds good to me. I honestly haven't thought about various combinations of a store that precedes an atomic store-release operation. In particular what is the hardware memory model implied by the programming guide for different counters.

@Pierre-vh Pierre-vh closed this Oct 2, 2025
Pierre-vh added a commit that referenced this pull request Oct 21, 2025
…161638)

They were previously optimized to not emit any waitcnt, which is
technically correct because there is no reordering of operations at
workgroup scope in CU mode for GFX10+.

This breaks transitivity however, for example if we have the following
sequence of events in one thread:

- some stores
- store atomic release syncscope("workgroup")
- barrier

then another thread follows with

- barrier
- load atomic acquire
- store atomic release syncscope("agent")

It does not work because, while the other thread sees the stores, it
cannot release them at the wider scope. Our release fences aren't strong
enough to "wait" on stores from other waves.

We also cannot strengthen our release fences any further to allow for
releasing other wave's stores because only GFX12 can do that with
`global_wb`. GFX10-11 do not have the writeback instruction.
It'd also add yet another level of complexity to code sequences, with
both acquire/release having CU-mode only alternatives.
Lastly, acq/rel are always used together. The price for synchronization
has to be paid either at the acq, or the rel. Strengthening the releases
would just make the memory model more complex but wouldn't help
performance.

So the choice here is to streamline the code sequences by making CU and
WGP mode emit almost identical (vL0 inv is not needed in CU mode) code
for release (or stronger) atomic ordering.

This also removes the `vm_vsrc(0)` wait before barriers. Now that the
release fence in CU mode is strong enough, it is no longer needed.

Supersedes #160501
Solves SC1-6454
ronlieb added a commit to ROCm/llvm-project that referenced this pull request Oct 21, 2025
* [flang] Fix standalone build regression from llvm#161179 (llvm#164309)

Fix incorrect linking and dependencies introduced in llvm#161179 that break
standalone builds of Flang.

Signed-off-by: Michał Górny <[email protected]>

* [AMDGPU] Remove magic constants from V_PK_ADD_F32 pattern. NFC (llvm#164335)

* [AMDGPU] Update code sequence for CU-mode Release Fences in GFX10+ (llvm#161638)

They were previously optimized to not emit any waitcnt, which is
technically correct because there is no reordering of operations at
workgroup scope in CU mode for GFX10+.

This breaks transitivity however, for example if we have the following
sequence of events in one thread:

- some stores
- store atomic release syncscope("workgroup")
- barrier

then another thread follows with

- barrier
- load atomic acquire
- store atomic release syncscope("agent")

It does not work because, while the other thread sees the stores, it
cannot release them at the wider scope. Our release fences aren't strong
enough to "wait" on stores from other waves.

We also cannot strengthen our release fences any further to allow for
releasing other wave's stores because only GFX12 can do that with
`global_wb`. GFX10-11 do not have the writeback instruction.
It'd also add yet another level of complexity to code sequences, with
both acquire/release having CU-mode only alternatives.
Lastly, acq/rel are always used together. The price for synchronization
has to be paid either at the acq, or the rel. Strengthening the releases
would just make the memory model more complex but wouldn't help
performance.

So the choice here is to streamline the code sequences by making CU and
WGP mode emit almost identical (vL0 inv is not needed in CU mode) code
for release (or stronger) atomic ordering.

This also removes the `vm_vsrc(0)` wait before barriers. Now that the
release fence in CU mode is strong enough, it is no longer needed.

Supersedes llvm#160501
Solves SC1-6454

* [InstSimplify] Support ptrtoaddr in simplifyGEPInst() (llvm#164262)

This adds support for ptrtoaddr in the `ptradd p, ptrtoaddr(p2) -
ptrtoaddr(p) -> p2` fold.

This fold requires that p and p2 have the same underlying object
(otherwise the provenance may not be the same).

The argument I would like to make here is that because the underlying
objects are the same (and the pointers in the same address space), the
non-address bits of the pointer must be the same. Looking at some
specific cases of underlying object relationship:

 * phi/select: Trivially true.
* getelementptr: Only modifies address bits, non-address bits must
remain the same.
* addrspacecast round-trip cast: Must preserve all bits because we
optimize such round-trip casts away.
* non-interposable global alias: I'm a bit unsure about this one, but I
guess the alias and the aliasee must have the same non-address bits?
* various intrinsics like launder.invariant.group, ptrmask. I think
these all either preserve all pointer bits (like the invariant.group
ones) or at least the non-address bits (like ptrmask). There are some
interesting cases like amdgcn.make.buffer.rsrc, but those are cross
address-space.

-----

There is a second `gep (gep p, C), (sub 0, ptrtoint(p)) -> C` transform
in this function, which I am not extending to handle ptrtoaddr, adding
negative tests instead. This transform is overall dubious for provenance
reasons, but especially dubious with ptrtoaddr, as then we don't have
the guarantee that provenance of `p` has been exposed.

* [Hexagon] Add REQUIRES: asserts to test

This test uses -debug-only, so needs an assertion-enabled build.

* [AArch64] Combing scalar_to_reg into DUP if the DUP already exists (llvm#160499)

If we already have a dup(x) as part of the DAG along with a
scalar_to_vec(x), we can re-use the result of the dup to the
scalar_to_vec(x).

* [CAS] OnDiskGraphDB - fix MSVC "not all control paths return a value" warnings. NFC. (llvm#164369)

* Reapply "[libc++] Optimize __hash_table::erase(iterator, iterator)" (llvm#162850)

This reapplication fixes the use after free caused by not properly
updating the bucket list in one case.

Original commit message:
Instead of just calling the single element `erase` on every element of
the range, we can combine some of the operations in a custom
implementation. Specifically, we don't need to search for the previous
node or re-link the list every iteration. Removing this unnecessary work
results in some nice performance improvements:
```
-----------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                             old           new
-----------------------------------------------------------------------------------------------------------------------
std::unordered_set<int>::erase(iterator, iterator) (erase half the container)/0                    457 ns        459 ns
std::unordered_set<int>::erase(iterator, iterator) (erase half the container)/32                   995 ns        626 ns
std::unordered_set<int>::erase(iterator, iterator) (erase half the container)/1024               18196 ns       7995 ns
std::unordered_set<int>::erase(iterator, iterator) (erase half the container)/8192              124722 ns      70125 ns
std::unordered_set<std::string>::erase(iterator, iterator) (erase half the container)/0            456 ns        461 ns
std::unordered_set<std::string>::erase(iterator, iterator) (erase half the container)/32          1183 ns        769 ns
std::unordered_set<std::string>::erase(iterator, iterator) (erase half the container)/1024       27827 ns      18614 ns
std::unordered_set<std::string>::erase(iterator, iterator) (erase half the container)/8192      266681 ns     226107 ns
std::unordered_map<int, int>::erase(iterator, iterator) (erase half the container)/0               455 ns        462 ns
std::unordered_map<int, int>::erase(iterator, iterator) (erase half the container)/32              996 ns        659 ns
std::unordered_map<int, int>::erase(iterator, iterator) (erase half the container)/1024          15963 ns       8108 ns
std::unordered_map<int, int>::erase(iterator, iterator) (erase half the container)/8192         136493 ns      71848 ns
std::unordered_multiset<int>::erase(iterator, iterator) (erase half the container)/0               454 ns        455 ns
std::unordered_multiset<int>::erase(iterator, iterator) (erase half the container)/32              985 ns        703 ns
std::unordered_multiset<int>::erase(iterator, iterator) (erase half the container)/1024          16277 ns       9085 ns
std::unordered_multiset<int>::erase(iterator, iterator) (erase half the container)/8192         125736 ns      82710 ns
std::unordered_multimap<int, int>::erase(iterator, iterator) (erase half the container)/0          457 ns        454 ns
std::unordered_multimap<int, int>::erase(iterator, iterator) (erase half the container)/32        1091 ns        646 ns
std::unordered_multimap<int, int>::erase(iterator, iterator) (erase half the container)/1024     17784 ns       7664 ns
std::unordered_multimap<int, int>::erase(iterator, iterator) (erase half the container)/8192    127098 ns      72806 ns
```


This reverts commit acc3a62.

* [TableGen] List the indices of sub-operands (llvm#163723)

Some instances of the `Operand` class used in Tablegen instruction
definitions expand to a cluster of multiple operands at the MC layer,
such as complex addressing modes involving base + offset + shift, or
clusters of operands describing conditional Arm instructions or
predicated MVE instructions. There's currently no convenient way for C++
code to know the offset of one of those sub-operands from the start of
the cluster: instead it just hard-codes magic numbers like `index+2`,
which is hard to read and fragile.

This patch adds an extra piece of output to `InstrInfoEmitter` to define
those instruction offsets, based on the name of the `Operand` class
instance in Tablegen, and the names assigned to the sub-operands in the
`MIOperandInfo` field. For example, if target Foo were to define

  def Bar : Operand {
    let MIOperandInfo = (ops GPR:$first, i32imm:$second);
    // ...
  }

then the new constants would be `Foo::SUBOP_Bar_first` and
`Foo::SUBOP_Bar_second`, defined as 0 and 1 respectively.

As an example, I've converted some magic numbers related to the MVE
predication operand types (`vpred_n` and its superset `vpred_r`) to use
the new named constants in place of the integer literals they previously
used. This is more verbose, but also clearer, because it explains why
the integer is chosen instead of what its value is.

* [lldb] Add bidirectional packetLog to gdbclientutils.py (llvm#162176)

While debugging the tests for llvm#155000 I found it helpful to have both
sides
of the simulated gdb-rsp traffic rather than just the responses so I've
extended
the packetLog in MockGDBServerResponder to record traffic in both
directions.
Tests have been updated accordingly

* [MLIR] [Vector] Added canonicalizer for folding from_elements + transpose (llvm#161841)

## Description
Adds a new canonicalizer that folds
`vector.from_elements(vector.transpose))` => `vector.from_elements`.
This canonicalization reorders the input elements for
`vector.from_elements`, adjusts the output shape to match the effect of
the transpose op and eliminating its need.

## Testing
Added a 2D vector lit test that verifies the working of the rewrite.

---------

Signed-off-by: Keshav Vinayak Jha <[email protected]>

* [DA] Add initial support for monotonicity check (llvm#162280)

The dependence testing functions in DA assume that the analyzed AddRec
does not wrap over the entire iteration space. For AddRecs that may
wrap, DA should conservatively return unknown dependence. However, no
validation is currently performed to ensure that this condition holds,
which can lead to incorrect results in some cases.

This patch introduces the notion of *monotonicity* and a validation
logic to check whether a SCEV is monotonic. The monotonicity check
classifies the SCEV into one of the following categories:

- Unknown: Nothing is known about the monotonicity of the SCEV.
- Invariant: The SCEV is loop-invariant.
- MultivariateSignedMonotonic: The SCEV doesn't wrap in a signed sense
for any iteration of the loops in the loop nest.

The current validation logic basically searches an affine AddRec
recursively and checks whether the `nsw` flag is present. Notably, it is
still unclear whether we should also have a category for unsigned
monotonicity.
The monotonicity check is still under development and disabled by
default for now. Since such a check is necessary to make DA sound, it
should be enabled by default once the functionality is sufficient.

Split off from llvm#154527.

* [VPlan] Use VPlan::getRegion to shorten code (NFC) (llvm#164287)

* [VPlan] Improve code using m_APInt (NFC) (llvm#161683)

* [SystemZ] Avoid trunc(add(X,X)) patterns (llvm#164378)

Replace with trunc(add(X,Y)) to avoid premature folding in upcoming patch llvm#164227

* [clang][CodeGen] Emit `llvm.tbaa.errno` metadata during module creation

Let Clang emit `llvm.tbaa.errno` metadata in order to let LLVM
carry out optimizations around errno-writing libcalls to, as
long as it is proved the involved memory location does not alias
`errno`.

Previous discussion: https://discourse.llvm.org/t/rfc-modelling-errno-memory-effects/82972.

* [LV][NFC] Remove undef from phi incoming values (llvm#163762)

Split off from PR llvm#163525, this standalone patch replaces
 use of undef as incoming PHI values with zero, in order
 to reduce the likelihood of contributors hitting the
 `undef deprecator` warning in github.

* [DA] Add option to enable specific dependence test only (llvm#164245)

PR llvm#157084 added an option `da-run-siv-routines-only` to run only SIV
routines in DA. This PR replaces that option with a more fine-grained
one that allows to select other than SIV routines as well. This option
is useful for regression testing of individual DA routines. This patch
also reorganizes regression tests that use `da-run-siv-routines-only`.

* [libcxx] Optimize `std::generate_n` for segmented iterators (llvm#164266)

Part of llvm#102817.

This is a natural follow-up to llvm#163006. We are forwarding
`std::generate_n` to `std::__for_each_n` (`std::for_each_n` needs
c++17), resulting in improved performance for segmented iterators.

before:

```
std::generate_n(deque<int>)/32          17.5 ns         17.3 ns     40727273
std::generate_n(deque<int>)/50          25.7 ns         25.5 ns     26352941
std::generate_n(deque<int>)/1024         490 ns          487 ns      1445161
std::generate_n(deque<int>)/8192        3908 ns         3924 ns       179200
```

after:

```
std::generate_n(deque<int>)/32          11.1 ns         11.0 ns     64000000
std::generate_n(deque<int>)/50          16.1 ns         16.0 ns     44800000
std::generate_n(deque<int>)/1024         291 ns          292 ns      2357895
std::generate_n(deque<int>)/8192        2269 ns         2250 ns       298667
```

* [BOLT] Check entry point address is not in constant island (llvm#163418)

There are cases where `addEntryPointAtOffset` is called with a given
`Offset` that points to an address within a constant island. This
triggers `assert(!isInConstantIsland(EntryPointAddress)` and causes BOLT
to crash. This patch adds a check which ignores functions that would add
such entry points and warns the user.

* [llvm][dwarfdump] Pretty-print DW_AT_language_version (llvm#164222)

In both verbose and non-verbose mode we will now use the
`llvm::dwarf::LanguageDescription` to turn the version into a human
readable string. In verbose mode we also display the raw version code
(similar to how we display addresses in verbose mode). To make the
version code and prettified easier to distinguish, we print the
prettified name in colour (if available), which is consistent with how
`DW_AT_language` is printed in colour.

Before:
```
0x0000000c: DW_TAG_compile_unit                                                                           
              DW_AT_language_name       (DW_LNAME_C)                                                      
              DW_AT_language_version    (201112)             
```
After:
```
0x0000000c: DW_TAG_compile_unit                                                                           
              DW_AT_language_name       (DW_LNAME_C)                                                      
              DW_AT_language_version    (201112 C11)                                                             
```

---------

Signed-off-by: Michał Górny <[email protected]>
Signed-off-by: Keshav Vinayak Jha <[email protected]>
Co-authored-by: Michał Górny <[email protected]>
Co-authored-by: Stanislav Mekhanoshin <[email protected]>
Co-authored-by: Pierre van Houtryve <[email protected]>
Co-authored-by: Nikita Popov <[email protected]>
Co-authored-by: David Green <[email protected]>
Co-authored-by: Simon Pilgrim <[email protected]>
Co-authored-by: Nikolas Klauser <[email protected]>
Co-authored-by: Simon Tatham <[email protected]>
Co-authored-by: Daniel Sanders <[email protected]>
Co-authored-by: Keshav Vinayak Jha <[email protected]>
Co-authored-by: Ryotaro Kasuga <[email protected]>
Co-authored-by: Ramkumar Ramachandra <[email protected]>
Co-authored-by: Antonio Frighetto <[email protected]>
Co-authored-by: David Sherwood <[email protected]>
Co-authored-by: Connector Switch <[email protected]>
Co-authored-by: Asher Dobrescu <[email protected]>
Co-authored-by: Michael Buch <[email protected]>
Lukacma pushed a commit to Lukacma/llvm-project that referenced this pull request Oct 29, 2025
…lvm#161638)

They were previously optimized to not emit any waitcnt, which is
technically correct because there is no reordering of operations at
workgroup scope in CU mode for GFX10+.

This breaks transitivity however, for example if we have the following
sequence of events in one thread:

- some stores
- store atomic release syncscope("workgroup")
- barrier

then another thread follows with

- barrier
- load atomic acquire
- store atomic release syncscope("agent")

It does not work because, while the other thread sees the stores, it
cannot release them at the wider scope. Our release fences aren't strong
enough to "wait" on stores from other waves.

We also cannot strengthen our release fences any further to allow for
releasing other wave's stores because only GFX12 can do that with
`global_wb`. GFX10-11 do not have the writeback instruction.
It'd also add yet another level of complexity to code sequences, with
both acquire/release having CU-mode only alternatives.
Lastly, acq/rel are always used together. The price for synchronization
has to be paid either at the acq, or the rel. Strengthening the releases
would just make the memory model more complex but wouldn't help
performance.

So the choice here is to streamline the code sequences by making CU and
WGP mode emit almost identical (vL0 inv is not needed in CU mode) code
for release (or stronger) atomic ordering.

This also removes the `vm_vsrc(0)` wait before barriers. Now that the
release fence in CU mode is strong enough, it is no longer needed.

Supersedes llvm#160501
Solves SC1-6454
aokblast pushed a commit to aokblast/llvm-project that referenced this pull request Oct 30, 2025
…lvm#161638)

They were previously optimized to not emit any waitcnt, which is
technically correct because there is no reordering of operations at
workgroup scope in CU mode for GFX10+.

This breaks transitivity however, for example if we have the following
sequence of events in one thread:

- some stores
- store atomic release syncscope("workgroup")
- barrier

then another thread follows with

- barrier
- load atomic acquire
- store atomic release syncscope("agent")

It does not work because, while the other thread sees the stores, it
cannot release them at the wider scope. Our release fences aren't strong
enough to "wait" on stores from other waves.

We also cannot strengthen our release fences any further to allow for
releasing other wave's stores because only GFX12 can do that with
`global_wb`. GFX10-11 do not have the writeback instruction.
It'd also add yet another level of complexity to code sequences, with
both acquire/release having CU-mode only alternatives.
Lastly, acq/rel are always used together. The price for synchronization
has to be paid either at the acq, or the rel. Strengthening the releases
would just make the memory model more complex but wouldn't help
performance.

So the choice here is to streamline the code sequences by making CU and
WGP mode emit almost identical (vL0 inv is not needed in CU mode) code
for release (or stronger) atomic ordering.

This also removes the `vm_vsrc(0)` wait before barriers. Now that the
release fence in CU mode is strong enough, it is no longer needed.

Supersedes llvm#160501
Solves SC1-6454
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants