Skip to content

Conversation

@RKSimon
Copy link
Collaborator

@RKSimon RKSimon commented Nov 4, 2025

If there are additional uses of the bit twiddled value as well as the rmw store, we can replace them with a (re)loaded copy of the full width integer value after the store.

There's some memory op chain handling to handle here - the additional (re)load is chained after the new store and then any dependencies of the original store are chained after the (re)load.

If there are additional uses of the bit twiddled value as well as the rmw store, we can replace them with a (re)loaded copy of the full width integer value after the store.

There's some memory op chain handling to handle here - the additional (re)load is chained after the new store and then any dependencies of the original store are chained after the (re)load.
@llvmbot
Copy link
Member

llvmbot commented Nov 4, 2025

@llvm/pr-subscribers-backend-x86

Author: Simon Pilgrim (RKSimon)

Changes

If there are additional uses of the bit twiddled value as well as the rmw store, we can replace them with a (re)loaded copy of the full width integer value after the store.

There's some memory op chain handling to handle here - the additional (re)load is chained after the new store and then any dependencies of the original store are chained after the (re)load.


Full diff: https://github.com/llvm/llvm-project/pull/166376.diff

2 Files Affected:

  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+15-4)
  • (modified) llvm/test/CodeGen/X86/bittest-big-integer.ll (+74-327)
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 6edf0185df813..4da1bb0c81db1 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -53369,8 +53369,7 @@ static SDValue narrowBitOpRMW(StoreSDNode *St, const SDLoc &DL,
   //
   // BitInsert: (X & ~(1 << ShAmt)) | (InsertBit << ShAmt)
   SDValue SrcVal, InsertBit, ShAmt;
-  if (!StoredVal.hasOneUse() ||
-      !(sd_match(StoredVal, m_And(m_Value(SrcVal),
+  if (!(sd_match(StoredVal, m_And(m_Value(SrcVal),
                                   m_Not(m_Shl(m_One(), m_Value(ShAmt))))) ||
         sd_match(StoredVal,
                  m_Or(m_Value(SrcVal), m_Shl(m_One(), m_Value(ShAmt)))) ||
@@ -53441,8 +53440,20 @@ static SDValue narrowBitOpRMW(StoreSDNode *St, const SDLoc &DL,
     Res = DAG.getNode(StoredVal.getOpcode(), DL, MVT::i32, X, Mask);
   }
 
-  return DAG.getStore(St->getChain(), DL, Res, NewPtr, St->getPointerInfo(),
-                      Align(), St->getMemOperand()->getFlags());
+  SDValue NewStore =
+      DAG.getStore(St->getChain(), DL, Res, NewPtr, St->getPointerInfo(),
+                   Align(), St->getMemOperand()->getFlags());
+
+  // If there are other uses of StoredVal, replace with a new load of the
+  // whole (updated) value and ensure that any chained dependencies on the
+  // original store are updated to come AFTER the new load.
+  if (!StoredVal.hasOneUse()) {
+    SDValue NewLoad =
+        DAG.getLoad(VT, DL, NewStore, Ld->getBasePtr(), Ld->getMemOperand());
+    DAG.ReplaceAllUsesWith(StoredVal, NewLoad);
+    DAG.ReplaceAllUsesWith(SDValue(St, 0), NewLoad.getValue(1));
+  }
+  return NewStore;
 }
 
 static SDValue combineStore(SDNode *N, SelectionDAG &DAG,
diff --git a/llvm/test/CodeGen/X86/bittest-big-integer.ll b/llvm/test/CodeGen/X86/bittest-big-integer.ll
index bcb14fd25b975..32d225273a6e1 100644
--- a/llvm/test/CodeGen/X86/bittest-big-integer.ll
+++ b/llvm/test/CodeGen/X86/bittest-big-integer.ll
@@ -906,115 +906,46 @@ define i1 @test_ne_i4096(ptr %word, i32 %position) nounwind {
 define i1 @complement_cmpz_i128(ptr %word, i32 %position) nounwind {
 ; X86-LABEL: complement_cmpz_i128:
 ; X86:       # %bb.0:
-; X86-NEXT:    pushl %ebp
-; X86-NEXT:    movl %esp, %ebp
-; X86-NEXT:    pushl %ebx
-; X86-NEXT:    pushl %edi
-; X86-NEXT:    pushl %esi
-; X86-NEXT:    andl $-16, %esp
-; X86-NEXT:    subl $64, %esp
-; X86-NEXT:    movzbl 12(%ebp), %ecx
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $1, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl %ecx, %eax
-; X86-NEXT:    shrb $3, %al
-; X86-NEXT:    andb $12, %al
-; X86-NEXT:    negb %al
-; X86-NEXT:    movsbl %al, %esi
-; X86-NEXT:    movl 36(%esp,%esi), %eax
-; X86-NEXT:    movl 40(%esp,%esi), %edi
-; X86-NEXT:    movl %edi, %edx
-; X86-NEXT:    shldl %cl, %eax, %edx
-; X86-NEXT:    movl 32(%esp,%esi), %ebx
-; X86-NEXT:    movl %ebx, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Spill
-; X86-NEXT:    movl 44(%esp,%esi), %esi
-; X86-NEXT:    shldl %cl, %edi, %esi
-; X86-NEXT:    movl %ebx, %edi
-; X86-NEXT:    shll %cl, %edi
-; X86-NEXT:    movl {{[-0-9]+}}(%e{{[sb]}}p), %ebx # 4-byte Reload
-; X86-NEXT:    shldl %cl, %ebx, %eax
-; X86-NEXT:    movl 8(%ebp), %ecx
-; X86-NEXT:    xorl 12(%ecx), %esi
-; X86-NEXT:    xorl 8(%ecx), %edx
-; X86-NEXT:    xorl 4(%ecx), %eax
-; X86-NEXT:    xorl (%ecx), %edi
-; X86-NEXT:    movl %edx, 8(%ecx)
-; X86-NEXT:    movl %esi, 12(%ecx)
-; X86-NEXT:    movl %edi, (%ecx)
-; X86-NEXT:    movl %eax, 4(%ecx)
-; X86-NEXT:    orl %esi, %eax
-; X86-NEXT:    orl %edx, %edi
-; X86-NEXT:    orl %eax, %edi
+; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
+; X86-NEXT:    movl {{[0-9]+}}(%esp), %ecx
+; X86-NEXT:    movl $1, %edx
+; X86-NEXT:    shll %cl, %edx
+; X86-NEXT:    andl $96, %ecx
+; X86-NEXT:    shrl $3, %ecx
+; X86-NEXT:    xorl %edx, (%eax,%ecx)
+; X86-NEXT:    movl (%eax), %ecx
+; X86-NEXT:    movl 4(%eax), %edx
+; X86-NEXT:    orl 12(%eax), %edx
+; X86-NEXT:    orl 8(%eax), %ecx
+; X86-NEXT:    orl %edx, %ecx
 ; X86-NEXT:    setne %al
-; X86-NEXT:    leal -12(%ebp), %esp
-; X86-NEXT:    popl %esi
-; X86-NEXT:    popl %edi
-; X86-NEXT:    popl %ebx
-; X86-NEXT:    popl %ebp
 ; X86-NEXT:    retl
 ;
 ; SSE-LABEL: complement_cmpz_i128:
 ; SSE:       # %bb.0:
 ; SSE-NEXT:    movl %esi, %ecx
 ; SSE-NEXT:    movl $1, %eax
-; SSE-NEXT:    xorl %edx, %edx
-; SSE-NEXT:    shldq %cl, %rax, %rdx
-; SSE-NEXT:    shlq %cl, %rax
-; SSE-NEXT:    xorl %esi, %esi
-; SSE-NEXT:    testb $64, %cl
-; SSE-NEXT:    cmovneq %rax, %rdx
-; SSE-NEXT:    cmovneq %rsi, %rax
-; SSE-NEXT:    xorq 8(%rdi), %rdx
-; SSE-NEXT:    xorq (%rdi), %rax
-; SSE-NEXT:    movq %rax, (%rdi)
-; SSE-NEXT:    movq %rdx, 8(%rdi)
-; SSE-NEXT:    orq %rdx, %rax
+; SSE-NEXT:    shll %cl, %eax
+; SSE-NEXT:    andl $96, %ecx
+; SSE-NEXT:    shrl $3, %ecx
+; SSE-NEXT:    xorl %eax, (%rdi,%rcx)
+; SSE-NEXT:    movq (%rdi), %rax
+; SSE-NEXT:    orq 8(%rdi), %rax
 ; SSE-NEXT:    setne %al
 ; SSE-NEXT:    retq
 ;
-; AVX2-LABEL: complement_cmpz_i128:
-; AVX2:       # %bb.0:
-; AVX2-NEXT:    movl %esi, %ecx
-; AVX2-NEXT:    movl $1, %eax
-; AVX2-NEXT:    xorl %edx, %edx
-; AVX2-NEXT:    shldq %cl, %rax, %rdx
-; AVX2-NEXT:    xorl %esi, %esi
-; AVX2-NEXT:    shlxq %rcx, %rax, %rax
-; AVX2-NEXT:    testb $64, %cl
-; AVX2-NEXT:    cmovneq %rax, %rdx
-; AVX2-NEXT:    cmovneq %rsi, %rax
-; AVX2-NEXT:    xorq 8(%rdi), %rdx
-; AVX2-NEXT:    xorq (%rdi), %rax
-; AVX2-NEXT:    movq %rax, (%rdi)
-; AVX2-NEXT:    movq %rdx, 8(%rdi)
-; AVX2-NEXT:    orq %rdx, %rax
-; AVX2-NEXT:    setne %al
-; AVX2-NEXT:    retq
-;
-; AVX512-LABEL: complement_cmpz_i128:
-; AVX512:       # %bb.0:
-; AVX512-NEXT:    movl %esi, %ecx
-; AVX512-NEXT:    xorl %eax, %eax
-; AVX512-NEXT:    movl $1, %edx
-; AVX512-NEXT:    xorl %esi, %esi
-; AVX512-NEXT:    shldq %cl, %rdx, %rsi
-; AVX512-NEXT:    shlxq %rcx, %rdx, %rdx
-; AVX512-NEXT:    testb $64, %cl
-; AVX512-NEXT:    cmovneq %rdx, %rsi
-; AVX512-NEXT:    cmovneq %rax, %rdx
-; AVX512-NEXT:    xorq 8(%rdi), %rsi
-; AVX512-NEXT:    xorq (%rdi), %rdx
-; AVX512-NEXT:    movq %rdx, (%rdi)
-; AVX512-NEXT:    movq %rsi, 8(%rdi)
-; AVX512-NEXT:    orq %rsi, %rdx
-; AVX512-NEXT:    setne %al
-; AVX512-NEXT:    retq
+; AVX-LABEL: complement_cmpz_i128:
+; AVX:       # %bb.0:
+; AVX-NEXT:    # kill: def $esi killed $esi def $rsi
+; AVX-NEXT:    movl $1, %eax
+; AVX-NEXT:    shlxl %esi, %eax, %eax
+; AVX-NEXT:    andl $96, %esi
+; AVX-NEXT:    shrl $3, %esi
+; AVX-NEXT:    xorl %eax, (%rdi,%rsi)
+; AVX-NEXT:    movq (%rdi), %rax
+; AVX-NEXT:    orq 8(%rdi), %rax
+; AVX-NEXT:    setne %al
+; AVX-NEXT:    retq
   %rem = and i32 %position, 127
   %ofs = zext nneg i32 %rem to i128
   %bit = shl nuw i128 1, %ofs
@@ -1088,247 +1019,63 @@ define i32 @chain_reset_i256(ptr %p0, ptr %p1, ptr %p2, i32 %position) nounwind
 ; X86-LABEL: chain_reset_i256:
 ; X86:       # %bb.0:
 ; X86-NEXT:    pushl %ebp
-; X86-NEXT:    movl %esp, %ebp
 ; X86-NEXT:    pushl %ebx
 ; X86-NEXT:    pushl %edi
 ; X86-NEXT:    pushl %esi
-; X86-NEXT:    andl $-16, %esp
-; X86-NEXT:    subl $112, %esp
-; X86-NEXT:    movzbl 20(%ebp), %ecx
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $1, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl $0, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl %ecx, %eax
-; X86-NEXT:    shrb $3, %al
-; X86-NEXT:    andb $28, %al
-; X86-NEXT:    negb %al
-; X86-NEXT:    movsbl %al, %eax
-; X86-NEXT:    movl 72(%esp,%eax), %edx
-; X86-NEXT:    movl 76(%esp,%eax), %edi
-; X86-NEXT:    movl %edi, %esi
-; X86-NEXT:    shldl %cl, %edx, %esi
-; X86-NEXT:    movl %esi, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Spill
-; X86-NEXT:    movl 68(%esp,%eax), %esi
-; X86-NEXT:    movl %esi, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Spill
-; X86-NEXT:    shldl %cl, %esi, %edx
-; X86-NEXT:    movl %edx, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Spill
-; X86-NEXT:    movl 80(%esp,%eax), %edx
-; X86-NEXT:    movl 84(%esp,%eax), %ebx
-; X86-NEXT:    movl %ebx, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Spill
-; X86-NEXT:    shldl %cl, %edx, %ebx
-; X86-NEXT:    movl %ebx, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Spill
-; X86-NEXT:    shldl %cl, %edi, %edx
-; X86-NEXT:    movl 64(%esp,%eax), %edi
-; X86-NEXT:    movl 88(%esp,%eax), %esi
-; X86-NEXT:    movl 92(%esp,%eax), %eax
-; X86-NEXT:    shldl %cl, %esi, %eax
-; X86-NEXT:    movl {{[-0-9]+}}(%e{{[sb]}}p), %ebx # 4-byte Reload
-; X86-NEXT:    shldl %cl, %ebx, %esi
-; X86-NEXT:    shldl %cl, %edi, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Folded Spill
-; X86-NEXT:    shll %cl, %edi
-; X86-NEXT:    movl %edi, %ecx
-; X86-NEXT:    movl %esi, %ebx
-; X86-NEXT:    notl %ebx
-; X86-NEXT:    notl %eax
-; X86-NEXT:    notl %edx
-; X86-NEXT:    notl {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Folded Spill
-; X86-NEXT:    notl {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Folded Spill
-; X86-NEXT:    movl {{[-0-9]+}}(%e{{[sb]}}p), %edi # 4-byte Reload
-; X86-NEXT:    notl %edi
-; X86-NEXT:    notl %ecx
-; X86-NEXT:    movl %ecx, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Spill
-; X86-NEXT:    movl {{[-0-9]+}}(%e{{[sb]}}p), %esi # 4-byte Reload
-; X86-NEXT:    notl %esi
-; X86-NEXT:    movl 8(%ebp), %ecx
-; X86-NEXT:    andl 12(%ecx), %edi
-; X86-NEXT:    movl %edi, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Spill
-; X86-NEXT:    movl {{[-0-9]+}}(%e{{[sb]}}p), %edi # 4-byte Reload
-; X86-NEXT:    andl 8(%ecx), %edi
-; X86-NEXT:    movl %edi, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Spill
-; X86-NEXT:    movl {{[-0-9]+}}(%e{{[sb]}}p), %edi # 4-byte Reload
-; X86-NEXT:    andl 20(%ecx), %edi
-; X86-NEXT:    movl %edi, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Spill
-; X86-NEXT:    andl 16(%ecx), %edx
-; X86-NEXT:    andl 28(%ecx), %eax
-; X86-NEXT:    andl 24(%ecx), %ebx
-; X86-NEXT:    andl 4(%ecx), %esi
-; X86-NEXT:    movl %esi, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Spill
-; X86-NEXT:    movl {{[-0-9]+}}(%e{{[sb]}}p), %esi # 4-byte Reload
-; X86-NEXT:    andl (%ecx), %esi
-; X86-NEXT:    movl %esi, {{[-0-9]+}}(%e{{[sb]}}p) # 4-byte Spill
-; X86-NEXT:    movl %ebx, 24(%ecx)
-; X86-NEXT:    movl %eax, 28(%ecx)
-; X86-NEXT:    movl %edx, 16(%ecx)
-; X86-NEXT:    movl %edi, 20(%ecx)
-; X86-NEXT:    movl {{[-0-9]+}}(%e{{[sb]}}p), %edi # 4-byte Reload
-; X86-NEXT:    movl %edi, 8(%ecx)
-; X86-NEXT:    movl {{[-0-9]+}}(%e{{[sb]}}p), %edi # 4-byte Reload
-; X86-NEXT:    movl %edi, 12(%ecx)
-; X86-NEXT:    movl {{[-0-9]+}}(%e{{[sb]}}p), %esi # 4-byte Reload
-; X86-NEXT:    movl %esi, (%ecx)
-; X86-NEXT:    movl {{[-0-9]+}}(%e{{[sb]}}p), %esi # 4-byte Reload
-; X86-NEXT:    movl %esi, 4(%ecx)
-; X86-NEXT:    orl %edi, %eax
-; X86-NEXT:    orl {{[-0-9]+}}(%e{{[sb]}}p), %esi # 4-byte Folded Reload
-; X86-NEXT:    orl %eax, %esi
-; X86-NEXT:    orl {{[-0-9]+}}(%e{{[sb]}}p), %ebx # 4-byte Folded Reload
-; X86-NEXT:    movl 12(%ebp), %eax
-; X86-NEXT:    movl (%eax), %ecx
-; X86-NEXT:    movl {{[-0-9]+}}(%e{{[sb]}}p), %edi # 4-byte Reload
-; X86-NEXT:    movl %edi, (%eax)
-; X86-NEXT:    orl %edi, %edx
-; X86-NEXT:    orl %ebx, %edx
-; X86-NEXT:    orl %esi, %edx
-; X86-NEXT:    movl 16(%ebp), %eax
+; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
+; X86-NEXT:    movl {{[0-9]+}}(%esp), %edx
+; X86-NEXT:    movl {{[0-9]+}}(%esp), %esi
+; X86-NEXT:    movl {{[0-9]+}}(%esp), %ecx
+; X86-NEXT:    movl $-2, %edi
+; X86-NEXT:    roll %cl, %edi
+; X86-NEXT:    shrl $3, %ecx
+; X86-NEXT:    andl $28, %ecx
+; X86-NEXT:    andl %edi, (%esi,%ecx)
+; X86-NEXT:    movl 8(%esi), %ebx
+; X86-NEXT:    movl (%esi), %edi
+; X86-NEXT:    movl 4(%esi), %ecx
+; X86-NEXT:    movl 12(%esi), %ebp
+; X86-NEXT:    orl 28(%esi), %ebp
+; X86-NEXT:    orl 20(%esi), %ecx
+; X86-NEXT:    orl %ebp, %ecx
+; X86-NEXT:    orl 24(%esi), %ebx
+; X86-NEXT:    movl 16(%esi), %ebp
+; X86-NEXT:    orl %edi, %ebp
+; X86-NEXT:    orl %ebx, %ebp
+; X86-NEXT:    movl (%edx), %esi
+; X86-NEXT:    movl %edi, (%edx)
 ; X86-NEXT:    movl (%eax), %eax
+; X86-NEXT:    orl %ecx, %ebp
 ; X86-NEXT:    jne .LBB23_2
 ; X86-NEXT:  # %bb.1:
-; X86-NEXT:    addl %ecx, %eax
+; X86-NEXT:    addl %esi, %eax
 ; X86-NEXT:  .LBB23_2:
-; X86-NEXT:    leal -12(%ebp), %esp
 ; X86-NEXT:    popl %esi
 ; X86-NEXT:    popl %edi
 ; X86-NEXT:    popl %ebx
 ; X86-NEXT:    popl %ebp
 ; X86-NEXT:    retl
 ;
-; SSE-LABEL: chain_reset_i256:
-; SSE:       # %bb.0:
-; SSE-NEXT:    xorps %xmm0, %xmm0
-; SSE-NEXT:    movups %xmm0, -{{[0-9]+}}(%rsp)
-; SSE-NEXT:    movaps %xmm0, -{{[0-9]+}}(%rsp)
-; SSE-NEXT:    movaps %xmm0, -{{[0-9]+}}(%rsp)
-; SSE-NEXT:    movq $0, -{{[0-9]+}}(%rsp)
-; SSE-NEXT:    movq $1, -{{[0-9]+}}(%rsp)
-; SSE-NEXT:    movl %ecx, %eax
-; SSE-NEXT:    shrb $3, %al
-; SSE-NEXT:    andb $24, %al
-; SSE-NEXT:    negb %al
-; SSE-NEXT:    movsbq %al, %r10
-; SSE-NEXT:    movq -24(%rsp,%r10), %r8
-; SSE-NEXT:    movq -16(%rsp,%r10), %rax
-; SSE-NEXT:    shldq %cl, %r8, %rax
-; SSE-NEXT:    movq -32(%rsp,%r10), %r9
-; SSE-NEXT:    shldq %cl, %r9, %r8
-; SSE-NEXT:    movq -40(%rsp,%r10), %r10
-; SSE-NEXT:    shldq %cl, %r10, %r9
-; SSE-NEXT:    # kill: def $cl killed $cl killed $ecx
-; SSE-NEXT:    shlq %cl, %r10
-; SSE-NEXT:    notq %r8
-; SSE-NEXT:    notq %rax
-; SSE-NEXT:    notq %r10
-; SSE-NEXT:    notq %r9
-; SSE-NEXT:    andq 24(%rdi), %rax
-; SSE-NEXT:    andq 16(%rdi), %r8
-; SSE-NEXT:    andq 8(%rdi), %r9
-; SSE-NEXT:    andq (%rdi), %r10
-; SSE-NEXT:    movq %r8, 16(%rdi)
-; SSE-NEXT:    movq %rax, 24(%rdi)
-; SSE-NEXT:    movq %r10, (%rdi)
-; SSE-NEXT:    movq %r9, 8(%rdi)
-; SSE-NEXT:    orq %rax, %r9
-; SSE-NEXT:    orq %r10, %r8
-; SSE-NEXT:    movl (%rsi), %eax
-; SSE-NEXT:    movl %r10d, (%rsi)
-; SSE-NEXT:    movl (%rdx), %ecx
-; SSE-NEXT:    addl %ecx, %eax
-; SSE-NEXT:    orq %r9, %r8
-; SSE-NEXT:    cmovnel %ecx, %eax
-; SSE-NEXT:    retq
-;
-; AVX2-LABEL: chain_reset_i256:
-; AVX2:       # %bb.0:
-; AVX2-NEXT:    # kill: def $ecx killed $ecx def $rcx
-; AVX2-NEXT:    vxorps %xmm0, %xmm0, %xmm0
-; AVX2-NEXT:    vmovups %ymm0, -{{[0-9]+}}(%rsp)
-; AVX2-NEXT:    vmovss {{.*#+}} xmm0 = [1,0,0,0]
-; AVX2-NEXT:    vmovups %ymm0, -{{[0-9]+}}(%rsp)
-; AVX2-NEXT:    movl %ecx, %eax
-; AVX2-NEXT:    shrb $3, %al
-; AVX2-NEXT:    andb $24, %al
-; AVX2-NEXT:    negb %al
-; AVX2-NEXT:    movsbq %al, %rax
-; AVX2-NEXT:    movq -32(%rsp,%rax), %r8
-; AVX2-NEXT:    movq -24(%rsp,%rax), %r9
-; AVX2-NEXT:    movq %r9, %r10
-; AVX2-NEXT:    shldq %cl, %r8, %r10
-; AVX2-NEXT:    movq -40(%rsp,%rax), %r11
-; AVX2-NEXT:    movq -16(%rsp,%rax), %rax
-; AVX2-NEXT:    shldq %cl, %r9, %rax
-; AVX2-NEXT:    shldq %cl, %r11, %r8
-; AVX2-NEXT:    andnq 24(%rdi), %rax, %rax
-; AVX2-NEXT:    andnq 16(%rdi), %r10, %r9
-; AVX2-NEXT:    andnq 8(%rdi), %r8, %r8
-; AVX2-NEXT:    shlxq %rcx, %r11, %rcx
-; AVX2-NEXT:    andnq (%rdi), %rcx, %rcx
-; AVX2-NEXT:    movq %r9, 16(%rdi)
-; AVX2-NEXT:    movq %rax, 24(%rdi)
-; AVX2-NEXT:    movq %rcx, (%rdi)
-; AVX2-NEXT:    movq %r8, 8(%rdi)
-; AVX2-NEXT:    orq %rax, %r8
-; AVX2-NEXT:    orq %rcx, %r9
-; AVX2-NEXT:    movl (%rsi), %eax
-; AVX2-NEXT:    movl %ecx, (%rsi)
-; AVX2-NEXT:    movl (%rdx), %ecx
-; AVX2-NEXT:    addl %ecx, %eax
-; AVX2-NEXT:    orq %r8, %r9
-; AVX2-NEXT:    cmovnel %ecx, %eax
-; AVX2-NEXT:    vzeroupper
-; AVX2-NEXT:    retq
-;
-; AVX512-LABEL: chain_reset_i256:
-; AVX512:       # %bb.0:
-; AVX512-NEXT:    vxorps %xmm0, %xmm0, %xmm0
-; AVX512-NEXT:    vmovups %ymm0, -{{[0-9]+}}(%rsp)
-; AVX512-NEXT:    vmovaps {{.*#+}} xmm0 = [1,0,0,0]
-; AVX512-NEXT:    vmovups %ymm0, -{{[0-9]+}}(%rsp)
-; AVX512-NEXT:    # kill: def $ecx killed $ecx def $rcx
-; AVX512-NEXT:    movl %ecx, %eax
-; AVX512-NEXT:    shrb $3, %al
-; AVX512-NEXT:    andb $24, %al
-; AVX512-NEXT:    negb %al
-; AVX512-NEXT:    movsbq %al, %rax
-; AVX512-NEXT:    movq -40(%rsp,%rax), %r8
-; AVX512-NEXT:    movq -32(%rsp,%rax), %r9
-; AVX512-NEXT:    movq -24(%rsp,%rax), %r10
-; AVX512-NEXT:    movq %r10, %r11
-; AVX512-NEXT:    shldq %cl, %r9, %r11
-; AVX512-NEXT:    movq -16(%rsp,%rax), %rax
-; AVX512-NEXT:    shldq %cl, %r10, %rax
-; AVX512-NEXT:    shlxq %rcx, %r8, %r10
-; AVX512-NEXT:    # kill: def $cl killed $cl killed $rcx
-; AVX512-NEXT:    shldq %cl, %r8, %r9
-; AVX512-NEXT:    andnq 24(%rdi), %rax, %rax
-; AVX512-NEXT:    andnq 16(%rdi), %r11, %rcx
-; AVX512-NEXT:    andnq 8(%rdi), %r9, %r8
-; AVX512-NEXT:    andnq (%rdi), %r10, %r9
-; AVX512-NEXT:    movq %rcx, 16(%rdi)
-; AVX512-NEXT:    movq %rax, 24(%rdi)
-; AVX512-NEXT:    movq %r9, (%rdi)
-; AVX512-NEXT:    movq %r8, 8(%rdi)
-; AVX512-NEXT:    orq %rax, %r8
-; AVX512-NEXT:    orq %r9, %rcx
-; AVX512-NEXT:    movl (%rsi), %eax
-; AVX512-NEXT:    movl %r9d, (%rsi)
-; AVX512-NEXT:    movl (%rdx), %edx
-; AVX512-NEXT:    addl %edx, %eax
-; AVX512-NEXT:    orq %r8, %rcx
-; AVX512-NEXT:    cmovnel %edx, %eax
-; AVX512-NEXT:    vzeroupper
-; AVX512-NEXT:    retq
+; X64-LABEL: chain_reset_i256:
+; X64:       # %bb.0:
+; X64-NEXT:    # kill: def $ecx killed $ecx def $rcx
+; X64-NEXT:    movl $-2, %eax
+; X64-NEXT:    roll %cl, %eax
+; X64-NEXT:    shrl $3, %ecx
+; X64-NEXT:    andl $28, %ecx
+; X64-NEXT:    andl %eax, (%rdi,%rcx)
+; X64-NEXT:    movq (%rdi), %rcx
+; X64-NEXT:    movq 8(%rdi), %r8
+; X64-NEXT:    orq 24(%rdi), %r8
+; X64-NEXT:    movq 16(%rdi), %rdi
+; X64-NEXT:    orq %rcx, %rdi
+; X64-NEXT:    movl (%rsi), %eax
+; X64-NEXT:    movl %ecx, (%rsi)
+; X64-NEXT:    movl (%rdx), %ecx
+; X64-NEXT:    addl %ecx, %eax
+; X64-NEXT:    orq %r8, %rdi
+; X64-NEXT:    cmovnel %ecx, %eax
+; X64-NEXT:    retq
   %rem = and i32 %position, 255
   %ofs = zext nneg i32 %rem to i256
   %bit = shl nuw i256 1, %ofs

SDValue NewLoad =
DAG.getLoad(VT, DL, NewStore, Ld->getBasePtr(), Ld->getMemOperand());
DAG.ReplaceAllUsesWith(StoredVal, NewLoad);
DAG.ReplaceAllUsesWith(SDValue(St, 0), NewLoad.getValue(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gotten burnt with previous attempts at this code so I might be being over cautious, but I'm wanting to ensure (1) that the reload occurs after the replacement store and (2) anything (load/store) that came after the old store now comes after the reload as well. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NewLoad has an input chain to NewStore. So (1) is not a problem, and load is guaranteed by data dependence, but store is dangerous. Maybe we can replace it with NewStore here?

DAG.ReplaceAllUsesWith(SDValue(St, 0), NewStore.getValue(1));

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll drop the line - we're replacing the old store with NewStore so that chain should be replaced automatically.

Copy link
Contributor

@phoebewang phoebewang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@RKSimon RKSimon merged commit 28e024f into llvm:main Nov 5, 2025
10 checks passed
@RKSimon RKSimon deleted the x86-bittest-multistore branch November 5, 2025 10:33
@llvm-ci
Copy link
Collaborator

llvm-ci commented Nov 5, 2025

LLVM Buildbot has detected a new failure on builder ppc64le-mlir-rhel-clang running on ppc64le-mlir-rhel-test while building llvm at step 6 "test-build-check-mlir-build-only-check-mlir".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/129/builds/32491

Here is the relevant piece of the build log for the reference
Step 6 (test-build-check-mlir-build-only-check-mlir) failure: 1200 seconds without output running [b'ninja', b'check-mlir'], attempting to kill
...
PASS: MLIR-Unit :: IR/./MLIRIRTests/39/130 (3625 of 3634)
PASS: MLIR :: Dialect/SCF/loop-unroll.mlir (3626 of 3634)
PASS: MLIR-Unit :: IR/./MLIRIRTests/38/130 (3627 of 3634)
PASS: MLIR :: Transforms/invalid-parallel-loop-collapsing.mlir (3628 of 3634)
PASS: MLIR :: mlir-reduce/multiple-function.mlir (3629 of 3634)
PASS: MLIR :: mlir-reduce/crashop-reduction.mlir (3630 of 3634)
PASS: MLIR :: mlir-runner/utils.mlir (3631 of 3634)
PASS: MLIR :: mlir-runner/verify-entry-point.mlir (3632 of 3634)
PASS: MLIR :: mlir-tblgen/op-error.td (3633 of 3634)
PASS: MLIR :: Pass/pipeline-options-parsing.mlir (3634 of 3634)
command timed out: 1200 seconds without output running [b'ninja', b'check-mlir'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=1601.412414

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants