Skip to content

Conversation

@zhaoqi5
Copy link
Contributor

@zhaoqi5 zhaoqi5 commented Sep 9, 2025

Checking isOperationLegalOrCustom instead of isOperationLegal allows more optimization opportunities. In particular, if a target wants to mark extract_vector_elt as Custom rather than Legal in order to optimize some certain cases, this combiner would otherwise miss some improvements.

Previously, using isOperationLegalOrCustom was avoided due to the risk of getting stuck in infinite loops (as noted in 61ec738). After testing, the issue no longer reproduces, but the coverage is limited to the regression/unit tests and the test-suite.

Would it make sense to relax this condition to enable more optimizations? And what would be the best way to ensure that doing so does not reintroduce infinite loop regressions? Any suggestions would be appreciated.

Checking `isOperationLegalOrCustom` instead of `isOperationLegal`
allows more optimization opportunities. In particular, if a target
wants to mark `extract_vector_elt` as `Custom` rather than `Legal`
in order to optimize some certain cases, this combiner would
otherwise miss some improvements.

Previously, using `isOperationLegalOrCustom` was avoided due to
the risk of getting stuck in infinite loops (as noted in
61ec738).
After testing, the issue no longer reproduces, but the coverage
is limited to the regression/unit tests and the test-suite.

Would it make sense to relax this condition to enable more
optimizations? And what would be the best way to ensure that
doing so does not reintroduce infinite loop regressions?
Any suggestions would be appreciated.
@llvmbot
Copy link
Member

llvmbot commented Sep 9, 2025

@llvm/pr-subscribers-backend-x86
@llvm/pr-subscribers-backend-webassembly

@llvm/pr-subscribers-llvm-selectiondag

Author: ZhaoQi (zhaoqi5)

Changes

Checking isOperationLegalOrCustom instead of isOperationLegal allows more optimization opportunities. In particular, if a target wants to mark extract_vector_elt as Custom rather than Legal in order to optimize some certain cases, this combiner would otherwise miss some improvements.

Previously, using isOperationLegalOrCustom was avoided due to the risk of getting stuck in infinite loops (as noted in 61ec738). After testing, the issue no longer reproduces, but the coverage is limited to the regression/unit tests and the test-suite.

Would it make sense to relax this condition to enable more optimizations? And what would be the best way to ensure that doing so does not reintroduce infinite loop regressions? Any suggestions would be appreciated.


Patch is 91.00 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/157658.diff

23 Files Affected:

  • (modified) llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/shufflevector.ll (+5-6)
  • (modified) llvm/test/CodeGen/Thumb2/active_lane_mask.ll (+4-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-i16-add.ll (+12-17)
  • (modified) llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-i8-add.ll (+12-17)
  • (modified) llvm/test/CodeGen/Thumb2/mve-fptosi-sat-vector.ll (+6-9)
  • (modified) llvm/test/CodeGen/Thumb2/mve-fptoui-sat-vector.ll (+6-9)
  • (modified) llvm/test/CodeGen/Thumb2/mve-laneinterleaving-cost.ll (+108-139)
  • (modified) llvm/test/CodeGen/Thumb2/mve-laneinterleaving.ll (+108-128)
  • (modified) llvm/test/CodeGen/Thumb2/mve-satmul-loops.ll (+22-30)
  • (modified) llvm/test/CodeGen/Thumb2/mve-sext-masked-load.ll (+13-18)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vabdus.ll (+31-41)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vld2.ll (+33-53)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vld3.ll (+97-208)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vld4-post.ll (+22-33)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vld4.ll (+87-128)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vmull-splat.ll (+77-95)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vst3.ll (+10-10)
  • (modified) llvm/test/CodeGen/WebAssembly/vector-reduce.ll (+22-24)
  • (modified) llvm/test/CodeGen/X86/avx512fp16-mov.ll (+25-29)
  • (modified) llvm/test/CodeGen/X86/test-shrink-bug.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/vec_smulo.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/vec_umulo.ll (+2-2)
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index d130efe96b56b..97a3d36a67103 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -23933,8 +23933,7 @@ SDValue DAGCombiner::visitEXTRACT_VECTOR_ELT(SDNode *N) {
     // scalar_to_vector here as well.
 
     if (!LegalOperations ||
-        // FIXME: Should really be just isOperationLegalOrCustom.
-        TLI.isOperationLegal(ISD::EXTRACT_VECTOR_ELT, VecVT) ||
+        TLI.isOperationLegalOrCustom(ISD::EXTRACT_VECTOR_ELT, VecVT) ||
         TLI.isOperationExpand(ISD::VECTOR_SHUFFLE, VecVT)) {
       return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, ScalarVT, SVInVec,
                          DAG.getVectorIdxConstant(OrigElt, DL));
diff --git a/llvm/test/CodeGen/AArch64/shufflevector.ll b/llvm/test/CodeGen/AArch64/shufflevector.ll
index 9fd5e65086782..b47c077ccf1c5 100644
--- a/llvm/test/CodeGen/AArch64/shufflevector.ll
+++ b/llvm/test/CodeGen/AArch64/shufflevector.ll
@@ -286,10 +286,11 @@ define i32 @shufflevector_v2i16(<2 x i16> %a, <2 x i16> %b){
 ; CHECK-SD:       // %bb.0:
 ; CHECK-SD-NEXT:    sub sp, sp, #16
 ; CHECK-SD-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-SD-NEXT:    ext v0.8b, v0.8b, v1.8b, #4
-; CHECK-SD-NEXT:    mov s1, v0.s[1]
-; CHECK-SD-NEXT:    str h0, [sp, #12]
+; CHECK-SD-NEXT:    // kill: def $d0 killed $d0 def $q0
+; CHECK-SD-NEXT:    // kill: def $d1 killed $d1 def $q1
 ; CHECK-SD-NEXT:    str h1, [sp, #14]
+; CHECK-SD-NEXT:    mov s0, v0.s[1]
+; CHECK-SD-NEXT:    str h0, [sp, #12]
 ; CHECK-SD-NEXT:    ldr w0, [sp, #12]
 ; CHECK-SD-NEXT:    add sp, sp, #16
 ; CHECK-SD-NEXT:    ret
@@ -491,10 +492,8 @@ define i32 @shufflevector_v2i16_zeroes(<2 x i16> %a, <2 x i16> %b){
 ; CHECK-SD-NEXT:    sub sp, sp, #16
 ; CHECK-SD-NEXT:    .cfi_def_cfa_offset 16
 ; CHECK-SD-NEXT:    // kill: def $d0 killed $d0 def $q0
-; CHECK-SD-NEXT:    dup v1.2s, v0.s[0]
+; CHECK-SD-NEXT:    str h0, [sp, #14]
 ; CHECK-SD-NEXT:    str h0, [sp, #12]
-; CHECK-SD-NEXT:    mov s1, v1.s[1]
-; CHECK-SD-NEXT:    str h1, [sp, #14]
 ; CHECK-SD-NEXT:    ldr w0, [sp, #12]
 ; CHECK-SD-NEXT:    add sp, sp, #16
 ; CHECK-SD-NEXT:    ret
diff --git a/llvm/test/CodeGen/Thumb2/active_lane_mask.ll b/llvm/test/CodeGen/Thumb2/active_lane_mask.ll
index bcd92f81911b2..cae8d6e3deaeb 100644
--- a/llvm/test/CodeGen/Thumb2/active_lane_mask.ll
+++ b/llvm/test/CodeGen/Thumb2/active_lane_mask.ll
@@ -107,6 +107,7 @@ define <7 x i32> @v7i32(i32 %index, i32 %TC, <7 x i32> %V1, <7 x i32> %V2) {
 ; CHECK-NEXT:    vstrw.32 q0, [r0]
 ; CHECK-NEXT:    vldrw.u32 q0, [r2]
 ; CHECK-NEXT:    ldr r2, [sp, #48]
+; CHECK-NEXT:    adds r0, #16
 ; CHECK-NEXT:    vqadd.u32 q0, q0, r1
 ; CHECK-NEXT:    ldr r1, [sp, #52]
 ; CHECK-NEXT:    vcmp.u32 hi, q3, q0
@@ -119,12 +120,9 @@ define <7 x i32> @v7i32(i32 %index, i32 %TC, <7 x i32> %V1, <7 x i32> %V2) {
 ; CHECK-NEXT:    ldr r1, [sp, #24]
 ; CHECK-NEXT:    vmov q1[2], q1[0], r2, r1
 ; CHECK-NEXT:    vpsel q0, q1, q0
-; CHECK-NEXT:    vmov r1, s2
-; CHECK-NEXT:    vmov.f32 s2, s1
-; CHECK-NEXT:    vmov r3, s0
-; CHECK-NEXT:    vmov r2, s2
-; CHECK-NEXT:    strd r3, r2, [r0, #16]
-; CHECK-NEXT:    str r1, [r0, #24]
+; CHECK-NEXT:    vmov r1, r2, d0
+; CHECK-NEXT:    vmov r3, s2
+; CHECK-NEXT:    stm r0!, {r1, r2, r3}
 ; CHECK-NEXT:    bx lr
 ; CHECK-NEXT:    .p2align 4
 ; CHECK-NEXT:  @ %bb.1:
diff --git a/llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-i16-add.ll b/llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-i16-add.ll
index 37f6bbeffd027..de508e67a7a77 100644
--- a/llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-i16-add.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-i16-add.ll
@@ -31,24 +31,19 @@ entry:
 define arm_aapcs_vfpcc <4 x i16> @complex_add_v4i16(<4 x i16> %a, <4 x i16> %b) {
 ; CHECK-LABEL: complex_add_v4i16:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    vrev64.32 q2, q0
-; CHECK-NEXT:    vmov r1, s6
-; CHECK-NEXT:    vmov r0, s10
-; CHECK-NEXT:    vrev64.32 q3, q1
-; CHECK-NEXT:    vmov r2, s4
-; CHECK-NEXT:    subs r0, r1, r0
-; CHECK-NEXT:    vmov r1, s8
+; CHECK-NEXT:    .save {r4, lr}
+; CHECK-NEXT:    push {r4, lr}
+; CHECK-NEXT:    vmov r12, r1, d1
+; CHECK-NEXT:    vmov r2, lr, d3
+; CHECK-NEXT:    vmov r3, r4, d2
 ; CHECK-NEXT:    subs r1, r2, r1
-; CHECK-NEXT:    vmov r2, s0
-; CHECK-NEXT:    vmov q2[2], q2[0], r1, r0
-; CHECK-NEXT:    vmov r0, s14
-; CHECK-NEXT:    vmov r1, s2
-; CHECK-NEXT:    add r0, r1
-; CHECK-NEXT:    vmov r1, s12
-; CHECK-NEXT:    add r1, r2
-; CHECK-NEXT:    vmov q2[3], q2[1], r1, r0
-; CHECK-NEXT:    vmov q0, q2
-; CHECK-NEXT:    bx lr
+; CHECK-NEXT:    vmov r2, r0, d0
+; CHECK-NEXT:    subs r0, r3, r0
+; CHECK-NEXT:    vmov q0[2], q0[0], r0, r1
+; CHECK-NEXT:    add.w r0, lr, r12
+; CHECK-NEXT:    adds r1, r4, r2
+; CHECK-NEXT:    vmov q0[3], q0[1], r1, r0
+; CHECK-NEXT:    pop {r4, pc}
 entry:
   %a.real = shufflevector <4 x i16> %a, <4 x i16> zeroinitializer, <2 x i32> <i32 0, i32 2>
   %a.imag = shufflevector <4 x i16> %a, <4 x i16> zeroinitializer, <2 x i32> <i32 1, i32 3>
diff --git a/llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-i8-add.ll b/llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-i8-add.ll
index 794894def9265..e11b3c773adf6 100644
--- a/llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-i8-add.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-i8-add.ll
@@ -31,24 +31,19 @@ entry:
 define arm_aapcs_vfpcc <4 x i8> @complex_add_v4i8(<4 x i8> %a, <4 x i8> %b) {
 ; CHECK-LABEL: complex_add_v4i8:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    vrev64.32 q2, q0
-; CHECK-NEXT:    vmov r1, s6
-; CHECK-NEXT:    vmov r0, s10
-; CHECK-NEXT:    vrev64.32 q3, q1
-; CHECK-NEXT:    vmov r2, s4
-; CHECK-NEXT:    subs r0, r1, r0
-; CHECK-NEXT:    vmov r1, s8
+; CHECK-NEXT:    .save {r4, lr}
+; CHECK-NEXT:    push {r4, lr}
+; CHECK-NEXT:    vmov r12, r1, d1
+; CHECK-NEXT:    vmov r2, lr, d3
+; CHECK-NEXT:    vmov r3, r4, d2
 ; CHECK-NEXT:    subs r1, r2, r1
-; CHECK-NEXT:    vmov r2, s0
-; CHECK-NEXT:    vmov q2[2], q2[0], r1, r0
-; CHECK-NEXT:    vmov r0, s14
-; CHECK-NEXT:    vmov r1, s2
-; CHECK-NEXT:    add r0, r1
-; CHECK-NEXT:    vmov r1, s12
-; CHECK-NEXT:    add r1, r2
-; CHECK-NEXT:    vmov q2[3], q2[1], r1, r0
-; CHECK-NEXT:    vmov q0, q2
-; CHECK-NEXT:    bx lr
+; CHECK-NEXT:    vmov r2, r0, d0
+; CHECK-NEXT:    subs r0, r3, r0
+; CHECK-NEXT:    vmov q0[2], q0[0], r0, r1
+; CHECK-NEXT:    add.w r0, lr, r12
+; CHECK-NEXT:    adds r1, r4, r2
+; CHECK-NEXT:    vmov q0[3], q0[1], r1, r0
+; CHECK-NEXT:    pop {r4, pc}
 entry:
   %a.real = shufflevector <4 x i8> %a, <4 x i8> zeroinitializer, <2 x i32> <i32 0, i32 2>
   %a.imag = shufflevector <4 x i8> %a, <4 x i8> zeroinitializer, <2 x i32> <i32 1, i32 3>
diff --git a/llvm/test/CodeGen/Thumb2/mve-fptosi-sat-vector.ll b/llvm/test/CodeGen/Thumb2/mve-fptosi-sat-vector.ll
index 77548b49d77f2..d535c64289d4f 100644
--- a/llvm/test/CodeGen/Thumb2/mve-fptosi-sat-vector.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-fptosi-sat-vector.ll
@@ -185,11 +185,10 @@ define arm_aapcs_vfpcc <6 x i32> @test_signed_v6f32_v6i32(<6 x float> %f) {
 ; CHECK-MVEFP:       @ %bb.0:
 ; CHECK-MVEFP-NEXT:    vcvt.s32.f32 q1, q1
 ; CHECK-MVEFP-NEXT:    vcvt.s32.f32 q0, q0
-; CHECK-MVEFP-NEXT:    vmov.f32 s6, s5
-; CHECK-MVEFP-NEXT:    vmov r2, s4
-; CHECK-MVEFP-NEXT:    vmov r1, s6
-; CHECK-MVEFP-NEXT:    strd r2, r1, [r0, #16]
+; CHECK-MVEFP-NEXT:    vmov r1, r2, d2
+; CHECK-MVEFP-NEXT:    str r2, [r0, #20]
 ; CHECK-MVEFP-NEXT:    vstrw.32 q0, [r0]
+; CHECK-MVEFP-NEXT:    str r1, [r0, #16]
 ; CHECK-MVEFP-NEXT:    bx lr
     %x = call <6 x i32> @llvm.fptosi.sat.v6f32.v6i32(<6 x float> %f)
     ret <6 x i32> %x
@@ -221,13 +220,11 @@ define arm_aapcs_vfpcc <7 x i32> @test_signed_v7f32_v7i32(<7 x float> %f) {
 ; CHECK-MVEFP:       @ %bb.0:
 ; CHECK-MVEFP-NEXT:    vcvt.s32.f32 q1, q1
 ; CHECK-MVEFP-NEXT:    vcvt.s32.f32 q0, q0
-; CHECK-MVEFP-NEXT:    vmov.f32 s10, s5
-; CHECK-MVEFP-NEXT:    vmov r2, s4
 ; CHECK-MVEFP-NEXT:    vmov r3, s6
-; CHECK-MVEFP-NEXT:    vmov r1, s10
-; CHECK-MVEFP-NEXT:    strd r2, r1, [r0, #16]
-; CHECK-MVEFP-NEXT:    str r3, [r0, #24]
+; CHECK-MVEFP-NEXT:    vmov r1, r2, d2
+; CHECK-MVEFP-NEXT:    strd r2, r3, [r0, #20]
 ; CHECK-MVEFP-NEXT:    vstrw.32 q0, [r0]
+; CHECK-MVEFP-NEXT:    str r1, [r0, #16]
 ; CHECK-MVEFP-NEXT:    bx lr
     %x = call <7 x i32> @llvm.fptosi.sat.v7f32.v7i32(<7 x float> %f)
     ret <7 x i32> %x
diff --git a/llvm/test/CodeGen/Thumb2/mve-fptoui-sat-vector.ll b/llvm/test/CodeGen/Thumb2/mve-fptoui-sat-vector.ll
index ee040feca4240..61f05347d511d 100644
--- a/llvm/test/CodeGen/Thumb2/mve-fptoui-sat-vector.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-fptoui-sat-vector.ll
@@ -172,11 +172,10 @@ define arm_aapcs_vfpcc <6 x i32> @test_unsigned_v6f32_v6i32(<6 x float> %f) {
 ; CHECK-MVEFP:       @ %bb.0:
 ; CHECK-MVEFP-NEXT:    vcvt.u32.f32 q1, q1
 ; CHECK-MVEFP-NEXT:    vcvt.u32.f32 q0, q0
-; CHECK-MVEFP-NEXT:    vmov.f32 s6, s5
-; CHECK-MVEFP-NEXT:    vmov r2, s4
-; CHECK-MVEFP-NEXT:    vmov r1, s6
-; CHECK-MVEFP-NEXT:    strd r2, r1, [r0, #16]
+; CHECK-MVEFP-NEXT:    vmov r1, r2, d2
+; CHECK-MVEFP-NEXT:    str r2, [r0, #20]
 ; CHECK-MVEFP-NEXT:    vstrw.32 q0, [r0]
+; CHECK-MVEFP-NEXT:    str r1, [r0, #16]
 ; CHECK-MVEFP-NEXT:    bx lr
     %x = call <6 x i32> @llvm.fptoui.sat.v6f32.v6i32(<6 x float> %f)
     ret <6 x i32> %x
@@ -208,13 +207,11 @@ define arm_aapcs_vfpcc <7 x i32> @test_unsigned_v7f32_v7i32(<7 x float> %f) {
 ; CHECK-MVEFP:       @ %bb.0:
 ; CHECK-MVEFP-NEXT:    vcvt.u32.f32 q1, q1
 ; CHECK-MVEFP-NEXT:    vcvt.u32.f32 q0, q0
-; CHECK-MVEFP-NEXT:    vmov.f32 s10, s5
-; CHECK-MVEFP-NEXT:    vmov r2, s4
 ; CHECK-MVEFP-NEXT:    vmov r3, s6
-; CHECK-MVEFP-NEXT:    vmov r1, s10
-; CHECK-MVEFP-NEXT:    strd r2, r1, [r0, #16]
-; CHECK-MVEFP-NEXT:    str r3, [r0, #24]
+; CHECK-MVEFP-NEXT:    vmov r1, r2, d2
+; CHECK-MVEFP-NEXT:    strd r2, r3, [r0, #20]
 ; CHECK-MVEFP-NEXT:    vstrw.32 q0, [r0]
+; CHECK-MVEFP-NEXT:    str r1, [r0, #16]
 ; CHECK-MVEFP-NEXT:    bx lr
     %x = call <7 x i32> @llvm.fptoui.sat.v7f32.v7i32(<7 x float> %f)
     ret <7 x i32> %x
diff --git a/llvm/test/CodeGen/Thumb2/mve-laneinterleaving-cost.ll b/llvm/test/CodeGen/Thumb2/mve-laneinterleaving-cost.ll
index 7be08b04c5957..0f71653afa408 100644
--- a/llvm/test/CodeGen/Thumb2/mve-laneinterleaving-cost.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-laneinterleaving-cost.ll
@@ -4,54 +4,45 @@
 define arm_aapcs_vfpcc <4 x i32> @loads_i32(ptr %A, ptr %B, ptr %C) {
 ; CHECK-LABEL: loads_i32:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    .save {r4, r5, r6, lr}
-; CHECK-NEXT:    push {r4, r5, r6, lr}
-; CHECK-NEXT:    vldrw.u32 q2, [r1]
-; CHECK-NEXT:    vmov.i64 q1, #0xffffffff
-; CHECK-NEXT:    vmov.f32 s0, s10
-; CHECK-NEXT:    vmov.f32 s2, s11
-; CHECK-NEXT:    vand q0, q0, q1
-; CHECK-NEXT:    vmov.f32 s10, s9
-; CHECK-NEXT:    vmov r1, r3, d0
-; CHECK-NEXT:    vand q2, q2, q1
-; CHECK-NEXT:    vmov r4, r5, d1
-; CHECK-NEXT:    vldrw.u32 q0, [r0]
+; CHECK-NEXT:    .save {r4, r5, r6, r7, r8, lr}
+; CHECK-NEXT:    push.w {r4, r5, r6, r7, r8, lr}
+; CHECK-NEXT:    vldrw.u32 q3, [r1]
+; CHECK-NEXT:    vldrw.u32 q1, [r0]
+; CHECK-NEXT:    vmov.i64 q2, #0xffffffff
+; CHECK-NEXT:    vmov.f32 s0, s12
+; CHECK-NEXT:    vmov.f32 s2, s13
+; CHECK-NEXT:    vmov lr, r0, d2
+; CHECK-NEXT:    vand q0, q0, q2
+; CHECK-NEXT:    vmov r1, r5, d1
+; CHECK-NEXT:    vmov.f32 s12, s14
+; CHECK-NEXT:    vmov.f32 s14, s15
+; CHECK-NEXT:    vand q2, q3, q2
+; CHECK-NEXT:    vmov r4, r3, d5
+; CHECK-NEXT:    asrs r6, r0, #31
+; CHECK-NEXT:    adds.w r12, r0, r1
+; CHECK-NEXT:    adc.w r1, r6, r5
+; CHECK-NEXT:    vmov r6, r5, d3
 ; CHECK-NEXT:    vldrw.u32 q1, [r2]
-; CHECK-NEXT:    vmov lr, r12, d5
-; CHECK-NEXT:    vmov.f32 s12, s2
-; CHECK-NEXT:    vmov.f32 s2, s3
-; CHECK-NEXT:    vmov r0, s12
-; CHECK-NEXT:    vmov.f32 s12, s6
-; CHECK-NEXT:    vmov.f32 s6, s7
-; CHECK-NEXT:    asrs r2, r0, #31
-; CHECK-NEXT:    adds r0, r0, r1
-; CHECK-NEXT:    adc.w r1, r2, r3
-; CHECK-NEXT:    vmov r2, s12
-; CHECK-NEXT:    asrl r0, r1, r2
-; CHECK-NEXT:    vmov r1, s2
-; CHECK-NEXT:    vmov.f32 s2, s1
-; CHECK-NEXT:    adds r2, r1, r4
-; CHECK-NEXT:    asr.w r3, r1, #31
-; CHECK-NEXT:    adc.w r1, r3, r5
-; CHECK-NEXT:    vmov r3, s6
-; CHECK-NEXT:    asrl r2, r1, r3
-; CHECK-NEXT:    vmov r4, r5, d4
-; CHECK-NEXT:    vmov r1, s2
-; CHECK-NEXT:    vmov.f32 s2, s5
-; CHECK-NEXT:    adds.w r6, r1, lr
-; CHECK-NEXT:    asr.w r3, r1, #31
-; CHECK-NEXT:    adc.w r1, r3, r12
-; CHECK-NEXT:    vmov r3, s2
-; CHECK-NEXT:    asrl r6, r1, r3
-; CHECK-NEXT:    vmov r1, s0
-; CHECK-NEXT:    adds r4, r4, r1
-; CHECK-NEXT:    asr.w r3, r1, #31
-; CHECK-NEXT:    adc.w r1, r3, r5
-; CHECK-NEXT:    vmov r3, s4
-; CHECK-NEXT:    asrl r4, r1, r3
-; CHECK-NEXT:    vmov q0[2], q0[0], r4, r0
-; CHECK-NEXT:    vmov q0[3], q0[1], r6, r2
-; CHECK-NEXT:    pop {r4, r5, r6, pc}
+; CHECK-NEXT:    vmov r2, r8, d3
+; CHECK-NEXT:    adds r0, r5, r4
+; CHECK-NEXT:    asr.w r4, r5, #31
+; CHECK-NEXT:    adc.w r5, r4, r3
+; CHECK-NEXT:    vmov r4, r7, d4
+; CHECK-NEXT:    asrs r3, r6, #31
+; CHECK-NEXT:    asrl r0, r5, r8
+; CHECK-NEXT:    adds r4, r4, r6
+; CHECK-NEXT:    adcs r3, r7
+; CHECK-NEXT:    asrl r4, r3, r2
+; CHECK-NEXT:    asr.w r2, lr, #31
+; CHECK-NEXT:    vmov r3, r7, d0
+; CHECK-NEXT:    adds.w r6, lr, r3
+; CHECK-NEXT:    adc.w r3, r2, r7
+; CHECK-NEXT:    vmov r2, r7, d2
+; CHECK-NEXT:    asrl r6, r3, r2
+; CHECK-NEXT:    asrl r12, r1, r7
+; CHECK-NEXT:    vmov q0[2], q0[0], r6, r4
+; CHECK-NEXT:    vmov q0[3], q0[1], r12, r0
+; CHECK-NEXT:    pop.w {r4, r5, r6, r7, r8, pc}
 entry:
   %a = load <4 x i32>, ptr %A, align 4
   %b = load <4 x i32>, ptr %B, align 4
@@ -136,55 +127,42 @@ define arm_aapcs_vfpcc void @load_store_i32(ptr %A, ptr %B, ptr %C, ptr %D) {
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    .save {r4, r5, r6, r7, r8, lr}
 ; CHECK-NEXT:    push.w {r4, r5, r6, r7, r8, lr}
-; CHECK-NEXT:    .vsave {d8, d9}
-; CHECK-NEXT:    vpush {d8, d9}
 ; CHECK-NEXT:    vldrw.u32 q1, [r1]
+; CHECK-NEXT:    vldrw.u32 q3, [r0]
 ; CHECK-NEXT:    vmov.i64 q0, #0xffffffff
 ; CHECK-NEXT:    vmov.f32 s8, s6
 ; CHECK-NEXT:    vmov.f32 s10, s7
+; CHECK-NEXT:    vand q2, q2, q0
+; CHECK-NEXT:    vmov r5, r0, d7
+; CHECK-NEXT:    vmov r1, r7, d5
+; CHECK-NEXT:    vmov r12, lr, d4
+; CHECK-NEXT:    vldrw.u32 q2, [r2]
 ; CHECK-NEXT:    vmov.f32 s6, s5
-; CHECK-NEXT:    vand q4, q2, q0
-; CHECK-NEXT:    vand q2, q1, q0
-; CHECK-NEXT:    vldrw.u32 q0, [r0]
-; CHECK-NEXT:    vmov r4, r5, d9
-; CHECK-NEXT:    vldrw.u32 q1, [r2]
-; CHECK-NEXT:    vmov.f32 s12, s2
-; CHECK-NEXT:    vmov.f32 s2, s3
-; CHECK-NEXT:    vmov lr, r12, d8
-; CHECK-NEXT:    vmov.f32 s16, s6
-; CHECK-NEXT:    vmov.f32 s6, s7
-; CHECK-NEXT:    vmov r6, r1, d5
-; CHECK-NEXT:    vmov.f32 s10, s1
-; CHECK-NEXT:    vmov r0, s2
-; CHECK-NEXT:    vmov.f32 s2, s5
-; CHECK-NEXT:    adds.w r8, r0, r4
+; CHECK-NEXT:    vand q0, q1, q0
+; CHECK-NEXT:    adds.w r8, r0, r1
 ; CHECK-NEXT:    asr.w r2, r0, #31
-; CHECK-NEXT:    adcs r5, r2
-; CHECK-NEXT:    vmov r2, s6
-; CHECK-NEXT:    asrl r8, r5, r2
-; CHECK-NEXT:    vmov r2, s10
-; CHECK-NEXT:    vmov r5, r7, d4
-; CHECK-NEXT:    asrs r4, r2, #31
-; CHECK-NEXT:    adds r2, r2, r6
-; CHECK-NEXT:    adcs r1, r4
-; CHECK-NEXT:    vmov r4, s2
-; CHECK-NEXT:    asrl r2, r1, r4
-; CHECK-NEXT:    vmov r1, s12
-; CHECK-NEXT:    adds.w r6, r1, lr
-; CHECK-NEXT:    asr.w r4, r1, #31
-; CHECK-NEXT:    adc.w r1, r4, r12
-; CHECK-NEXT:    vmov r4, s16
-; CHECK-NEXT:    asrl r6, r1, r4
-; CHECK-NEXT:    vmov r1, s0
-; CHECK-NEXT:    adds r0, r1, r5
-; CHECK-NEXT:    asr.w r4, r1, #31
-; CHECK-NEXT:    adc.w r1, r4, r7
-; CHECK-NEXT:    vmov r7, s4
-; CHECK-NEXT:    asrl r0, r1, r7
-; CHECK-NEXT:    vmov q0[2], q0[0], r0, r6
-; CHECK-NEXT:    vmov q0[3], q0[1], r2, r8
-; CHECK-NEXT:    vstrw.32 q0, [r3]
-; CHECK-NEXT:    vpop {d8, d9}
+; CHECK-NEXT:    adcs r7, r2
+; CHECK-NEXT:    asrs r4, r5, #31
+; CHECK-NEXT:    adds.w r2, r5, r12
+; CHECK-NEXT:    vmov r6, r1, d6
+; CHECK-NEXT:    adc.w r5, r4, lr
+; CHECK-NEXT:    vmov r4, r12, d5
+; CHECK-NEXT:    asrl r2, r5, r4
+; CHECK-NEXT:    asrl r8, r7, r12
+; CHECK-NEXT:    vmov r5, r4, d0
+; CHECK-NEXT:    asrs r7, r1, #31
+; CHECK-NEXT:    adds r0, r6, r5
+; CHECK-NEXT:    asr.w r6, r6, #31
+; CHECK-NEXT:    adc.w r5, r6, r4
+; CHECK-NEXT:    vmov r6, r4, d4
+; CHECK-NEXT:    asrl r0, r5, r6
+; CHECK-NEXT:    vmov q1[2], q1[0], r0, r2
+; CHECK-NEXT:    vmov r0, r2, d1
+; CHECK-NEXT:    adds r0, r0, r1
+; CHECK-NEXT:    adc.w r1, r7, r2
+; CHECK-NEXT:    asrl r0, r1, r4
+; CHECK-NEXT:    vmov q1[3], q1[1], r0, r8
+; CHECK-NEXT:    vstrw.32 q1, [r3]
 ; CHECK-NEXT:    pop.w {r4, r5, r6, r7, r8, pc}
 entry:
   %a = load <4 x i32>, ptr %A, align 4
@@ -268,36 +246,31 @@ entry:
 define arm_aapcs_vfpcc void @load_one_store_i32(ptr %A, ptr %D) {
 ; CHECK-LABEL: load_one_store_i32:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    .save {r4, r5, r6, lr}
-; CHECK-NEXT:    push {r4, r5, r6, lr}
+; CHECK-NEXT:    .save {r4, r5, r6, r7, r9, lr}
+; CHECK-NEXT:    push.w {r4, r5, r6, r7, r9, lr}
 ; CHECK-NEXT:    vldrw.u32 q0, [r0]
-; CHECK-NEXT:    vmov.f32 s4, s2
-; CHECK-NEXT:    vmov.f32 s2, s3
-; CHECK-NEXT:    vmov r2, s2
-; CHECK-NEXT:    vmov.f32 s2, s1
-; CHECK-NEXT:    adds.w r12, r2, r2
-; CHECK-NEXT:    asr.w r3, r2, #31
-; CHECK-NEXT:    adc.w r3, r3, r2, asr #31
-; CHECK-NEXT:    asrl r12, r3, r2
-; CHECK-NEXT:    vmov r3, s2
-; CHECK-NEXT:    adds r2, r3, r3
-; CHECK-NEXT:    asr.w r0, r3, #31
-; CHECK-NEXT:    adc.w r5, r0, r3, asr #31
-; CHECK-NEXT:    vmov r0, s4
-; CHECK-NEXT:    asrl r2, r5, r3
+; CHECK-NEXT:    vmov r2, r3, d1
+; CHECK-NEXT:    vmov r5, r0, d0
+; CHECK-NEXT:    adds r6, r3, r3
+; CHECK-NEXT:    asr.w r12, r3, #31
+; CHECK-NEXT:    adc.w r9, r12, r3, asr #31
+; CHECK-NEXT:    adds r4, r2, r2
+; CHECK-NEXT:    asr.w r12, r2, #31
+; CHECK-NEXT:    adc.w r7, r12, r2, asr #31
+; CHECK-NEXT:    asrl r6, r9, r3
+; CHECK-NEXT:    asrl r4, r7, r2
+; CHECK-NEXT:    adds r2, r5, r5
+; CHECK-NEXT:    asr.w r7, r5, #31
+; CHECK-NEXT:    adc.w r7, r7, r5, asr #31
+; CHECK-NEXT:    asrl r2, r7, r5
+; CHECK-NEXT:    vmov q0[2], q0[0], r2, r4
 ; CHECK-NEXT:    adds r4, r0, r0
-; CHECK-NEXT:    asr.w r3, r0, #31
-; CHECK-NEXT:    adc.w r3, r3, r0, asr #31
+; CHECK-NEXT:    asr.w r2, r0, #31
+; CHECK-NEXT:    adc.w r3, r2, r0, asr #31
 ; CHECK-NEXT:    asrl r4, r3, r0
-; CHECK-NEXT:    vmov r0, s0
-; CHECK-NEXT:    adds r6, r0, r0
-; CHECK-NEXT:    asr.w r3, r0, #31
-; CHECK-NEXT:    adc.w r3, r3, r0, asr #31
-; CHECK-NEXT:    asrl r6, r3, r0
-; CHECK-NEXT:    vmov q0[2], q0[0], r6, r4
-; CHECK-NEXT:    vmov q0[3], q0[1], r2, r12
+; CHECK-NEXT:    vmov q0[3], q0[1], r4, r6
 ; CHECK-NEXT:    vstrw.32 q0, [r1]
-; CHECK-NEXT:    pop {r4, r5, r6, pc}
+; CHECK-NEXT:    pop.w {r4, r5, r6, r7, r9, pc}
 entry:
   %a = load <4 x i32>, ptr %A, align 4
   %sa = sext <4 x i32> %a to <4 x i64>
@@ -360,34 +333,30 @@ entry:
 define arm_aapcs_vfpcc void @mul_i32(ptr %A, ptr %B, i64 %C, ptr %D) {
 ; CHECK-LABEL: mul_i32:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    .save {r4, r5, r6, r7, lr}
-; CHECK-NEXT:    push {r4, r5, r6, r7, lr}
-; CHECK-NEXT:    vldrw.u32 q1, [r0]
+; CHECK-NEXT:    .save {r4, r5, r6, r7, r8, lr}
+; CHECK-NEXT:    push.w {r4, r5, r6, r7, r8, lr}
 ; CHECK-NEXT:    vldrw.u32 q0, [r1]
-; CHECK-NEXT:    ldr.w lr, [sp, #20]
-; CHECK-NEXT:    vmov.f32 s10, s1
-; CHECK-NEXT:    vmov.f32 s14, s5
-; CHECK-NEXT:    vmov r5, s4
-; CHECK-NEXT:    vmov.f32 s4, s6
-; CHECK-NEXT:    vmov.f32 s6, s7
-; CHECK-NEXT:    vmov r0, s10
-; CHECK-NEXT:    vmov r1, s14
-; CHECK-NEXT:    smull r12, r3, r1, r0
-; CHECK-NEXT:    vmov r0, s0
+; CHECK-NEXT:    vldrw.u32 q1, [r0]
+; CHECK-NEXT:    ldr.w r12, [sp, #24]
+; CHECK-NEXT:    vmov r3, lr, d0
+; CHECK-NEXT:    vmov r0, r1, d2
 ; CHECK-NEXT:    vmov.f32 s0, s2
 ; CHECK-NEXT:    vmov.f32 s2, s3
+; CHECK-NEXT:    vmov.f32 s4, s6
+; CHECK-NEXT:    vmov.f32 s6, s7
 ; CHECK-NEXT:    vmullb.s32 q2, q1, q0
-; CHECK-NEXT:    asrl r12, r3, r2
-; CHECK-NEXT:    vmov r6, r1, d4
-; CHECK-NEXT:    vmov r4, r7, d5
+; CHECK-NEXT:    vmov r4, r5, d5
+; CHECK-NEXT:    asrl r4, r5, r2
+; CHECK-NEXT:    smull r8, r3, r0, r3
+; CHECK-NEXT:    vmov r0, r7, d4
+; CHECK-NEXT:    asrl r0, r7, r2
+; CHECK-NEXT:    smull r6, r1, r1, lr
+; CHECK-NEXT:    asrl r8, r3, r2
+; CHECK-NEXT:    vmov q...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Sep 9, 2025

@llvm/pr-subscribers-backend-aarch64

Author: ZhaoQi (zhaoqi5)

Changes

Checking isOperationLegalOrCustom instead of isOperationLegal allows more optimization opportunities. In particular, if a target wants to mark extract_vector_elt as Custom rather than Legal in order to optimize some certain cases, this combiner would otherwise miss some improvements.

Previously, using isOperationLegalOrCustom was avoided due to the risk of getting stuck in infinite loops (as noted in 61ec738). After testing, the issue no longer reproduces, but the coverage is limited to the regression/unit tests and the test-suite.

Would it make sense to relax this condition to enable more optimizations? And what would be the best way to ensure that doing so does not reintroduce infinite loop regressions? Any suggestions would be appreciated.


Patch is 91.00 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/157658.diff

23 Files Affected:

  • (modified) llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/shufflevector.ll (+5-6)
  • (modified) llvm/test/CodeGen/Thumb2/active_lane_mask.ll (+4-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-i16-add.ll (+12-17)
  • (modified) llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-i8-add.ll (+12-17)
  • (modified) llvm/test/CodeGen/Thumb2/mve-fptosi-sat-vector.ll (+6-9)
  • (modified) llvm/test/CodeGen/Thumb2/mve-fptoui-sat-vector.ll (+6-9)
  • (modified) llvm/test/CodeGen/Thumb2/mve-laneinterleaving-cost.ll (+108-139)
  • (modified) llvm/test/CodeGen/Thumb2/mve-laneinterleaving.ll (+108-128)
  • (modified) llvm/test/CodeGen/Thumb2/mve-satmul-loops.ll (+22-30)
  • (modified) llvm/test/CodeGen/Thumb2/mve-sext-masked-load.ll (+13-18)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vabdus.ll (+31-41)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vld2.ll (+33-53)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vld3.ll (+97-208)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vld4-post.ll (+22-33)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vld4.ll (+87-128)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vmull-splat.ll (+77-95)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vst3.ll (+10-10)
  • (modified) llvm/test/CodeGen/WebAssembly/vector-reduce.ll (+22-24)
  • (modified) llvm/test/CodeGen/X86/avx512fp16-mov.ll (+25-29)
  • (modified) llvm/test/CodeGen/X86/test-shrink-bug.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/vec_smulo.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/vec_umulo.ll (+2-2)
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index d130efe96b56b..97a3d36a67103 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -23933,8 +23933,7 @@ SDValue DAGCombiner::visitEXTRACT_VECTOR_ELT(SDNode *N) {
     // scalar_to_vector here as well.
 
     if (!LegalOperations ||
-        // FIXME: Should really be just isOperationLegalOrCustom.
-        TLI.isOperationLegal(ISD::EXTRACT_VECTOR_ELT, VecVT) ||
+        TLI.isOperationLegalOrCustom(ISD::EXTRACT_VECTOR_ELT, VecVT) ||
         TLI.isOperationExpand(ISD::VECTOR_SHUFFLE, VecVT)) {
       return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, ScalarVT, SVInVec,
                          DAG.getVectorIdxConstant(OrigElt, DL));
diff --git a/llvm/test/CodeGen/AArch64/shufflevector.ll b/llvm/test/CodeGen/AArch64/shufflevector.ll
index 9fd5e65086782..b47c077ccf1c5 100644
--- a/llvm/test/CodeGen/AArch64/shufflevector.ll
+++ b/llvm/test/CodeGen/AArch64/shufflevector.ll
@@ -286,10 +286,11 @@ define i32 @shufflevector_v2i16(<2 x i16> %a, <2 x i16> %b){
 ; CHECK-SD:       // %bb.0:
 ; CHECK-SD-NEXT:    sub sp, sp, #16
 ; CHECK-SD-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-SD-NEXT:    ext v0.8b, v0.8b, v1.8b, #4
-; CHECK-SD-NEXT:    mov s1, v0.s[1]
-; CHECK-SD-NEXT:    str h0, [sp, #12]
+; CHECK-SD-NEXT:    // kill: def $d0 killed $d0 def $q0
+; CHECK-SD-NEXT:    // kill: def $d1 killed $d1 def $q1
 ; CHECK-SD-NEXT:    str h1, [sp, #14]
+; CHECK-SD-NEXT:    mov s0, v0.s[1]
+; CHECK-SD-NEXT:    str h0, [sp, #12]
 ; CHECK-SD-NEXT:    ldr w0, [sp, #12]
 ; CHECK-SD-NEXT:    add sp, sp, #16
 ; CHECK-SD-NEXT:    ret
@@ -491,10 +492,8 @@ define i32 @shufflevector_v2i16_zeroes(<2 x i16> %a, <2 x i16> %b){
 ; CHECK-SD-NEXT:    sub sp, sp, #16
 ; CHECK-SD-NEXT:    .cfi_def_cfa_offset 16
 ; CHECK-SD-NEXT:    // kill: def $d0 killed $d0 def $q0
-; CHECK-SD-NEXT:    dup v1.2s, v0.s[0]
+; CHECK-SD-NEXT:    str h0, [sp, #14]
 ; CHECK-SD-NEXT:    str h0, [sp, #12]
-; CHECK-SD-NEXT:    mov s1, v1.s[1]
-; CHECK-SD-NEXT:    str h1, [sp, #14]
 ; CHECK-SD-NEXT:    ldr w0, [sp, #12]
 ; CHECK-SD-NEXT:    add sp, sp, #16
 ; CHECK-SD-NEXT:    ret
diff --git a/llvm/test/CodeGen/Thumb2/active_lane_mask.ll b/llvm/test/CodeGen/Thumb2/active_lane_mask.ll
index bcd92f81911b2..cae8d6e3deaeb 100644
--- a/llvm/test/CodeGen/Thumb2/active_lane_mask.ll
+++ b/llvm/test/CodeGen/Thumb2/active_lane_mask.ll
@@ -107,6 +107,7 @@ define <7 x i32> @v7i32(i32 %index, i32 %TC, <7 x i32> %V1, <7 x i32> %V2) {
 ; CHECK-NEXT:    vstrw.32 q0, [r0]
 ; CHECK-NEXT:    vldrw.u32 q0, [r2]
 ; CHECK-NEXT:    ldr r2, [sp, #48]
+; CHECK-NEXT:    adds r0, #16
 ; CHECK-NEXT:    vqadd.u32 q0, q0, r1
 ; CHECK-NEXT:    ldr r1, [sp, #52]
 ; CHECK-NEXT:    vcmp.u32 hi, q3, q0
@@ -119,12 +120,9 @@ define <7 x i32> @v7i32(i32 %index, i32 %TC, <7 x i32> %V1, <7 x i32> %V2) {
 ; CHECK-NEXT:    ldr r1, [sp, #24]
 ; CHECK-NEXT:    vmov q1[2], q1[0], r2, r1
 ; CHECK-NEXT:    vpsel q0, q1, q0
-; CHECK-NEXT:    vmov r1, s2
-; CHECK-NEXT:    vmov.f32 s2, s1
-; CHECK-NEXT:    vmov r3, s0
-; CHECK-NEXT:    vmov r2, s2
-; CHECK-NEXT:    strd r3, r2, [r0, #16]
-; CHECK-NEXT:    str r1, [r0, #24]
+; CHECK-NEXT:    vmov r1, r2, d0
+; CHECK-NEXT:    vmov r3, s2
+; CHECK-NEXT:    stm r0!, {r1, r2, r3}
 ; CHECK-NEXT:    bx lr
 ; CHECK-NEXT:    .p2align 4
 ; CHECK-NEXT:  @ %bb.1:
diff --git a/llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-i16-add.ll b/llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-i16-add.ll
index 37f6bbeffd027..de508e67a7a77 100644
--- a/llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-i16-add.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-i16-add.ll
@@ -31,24 +31,19 @@ entry:
 define arm_aapcs_vfpcc <4 x i16> @complex_add_v4i16(<4 x i16> %a, <4 x i16> %b) {
 ; CHECK-LABEL: complex_add_v4i16:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    vrev64.32 q2, q0
-; CHECK-NEXT:    vmov r1, s6
-; CHECK-NEXT:    vmov r0, s10
-; CHECK-NEXT:    vrev64.32 q3, q1
-; CHECK-NEXT:    vmov r2, s4
-; CHECK-NEXT:    subs r0, r1, r0
-; CHECK-NEXT:    vmov r1, s8
+; CHECK-NEXT:    .save {r4, lr}
+; CHECK-NEXT:    push {r4, lr}
+; CHECK-NEXT:    vmov r12, r1, d1
+; CHECK-NEXT:    vmov r2, lr, d3
+; CHECK-NEXT:    vmov r3, r4, d2
 ; CHECK-NEXT:    subs r1, r2, r1
-; CHECK-NEXT:    vmov r2, s0
-; CHECK-NEXT:    vmov q2[2], q2[0], r1, r0
-; CHECK-NEXT:    vmov r0, s14
-; CHECK-NEXT:    vmov r1, s2
-; CHECK-NEXT:    add r0, r1
-; CHECK-NEXT:    vmov r1, s12
-; CHECK-NEXT:    add r1, r2
-; CHECK-NEXT:    vmov q2[3], q2[1], r1, r0
-; CHECK-NEXT:    vmov q0, q2
-; CHECK-NEXT:    bx lr
+; CHECK-NEXT:    vmov r2, r0, d0
+; CHECK-NEXT:    subs r0, r3, r0
+; CHECK-NEXT:    vmov q0[2], q0[0], r0, r1
+; CHECK-NEXT:    add.w r0, lr, r12
+; CHECK-NEXT:    adds r1, r4, r2
+; CHECK-NEXT:    vmov q0[3], q0[1], r1, r0
+; CHECK-NEXT:    pop {r4, pc}
 entry:
   %a.real = shufflevector <4 x i16> %a, <4 x i16> zeroinitializer, <2 x i32> <i32 0, i32 2>
   %a.imag = shufflevector <4 x i16> %a, <4 x i16> zeroinitializer, <2 x i32> <i32 1, i32 3>
diff --git a/llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-i8-add.ll b/llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-i8-add.ll
index 794894def9265..e11b3c773adf6 100644
--- a/llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-i8-add.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-complex-deinterleaving-i8-add.ll
@@ -31,24 +31,19 @@ entry:
 define arm_aapcs_vfpcc <4 x i8> @complex_add_v4i8(<4 x i8> %a, <4 x i8> %b) {
 ; CHECK-LABEL: complex_add_v4i8:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    vrev64.32 q2, q0
-; CHECK-NEXT:    vmov r1, s6
-; CHECK-NEXT:    vmov r0, s10
-; CHECK-NEXT:    vrev64.32 q3, q1
-; CHECK-NEXT:    vmov r2, s4
-; CHECK-NEXT:    subs r0, r1, r0
-; CHECK-NEXT:    vmov r1, s8
+; CHECK-NEXT:    .save {r4, lr}
+; CHECK-NEXT:    push {r4, lr}
+; CHECK-NEXT:    vmov r12, r1, d1
+; CHECK-NEXT:    vmov r2, lr, d3
+; CHECK-NEXT:    vmov r3, r4, d2
 ; CHECK-NEXT:    subs r1, r2, r1
-; CHECK-NEXT:    vmov r2, s0
-; CHECK-NEXT:    vmov q2[2], q2[0], r1, r0
-; CHECK-NEXT:    vmov r0, s14
-; CHECK-NEXT:    vmov r1, s2
-; CHECK-NEXT:    add r0, r1
-; CHECK-NEXT:    vmov r1, s12
-; CHECK-NEXT:    add r1, r2
-; CHECK-NEXT:    vmov q2[3], q2[1], r1, r0
-; CHECK-NEXT:    vmov q0, q2
-; CHECK-NEXT:    bx lr
+; CHECK-NEXT:    vmov r2, r0, d0
+; CHECK-NEXT:    subs r0, r3, r0
+; CHECK-NEXT:    vmov q0[2], q0[0], r0, r1
+; CHECK-NEXT:    add.w r0, lr, r12
+; CHECK-NEXT:    adds r1, r4, r2
+; CHECK-NEXT:    vmov q0[3], q0[1], r1, r0
+; CHECK-NEXT:    pop {r4, pc}
 entry:
   %a.real = shufflevector <4 x i8> %a, <4 x i8> zeroinitializer, <2 x i32> <i32 0, i32 2>
   %a.imag = shufflevector <4 x i8> %a, <4 x i8> zeroinitializer, <2 x i32> <i32 1, i32 3>
diff --git a/llvm/test/CodeGen/Thumb2/mve-fptosi-sat-vector.ll b/llvm/test/CodeGen/Thumb2/mve-fptosi-sat-vector.ll
index 77548b49d77f2..d535c64289d4f 100644
--- a/llvm/test/CodeGen/Thumb2/mve-fptosi-sat-vector.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-fptosi-sat-vector.ll
@@ -185,11 +185,10 @@ define arm_aapcs_vfpcc <6 x i32> @test_signed_v6f32_v6i32(<6 x float> %f) {
 ; CHECK-MVEFP:       @ %bb.0:
 ; CHECK-MVEFP-NEXT:    vcvt.s32.f32 q1, q1
 ; CHECK-MVEFP-NEXT:    vcvt.s32.f32 q0, q0
-; CHECK-MVEFP-NEXT:    vmov.f32 s6, s5
-; CHECK-MVEFP-NEXT:    vmov r2, s4
-; CHECK-MVEFP-NEXT:    vmov r1, s6
-; CHECK-MVEFP-NEXT:    strd r2, r1, [r0, #16]
+; CHECK-MVEFP-NEXT:    vmov r1, r2, d2
+; CHECK-MVEFP-NEXT:    str r2, [r0, #20]
 ; CHECK-MVEFP-NEXT:    vstrw.32 q0, [r0]
+; CHECK-MVEFP-NEXT:    str r1, [r0, #16]
 ; CHECK-MVEFP-NEXT:    bx lr
     %x = call <6 x i32> @llvm.fptosi.sat.v6f32.v6i32(<6 x float> %f)
     ret <6 x i32> %x
@@ -221,13 +220,11 @@ define arm_aapcs_vfpcc <7 x i32> @test_signed_v7f32_v7i32(<7 x float> %f) {
 ; CHECK-MVEFP:       @ %bb.0:
 ; CHECK-MVEFP-NEXT:    vcvt.s32.f32 q1, q1
 ; CHECK-MVEFP-NEXT:    vcvt.s32.f32 q0, q0
-; CHECK-MVEFP-NEXT:    vmov.f32 s10, s5
-; CHECK-MVEFP-NEXT:    vmov r2, s4
 ; CHECK-MVEFP-NEXT:    vmov r3, s6
-; CHECK-MVEFP-NEXT:    vmov r1, s10
-; CHECK-MVEFP-NEXT:    strd r2, r1, [r0, #16]
-; CHECK-MVEFP-NEXT:    str r3, [r0, #24]
+; CHECK-MVEFP-NEXT:    vmov r1, r2, d2
+; CHECK-MVEFP-NEXT:    strd r2, r3, [r0, #20]
 ; CHECK-MVEFP-NEXT:    vstrw.32 q0, [r0]
+; CHECK-MVEFP-NEXT:    str r1, [r0, #16]
 ; CHECK-MVEFP-NEXT:    bx lr
     %x = call <7 x i32> @llvm.fptosi.sat.v7f32.v7i32(<7 x float> %f)
     ret <7 x i32> %x
diff --git a/llvm/test/CodeGen/Thumb2/mve-fptoui-sat-vector.ll b/llvm/test/CodeGen/Thumb2/mve-fptoui-sat-vector.ll
index ee040feca4240..61f05347d511d 100644
--- a/llvm/test/CodeGen/Thumb2/mve-fptoui-sat-vector.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-fptoui-sat-vector.ll
@@ -172,11 +172,10 @@ define arm_aapcs_vfpcc <6 x i32> @test_unsigned_v6f32_v6i32(<6 x float> %f) {
 ; CHECK-MVEFP:       @ %bb.0:
 ; CHECK-MVEFP-NEXT:    vcvt.u32.f32 q1, q1
 ; CHECK-MVEFP-NEXT:    vcvt.u32.f32 q0, q0
-; CHECK-MVEFP-NEXT:    vmov.f32 s6, s5
-; CHECK-MVEFP-NEXT:    vmov r2, s4
-; CHECK-MVEFP-NEXT:    vmov r1, s6
-; CHECK-MVEFP-NEXT:    strd r2, r1, [r0, #16]
+; CHECK-MVEFP-NEXT:    vmov r1, r2, d2
+; CHECK-MVEFP-NEXT:    str r2, [r0, #20]
 ; CHECK-MVEFP-NEXT:    vstrw.32 q0, [r0]
+; CHECK-MVEFP-NEXT:    str r1, [r0, #16]
 ; CHECK-MVEFP-NEXT:    bx lr
     %x = call <6 x i32> @llvm.fptoui.sat.v6f32.v6i32(<6 x float> %f)
     ret <6 x i32> %x
@@ -208,13 +207,11 @@ define arm_aapcs_vfpcc <7 x i32> @test_unsigned_v7f32_v7i32(<7 x float> %f) {
 ; CHECK-MVEFP:       @ %bb.0:
 ; CHECK-MVEFP-NEXT:    vcvt.u32.f32 q1, q1
 ; CHECK-MVEFP-NEXT:    vcvt.u32.f32 q0, q0
-; CHECK-MVEFP-NEXT:    vmov.f32 s10, s5
-; CHECK-MVEFP-NEXT:    vmov r2, s4
 ; CHECK-MVEFP-NEXT:    vmov r3, s6
-; CHECK-MVEFP-NEXT:    vmov r1, s10
-; CHECK-MVEFP-NEXT:    strd r2, r1, [r0, #16]
-; CHECK-MVEFP-NEXT:    str r3, [r0, #24]
+; CHECK-MVEFP-NEXT:    vmov r1, r2, d2
+; CHECK-MVEFP-NEXT:    strd r2, r3, [r0, #20]
 ; CHECK-MVEFP-NEXT:    vstrw.32 q0, [r0]
+; CHECK-MVEFP-NEXT:    str r1, [r0, #16]
 ; CHECK-MVEFP-NEXT:    bx lr
     %x = call <7 x i32> @llvm.fptoui.sat.v7f32.v7i32(<7 x float> %f)
     ret <7 x i32> %x
diff --git a/llvm/test/CodeGen/Thumb2/mve-laneinterleaving-cost.ll b/llvm/test/CodeGen/Thumb2/mve-laneinterleaving-cost.ll
index 7be08b04c5957..0f71653afa408 100644
--- a/llvm/test/CodeGen/Thumb2/mve-laneinterleaving-cost.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-laneinterleaving-cost.ll
@@ -4,54 +4,45 @@
 define arm_aapcs_vfpcc <4 x i32> @loads_i32(ptr %A, ptr %B, ptr %C) {
 ; CHECK-LABEL: loads_i32:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    .save {r4, r5, r6, lr}
-; CHECK-NEXT:    push {r4, r5, r6, lr}
-; CHECK-NEXT:    vldrw.u32 q2, [r1]
-; CHECK-NEXT:    vmov.i64 q1, #0xffffffff
-; CHECK-NEXT:    vmov.f32 s0, s10
-; CHECK-NEXT:    vmov.f32 s2, s11
-; CHECK-NEXT:    vand q0, q0, q1
-; CHECK-NEXT:    vmov.f32 s10, s9
-; CHECK-NEXT:    vmov r1, r3, d0
-; CHECK-NEXT:    vand q2, q2, q1
-; CHECK-NEXT:    vmov r4, r5, d1
-; CHECK-NEXT:    vldrw.u32 q0, [r0]
+; CHECK-NEXT:    .save {r4, r5, r6, r7, r8, lr}
+; CHECK-NEXT:    push.w {r4, r5, r6, r7, r8, lr}
+; CHECK-NEXT:    vldrw.u32 q3, [r1]
+; CHECK-NEXT:    vldrw.u32 q1, [r0]
+; CHECK-NEXT:    vmov.i64 q2, #0xffffffff
+; CHECK-NEXT:    vmov.f32 s0, s12
+; CHECK-NEXT:    vmov.f32 s2, s13
+; CHECK-NEXT:    vmov lr, r0, d2
+; CHECK-NEXT:    vand q0, q0, q2
+; CHECK-NEXT:    vmov r1, r5, d1
+; CHECK-NEXT:    vmov.f32 s12, s14
+; CHECK-NEXT:    vmov.f32 s14, s15
+; CHECK-NEXT:    vand q2, q3, q2
+; CHECK-NEXT:    vmov r4, r3, d5
+; CHECK-NEXT:    asrs r6, r0, #31
+; CHECK-NEXT:    adds.w r12, r0, r1
+; CHECK-NEXT:    adc.w r1, r6, r5
+; CHECK-NEXT:    vmov r6, r5, d3
 ; CHECK-NEXT:    vldrw.u32 q1, [r2]
-; CHECK-NEXT:    vmov lr, r12, d5
-; CHECK-NEXT:    vmov.f32 s12, s2
-; CHECK-NEXT:    vmov.f32 s2, s3
-; CHECK-NEXT:    vmov r0, s12
-; CHECK-NEXT:    vmov.f32 s12, s6
-; CHECK-NEXT:    vmov.f32 s6, s7
-; CHECK-NEXT:    asrs r2, r0, #31
-; CHECK-NEXT:    adds r0, r0, r1
-; CHECK-NEXT:    adc.w r1, r2, r3
-; CHECK-NEXT:    vmov r2, s12
-; CHECK-NEXT:    asrl r0, r1, r2
-; CHECK-NEXT:    vmov r1, s2
-; CHECK-NEXT:    vmov.f32 s2, s1
-; CHECK-NEXT:    adds r2, r1, r4
-; CHECK-NEXT:    asr.w r3, r1, #31
-; CHECK-NEXT:    adc.w r1, r3, r5
-; CHECK-NEXT:    vmov r3, s6
-; CHECK-NEXT:    asrl r2, r1, r3
-; CHECK-NEXT:    vmov r4, r5, d4
-; CHECK-NEXT:    vmov r1, s2
-; CHECK-NEXT:    vmov.f32 s2, s5
-; CHECK-NEXT:    adds.w r6, r1, lr
-; CHECK-NEXT:    asr.w r3, r1, #31
-; CHECK-NEXT:    adc.w r1, r3, r12
-; CHECK-NEXT:    vmov r3, s2
-; CHECK-NEXT:    asrl r6, r1, r3
-; CHECK-NEXT:    vmov r1, s0
-; CHECK-NEXT:    adds r4, r4, r1
-; CHECK-NEXT:    asr.w r3, r1, #31
-; CHECK-NEXT:    adc.w r1, r3, r5
-; CHECK-NEXT:    vmov r3, s4
-; CHECK-NEXT:    asrl r4, r1, r3
-; CHECK-NEXT:    vmov q0[2], q0[0], r4, r0
-; CHECK-NEXT:    vmov q0[3], q0[1], r6, r2
-; CHECK-NEXT:    pop {r4, r5, r6, pc}
+; CHECK-NEXT:    vmov r2, r8, d3
+; CHECK-NEXT:    adds r0, r5, r4
+; CHECK-NEXT:    asr.w r4, r5, #31
+; CHECK-NEXT:    adc.w r5, r4, r3
+; CHECK-NEXT:    vmov r4, r7, d4
+; CHECK-NEXT:    asrs r3, r6, #31
+; CHECK-NEXT:    asrl r0, r5, r8
+; CHECK-NEXT:    adds r4, r4, r6
+; CHECK-NEXT:    adcs r3, r7
+; CHECK-NEXT:    asrl r4, r3, r2
+; CHECK-NEXT:    asr.w r2, lr, #31
+; CHECK-NEXT:    vmov r3, r7, d0
+; CHECK-NEXT:    adds.w r6, lr, r3
+; CHECK-NEXT:    adc.w r3, r2, r7
+; CHECK-NEXT:    vmov r2, r7, d2
+; CHECK-NEXT:    asrl r6, r3, r2
+; CHECK-NEXT:    asrl r12, r1, r7
+; CHECK-NEXT:    vmov q0[2], q0[0], r6, r4
+; CHECK-NEXT:    vmov q0[3], q0[1], r12, r0
+; CHECK-NEXT:    pop.w {r4, r5, r6, r7, r8, pc}
 entry:
   %a = load <4 x i32>, ptr %A, align 4
   %b = load <4 x i32>, ptr %B, align 4
@@ -136,55 +127,42 @@ define arm_aapcs_vfpcc void @load_store_i32(ptr %A, ptr %B, ptr %C, ptr %D) {
 ; CHECK:       @ %bb.0: @ %entry
 ; CHECK-NEXT:    .save {r4, r5, r6, r7, r8, lr}
 ; CHECK-NEXT:    push.w {r4, r5, r6, r7, r8, lr}
-; CHECK-NEXT:    .vsave {d8, d9}
-; CHECK-NEXT:    vpush {d8, d9}
 ; CHECK-NEXT:    vldrw.u32 q1, [r1]
+; CHECK-NEXT:    vldrw.u32 q3, [r0]
 ; CHECK-NEXT:    vmov.i64 q0, #0xffffffff
 ; CHECK-NEXT:    vmov.f32 s8, s6
 ; CHECK-NEXT:    vmov.f32 s10, s7
+; CHECK-NEXT:    vand q2, q2, q0
+; CHECK-NEXT:    vmov r5, r0, d7
+; CHECK-NEXT:    vmov r1, r7, d5
+; CHECK-NEXT:    vmov r12, lr, d4
+; CHECK-NEXT:    vldrw.u32 q2, [r2]
 ; CHECK-NEXT:    vmov.f32 s6, s5
-; CHECK-NEXT:    vand q4, q2, q0
-; CHECK-NEXT:    vand q2, q1, q0
-; CHECK-NEXT:    vldrw.u32 q0, [r0]
-; CHECK-NEXT:    vmov r4, r5, d9
-; CHECK-NEXT:    vldrw.u32 q1, [r2]
-; CHECK-NEXT:    vmov.f32 s12, s2
-; CHECK-NEXT:    vmov.f32 s2, s3
-; CHECK-NEXT:    vmov lr, r12, d8
-; CHECK-NEXT:    vmov.f32 s16, s6
-; CHECK-NEXT:    vmov.f32 s6, s7
-; CHECK-NEXT:    vmov r6, r1, d5
-; CHECK-NEXT:    vmov.f32 s10, s1
-; CHECK-NEXT:    vmov r0, s2
-; CHECK-NEXT:    vmov.f32 s2, s5
-; CHECK-NEXT:    adds.w r8, r0, r4
+; CHECK-NEXT:    vand q0, q1, q0
+; CHECK-NEXT:    adds.w r8, r0, r1
 ; CHECK-NEXT:    asr.w r2, r0, #31
-; CHECK-NEXT:    adcs r5, r2
-; CHECK-NEXT:    vmov r2, s6
-; CHECK-NEXT:    asrl r8, r5, r2
-; CHECK-NEXT:    vmov r2, s10
-; CHECK-NEXT:    vmov r5, r7, d4
-; CHECK-NEXT:    asrs r4, r2, #31
-; CHECK-NEXT:    adds r2, r2, r6
-; CHECK-NEXT:    adcs r1, r4
-; CHECK-NEXT:    vmov r4, s2
-; CHECK-NEXT:    asrl r2, r1, r4
-; CHECK-NEXT:    vmov r1, s12
-; CHECK-NEXT:    adds.w r6, r1, lr
-; CHECK-NEXT:    asr.w r4, r1, #31
-; CHECK-NEXT:    adc.w r1, r4, r12
-; CHECK-NEXT:    vmov r4, s16
-; CHECK-NEXT:    asrl r6, r1, r4
-; CHECK-NEXT:    vmov r1, s0
-; CHECK-NEXT:    adds r0, r1, r5
-; CHECK-NEXT:    asr.w r4, r1, #31
-; CHECK-NEXT:    adc.w r1, r4, r7
-; CHECK-NEXT:    vmov r7, s4
-; CHECK-NEXT:    asrl r0, r1, r7
-; CHECK-NEXT:    vmov q0[2], q0[0], r0, r6
-; CHECK-NEXT:    vmov q0[3], q0[1], r2, r8
-; CHECK-NEXT:    vstrw.32 q0, [r3]
-; CHECK-NEXT:    vpop {d8, d9}
+; CHECK-NEXT:    adcs r7, r2
+; CHECK-NEXT:    asrs r4, r5, #31
+; CHECK-NEXT:    adds.w r2, r5, r12
+; CHECK-NEXT:    vmov r6, r1, d6
+; CHECK-NEXT:    adc.w r5, r4, lr
+; CHECK-NEXT:    vmov r4, r12, d5
+; CHECK-NEXT:    asrl r2, r5, r4
+; CHECK-NEXT:    asrl r8, r7, r12
+; CHECK-NEXT:    vmov r5, r4, d0
+; CHECK-NEXT:    asrs r7, r1, #31
+; CHECK-NEXT:    adds r0, r6, r5
+; CHECK-NEXT:    asr.w r6, r6, #31
+; CHECK-NEXT:    adc.w r5, r6, r4
+; CHECK-NEXT:    vmov r6, r4, d4
+; CHECK-NEXT:    asrl r0, r5, r6
+; CHECK-NEXT:    vmov q1[2], q1[0], r0, r2
+; CHECK-NEXT:    vmov r0, r2, d1
+; CHECK-NEXT:    adds r0, r0, r1
+; CHECK-NEXT:    adc.w r1, r7, r2
+; CHECK-NEXT:    asrl r0, r1, r4
+; CHECK-NEXT:    vmov q1[3], q1[1], r0, r8
+; CHECK-NEXT:    vstrw.32 q1, [r3]
 ; CHECK-NEXT:    pop.w {r4, r5, r6, r7, r8, pc}
 entry:
   %a = load <4 x i32>, ptr %A, align 4
@@ -268,36 +246,31 @@ entry:
 define arm_aapcs_vfpcc void @load_one_store_i32(ptr %A, ptr %D) {
 ; CHECK-LABEL: load_one_store_i32:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    .save {r4, r5, r6, lr}
-; CHECK-NEXT:    push {r4, r5, r6, lr}
+; CHECK-NEXT:    .save {r4, r5, r6, r7, r9, lr}
+; CHECK-NEXT:    push.w {r4, r5, r6, r7, r9, lr}
 ; CHECK-NEXT:    vldrw.u32 q0, [r0]
-; CHECK-NEXT:    vmov.f32 s4, s2
-; CHECK-NEXT:    vmov.f32 s2, s3
-; CHECK-NEXT:    vmov r2, s2
-; CHECK-NEXT:    vmov.f32 s2, s1
-; CHECK-NEXT:    adds.w r12, r2, r2
-; CHECK-NEXT:    asr.w r3, r2, #31
-; CHECK-NEXT:    adc.w r3, r3, r2, asr #31
-; CHECK-NEXT:    asrl r12, r3, r2
-; CHECK-NEXT:    vmov r3, s2
-; CHECK-NEXT:    adds r2, r3, r3
-; CHECK-NEXT:    asr.w r0, r3, #31
-; CHECK-NEXT:    adc.w r5, r0, r3, asr #31
-; CHECK-NEXT:    vmov r0, s4
-; CHECK-NEXT:    asrl r2, r5, r3
+; CHECK-NEXT:    vmov r2, r3, d1
+; CHECK-NEXT:    vmov r5, r0, d0
+; CHECK-NEXT:    adds r6, r3, r3
+; CHECK-NEXT:    asr.w r12, r3, #31
+; CHECK-NEXT:    adc.w r9, r12, r3, asr #31
+; CHECK-NEXT:    adds r4, r2, r2
+; CHECK-NEXT:    asr.w r12, r2, #31
+; CHECK-NEXT:    adc.w r7, r12, r2, asr #31
+; CHECK-NEXT:    asrl r6, r9, r3
+; CHECK-NEXT:    asrl r4, r7, r2
+; CHECK-NEXT:    adds r2, r5, r5
+; CHECK-NEXT:    asr.w r7, r5, #31
+; CHECK-NEXT:    adc.w r7, r7, r5, asr #31
+; CHECK-NEXT:    asrl r2, r7, r5
+; CHECK-NEXT:    vmov q0[2], q0[0], r2, r4
 ; CHECK-NEXT:    adds r4, r0, r0
-; CHECK-NEXT:    asr.w r3, r0, #31
-; CHECK-NEXT:    adc.w r3, r3, r0, asr #31
+; CHECK-NEXT:    asr.w r2, r0, #31
+; CHECK-NEXT:    adc.w r3, r2, r0, asr #31
 ; CHECK-NEXT:    asrl r4, r3, r0
-; CHECK-NEXT:    vmov r0, s0
-; CHECK-NEXT:    adds r6, r0, r0
-; CHECK-NEXT:    asr.w r3, r0, #31
-; CHECK-NEXT:    adc.w r3, r3, r0, asr #31
-; CHECK-NEXT:    asrl r6, r3, r0
-; CHECK-NEXT:    vmov q0[2], q0[0], r6, r4
-; CHECK-NEXT:    vmov q0[3], q0[1], r2, r12
+; CHECK-NEXT:    vmov q0[3], q0[1], r4, r6
 ; CHECK-NEXT:    vstrw.32 q0, [r1]
-; CHECK-NEXT:    pop {r4, r5, r6, pc}
+; CHECK-NEXT:    pop.w {r4, r5, r6, r7, r9, pc}
 entry:
   %a = load <4 x i32>, ptr %A, align 4
   %sa = sext <4 x i32> %a to <4 x i64>
@@ -360,34 +333,30 @@ entry:
 define arm_aapcs_vfpcc void @mul_i32(ptr %A, ptr %B, i64 %C, ptr %D) {
 ; CHECK-LABEL: mul_i32:
 ; CHECK:       @ %bb.0: @ %entry
-; CHECK-NEXT:    .save {r4, r5, r6, r7, lr}
-; CHECK-NEXT:    push {r4, r5, r6, r7, lr}
-; CHECK-NEXT:    vldrw.u32 q1, [r0]
+; CHECK-NEXT:    .save {r4, r5, r6, r7, r8, lr}
+; CHECK-NEXT:    push.w {r4, r5, r6, r7, r8, lr}
 ; CHECK-NEXT:    vldrw.u32 q0, [r1]
-; CHECK-NEXT:    ldr.w lr, [sp, #20]
-; CHECK-NEXT:    vmov.f32 s10, s1
-; CHECK-NEXT:    vmov.f32 s14, s5
-; CHECK-NEXT:    vmov r5, s4
-; CHECK-NEXT:    vmov.f32 s4, s6
-; CHECK-NEXT:    vmov.f32 s6, s7
-; CHECK-NEXT:    vmov r0, s10
-; CHECK-NEXT:    vmov r1, s14
-; CHECK-NEXT:    smull r12, r3, r1, r0
-; CHECK-NEXT:    vmov r0, s0
+; CHECK-NEXT:    vldrw.u32 q1, [r0]
+; CHECK-NEXT:    ldr.w r12, [sp, #24]
+; CHECK-NEXT:    vmov r3, lr, d0
+; CHECK-NEXT:    vmov r0, r1, d2
 ; CHECK-NEXT:    vmov.f32 s0, s2
 ; CHECK-NEXT:    vmov.f32 s2, s3
+; CHECK-NEXT:    vmov.f32 s4, s6
+; CHECK-NEXT:    vmov.f32 s6, s7
 ; CHECK-NEXT:    vmullb.s32 q2, q1, q0
-; CHECK-NEXT:    asrl r12, r3, r2
-; CHECK-NEXT:    vmov r6, r1, d4
-; CHECK-NEXT:    vmov r4, r7, d5
+; CHECK-NEXT:    vmov r4, r5, d5
+; CHECK-NEXT:    asrl r4, r5, r2
+; CHECK-NEXT:    smull r8, r3, r0, r3
+; CHECK-NEXT:    vmov r0, r7, d4
+; CHECK-NEXT:    asrl r0, r7, r2
+; CHECK-NEXT:    smull r6, r1, r1, lr
+; CHECK-NEXT:    asrl r8, r3, r2
+; CHECK-NEXT:    vmov q...
[truncated]

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhaoqi5 zhaoqi5 merged commit 4621e17 into main Sep 10, 2025
14 checks passed
@zhaoqi5 zhaoqi5 deleted the users/zhaoqi5/relax-extractelt-combine-condition branch September 10, 2025 07:51
@aeubanks
Copy link
Contributor

this is causing hangs on the following IR:

$ cat /tmp/a.ll
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-ios17.0.0-simulator"

declare void @llvm.memset.p0.i64(ptr writeonly captures(none), i8, i64, i1 immarg)

declare void @llvm.memcpy.p0.p0.i64(ptr noalias writeonly captures(none), ptr noalias readonly captures(none), i64, i1 immarg)

define ptr @_ZN5SkM449setRotateE4SkV3f(ptr noundef returned writeonly align 4 captures(ret: address, provenance) dereferenceable_or_null(64) initializes((0, 64)) %this, <2 x float> %axis.coerce0, float %axis.coerce1, float noundef %radians) {
entry:
  %0 = fmul <2 x float> %axis.coerce0, %axis.coerce0
  %shift = shufflevector <2 x float> %0, <2 x float> poison, <2 x i32> <i32 1, i32 poison>
  %foldExtExtBinop = fadd <2 x float> %0, %shift
  %add.i.i = extractelement <2 x float> %foldExtExtBinop, i64 0
  %mul5.i.i = fmul float %axis.coerce1, %axis.coerce1
  %add6.i.i = fadd float %mul5.i.i, %add.i.i
  %1 = tail call noundef float @llvm.sqrt.f32(float %add6.i.i)
  %cmp = fcmp ogt float %add6.i.i, 0.000000e+00
  %sub.i = fsub float %1, %1
  %cmp.i = fcmp ord float %sub.i, 0.000000e+00
  %or.cond = and i1 %cmp, %cmp.i
  %div = fdiv float 1.000000e+00, %1
  %mul5.i = fmul float %axis.coerce1, %div
  %2 = tail call noundef float @llvm.sin.f32(float %radians)
  %3 = tail call noundef float @llvm.cos.f32(float %radians)
  %sub.i.i = fsub float 1.000000e+00, %3
  %mul8.i.i = fmul float %2, %mul5.i
  %mul33.i.i = fmul float %sub.i.i, %mul5.i
  %mul34.i.i = fmul float %mul5.i, %mul33.i.i
  %add35.i.i = fadd float %3, %mul34.i.i
  %4 = insertelement <2 x float> poison, float %div, i64 0
  %5 = shufflevector <2 x float> %4, <2 x float> poison, <2 x i32> zeroinitializer
  %6 = fmul <2 x float> %axis.coerce0, %5
  %7 = extractelement <2 x float> %6, i64 0
  %mul.i.i8 = fmul float %sub.i.i, %7
  %8 = insertelement <2 x float> poison, float %mul.i.i8, i64 0
  %9 = shufflevector <2 x float> %8, <2 x float> poison, <2 x i32> zeroinitializer
  %10 = fmul <2 x float> %6, %9
  %11 = extractelement <2 x float> %10, i64 1
  %sub9.i.i = fsub float %11, %mul8.i.i
  %mul11.i.i = fmul float %mul5.i, %mul.i.i8
  %12 = extractelement <2 x float> %6, i64 1
  %mul12.i.i = fmul float %2, %12
  %add13.i.i = fadd float %mul12.i.i, %mul11.i.i
  %13 = insertelement <2 x float> poison, float %3, i64 0
  %14 = insertelement <2 x float> %13, float %mul8.i.i, i64 1
  %15 = fadd <2 x float> %14, %10
  %mul18.i.i = fmul float %sub.i.i, %12
  %mul22.i.i = fmul float %mul5.i, %mul18.i.i
  %sub28.i.i = fsub float %mul11.i.i, %mul12.i.i
  store <2 x float> %15, ptr %this, align 4
  %ref.tmp.sroa.5.0.this.sroa_idx.i.i = getelementptr inbounds nuw i8, ptr %this, i64 8
  store float %sub28.i.i, ptr %ref.tmp.sroa.5.0.this.sroa_idx.i.i, align 4
  %ref.tmp.sroa.6.0.this.sroa_idx.i.i = getelementptr inbounds nuw i8, ptr %this, i64 12
  store float 0.000000e+00, ptr %ref.tmp.sroa.6.0.this.sroa_idx.i.i, align 4
  %ref.tmp.sroa.7.0.this.sroa_idx.i.i = getelementptr inbounds nuw i8, ptr %this, i64 16
  store float %sub9.i.i, ptr %ref.tmp.sroa.7.0.this.sroa_idx.i.i, align 4
  %ref.tmp.sroa.8.0.this.sroa_idx.i.i = getelementptr inbounds nuw i8, ptr %this, i64 20
  %16 = insertelement <2 x float> poison, float %2, i64 0
  %17 = insertelement <2 x float> %16, float %mul18.i.i, i64 1
  %18 = fmul <2 x float> %6, %17
  %19 = extractelement <2 x float> %18, i64 0
  %sub24.i.i = fsub float %mul22.i.i, %19
  %20 = insertelement <2 x float> poison, float %mul22.i.i, i64 0
  %21 = insertelement <2 x float> %20, float %3, i64 1
  %22 = fadd <2 x float> %21, %18
  %23 = shufflevector <2 x float> %22, <2 x float> poison, <2 x i32> <i32 1, i32 0>
  store <2 x float> %23, ptr %ref.tmp.sroa.8.0.this.sroa_idx.i.i, align 4
  %ref.tmp.sroa.10.0.this.sroa_idx.i.i = getelementptr inbounds nuw i8, ptr %this, i64 28
  store float 0.000000e+00, ptr %ref.tmp.sroa.10.0.this.sroa_idx.i.i, align 4
  %ref.tmp.sroa.11.0.this.sroa_idx.i.i = getelementptr inbounds nuw i8, ptr %this, i64 32
  store float %add13.i.i, ptr %ref.tmp.sroa.11.0.this.sroa_idx.i.i, align 4
  %ref.tmp.sroa.12.0.this.sroa_idx.i.i = getelementptr inbounds nuw i8, ptr %this, i64 36
  store float %sub24.i.i, ptr %ref.tmp.sroa.12.0.this.sroa_idx.i.i, align 4
  %ref.tmp.sroa.13.0.this.sroa_idx.i.i = getelementptr inbounds nuw i8, ptr %this, i64 40
  store float %add35.i.i, ptr %ref.tmp.sroa.13.0.this.sroa_idx.i.i, align 4
  %ref.tmp.sroa.14.0.this.sroa_idx.i.i = getelementptr inbounds nuw i8, ptr %this, i64 44
  %ref.tmp.sroa.18.0.this.sroa_idx.i.i = getelementptr inbounds nuw i8, ptr %this, i64 60
  tail call void @llvm.memset.p0.i64(ptr noundef nonnull align 4 dereferenceable(16) %ref.tmp.sroa.14.0.this.sroa_idx.i.i, i8 0, i64 16, i1 false)
  store float 1.000000e+00, ptr %ref.tmp.sroa.18.0.this.sroa_idx.i.i, align 4
  ret ptr null
}

declare float @llvm.sqrt.f32(float)

declare float @llvm.sin.f32(float)

declare float @llvm.cos.f32(float)

$ llc -o /dev/null /tmp/a.ll
hang ... 

I'll revert this in the meantime

@zhaoqi5
Copy link
Contributor Author

zhaoqi5 commented Sep 11, 2025

this is causing hangs on the following IR:

$ cat /tmp/a.ll
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-ios17.0.0-simulator"

declare void @llvm.memset.p0.i64(ptr writeonly captures(none), i8, i64, i1 immarg)

declare void @llvm.memcpy.p0.p0.i64(ptr noalias writeonly captures(none), ptr noalias readonly captures(none), i64, i1 immarg)

define ptr @_ZN5SkM449setRotateE4SkV3f(ptr noundef returned writeonly align 4 captures(ret: address, provenance) dereferenceable_or_null(64) initializes((0, 64)) %this, <2 x float> %axis.coerce0, float %axis.coerce1, float noundef %radians) {
entry:
  %0 = fmul <2 x float> %axis.coerce0, %axis.coerce0
  %shift = shufflevector <2 x float> %0, <2 x float> poison, <2 x i32> <i32 1, i32 poison>
  %foldExtExtBinop = fadd <2 x float> %0, %shift
  %add.i.i = extractelement <2 x float> %foldExtExtBinop, i64 0
  %mul5.i.i = fmul float %axis.coerce1, %axis.coerce1
  %add6.i.i = fadd float %mul5.i.i, %add.i.i
  %1 = tail call noundef float @llvm.sqrt.f32(float %add6.i.i)
  %cmp = fcmp ogt float %add6.i.i, 0.000000e+00
  %sub.i = fsub float %1, %1
  %cmp.i = fcmp ord float %sub.i, 0.000000e+00
  %or.cond = and i1 %cmp, %cmp.i
  %div = fdiv float 1.000000e+00, %1
  %mul5.i = fmul float %axis.coerce1, %div
  %2 = tail call noundef float @llvm.sin.f32(float %radians)
  %3 = tail call noundef float @llvm.cos.f32(float %radians)
  %sub.i.i = fsub float 1.000000e+00, %3
  %mul8.i.i = fmul float %2, %mul5.i
  %mul33.i.i = fmul float %sub.i.i, %mul5.i
  %mul34.i.i = fmul float %mul5.i, %mul33.i.i
  %add35.i.i = fadd float %3, %mul34.i.i
  %4 = insertelement <2 x float> poison, float %div, i64 0
  %5 = shufflevector <2 x float> %4, <2 x float> poison, <2 x i32> zeroinitializer
  %6 = fmul <2 x float> %axis.coerce0, %5
  %7 = extractelement <2 x float> %6, i64 0
  %mul.i.i8 = fmul float %sub.i.i, %7
  %8 = insertelement <2 x float> poison, float %mul.i.i8, i64 0
  %9 = shufflevector <2 x float> %8, <2 x float> poison, <2 x i32> zeroinitializer
  %10 = fmul <2 x float> %6, %9
  %11 = extractelement <2 x float> %10, i64 1
  %sub9.i.i = fsub float %11, %mul8.i.i
  %mul11.i.i = fmul float %mul5.i, %mul.i.i8
  %12 = extractelement <2 x float> %6, i64 1
  %mul12.i.i = fmul float %2, %12
  %add13.i.i = fadd float %mul12.i.i, %mul11.i.i
  %13 = insertelement <2 x float> poison, float %3, i64 0
  %14 = insertelement <2 x float> %13, float %mul8.i.i, i64 1
  %15 = fadd <2 x float> %14, %10
  %mul18.i.i = fmul float %sub.i.i, %12
  %mul22.i.i = fmul float %mul5.i, %mul18.i.i
  %sub28.i.i = fsub float %mul11.i.i, %mul12.i.i
  store <2 x float> %15, ptr %this, align 4
  %ref.tmp.sroa.5.0.this.sroa_idx.i.i = getelementptr inbounds nuw i8, ptr %this, i64 8
  store float %sub28.i.i, ptr %ref.tmp.sroa.5.0.this.sroa_idx.i.i, align 4
  %ref.tmp.sroa.6.0.this.sroa_idx.i.i = getelementptr inbounds nuw i8, ptr %this, i64 12
  store float 0.000000e+00, ptr %ref.tmp.sroa.6.0.this.sroa_idx.i.i, align 4
  %ref.tmp.sroa.7.0.this.sroa_idx.i.i = getelementptr inbounds nuw i8, ptr %this, i64 16
  store float %sub9.i.i, ptr %ref.tmp.sroa.7.0.this.sroa_idx.i.i, align 4
  %ref.tmp.sroa.8.0.this.sroa_idx.i.i = getelementptr inbounds nuw i8, ptr %this, i64 20
  %16 = insertelement <2 x float> poison, float %2, i64 0
  %17 = insertelement <2 x float> %16, float %mul18.i.i, i64 1
  %18 = fmul <2 x float> %6, %17
  %19 = extractelement <2 x float> %18, i64 0
  %sub24.i.i = fsub float %mul22.i.i, %19
  %20 = insertelement <2 x float> poison, float %mul22.i.i, i64 0
  %21 = insertelement <2 x float> %20, float %3, i64 1
  %22 = fadd <2 x float> %21, %18
  %23 = shufflevector <2 x float> %22, <2 x float> poison, <2 x i32> <i32 1, i32 0>
  store <2 x float> %23, ptr %ref.tmp.sroa.8.0.this.sroa_idx.i.i, align 4
  %ref.tmp.sroa.10.0.this.sroa_idx.i.i = getelementptr inbounds nuw i8, ptr %this, i64 28
  store float 0.000000e+00, ptr %ref.tmp.sroa.10.0.this.sroa_idx.i.i, align 4
  %ref.tmp.sroa.11.0.this.sroa_idx.i.i = getelementptr inbounds nuw i8, ptr %this, i64 32
  store float %add13.i.i, ptr %ref.tmp.sroa.11.0.this.sroa_idx.i.i, align 4
  %ref.tmp.sroa.12.0.this.sroa_idx.i.i = getelementptr inbounds nuw i8, ptr %this, i64 36
  store float %sub24.i.i, ptr %ref.tmp.sroa.12.0.this.sroa_idx.i.i, align 4
  %ref.tmp.sroa.13.0.this.sroa_idx.i.i = getelementptr inbounds nuw i8, ptr %this, i64 40
  store float %add35.i.i, ptr %ref.tmp.sroa.13.0.this.sroa_idx.i.i, align 4
  %ref.tmp.sroa.14.0.this.sroa_idx.i.i = getelementptr inbounds nuw i8, ptr %this, i64 44
  %ref.tmp.sroa.18.0.this.sroa_idx.i.i = getelementptr inbounds nuw i8, ptr %this, i64 60
  tail call void @llvm.memset.p0.i64(ptr noundef nonnull align 4 dereferenceable(16) %ref.tmp.sroa.14.0.this.sroa_idx.i.i, i8 0, i64 16, i1 false)
  store float 1.000000e+00, ptr %ref.tmp.sroa.18.0.this.sroa_idx.i.i, align 4
  ret ptr null
}

declare float @llvm.sqrt.f32(float)

declare float @llvm.sin.f32(float)

declare float @llvm.cos.f32(float)

$ llc -o /dev/null /tmp/a.ll
hang ... 

I'll revert this in the meantime

Thank you for pointing out this potential issue and providing an example.

If possible, I think it would be better to address this in the targets, as this commit can enable broader optimization opportunities for all targets.

If anyone is willing to take a look, that would be great. I’ll also continue to study this issue further when time permits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants