Skip to content

Conversation

@mark-sed
Copy link
Contributor

@mark-sed mark-sed commented Oct 9, 2025

This patch adds a profitability check to loop rotation to rotate the loop if this makes the exit latch count computable.

This form is beneficial to runtime loop unrolling as well as loop vectorization, which requires the loop to be bottom-tested.

I have tried different approaches to improve runtime unrolling (#146540 and #148243) none of which seems the right way to go.

After discussion with @annamthomas, we now think this additional heuristic of checking a computable latch is a stronger condition worth adding for a rotated loop form -- a good canonical form to rotate a loop into. This form helps with runtime loop unrolling (see test) and also LoopVectorization requires rotation to make the loop bottom-tested (unconditional latch -> rotate the loop to make latch conditional). There are other passes as well which prefers this "bottom-tested" notation for loops:

  1. EarlyExitVectorization:
    PSE.getSE()->getPredicatedExitCount(TheLoop, LatchBB, &Predicates))) {
  2. LoopConstrainer:
    if (isa<SCEVCouldNotCompute>(MaxBETakenCount)) {

loop if this makes the exit latch count computable.

This form is beneficial to runtime loop unrolling as well as loop
vectorization, which requires the loop to be bottom-tested.
@llvmbot
Copy link
Member

llvmbot commented Oct 9, 2025

@llvm/pr-subscribers-llvm-transforms

Author: Marek Sedláček (mark-sed)

Changes

This patch adds a profitability check to loop rotation to rotate the loop if this makes the exit latch count computable.

This form is beneficial to runtime loop unrolling as well as loop vectorization, which requires the loop to be bottom-tested.

I have tried different approaches to improve runtime unrolling (#146540 and #148243) none of which seems the right way to go.

After discussion with @annamthomas, we now think this additional heuristic of checking a computable latch is a stronger condition worth adding for a rotated loop form -- a good canonical form to rotate a loop into. This form helps with runtime loop unrolling (see test) and also LoopVectorization requires rotation to make the loop bottom-tested (unconditional latch -> rotate the loop to make latch conditional). There are other passes as well which prefers this "bottom-tested" notation for loops:

  1. EarlyExitVectorization:
    PSE.getSE()->getPredicatedExitCount(TheLoop, LatchBB, &Predicates))) {
  2. LoopConstrainer:
    if (isa<SCEVCouldNotCompute>(MaxBETakenCount)) {

Review request: @fhahn @nikic @davemgreen


Full diff: https://github.com/llvm/llvm-project/pull/162654.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Utils/LoopRotationUtils.cpp (+15-1)
  • (added) llvm/test/Transforms/LoopUnroll/X86/runtime-unroll-after-rotate-if-computable.ll (+122)
diff --git a/llvm/lib/Transforms/Utils/LoopRotationUtils.cpp b/llvm/lib/Transforms/Utils/LoopRotationUtils.cpp
index 0c8d6fa47b9ae..2abc56d95a393 100644
--- a/llvm/lib/Transforms/Utils/LoopRotationUtils.cpp
+++ b/llvm/lib/Transforms/Utils/LoopRotationUtils.cpp
@@ -200,6 +200,19 @@ static bool profitableToRotateLoopExitingLatch(Loop *L) {
   return false;
 }
 
+// Checks that if loop gets rotated it makes the exit latch count computable.
+// This form is beneficial to runtime loop unrolling as well as loop
+// vectorization, which requires the loop to be bottom-tested.
+static bool rotationMakesLoopComputable(Loop *L, ScalarEvolution *SE) {
+  BasicBlock *Header = L->getHeader();
+  BranchInst *BI = dyn_cast<BranchInst>(Header->getTerminator());
+  assert(BI && BI->isConditional() && "need header with conditional exit");
+  if (SE && isa<SCEVCouldNotCompute>(SE->getExitCount(L, L->getLoopLatch())) &&
+      !isa<SCEVCouldNotCompute>(SE->getExitCount(L, Header)))
+    return true;
+  return false;
+}
+
 static void updateBranchWeights(BranchInst &PreHeaderBI, BranchInst &LoopBI,
                                 bool HasConditionalPreHeader,
                                 bool SuccsSwapped) {
@@ -364,7 +377,8 @@ bool LoopRotate::rotateLoop(Loop *L, bool SimplifiedLatch) {
   // Rotate if the loop latch was just simplified. Or if it makes the loop exit
   // count computable. Or if we think it will be profitable.
   if (L->isLoopExiting(OrigLatch) && !SimplifiedLatch && IsUtilMode == false &&
-      !profitableToRotateLoopExitingLatch(L))
+      !profitableToRotateLoopExitingLatch(L) &&
+      !rotationMakesLoopComputable(L, SE))
     return Rotated;
 
   // Check size of original header and reject loop if it is very big or we can't
diff --git a/llvm/test/Transforms/LoopUnroll/X86/runtime-unroll-after-rotate-if-computable.ll b/llvm/test/Transforms/LoopUnroll/X86/runtime-unroll-after-rotate-if-computable.ll
new file mode 100644
index 0000000000000..2a408fbb364da
--- /dev/null
+++ b/llvm/test/Transforms/LoopUnroll/X86/runtime-unroll-after-rotate-if-computable.ll
@@ -0,0 +1,122 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt --passes='loop(loop-rotate),loop-unroll' -unroll-runtime=true -unroll-runtime-other-exit-predictable=1 -S %s | FileCheck %s
+; RUN: opt --passes='loop-unroll' -unroll-runtime=true -unroll-runtime-other-exit-predictable=1 -S %s | FileCheck %s -check-prefix=NO-ROTATE
+
+target triple = "x86_64-unknown-linux-gnu"
+
+; Test that loop gets unrolled if rotated (becomes computable after rotation).
+define void @test(i64 %0, ptr %1) {
+; CHECK-LABEL: define void @test(
+; CHECK-SAME: i64 [[TMP0:%.*]], ptr [[TMP1:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[B1:%.*]] = icmp eq i64 [[TMP0]], 0
+; CHECK-NEXT:    br i1 [[B1]], label %[[AFTER:.*]], label %[[BODY_LR_PH:.*]]
+; CHECK:       [[BODY_LR_PH]]:
+; CHECK-NEXT:    [[TMP5:%.*]] = sub i64 0, [[TMP0]]
+; CHECK-NEXT:    [[TMP2:%.*]] = freeze i64 [[TMP5]]
+; CHECK-NEXT:    [[TMP3:%.*]] = add i64 [[TMP2]], -1
+; CHECK-NEXT:    [[XTRAITER:%.*]] = and i64 [[TMP2]], 7
+; CHECK-NEXT:    [[LCMP_MOD:%.*]] = icmp ne i64 [[XTRAITER]], 0
+; CHECK-NEXT:    br i1 [[LCMP_MOD]], label %[[BODY_PROL_PREHEADER:.*]], label %[[BODY_PROL_LOOPEXIT:.*]]
+; CHECK:       [[BODY_PROL_PREHEADER]]:
+; CHECK-NEXT:    br label %[[BODY_PROL:.*]]
+; CHECK:       [[BODY_PROL]]:
+; CHECK-NEXT:    [[A2_PROL:%.*]] = phi i64 [ [[TMP0]], %[[BODY_PROL_PREHEADER]] ], [ [[A_PROL:%.*]], %[[HEADER_PROL:.*]] ]
+; CHECK-NEXT:    [[PROL_ITER:%.*]] = phi i64 [ 0, %[[BODY_PROL_PREHEADER]] ], [ [[PROL_ITER_NEXT:%.*]], %[[HEADER_PROL]] ]
+; CHECK-NEXT:    [[C_PROL:%.*]] = add i64 [[A2_PROL]], 1
+; CHECK-NEXT:    [[D_PROL:%.*]] = load i32, ptr [[TMP1]], align 4
+; CHECK-NEXT:    [[E_PROL:%.*]] = icmp eq i32 [[D_PROL]], 0
+; CHECK-NEXT:    br i1 [[E_PROL]], label %[[END_LOOPEXIT3:.*]], label %[[HEADER_PROL]]
+; CHECK:       [[HEADER_PROL]]:
+; CHECK-NEXT:    [[A_PROL]] = phi i64 [ [[C_PROL]], %[[BODY_PROL]] ]
+; CHECK-NEXT:    [[B_PROL:%.*]] = icmp eq i64 [[A_PROL]], 0
+; CHECK-NEXT:    [[PROL_ITER_NEXT]] = add i64 [[PROL_ITER]], 1
+; CHECK-NEXT:    [[PROL_ITER_CMP:%.*]] = icmp ne i64 [[PROL_ITER_NEXT]], [[XTRAITER]]
+; CHECK-NEXT:    br i1 [[PROL_ITER_CMP]], label %[[BODY_PROL]], label %[[BODY_PROL_LOOPEXIT_UNR_LCSSA:.*]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[BODY_PROL_LOOPEXIT_UNR_LCSSA]]:
+; CHECK-NEXT:    [[A2_UNR_PH:%.*]] = phi i64 [ [[A_PROL]], %[[HEADER_PROL]] ]
+; CHECK-NEXT:    br label %[[BODY_PROL_LOOPEXIT]]
+; CHECK:       [[BODY_PROL_LOOPEXIT]]:
+; CHECK-NEXT:    [[A2_UNR:%.*]] = phi i64 [ [[TMP0]], %[[BODY_LR_PH]] ], [ [[A2_UNR_PH]], %[[BODY_PROL_LOOPEXIT_UNR_LCSSA]] ]
+; CHECK-NEXT:    [[TMP6:%.*]] = icmp ult i64 [[TMP3]], 7
+; CHECK-NEXT:    br i1 [[TMP6]], label %[[HEADER_AFTER_CRIT_EDGE:.*]], label %[[BODY_LR_PH_NEW:.*]]
+; CHECK:       [[BODY_LR_PH_NEW]]:
+; CHECK-NEXT:    br label %[[BODY:.*]]
+; CHECK:       [[HEADER:.*]]:
+; CHECK-NEXT:    br i1 false, label %[[END_LOOPEXIT:.*]], label %[[HEADER_1:.*]]
+; CHECK:       [[HEADER_1]]:
+; CHECK-NEXT:    br i1 false, label %[[END_LOOPEXIT]], label %[[HEADER_2:.*]]
+; CHECK:       [[HEADER_2]]:
+; CHECK-NEXT:    br i1 false, label %[[END_LOOPEXIT]], label %[[HEADER_3:.*]]
+; CHECK:       [[HEADER_3]]:
+; CHECK-NEXT:    br i1 false, label %[[END_LOOPEXIT]], label %[[HEADER_4:.*]]
+; CHECK:       [[HEADER_4]]:
+; CHECK-NEXT:    br i1 false, label %[[END_LOOPEXIT]], label %[[HEADER_5:.*]]
+; CHECK:       [[HEADER_5]]:
+; CHECK-NEXT:    br i1 false, label %[[END_LOOPEXIT]], label %[[HEADER_6:.*]]
+; CHECK:       [[HEADER_6]]:
+; CHECK-NEXT:    [[C_7:%.*]] = add i64 [[A2:%.*]], 8
+; CHECK-NEXT:    br i1 false, label %[[END_LOOPEXIT]], label %[[HEADER_7:.*]]
+; CHECK:       [[HEADER_7]]:
+; CHECK-NEXT:    [[B_7:%.*]] = icmp eq i64 [[C_7]], 0
+; CHECK-NEXT:    br i1 [[B_7]], label %[[HEADER_AFTER_CRIT_EDGE_UNR_LCSSA:.*]], label %[[BODY]]
+; CHECK:       [[BODY]]:
+; CHECK-NEXT:    [[A2]] = phi i64 [ [[A2_UNR]], %[[BODY_LR_PH_NEW]] ], [ [[C_7]], %[[HEADER_7]] ]
+; CHECK-NEXT:    [[D:%.*]] = load i32, ptr [[TMP1]], align 4
+; CHECK-NEXT:    [[E:%.*]] = icmp eq i32 [[D]], 0
+; CHECK-NEXT:    br i1 [[E]], label %[[END_LOOPEXIT]], label %[[HEADER]]
+; CHECK:       [[END_LOOPEXIT]]:
+; CHECK-NEXT:    br label %[[END:.*]]
+; CHECK:       [[END_LOOPEXIT3]]:
+; CHECK-NEXT:    br label %[[END]]
+; CHECK:       [[END]]:
+; CHECK-NEXT:    ret void
+; CHECK:       [[HEADER_AFTER_CRIT_EDGE_UNR_LCSSA]]:
+; CHECK-NEXT:    br label %[[HEADER_AFTER_CRIT_EDGE]]
+; CHECK:       [[HEADER_AFTER_CRIT_EDGE]]:
+; CHECK-NEXT:    br label %[[AFTER]]
+; CHECK:       [[AFTER]]:
+; CHECK-NEXT:    ret void
+;
+; NO-ROTATE-LABEL: define void @test(
+; NO-ROTATE-SAME: i64 [[TMP0:%.*]], ptr [[TMP1:%.*]]) {
+; NO-ROTATE-NEXT:  [[ENTRY:.*]]:
+; NO-ROTATE-NEXT:    br label %[[HEADER:.*]]
+; NO-ROTATE:       [[HEADER]]:
+; NO-ROTATE-NEXT:    [[A_PROL:%.*]] = phi i64 [ [[TMP0]], %[[ENTRY]] ], [ [[C:%.*]], %[[BODY:.*]] ]
+; NO-ROTATE-NEXT:    [[B_PROL:%.*]] = icmp eq i64 [[A_PROL]], 0
+; NO-ROTATE-NEXT:    br i1 [[B_PROL]], label %[[AFTER:.*]], label %[[BODY]]
+; NO-ROTATE:       [[BODY]]:
+; NO-ROTATE-NEXT:    [[C]] = add i64 [[A_PROL]], 1
+; NO-ROTATE-NEXT:    [[D:%.*]] = load i32, ptr [[TMP1]], align 4
+; NO-ROTATE-NEXT:    [[E:%.*]] = icmp eq i32 [[D]], 0
+; NO-ROTATE-NEXT:    br i1 [[E]], label %[[END:.*]], label %[[HEADER]]
+; NO-ROTATE:       [[END]]:
+; NO-ROTATE-NEXT:    ret void
+; NO-ROTATE:       [[AFTER]]:
+; NO-ROTATE-NEXT:    ret void
+;
+entry:
+  br label %header
+
+header:
+  %a = phi i64 [ %0, %entry ], [ %c, %body ]
+  %b = icmp eq i64 %a, 0
+  br i1 %b, label %after, label %body
+
+body:
+  %c = add i64 %a, 1
+  %d = load i32, ptr %1, align 4
+  %e = icmp eq i32 %d, 0
+  br i1 %e, label %end, label %header
+
+end:
+  ret void
+
+after:
+  ret void
+}
+;.
+; CHECK: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]]}
+; CHECK: [[META1]] = !{!"llvm.loop.unroll.disable"}
+;.

@annamthomas
Copy link
Contributor

this form helps with runtime loop unrolling (see test) and also LoopVectorization requires rotation to make the loop bottom-tested (unconditional latch -> rotate the loop to make latch conditional). There are other passes as well which prefers this "bottom-tested" notation for loops

Just to clarify, the main reason for loop rotation has always been to make the loop bottom tested. . What is being proposed here is a stronger heuristic: make the bottom tested loop countable (if possible).

As shown in the testcase, this helps in loop-unrolling. We also can potentially help vectorize more early-exit loops (as seen in the code referenced in the description).

@nikic
Copy link
Contributor

nikic commented Oct 9, 2025

It looks like this has a large impact on compile-time: https://llvm-compile-time-tracker.com/compare.php?from=5841319aca0f2596cc00ab83d54ec07c9b70da3c&to=08496a966850e06788de4bbf75e521ee5ed1363c&stat=instructions:u The clang thin link is up +0.5%.

Don't know whether that's due to the transform itself or whether it's because this enables expensive followup transforms.

@mark-sed
Copy link
Contributor Author

Don't know whether that's due to the transform itself or whether it's because this enables expensive followup transforms.

@nikic That was unexpected. Do you have any recommendation on what to do about this?
We can see that this brings potential speedup with unrolling and vectorization where it was not possible and I am not sure what is justifiable trade off of compile time for performance.

One option I could propose is enabling this only for higher level opt.
Alternatively it could be controlled by off-by-default flag.

@mark-sed
Copy link
Contributor Author

Kind ping @nikic @fhahn

@nikic
Copy link
Contributor

nikic commented Oct 23, 2025

To add some more data here, this is the impact of just performing the analysis without actually doing the rotation: https://llvm-compile-time-tracker.com/compare.php?from=5841319aca0f2596cc00ab83d54ec07c9b70da3c&to=d60a14f066e38e572d5266b7526c575c391cf657&stat=instructions%3Au
And this is the impact of doing the rotation (likely not the rotation itself but second-order effects from later transforms): https://llvm-compile-time-tracker.com/compare.php?stat=instructions%3Au&from=d60a14f066e38e572d5266b7526c575c391cf657&to=08496a966850e06788de4bbf75e521ee5ed1363c

So the main cost here is the analysis part. I believe the problem is that LoopRotate is an LPM1 pass, and LPM1 does not use SCEV. All the SCEV based passes are in LPM2. This means we're doing SCEV construction just for LoopRotate.

The heuristic itself does seem pretty sensible to me.

Regarding unrolling specifically, I wonder whether it would make sense to support non-latch runtime unrolling? IIRC the non-runtime unrolling used to be limited to latch exits as well, but I generalized that to arbitrary exits at some point. (I haven't looked whether this is feasible...)

…cking in LPM1 to avoid computation of SVEC just for rotate.
@mark-sed
Copy link
Contributor Author

mark-sed commented Nov 5, 2025

@nikic I have enabled the computability check only for LPM2 to avoid the SVEC computation you mentioned. Hopefully this will lower the compile time impact.

I have copied this branch to perf/loop-rotate-if-computable to get compile time measurements.

@github-actions
Copy link

github-actions bot commented Nov 5, 2025

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff origin/main HEAD --extensions cpp,h -- llvm/include/llvm/Transforms/Scalar/LoopRotation.h llvm/include/llvm/Transforms/Utils/LoopRotationUtils.h llvm/lib/Passes/PassBuilderPipelines.cpp llvm/lib/Transforms/Scalar/LoopRotation.cpp llvm/lib/Transforms/Utils/LoopRotationUtils.cpp --diff_from_common_commit

⚠️
The reproduction instructions above might return results for more than one PR
in a stack if you are using a stacked PR workflow. You can limit the results by
changing origin/main to the base branch/commit you want to compare against.
⚠️

View the diff from clang-format here.
diff --git a/llvm/include/llvm/Transforms/Scalar/LoopRotation.h b/llvm/include/llvm/Transforms/Scalar/LoopRotation.h
index a5299ad62..4117c1505 100644
--- a/llvm/include/llvm/Transforms/Scalar/LoopRotation.h
+++ b/llvm/include/llvm/Transforms/Scalar/LoopRotation.h
@@ -24,8 +24,7 @@ class Loop;
 class LoopRotatePass : public PassInfoMixin<LoopRotatePass> {
 public:
   LoopRotatePass(bool EnableHeaderDuplication = true,
-                 bool PrepareForLTO = false,
-                 bool RotateComputable = true);
+                 bool PrepareForLTO = false, bool RotateComputable = true);
   PreservedAnalyses run(Loop &L, LoopAnalysisManager &AM,
                         LoopStandardAnalysisResults &AR, LPMUpdater &U);
 
diff --git a/llvm/lib/Transforms/Scalar/LoopRotation.cpp b/llvm/lib/Transforms/Scalar/LoopRotation.cpp
index 4fd092908..8cf0fa1ec 100644
--- a/llvm/lib/Transforms/Scalar/LoopRotation.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopRotation.cpp
@@ -75,10 +75,10 @@ PreservedAnalyses LoopRotatePass::run(Loop &L, LoopAnalysisManager &AM,
   std::optional<MemorySSAUpdater> MSSAU;
   if (AR.MSSA)
     MSSAU = MemorySSAUpdater(AR.MSSA);
-  bool Changed = LoopRotation(&L, &AR.LI, &AR.TTI, &AR.AC, &AR.DT, &AR.SE,
-                              MSSAU ? &*MSSAU : nullptr, SQ, false, Threshold,
-                              false, PrepareForLTO || PrepareForLTOOption,
-                              RotateComputable);
+  bool Changed =
+      LoopRotation(&L, &AR.LI, &AR.TTI, &AR.AC, &AR.DT, &AR.SE,
+                   MSSAU ? &*MSSAU : nullptr, SQ, false, Threshold, false,
+                   PrepareForLTO || PrepareForLTOOption, RotateComputable);
 
   if (!Changed)
     return PreservedAnalyses::all();
diff --git a/llvm/lib/Transforms/Utils/LoopRotationUtils.cpp b/llvm/lib/Transforms/Utils/LoopRotationUtils.cpp
index 30c9e1028..06aae0d4e 100644
--- a/llvm/lib/Transforms/Utils/LoopRotationUtils.cpp
+++ b/llvm/lib/Transforms/Utils/LoopRotationUtils.cpp
@@ -379,8 +379,8 @@ bool LoopRotate::rotateLoop(Loop *L, bool SimplifiedLatch) {
   // Rotate if the loop latch was just simplified. Or if it makes the loop exit
   // count computable. Or if we think it will be profitable.
   if (L->isLoopExiting(OrigLatch) && !SimplifiedLatch && IsUtilMode == false &&
-      !profitableToRotateLoopExitingLatch(L) && !(RotateComputable &&
-      rotationMakesLoopComputable(L, SE)))
+      !profitableToRotateLoopExitingLatch(L) &&
+      !(RotateComputable && rotationMakesLoopComputable(L, SE)))
     return Rotated;
 
   // Check size of original header and reject loop if it is very big or we can't

@mark-sed
Copy link
Contributor Author

mark-sed commented Nov 5, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants