[mlir][vector] Add support for yielding loop bounds in `scf.for` distribution. #163443

charithaintc · 2025-10-14T20:06:12Z

In some cases, loop bounds (lower, upper and step) of scf.for can come locally from the parent warp op the scf.for. Current logic will not yield the loop bounds in the new warp op generated during lowering causing sinked scf.for to have non dominating use.

In this PR, we have added logic to yield loop bounds by default (treat them as other operands of scf.for) which fixes this bug.

llvmbot · 2025-10-14T20:06:46Z

@llvm/pr-subscribers-mlir

Author: Charitha Saumya (charithaintc)

Changes

In some cases, loop bounds (lower, upper and step) of scf.for can come locally from the parent warp op the scf.for. Current logic will not yield the loop bounds in the new warp op generated during lowering causing sinked scf.for to have non dominating use.

In this PR, we have added logic to yield loop bounds by default (treat them as other operands of scf.for) which fixes this bug.

Full diff: https://github.com/llvm/llvm-project/pull/163443.diff

2 Files Affected:

(modified) mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp (+18-7)
(modified) mlir/test/Dialect/Vector/vector-warp-distribute.mlir (+35)

diff --git a/mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp b/mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp
index e95338f7d18be..2ee65dc0f902a 100644
--- a/mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp
+++ b/mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp
@@ -2038,11 +2038,19 @@ struct WarpOpScfForOp : public WarpDistributionPattern {
     }
 
     // Newly created `WarpOp` will yield values in following order:
-    // 1. All init args of the `ForOp`.
-    // 2. All escaping values.
-    // 3. All non-`ForOp` yielded values.
+    // 1. Loop bounds.
+    // 2. All init args of the `ForOp`.
+    // 3. All escaping values.
+    // 4. All non-`ForOp` yielded values.
     SmallVector<Value> newWarpOpYieldValues;
     SmallVector<Type> newWarpOpDistTypes;
+    newWarpOpYieldValues.insert(
+        newWarpOpYieldValues.end(),
+        {forOp.getLowerBound(), forOp.getUpperBound(), forOp.getStep()});
+    newWarpOpDistTypes.insert(newWarpOpDistTypes.end(),
+                              {forOp.getLowerBound().getType(),
+                               forOp.getUpperBound().getType(),
+                               forOp.getStep().getType()});
     for (auto [i, initArg] : llvm::enumerate(forOp.getInitArgs())) {
       newWarpOpYieldValues.push_back(initArg);
       // Compute the distributed type for this init arg.
@@ -2081,20 +2089,23 @@ struct WarpOpScfForOp : public WarpDistributionPattern {
 
     // Next, we create a new `ForOp` with the init args yielded by the new
     // `WarpOp`.
+    const unsigned initArgsStartIdx = 3; // After loop bounds.
     const unsigned escapingValuesStartIdx =
+        initArgsStartIdx +
         forOp.getInitArgs().size(); // `ForOp` init args are positioned before
                                     // escaping values in the new `WarpOp`.
     SmallVector<Value> newForOpOperands;
-    for (size_t i = 0; i < escapingValuesStartIdx; ++i)
+    for (size_t i = initArgsStartIdx; i < escapingValuesStartIdx; ++i)
       newForOpOperands.push_back(newWarpOp.getResult(i));
 
     // Create a new `ForOp` outside the new `WarpOp` region.
     OpBuilder::InsertionGuard g(rewriter);
     rewriter.setInsertionPointAfter(newWarpOp);
     auto newForOp = scf::ForOp::create(
-        rewriter, forOp.getLoc(), forOp.getLowerBound(), forOp.getUpperBound(),
-        forOp.getStep(), newForOpOperands, /*bodyBuilder=*/nullptr,
-        forOp.getUnsignedCmp());
+        rewriter, forOp.getLoc(), /**LowerBound=**/ newWarpOp.getResult(0),
+        /**UpperBound=**/ newWarpOp.getResult(1),
+        /**Step=**/ newWarpOp.getResult(2), newForOpOperands,
+        /*bodyBuilder=*/nullptr, forOp.getUnsignedCmp());
     // Next, we insert a new `WarpOp` (called inner `WarpOp`) inside the
     // newly created `ForOp`. This `WarpOp` will contain all ops that were
     // contained within the original `ForOp` body.
diff --git a/mlir/test/Dialect/Vector/vector-warp-distribute.mlir b/mlir/test/Dialect/Vector/vector-warp-distribute.mlir
index bb7639204022f..ab87684dbb01a 100644
--- a/mlir/test/Dialect/Vector/vector-warp-distribute.mlir
+++ b/mlir/test/Dialect/Vector/vector-warp-distribute.mlir
@@ -473,6 +473,41 @@ func.func @warp_scf_for_use_from_above(%arg0: index) {
   return
 }
 
+// -----
+// CHECK-PROP-LABEL:  func.func @warp_scf_for_local_loop_bounds
+// CHECK-PROP:          (%{{.*}}: index, %[[ARG1:[a-zA-Z0-9]+]]: index) {
+// CHECK-PROP:          %[[W:.*]] = gpu.warp_execute_on_lane_0(%{{.*}})[32] args(%[[ARG1]] : index) -> (vector<4xf32>) {
+// CHECK-PROP:          ^bb0(%{{.*}}: index):
+// CHECK-PROP:            %[[T2:.*]] = "some_def"() : () -> vector<128xf32>
+// CHECK-PROP:            gpu.yield %[[T2]] : vector<128xf32>
+// CHECK-PROP:          }
+// CHECK-PROP:          %[[FOR:.*]] = scf.for %{{.*}} to %[[ARG1]] step %{{.*}} iter_args(%{{.*}}) -> (vector<4xf32>) {
+// CHECK-PROP:            %[[W2:.*]] = gpu.warp_execute_on_lane_0(%{{.*}})[32]
+// CHECK-PROP-SAME:          args(%{{.*}} : vector<4xf32>) -> (vector<4xf32>) {
+// CHECK-PROP:            ^bb0(%{{.*}}: vector<128xf32>):
+// CHECK-PROP:              gpu.yield %{{.*}} : vector<128xf32>
+// CHECK-PROP:            }
+// CHECK-PROP:            scf.yield %[[W2]] : vector<4xf32>
+// CHECK-PROP:          }
+// CHECK-PROP:          "some_use"(%[[FOR]]) : (vector<4xf32>) -> ()
+// CHECK-PROP:          return
+func.func @warp_scf_for_local_loop_bounds(%arg0: index, %bound: index) {
+  %c1 = arith.constant 1 : index
+  %c0 = arith.constant 0 : index
+  %0 = gpu.warp_execute_on_lane_0(%arg0)[32]
+    args(%bound : index) -> (vector<4xf32>) {
+    ^bb0(%arg1: index):
+    %ini = "some_def"() : () -> (vector<128xf32>)
+    %3 = scf.for %arg3 = %c0 to %arg1 step %c1 iter_args(%arg4 = %ini) -> (vector<128xf32>) {
+      %acc = "some_def"(%arg4) : (vector<128xf32>) -> (vector<128xf32>)
+      scf.yield %acc : vector<128xf32>
+    }
+    gpu.yield %3 : vector<128xf32>
+  }
+  "some_use"(%0) : (vector<4xf32>) -> ()
+  return
+}
+
 // -----
 
 // CHECK-PROP-LABEL:   func @warp_scf_for_swap(

llvmbot · 2025-10-14T20:06:47Z

@llvm/pr-subscribers-mlir-vector

Author: Charitha Saumya (charithaintc)

Changes

In some cases, loop bounds (lower, upper and step) of scf.for can come locally from the parent warp op the scf.for. Current logic will not yield the loop bounds in the new warp op generated during lowering causing sinked scf.for to have non dominating use.

In this PR, we have added logic to yield loop bounds by default (treat them as other operands of scf.for) which fixes this bug.

Full diff: https://github.com/llvm/llvm-project/pull/163443.diff

2 Files Affected:

(modified) mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp (+18-7)
(modified) mlir/test/Dialect/Vector/vector-warp-distribute.mlir (+35)

diff --git a/mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp b/mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp
index e95338f7d18be..2ee65dc0f902a 100644
--- a/mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp
+++ b/mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp
@@ -2038,11 +2038,19 @@ struct WarpOpScfForOp : public WarpDistributionPattern {
     }
 
     // Newly created `WarpOp` will yield values in following order:
-    // 1. All init args of the `ForOp`.
-    // 2. All escaping values.
-    // 3. All non-`ForOp` yielded values.
+    // 1. Loop bounds.
+    // 2. All init args of the `ForOp`.
+    // 3. All escaping values.
+    // 4. All non-`ForOp` yielded values.
     SmallVector<Value> newWarpOpYieldValues;
     SmallVector<Type> newWarpOpDistTypes;
+    newWarpOpYieldValues.insert(
+        newWarpOpYieldValues.end(),
+        {forOp.getLowerBound(), forOp.getUpperBound(), forOp.getStep()});
+    newWarpOpDistTypes.insert(newWarpOpDistTypes.end(),
+                              {forOp.getLowerBound().getType(),
+                               forOp.getUpperBound().getType(),
+                               forOp.getStep().getType()});
     for (auto [i, initArg] : llvm::enumerate(forOp.getInitArgs())) {
       newWarpOpYieldValues.push_back(initArg);
       // Compute the distributed type for this init arg.
@@ -2081,20 +2089,23 @@ struct WarpOpScfForOp : public WarpDistributionPattern {
 
     // Next, we create a new `ForOp` with the init args yielded by the new
     // `WarpOp`.
+    const unsigned initArgsStartIdx = 3; // After loop bounds.
     const unsigned escapingValuesStartIdx =
+        initArgsStartIdx +
         forOp.getInitArgs().size(); // `ForOp` init args are positioned before
                                     // escaping values in the new `WarpOp`.
     SmallVector<Value> newForOpOperands;
-    for (size_t i = 0; i < escapingValuesStartIdx; ++i)
+    for (size_t i = initArgsStartIdx; i < escapingValuesStartIdx; ++i)
       newForOpOperands.push_back(newWarpOp.getResult(i));
 
     // Create a new `ForOp` outside the new `WarpOp` region.
     OpBuilder::InsertionGuard g(rewriter);
     rewriter.setInsertionPointAfter(newWarpOp);
     auto newForOp = scf::ForOp::create(
-        rewriter, forOp.getLoc(), forOp.getLowerBound(), forOp.getUpperBound(),
-        forOp.getStep(), newForOpOperands, /*bodyBuilder=*/nullptr,
-        forOp.getUnsignedCmp());
+        rewriter, forOp.getLoc(), /**LowerBound=**/ newWarpOp.getResult(0),
+        /**UpperBound=**/ newWarpOp.getResult(1),
+        /**Step=**/ newWarpOp.getResult(2), newForOpOperands,
+        /*bodyBuilder=*/nullptr, forOp.getUnsignedCmp());
     // Next, we insert a new `WarpOp` (called inner `WarpOp`) inside the
     // newly created `ForOp`. This `WarpOp` will contain all ops that were
     // contained within the original `ForOp` body.
diff --git a/mlir/test/Dialect/Vector/vector-warp-distribute.mlir b/mlir/test/Dialect/Vector/vector-warp-distribute.mlir
index bb7639204022f..ab87684dbb01a 100644
--- a/mlir/test/Dialect/Vector/vector-warp-distribute.mlir
+++ b/mlir/test/Dialect/Vector/vector-warp-distribute.mlir
@@ -473,6 +473,41 @@ func.func @warp_scf_for_use_from_above(%arg0: index) {
   return
 }
 
+// -----
+// CHECK-PROP-LABEL:  func.func @warp_scf_for_local_loop_bounds
+// CHECK-PROP:          (%{{.*}}: index, %[[ARG1:[a-zA-Z0-9]+]]: index) {
+// CHECK-PROP:          %[[W:.*]] = gpu.warp_execute_on_lane_0(%{{.*}})[32] args(%[[ARG1]] : index) -> (vector<4xf32>) {
+// CHECK-PROP:          ^bb0(%{{.*}}: index):
+// CHECK-PROP:            %[[T2:.*]] = "some_def"() : () -> vector<128xf32>
+// CHECK-PROP:            gpu.yield %[[T2]] : vector<128xf32>
+// CHECK-PROP:          }
+// CHECK-PROP:          %[[FOR:.*]] = scf.for %{{.*}} to %[[ARG1]] step %{{.*}} iter_args(%{{.*}}) -> (vector<4xf32>) {
+// CHECK-PROP:            %[[W2:.*]] = gpu.warp_execute_on_lane_0(%{{.*}})[32]
+// CHECK-PROP-SAME:          args(%{{.*}} : vector<4xf32>) -> (vector<4xf32>) {
+// CHECK-PROP:            ^bb0(%{{.*}}: vector<128xf32>):
+// CHECK-PROP:              gpu.yield %{{.*}} : vector<128xf32>
+// CHECK-PROP:            }
+// CHECK-PROP:            scf.yield %[[W2]] : vector<4xf32>
+// CHECK-PROP:          }
+// CHECK-PROP:          "some_use"(%[[FOR]]) : (vector<4xf32>) -> ()
+// CHECK-PROP:          return
+func.func @warp_scf_for_local_loop_bounds(%arg0: index, %bound: index) {
+  %c1 = arith.constant 1 : index
+  %c0 = arith.constant 0 : index
+  %0 = gpu.warp_execute_on_lane_0(%arg0)[32]
+    args(%bound : index) -> (vector<4xf32>) {
+    ^bb0(%arg1: index):
+    %ini = "some_def"() : () -> (vector<128xf32>)
+    %3 = scf.for %arg3 = %c0 to %arg1 step %c1 iter_args(%arg4 = %ini) -> (vector<128xf32>) {
+      %acc = "some_def"(%arg4) : (vector<128xf32>) -> (vector<128xf32>)
+      scf.yield %acc : vector<128xf32>
+    }
+    gpu.yield %3 : vector<128xf32>
+  }
+  "some_use"(%0) : (vector<4xf32>) -> ()
+  return
+}
+
 // -----
 
 // CHECK-PROP-LABEL:   func @warp_scf_for_swap(

akroviakov

I suppose it uses WarpOpForwardOperand for clean-up of warp result usage. The case is valid, although generally I'd expect scalar usages inside warpOp to be "forwarded" to outer definitions beforehand, similar to how scalar definitions are hoisted.

charithaintc · 2025-10-15T15:49:12Z

The case is valid, although generally I'd expect scalar usages inside warpOp to be "forwarded" to outer definitions beforehand, similar to how scalar definitions are hoisted.

I get your point. But above not always true. In flash attention loop bounds comes from kernel args. then they can not be hoisted like scalar constants. hence compilation fails in flash attention. Even the test case I have in this PR also fails (much simpler).

akroviakov · 2025-10-15T15:54:48Z

Yeah, sure, I did not mean to hoist anything. I meant that scalar values coming from the outside may be referenced directly, not necessarily via the warpOp arguments. Meaning that some pass, similar to hoisting in terms of when it should be called, would clean up things beforehand, simplifying the work for the actual distribution patterns.

charithaintc · 2025-10-15T15:59:45Z

Yeah, sure, I did not mean to hoist anything. I meant that scalar values coming from the outside may be referenced directly, not necessarily via the warpOp arguments. Meaning that some pass, similar to hoisting in terms of when it should be called, would clean up things beforehand, simplifying the work for the actual distribution patterns.

agreed. uniform scalars/vectors are hoisted before even we begin distribution. I guess this is to reduce compile time (no need of repeated application of WarpForward + WarpDead) . but unfortunately that does not help us here.

akroviakov · 2025-10-15T16:06:35Z

but unfortunately that does not help us here.

So the test fails if the loop op references %bound directly:

func.func @warp_scf_for_local_loop_bounds(%arg0: index, %bound: index) {
...
  %0 = gpu.warp_execute_on_lane_0(%arg0)[32] -> (vector<4xf32>) {
    %ini = "some_def"() : () -> (vector<128xf32>)
    %3 = scf.for %arg3 = %c0 to %bound step %c1 iter_args(%arg4 = %ini) -> (vector<128xf32>) {

?

charithaintc · 2025-10-15T16:40:02Z

but unfortunately that does not help us here.

So the test fails if the loop op references %bound directly:

func.func @warp_scf_for_local_loop_bounds(%arg0: index, %bound: index) {
...
  %0 = gpu.warp_execute_on_lane_0(%arg0)[32] -> (vector<4xf32>) {
    %ini = "some_def"() : () -> (vector<128xf32>)
    %3 = scf.for %arg3 = %c0 to %bound step %c1 iter_args(%arg4 = %ini) -> (vector<128xf32>) {

?

no. only if the %bound is a warp of arg, it will fail. I guess it is sensitive to pattern application ordering. but with this fix all the loop op arguments are yielded regardless of their nature. So it should always work.

akroviakov · 2025-10-15T16:51:26Z

Yes, the fix is fine. I was just confused by your reply

but unfortunately that does not help us here.

to my comment

I'd expect scalar usages inside warpOp to be "forwarded" to outer definitions beforehand

I meant that scalar values coming from the outside may be referenced directly, not necessarily via the warpOp arguments. Meaning that some pass, similar to hoisting in terms of when it should be called, would clean up things beforehand

charithaintc · 2025-10-15T17:03:27Z

Yes, the fix is fine. I was just confused by your reply

but unfortunately that does not help us here.

to my comment

I'd expect scalar usages inside warpOp to be "forwarded" to outer definitions beforehand

I meant that scalar values coming from the outside may be referenced directly, not necessarily via the warpOp arguments. Meaning that some pass, similar to hoisting in terms of when it should be called, would clean up things beforehand

I see. sorry about the confusion.

Jianhui-Li

LGTM

dcaballe · 2025-10-15T21:34:36Z

Not sure why @Groverkss wasn't added to this PR automatically. We may have messed up the CODEOWNERS file again?

charithaintc · 2025-10-15T21:39:08Z

Not sure why @Groverkss wasn't added to this PR automatically. We may have messed up the CODEOWNERS file again?

Hi @dcaballe, We made several changes to this file over the last few months and only these reviewers were added by default.

hanhanW · 2025-10-16T17:48:41Z

Not sure why @Groverkss wasn't added to this PR automatically. We may have messed up the CODEOWNERS file again?

https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners#example-of-a-codeowners-file

# Order is important; the last matching pattern takes the most
# precedence. When someone opens a pull request that only
# modifies JS files, only @js-owner and not the global
# owner(s) will be requested for a review.
*.js    @js-owner #This is an inline comment.

I think it is because @Groverkss is not on the /mlir/lib/Dialect/Vector/Transforms/* list, so the "global" owners do not get the request. It is a known issue to me, and the solution is adding your handle to all the path that you care.

llvm-project/.github/CODEOWNERS

Lines 79 to 93 in 35cd291

    
           # Vector Dialect in MLIR. 
        
           /mlir/**/*AMX* @aartbik @dcaballe 
        
           /mlir/**/*Neon* @banach-space @dcaballe @nicolasvasilache 
        
           /mlir/**/*SME* @banach-space @dcaballe @nicolasvasilache 
        
           /mlir/**/*SVE* @banach-space @dcaballe @nicolasvasilache 
        
           /mlir/**/*VectorInterfaces* @dcaballe @nicolasvasilache 
        
           /mlir/**/*VectorToSCF* @banach-space @dcaballe @matthias-springer @nicolasvasilache 
        
           /mlir/**/*VectorToLLVM* @banach-space @dcaballe @nicolasvasilache 
        
           /mlir/**/*X86Vector* @aartbik @dcaballe @nicolasvasilache 
        
           /mlir/include/mlir/Dialect/Vector @banach-space @dcaballe @nicolasvasilache @Groverkss 
        
           /mlir/include/mlir/Dialect/Vector/IR @kuhar 
        
           /mlir/lib/Dialect/Vector @banach-space @dcaballe @nicolasvasilache @Groverkss 
        
           /mlir/lib/Dialect/Vector/Transforms/* @banach-space @dcaballe @hanhanW @nicolasvasilache 
        
           /mlir/lib/Dialect/Vector/Transforms/VectorEmulateNarrowType.cpp @banach-space @dcaballe @MaheshRavishankar @nicolasvasilache 
        
           /mlir/**/*EmulateNarrowType* @dcaballe @hanhanW

charithaintc · 2025-10-16T18:25:25Z

Not sure why @Groverkss wasn't added to this PR automatically. We may have messed up the CODEOWNERS file again?

Hi @Groverkss, Do you have any comments/concerns regarding this change? Otherwise I would like to merge this PR. Thanks!

add fix

d9ad36d

charithaintc requested review from banach-space, dcaballe, hanhanW and nicolasvasilache as code owners October 14, 2025 20:06

llvmbot added mlir:vectorops mlir mlir:vector labels Oct 14, 2025

charithaintc requested a review from Jianhui-Li October 14, 2025 20:06

charithaintc requested a review from akroviakov October 14, 2025 20:07

charithaintc assigned adam-smnk and unassigned adam-smnk Oct 14, 2025

charithaintc requested a review from adam-smnk October 14, 2025 20:07

akroviakov approved these changes Oct 15, 2025

View reviewed changes

Jianhui-Li approved these changes Oct 15, 2025

View reviewed changes

dcaballe requested a review from Groverkss October 15, 2025 21:33

charithaintc added 2 commits October 15, 2025 22:12

Merge branch 'main' into loop_bounds_fix_in_scf_for

c0fd573

fix

7224394

[mlir][vector] Add support for yielding loop bounds in scf.for distribution. #163443

Are you sure you want to change the base?

[mlir][vector] Add support for yielding loop bounds in scf.for distribution. #163443

Conversation

charithaintc commented Oct 14, 2025

Uh oh!

llvmbot commented Oct 14, 2025

Uh oh!

llvmbot commented Oct 14, 2025

Uh oh!

akroviakov left a comment

Choose a reason for hiding this comment

Uh oh!

charithaintc commented Oct 15, 2025

Uh oh!

akroviakov commented Oct 15, 2025

Uh oh!

charithaintc commented Oct 15, 2025

Uh oh!

akroviakov commented Oct 15, 2025

Uh oh!

charithaintc commented Oct 15, 2025

Uh oh!

akroviakov commented Oct 15, 2025

Uh oh!

charithaintc commented Oct 15, 2025

Uh oh!

Jianhui-Li left a comment

Choose a reason for hiding this comment

Uh oh!

dcaballe commented Oct 15, 2025

Uh oh!

charithaintc commented Oct 15, 2025

Uh oh!

hanhanW commented Oct 16, 2025

Uh oh!

charithaintc commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[mlir][vector] Add support for yielding loop bounds in `scf.for` distribution. #163443

[mlir][vector] Add support for yielding loop bounds in `scf.for` distribution. #163443