[LLVM][CodeGen][SVE] Only use unpredicated bfloat instructions when all lanes are in use. #168387

paulwalker-arm · 2025-11-17T15:36:25Z

While SVE support for exception safe floating point code generation is bare bones we try to ensure inactive lanes remiain inert. I mistakenly broke this rule when adding support for SVE-B16B16 by lowering some bfloat operations of unpacked vectors to unpredicated instructions.

…ll lanes are in use. While SVE support for exception safe floating point code generation is bare bones we try to ensure inactive lanes remiain inert. I mistakenly broke this rule when adding support for SVE-B16B16 by lowering some bfloat operations of unpacked vectors to unpredicated instructions.

llvmbot · 2025-11-17T15:37:03Z

@llvm/pr-subscribers-backend-aarch64

Author: Paul Walker (paulwalker-arm)

Changes

While SVE support for exception safe floating point code generation is bare bones we try to ensure inactive lanes remiain inert. I mistakenly broke this rule when adding support for SVE-B16B16 by lowering some bfloat operations of unpacked vectors to unpredicated instructions.

Full diff: https://github.com/llvm/llvm-project/pull/168387.diff

4 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+3-3)
(modified) llvm/lib/Target/AArch64/SVEInstrFormats.td (-2)
(modified) llvm/test/CodeGen/AArch64/sve-bf16-arith.ll (+12-6)
(modified) llvm/test/CodeGen/AArch64/sve-bf16-combines.ll (+10-26)

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 35836af3c874b..c6fed015f08fa 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -1774,14 +1774,14 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
 
       if (Subtarget->hasSVEB16B16() &&
           Subtarget->isNonStreamingSVEorSME2Available()) {
-        setOperationAction(ISD::FADD, VT, Legal);
+        setOperationAction(ISD::FADD, VT, Custom);
         setOperationAction(ISD::FMA, VT, Custom);
         setOperationAction(ISD::FMAXIMUM, VT, Custom);
         setOperationAction(ISD::FMAXNUM, VT, Custom);
         setOperationAction(ISD::FMINIMUM, VT, Custom);
         setOperationAction(ISD::FMINNUM, VT, Custom);
-        setOperationAction(ISD::FMUL, VT, Legal);
-        setOperationAction(ISD::FSUB, VT, Legal);
+        setOperationAction(ISD::FMUL, VT, Custom);
+        setOperationAction(ISD::FSUB, VT, Custom);
       }
     }
 
diff --git a/llvm/lib/Target/AArch64/SVEInstrFormats.td b/llvm/lib/Target/AArch64/SVEInstrFormats.td
index 1664f4ad0c8fa..ebaea477d698f 100644
--- a/llvm/lib/Target/AArch64/SVEInstrFormats.td
+++ b/llvm/lib/Target/AArch64/SVEInstrFormats.td
@@ -2439,8 +2439,6 @@ multiclass sve_fp_3op_u_zd_bfloat<bits<3> opc, string asm, SDPatternOperator op>
   def NAME : sve_fp_3op_u_zd<0b00, opc, asm, ZPR16>;
 
   def : SVE_2_Op_Pat<nxv8bf16, op, nxv8bf16, nxv8bf16, !cast<Instruction>(NAME)>;
-  def : SVE_2_Op_Pat<nxv4bf16, op, nxv4bf16, nxv4bf16, !cast<Instruction>(NAME)>;
-  def : SVE_2_Op_Pat<nxv2bf16, op, nxv2bf16, nxv2bf16, !cast<Instruction>(NAME)>;
 }
 
 multiclass sve_fp_3op_u_zd_ftsmul<bits<3> opc, string asm, SDPatternOperator op> {
diff --git a/llvm/test/CodeGen/AArch64/sve-bf16-arith.ll b/llvm/test/CodeGen/AArch64/sve-bf16-arith.ll
index 582e8456c05b3..2103bc30b8381 100644
--- a/llvm/test/CodeGen/AArch64/sve-bf16-arith.ll
+++ b/llvm/test/CodeGen/AArch64/sve-bf16-arith.ll
@@ -56,7 +56,8 @@ define <vscale x 2 x bfloat> @fadd_nxv2bf16(<vscale x 2 x bfloat> %a, <vscale x
 ;
 ; B16B16-LABEL: fadd_nxv2bf16:
 ; B16B16:       // %bb.0:
-; B16B16-NEXT:    bfadd z0.h, z0.h, z1.h
+; B16B16-NEXT:    ptrue p0.d
+; B16B16-NEXT:    bfadd z0.h, p0/m, z0.h, z1.h
 ; B16B16-NEXT:    ret
   %res = fadd <vscale x 2 x bfloat> %a, %b
   ret <vscale x 2 x bfloat> %res
@@ -74,7 +75,8 @@ define <vscale x 4 x bfloat> @fadd_nxv4bf16(<vscale x 4 x bfloat> %a, <vscale x
 ;
 ; B16B16-LABEL: fadd_nxv4bf16:
 ; B16B16:       // %bb.0:
-; B16B16-NEXT:    bfadd z0.h, z0.h, z1.h
+; B16B16-NEXT:    ptrue p0.s
+; B16B16-NEXT:    bfadd z0.h, p0/m, z0.h, z1.h
 ; B16B16-NEXT:    ret
   %res = fadd <vscale x 4 x bfloat> %a, %b
   ret <vscale x 4 x bfloat> %res
@@ -525,7 +527,8 @@ define <vscale x 2 x bfloat> @fmul_nxv2bf16(<vscale x 2 x bfloat> %a, <vscale x
 ;
 ; B16B16-LABEL: fmul_nxv2bf16:
 ; B16B16:       // %bb.0:
-; B16B16-NEXT:    bfmul z0.h, z0.h, z1.h
+; B16B16-NEXT:    ptrue p0.d
+; B16B16-NEXT:    bfmul z0.h, p0/m, z0.h, z1.h
 ; B16B16-NEXT:    ret
   %res = fmul <vscale x 2 x bfloat> %a, %b
   ret <vscale x 2 x bfloat> %res
@@ -543,7 +546,8 @@ define <vscale x 4 x bfloat> @fmul_nxv4bf16(<vscale x 4 x bfloat> %a, <vscale x
 ;
 ; B16B16-LABEL: fmul_nxv4bf16:
 ; B16B16:       // %bb.0:
-; B16B16-NEXT:    bfmul z0.h, z0.h, z1.h
+; B16B16-NEXT:    ptrue p0.s
+; B16B16-NEXT:    bfmul z0.h, p0/m, z0.h, z1.h
 ; B16B16-NEXT:    ret
   %res = fmul <vscale x 4 x bfloat> %a, %b
   ret <vscale x 4 x bfloat> %res
@@ -672,7 +676,8 @@ define <vscale x 2 x bfloat> @fsub_nxv2bf16(<vscale x 2 x bfloat> %a, <vscale x
 ;
 ; B16B16-LABEL: fsub_nxv2bf16:
 ; B16B16:       // %bb.0:
-; B16B16-NEXT:    bfsub z0.h, z0.h, z1.h
+; B16B16-NEXT:    ptrue p0.d
+; B16B16-NEXT:    bfsub z0.h, p0/m, z0.h, z1.h
 ; B16B16-NEXT:    ret
   %res = fsub <vscale x 2 x bfloat> %a, %b
   ret <vscale x 2 x bfloat> %res
@@ -690,7 +695,8 @@ define <vscale x 4 x bfloat> @fsub_nxv4bf16(<vscale x 4 x bfloat> %a, <vscale x
 ;
 ; B16B16-LABEL: fsub_nxv4bf16:
 ; B16B16:       // %bb.0:
-; B16B16-NEXT:    bfsub z0.h, z0.h, z1.h
+; B16B16-NEXT:    ptrue p0.s
+; B16B16-NEXT:    bfsub z0.h, p0/m, z0.h, z1.h
 ; B16B16-NEXT:    ret
   %res = fsub <vscale x 4 x bfloat> %a, %b
   ret <vscale x 4 x bfloat> %res
diff --git a/llvm/test/CodeGen/AArch64/sve-bf16-combines.ll b/llvm/test/CodeGen/AArch64/sve-bf16-combines.ll
index 16e8feb0dc5bb..86d4f61316446 100644
--- a/llvm/test/CodeGen/AArch64/sve-bf16-combines.ll
+++ b/llvm/test/CodeGen/AArch64/sve-bf16-combines.ll
@@ -311,8 +311,7 @@ define <vscale x 8 x bfloat> @fadd_sel_nxv8bf16(<vscale x 8 x bfloat> %a, <vscal
 ;
 ; SVE-B16B16-LABEL: fadd_sel_nxv8bf16:
 ; SVE-B16B16:       // %bb.0:
-; SVE-B16B16-NEXT:    bfadd z1.h, z0.h, z1.h
-; SVE-B16B16-NEXT:    mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT:    bfadd z0.h, p0/m, z0.h, z1.h
 ; SVE-B16B16-NEXT:    ret
   %sel = select <vscale x 8 x i1> %mask, <vscale x 8 x bfloat> %b, <vscale x 8 x bfloat> zeroinitializer
   %fadd = fadd nsz <vscale x 8 x bfloat> %a, %sel
@@ -341,8 +340,7 @@ define <vscale x 8 x bfloat> @fsub_sel_nxv8bf16(<vscale x 8 x bfloat> %a, <vscal
 ;
 ; SVE-B16B16-LABEL: fsub_sel_nxv8bf16:
 ; SVE-B16B16:       // %bb.0:
-; SVE-B16B16-NEXT:    bfsub z1.h, z0.h, z1.h
-; SVE-B16B16-NEXT:    mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT:    bfsub z0.h, p0/m, z0.h, z1.h
 ; SVE-B16B16-NEXT:    ret
   %sel = select <vscale x 8 x i1> %mask, <vscale x 8 x bfloat> %b, <vscale x 8 x bfloat> zeroinitializer
   %fsub = fsub <vscale x 8 x bfloat> %a, %sel
@@ -371,8 +369,7 @@ define <vscale x 8 x bfloat> @fadd_sel_negzero_nxv8bf16(<vscale x 8 x bfloat> %a
 ;
 ; SVE-B16B16-LABEL: fadd_sel_negzero_nxv8bf16:
 ; SVE-B16B16:       // %bb.0:
-; SVE-B16B16-NEXT:    bfadd z1.h, z0.h, z1.h
-; SVE-B16B16-NEXT:    mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT:    bfadd z0.h, p0/m, z0.h, z1.h
 ; SVE-B16B16-NEXT:    ret
   %nz = fneg <vscale x 8 x bfloat> zeroinitializer
   %sel = select <vscale x 8 x i1> %mask, <vscale x 8 x bfloat> %b, <vscale x 8 x bfloat> %nz
@@ -402,8 +399,7 @@ define <vscale x 8 x bfloat> @fsub_sel_negzero_nxv8bf16(<vscale x 8 x bfloat> %a
 ;
 ; SVE-B16B16-LABEL: fsub_sel_negzero_nxv8bf16:
 ; SVE-B16B16:       // %bb.0:
-; SVE-B16B16-NEXT:    bfsub z1.h, z0.h, z1.h
-; SVE-B16B16-NEXT:    mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT:    bfsub z0.h, p0/m, z0.h, z1.h
 ; SVE-B16B16-NEXT:    ret
   %nz = fneg <vscale x 8 x bfloat> zeroinitializer
   %sel = select <vscale x 8 x i1> %mask, <vscale x 8 x bfloat> %b, <vscale x 8 x bfloat> %nz
@@ -490,9 +486,7 @@ define <vscale x 8 x bfloat> @fsub_sel_fmul_nxv8bf16(<vscale x 8 x bfloat> %a, <
 ;
 ; SVE-B16B16-LABEL: fsub_sel_fmul_nxv8bf16:
 ; SVE-B16B16:       // %bb.0:
-; SVE-B16B16-NEXT:    bfmul z1.h, z1.h, z2.h
-; SVE-B16B16-NEXT:    bfsub z1.h, z0.h, z1.h
-; SVE-B16B16-NEXT:    mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT:    bfmls z0.h, p0/m, z1.h, z2.h
 ; SVE-B16B16-NEXT:    ret
   %fmul = fmul <vscale x 8 x bfloat> %b, %c
   %sel = select <vscale x 8 x i1> %mask, <vscale x 8 x bfloat> %fmul, <vscale x 8 x bfloat> zeroinitializer
@@ -532,9 +526,7 @@ define <vscale x 8 x bfloat> @fadd_sel_fmul_nsz_nxv8bf16(<vscale x 8 x bfloat> %
 ;
 ; SVE-B16B16-LABEL: fadd_sel_fmul_nsz_nxv8bf16:
 ; SVE-B16B16:       // %bb.0:
-; SVE-B16B16-NEXT:    bfmul z1.h, z1.h, z2.h
-; SVE-B16B16-NEXT:    bfadd z1.h, z0.h, z1.h
-; SVE-B16B16-NEXT:    mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT:    bfmla z0.h, p0/m, z1.h, z2.h
 ; SVE-B16B16-NEXT:    ret
   %fmul = fmul <vscale x 8 x bfloat> %b, %c
   %sel = select <vscale x 8 x i1> %mask, <vscale x 8 x bfloat> %fmul, <vscale x 8 x bfloat> zeroinitializer
@@ -574,9 +566,7 @@ define <vscale x 8 x bfloat> @fsub_sel_fmul_nsz_nxv8bf16(<vscale x 8 x bfloat> %
 ;
 ; SVE-B16B16-LABEL: fsub_sel_fmul_nsz_nxv8bf16:
 ; SVE-B16B16:       // %bb.0:
-; SVE-B16B16-NEXT:    bfmul z1.h, z1.h, z2.h
-; SVE-B16B16-NEXT:    bfsub z1.h, z0.h, z1.h
-; SVE-B16B16-NEXT:    mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT:    bfmls z0.h, p0/m, z1.h, z2.h
 ; SVE-B16B16-NEXT:    ret
   %fmul = fmul <vscale x 8 x bfloat> %b, %c
   %sel = select <vscale x 8 x i1> %mask, <vscale x 8 x bfloat> %fmul, <vscale x 8 x bfloat> zeroinitializer
@@ -616,9 +606,7 @@ define <vscale x 8 x bfloat> @fadd_sel_fmul_negzero_nxv8bf16(<vscale x 8 x bfloa
 ;
 ; SVE-B16B16-LABEL: fadd_sel_fmul_negzero_nxv8bf16:
 ; SVE-B16B16:       // %bb.0:
-; SVE-B16B16-NEXT:    bfmul z1.h, z1.h, z2.h
-; SVE-B16B16-NEXT:    bfadd z1.h, z0.h, z1.h
-; SVE-B16B16-NEXT:    mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT:    bfmla z0.h, p0/m, z1.h, z2.h
 ; SVE-B16B16-NEXT:    ret
   %fmul = fmul <vscale x 8 x bfloat> %b, %c
   %nz = fneg <vscale x 8 x bfloat> zeroinitializer
@@ -711,9 +699,7 @@ define <vscale x 8 x bfloat> @fadd_sel_fmul_negzero_nsz_nxv8bf16(<vscale x 8 x b
 ;
 ; SVE-B16B16-LABEL: fadd_sel_fmul_negzero_nsz_nxv8bf16:
 ; SVE-B16B16:       // %bb.0:
-; SVE-B16B16-NEXT:    bfmul z1.h, z1.h, z2.h
-; SVE-B16B16-NEXT:    bfadd z1.h, z0.h, z1.h
-; SVE-B16B16-NEXT:    mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT:    bfmla z0.h, p0/m, z1.h, z2.h
 ; SVE-B16B16-NEXT:    ret
   %fmul = fmul <vscale x 8 x bfloat> %b, %c
   %nz = fneg <vscale x 8 x bfloat> zeroinitializer
@@ -754,9 +740,7 @@ define <vscale x 8 x bfloat> @fsub_sel_fmul_negzero_nsz_nxv8bf16(<vscale x 8 x b
 ;
 ; SVE-B16B16-LABEL: fsub_sel_fmul_negzero_nsz_nxv8bf16:
 ; SVE-B16B16:       // %bb.0:
-; SVE-B16B16-NEXT:    bfmul z1.h, z1.h, z2.h
-; SVE-B16B16-NEXT:    bfsub z1.h, z0.h, z1.h
-; SVE-B16B16-NEXT:    mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT:    bfmls z0.h, p0/m, z1.h, z2.h
 ; SVE-B16B16-NEXT:    ret
   %fmul = fmul <vscale x 8 x bfloat> %b, %c
   %nz = fneg <vscale x 8 x bfloat> zeroinitializer

MacDue

LGTM

davemgreen · 2025-11-19T13:18:28Z

While SVE support for exception safe floating point code generation is bare bones we try to ensure inactive lanes remain inert. I mistakenly broke this rule when adding support for SVE-B16B16 by lowering some bfloat operations of unpacked vectors to unpredicated instructions.

Hi - If this is specifically for a fadd node then the fp-exception flags are not observable outside of strict-fp and we assume cannot trap. Otherwise fp operations would have side-effects and not be speculatable. In order for fp exceptions to be observable you are required to use the strict-fp llvm.constrained.fadd intrinsics, which should turn into a STRICT_FADD ISD node.

There are details in https://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics. Strictfp functions require a function attribute. From target predicated llvm.aarch64.sve.fadd intrinsics my understanding is that it will depend on the function and should correctly preserve/honour fp exceptions and rounding modes in strict-fp, but don't need to outside of it.

So if this is just for fadd I think the code was fine and it is OK to do whatever in the off lanes, including setting fp exception bits. Some of this might change, there are discussion on discourse, but my understanding is the basics wouldn't change (there are still a difference between ISD::FADD and ISD::STRICT_FADD and fp exceptions by default are not observable).

paulwalker-arm · 2025-11-19T13:53:25Z

I have been following the floating point environment support for a while although kind of lost track since #109798 stalled. The constrained intrinsic approach is not a viable implementation for target intrinsics, with ongoing work to redesign/replace/reimplement them using operand bundles. A previous attempt of mine to guide selection using a function's strict-fp attribute was rejected.

I do need to extend the SVE support for the STRICT_* ISD nodes, but today they simply lower to their non-strict counterparts. When combined with the above issues, we chose to at least not aggravate the issue with a minimal effect on code generation given the required predicate is likely always available.

davemgreen · 2025-11-19T21:53:23Z

Yeah everything I talked about was the way things did / currently work, not how they might work in the future. Adding support for strict-fp certainly sounds good. I had seen the strict-fp intrinsics being converted to standard nodes for unsupported operations in the past and been surprised by it, it did not sound very sound. It looks like most vector operations will currently try to unroll. https://godbolt.org/z/Y7vbq7Tj3

paulwalker-arm · 2025-11-20T12:02:58Z

Thanks @davemgreen. Seems the situation is worse than I thought so I'll bump the SVE support for the STRICT_* ISD nodes higher up my list.

paulwalker-arm requested review from david-arm and huntergr-arm November 17, 2025 15:36

llvmbot added the backend:AArch64 label Nov 17, 2025

paulwalker-arm mentioned this pull request Nov 17, 2025

[AArch64][SVE] Add custom lowering for bfloat FMUL (with +bf16) #167502

Open

mgabka requested a review from MacDue November 19, 2025 10:11

MacDue approved these changes Nov 19, 2025

View reviewed changes

paulwalker-arm merged commit 21c4c15 into llvm:main Nov 20, 2025
12 checks passed

paulwalker-arm deleted the sve-bfloat-trapping branch November 20, 2025 13:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LLVM][CodeGen][SVE] Only use unpredicated bfloat instructions when all lanes are in use. #168387

[LLVM][CodeGen][SVE] Only use unpredicated bfloat instructions when all lanes are in use. #168387

paulwalker-arm commented Nov 17, 2025

Uh oh!

llvmbot commented Nov 17, 2025

Uh oh!

MacDue left a comment

Uh oh!

davemgreen commented Nov 19, 2025

Uh oh!

paulwalker-arm commented Nov 19, 2025

Uh oh!

davemgreen commented Nov 19, 2025

Uh oh!

Uh oh!

paulwalker-arm commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[LLVM][CodeGen][SVE] Only use unpredicated bfloat instructions when all lanes are in use. #168387

[LLVM][CodeGen][SVE] Only use unpredicated bfloat instructions when all lanes are in use. #168387

Conversation

paulwalker-arm commented Nov 17, 2025

Uh oh!

llvmbot commented Nov 17, 2025

Uh oh!

MacDue left a comment

Choose a reason for hiding this comment

Uh oh!

davemgreen commented Nov 19, 2025

Uh oh!

paulwalker-arm commented Nov 19, 2025

Uh oh!

davemgreen commented Nov 19, 2025

Uh oh!

Uh oh!

paulwalker-arm commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants