-
Notifications
You must be signed in to change notification settings - Fork 15.3k
[LLVM][CodeGen][SVE] Only use unpredicated bfloat instructions when all lanes are in use. #168387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ll lanes are in use. While SVE support for exception safe floating point code generation is bare bones we try to ensure inactive lanes remiain inert. I mistakenly broke this rule when adding support for SVE-B16B16 by lowering some bfloat operations of unpacked vectors to unpredicated instructions.
|
@llvm/pr-subscribers-backend-aarch64 Author: Paul Walker (paulwalker-arm) ChangesWhile SVE support for exception safe floating point code generation is bare bones we try to ensure inactive lanes remiain inert. I mistakenly broke this rule when adding support for SVE-B16B16 by lowering some bfloat operations of unpacked vectors to unpredicated instructions. Full diff: https://github.com/llvm/llvm-project/pull/168387.diff 4 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 35836af3c874b..c6fed015f08fa 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -1774,14 +1774,14 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
if (Subtarget->hasSVEB16B16() &&
Subtarget->isNonStreamingSVEorSME2Available()) {
- setOperationAction(ISD::FADD, VT, Legal);
+ setOperationAction(ISD::FADD, VT, Custom);
setOperationAction(ISD::FMA, VT, Custom);
setOperationAction(ISD::FMAXIMUM, VT, Custom);
setOperationAction(ISD::FMAXNUM, VT, Custom);
setOperationAction(ISD::FMINIMUM, VT, Custom);
setOperationAction(ISD::FMINNUM, VT, Custom);
- setOperationAction(ISD::FMUL, VT, Legal);
- setOperationAction(ISD::FSUB, VT, Legal);
+ setOperationAction(ISD::FMUL, VT, Custom);
+ setOperationAction(ISD::FSUB, VT, Custom);
}
}
diff --git a/llvm/lib/Target/AArch64/SVEInstrFormats.td b/llvm/lib/Target/AArch64/SVEInstrFormats.td
index 1664f4ad0c8fa..ebaea477d698f 100644
--- a/llvm/lib/Target/AArch64/SVEInstrFormats.td
+++ b/llvm/lib/Target/AArch64/SVEInstrFormats.td
@@ -2439,8 +2439,6 @@ multiclass sve_fp_3op_u_zd_bfloat<bits<3> opc, string asm, SDPatternOperator op>
def NAME : sve_fp_3op_u_zd<0b00, opc, asm, ZPR16>;
def : SVE_2_Op_Pat<nxv8bf16, op, nxv8bf16, nxv8bf16, !cast<Instruction>(NAME)>;
- def : SVE_2_Op_Pat<nxv4bf16, op, nxv4bf16, nxv4bf16, !cast<Instruction>(NAME)>;
- def : SVE_2_Op_Pat<nxv2bf16, op, nxv2bf16, nxv2bf16, !cast<Instruction>(NAME)>;
}
multiclass sve_fp_3op_u_zd_ftsmul<bits<3> opc, string asm, SDPatternOperator op> {
diff --git a/llvm/test/CodeGen/AArch64/sve-bf16-arith.ll b/llvm/test/CodeGen/AArch64/sve-bf16-arith.ll
index 582e8456c05b3..2103bc30b8381 100644
--- a/llvm/test/CodeGen/AArch64/sve-bf16-arith.ll
+++ b/llvm/test/CodeGen/AArch64/sve-bf16-arith.ll
@@ -56,7 +56,8 @@ define <vscale x 2 x bfloat> @fadd_nxv2bf16(<vscale x 2 x bfloat> %a, <vscale x
;
; B16B16-LABEL: fadd_nxv2bf16:
; B16B16: // %bb.0:
-; B16B16-NEXT: bfadd z0.h, z0.h, z1.h
+; B16B16-NEXT: ptrue p0.d
+; B16B16-NEXT: bfadd z0.h, p0/m, z0.h, z1.h
; B16B16-NEXT: ret
%res = fadd <vscale x 2 x bfloat> %a, %b
ret <vscale x 2 x bfloat> %res
@@ -74,7 +75,8 @@ define <vscale x 4 x bfloat> @fadd_nxv4bf16(<vscale x 4 x bfloat> %a, <vscale x
;
; B16B16-LABEL: fadd_nxv4bf16:
; B16B16: // %bb.0:
-; B16B16-NEXT: bfadd z0.h, z0.h, z1.h
+; B16B16-NEXT: ptrue p0.s
+; B16B16-NEXT: bfadd z0.h, p0/m, z0.h, z1.h
; B16B16-NEXT: ret
%res = fadd <vscale x 4 x bfloat> %a, %b
ret <vscale x 4 x bfloat> %res
@@ -525,7 +527,8 @@ define <vscale x 2 x bfloat> @fmul_nxv2bf16(<vscale x 2 x bfloat> %a, <vscale x
;
; B16B16-LABEL: fmul_nxv2bf16:
; B16B16: // %bb.0:
-; B16B16-NEXT: bfmul z0.h, z0.h, z1.h
+; B16B16-NEXT: ptrue p0.d
+; B16B16-NEXT: bfmul z0.h, p0/m, z0.h, z1.h
; B16B16-NEXT: ret
%res = fmul <vscale x 2 x bfloat> %a, %b
ret <vscale x 2 x bfloat> %res
@@ -543,7 +546,8 @@ define <vscale x 4 x bfloat> @fmul_nxv4bf16(<vscale x 4 x bfloat> %a, <vscale x
;
; B16B16-LABEL: fmul_nxv4bf16:
; B16B16: // %bb.0:
-; B16B16-NEXT: bfmul z0.h, z0.h, z1.h
+; B16B16-NEXT: ptrue p0.s
+; B16B16-NEXT: bfmul z0.h, p0/m, z0.h, z1.h
; B16B16-NEXT: ret
%res = fmul <vscale x 4 x bfloat> %a, %b
ret <vscale x 4 x bfloat> %res
@@ -672,7 +676,8 @@ define <vscale x 2 x bfloat> @fsub_nxv2bf16(<vscale x 2 x bfloat> %a, <vscale x
;
; B16B16-LABEL: fsub_nxv2bf16:
; B16B16: // %bb.0:
-; B16B16-NEXT: bfsub z0.h, z0.h, z1.h
+; B16B16-NEXT: ptrue p0.d
+; B16B16-NEXT: bfsub z0.h, p0/m, z0.h, z1.h
; B16B16-NEXT: ret
%res = fsub <vscale x 2 x bfloat> %a, %b
ret <vscale x 2 x bfloat> %res
@@ -690,7 +695,8 @@ define <vscale x 4 x bfloat> @fsub_nxv4bf16(<vscale x 4 x bfloat> %a, <vscale x
;
; B16B16-LABEL: fsub_nxv4bf16:
; B16B16: // %bb.0:
-; B16B16-NEXT: bfsub z0.h, z0.h, z1.h
+; B16B16-NEXT: ptrue p0.s
+; B16B16-NEXT: bfsub z0.h, p0/m, z0.h, z1.h
; B16B16-NEXT: ret
%res = fsub <vscale x 4 x bfloat> %a, %b
ret <vscale x 4 x bfloat> %res
diff --git a/llvm/test/CodeGen/AArch64/sve-bf16-combines.ll b/llvm/test/CodeGen/AArch64/sve-bf16-combines.ll
index 16e8feb0dc5bb..86d4f61316446 100644
--- a/llvm/test/CodeGen/AArch64/sve-bf16-combines.ll
+++ b/llvm/test/CodeGen/AArch64/sve-bf16-combines.ll
@@ -311,8 +311,7 @@ define <vscale x 8 x bfloat> @fadd_sel_nxv8bf16(<vscale x 8 x bfloat> %a, <vscal
;
; SVE-B16B16-LABEL: fadd_sel_nxv8bf16:
; SVE-B16B16: // %bb.0:
-; SVE-B16B16-NEXT: bfadd z1.h, z0.h, z1.h
-; SVE-B16B16-NEXT: mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT: bfadd z0.h, p0/m, z0.h, z1.h
; SVE-B16B16-NEXT: ret
%sel = select <vscale x 8 x i1> %mask, <vscale x 8 x bfloat> %b, <vscale x 8 x bfloat> zeroinitializer
%fadd = fadd nsz <vscale x 8 x bfloat> %a, %sel
@@ -341,8 +340,7 @@ define <vscale x 8 x bfloat> @fsub_sel_nxv8bf16(<vscale x 8 x bfloat> %a, <vscal
;
; SVE-B16B16-LABEL: fsub_sel_nxv8bf16:
; SVE-B16B16: // %bb.0:
-; SVE-B16B16-NEXT: bfsub z1.h, z0.h, z1.h
-; SVE-B16B16-NEXT: mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT: bfsub z0.h, p0/m, z0.h, z1.h
; SVE-B16B16-NEXT: ret
%sel = select <vscale x 8 x i1> %mask, <vscale x 8 x bfloat> %b, <vscale x 8 x bfloat> zeroinitializer
%fsub = fsub <vscale x 8 x bfloat> %a, %sel
@@ -371,8 +369,7 @@ define <vscale x 8 x bfloat> @fadd_sel_negzero_nxv8bf16(<vscale x 8 x bfloat> %a
;
; SVE-B16B16-LABEL: fadd_sel_negzero_nxv8bf16:
; SVE-B16B16: // %bb.0:
-; SVE-B16B16-NEXT: bfadd z1.h, z0.h, z1.h
-; SVE-B16B16-NEXT: mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT: bfadd z0.h, p0/m, z0.h, z1.h
; SVE-B16B16-NEXT: ret
%nz = fneg <vscale x 8 x bfloat> zeroinitializer
%sel = select <vscale x 8 x i1> %mask, <vscale x 8 x bfloat> %b, <vscale x 8 x bfloat> %nz
@@ -402,8 +399,7 @@ define <vscale x 8 x bfloat> @fsub_sel_negzero_nxv8bf16(<vscale x 8 x bfloat> %a
;
; SVE-B16B16-LABEL: fsub_sel_negzero_nxv8bf16:
; SVE-B16B16: // %bb.0:
-; SVE-B16B16-NEXT: bfsub z1.h, z0.h, z1.h
-; SVE-B16B16-NEXT: mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT: bfsub z0.h, p0/m, z0.h, z1.h
; SVE-B16B16-NEXT: ret
%nz = fneg <vscale x 8 x bfloat> zeroinitializer
%sel = select <vscale x 8 x i1> %mask, <vscale x 8 x bfloat> %b, <vscale x 8 x bfloat> %nz
@@ -490,9 +486,7 @@ define <vscale x 8 x bfloat> @fsub_sel_fmul_nxv8bf16(<vscale x 8 x bfloat> %a, <
;
; SVE-B16B16-LABEL: fsub_sel_fmul_nxv8bf16:
; SVE-B16B16: // %bb.0:
-; SVE-B16B16-NEXT: bfmul z1.h, z1.h, z2.h
-; SVE-B16B16-NEXT: bfsub z1.h, z0.h, z1.h
-; SVE-B16B16-NEXT: mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT: bfmls z0.h, p0/m, z1.h, z2.h
; SVE-B16B16-NEXT: ret
%fmul = fmul <vscale x 8 x bfloat> %b, %c
%sel = select <vscale x 8 x i1> %mask, <vscale x 8 x bfloat> %fmul, <vscale x 8 x bfloat> zeroinitializer
@@ -532,9 +526,7 @@ define <vscale x 8 x bfloat> @fadd_sel_fmul_nsz_nxv8bf16(<vscale x 8 x bfloat> %
;
; SVE-B16B16-LABEL: fadd_sel_fmul_nsz_nxv8bf16:
; SVE-B16B16: // %bb.0:
-; SVE-B16B16-NEXT: bfmul z1.h, z1.h, z2.h
-; SVE-B16B16-NEXT: bfadd z1.h, z0.h, z1.h
-; SVE-B16B16-NEXT: mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT: bfmla z0.h, p0/m, z1.h, z2.h
; SVE-B16B16-NEXT: ret
%fmul = fmul <vscale x 8 x bfloat> %b, %c
%sel = select <vscale x 8 x i1> %mask, <vscale x 8 x bfloat> %fmul, <vscale x 8 x bfloat> zeroinitializer
@@ -574,9 +566,7 @@ define <vscale x 8 x bfloat> @fsub_sel_fmul_nsz_nxv8bf16(<vscale x 8 x bfloat> %
;
; SVE-B16B16-LABEL: fsub_sel_fmul_nsz_nxv8bf16:
; SVE-B16B16: // %bb.0:
-; SVE-B16B16-NEXT: bfmul z1.h, z1.h, z2.h
-; SVE-B16B16-NEXT: bfsub z1.h, z0.h, z1.h
-; SVE-B16B16-NEXT: mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT: bfmls z0.h, p0/m, z1.h, z2.h
; SVE-B16B16-NEXT: ret
%fmul = fmul <vscale x 8 x bfloat> %b, %c
%sel = select <vscale x 8 x i1> %mask, <vscale x 8 x bfloat> %fmul, <vscale x 8 x bfloat> zeroinitializer
@@ -616,9 +606,7 @@ define <vscale x 8 x bfloat> @fadd_sel_fmul_negzero_nxv8bf16(<vscale x 8 x bfloa
;
; SVE-B16B16-LABEL: fadd_sel_fmul_negzero_nxv8bf16:
; SVE-B16B16: // %bb.0:
-; SVE-B16B16-NEXT: bfmul z1.h, z1.h, z2.h
-; SVE-B16B16-NEXT: bfadd z1.h, z0.h, z1.h
-; SVE-B16B16-NEXT: mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT: bfmla z0.h, p0/m, z1.h, z2.h
; SVE-B16B16-NEXT: ret
%fmul = fmul <vscale x 8 x bfloat> %b, %c
%nz = fneg <vscale x 8 x bfloat> zeroinitializer
@@ -711,9 +699,7 @@ define <vscale x 8 x bfloat> @fadd_sel_fmul_negzero_nsz_nxv8bf16(<vscale x 8 x b
;
; SVE-B16B16-LABEL: fadd_sel_fmul_negzero_nsz_nxv8bf16:
; SVE-B16B16: // %bb.0:
-; SVE-B16B16-NEXT: bfmul z1.h, z1.h, z2.h
-; SVE-B16B16-NEXT: bfadd z1.h, z0.h, z1.h
-; SVE-B16B16-NEXT: mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT: bfmla z0.h, p0/m, z1.h, z2.h
; SVE-B16B16-NEXT: ret
%fmul = fmul <vscale x 8 x bfloat> %b, %c
%nz = fneg <vscale x 8 x bfloat> zeroinitializer
@@ -754,9 +740,7 @@ define <vscale x 8 x bfloat> @fsub_sel_fmul_negzero_nsz_nxv8bf16(<vscale x 8 x b
;
; SVE-B16B16-LABEL: fsub_sel_fmul_negzero_nsz_nxv8bf16:
; SVE-B16B16: // %bb.0:
-; SVE-B16B16-NEXT: bfmul z1.h, z1.h, z2.h
-; SVE-B16B16-NEXT: bfsub z1.h, z0.h, z1.h
-; SVE-B16B16-NEXT: mov z0.h, p0/m, z1.h
+; SVE-B16B16-NEXT: bfmls z0.h, p0/m, z1.h, z2.h
; SVE-B16B16-NEXT: ret
%fmul = fmul <vscale x 8 x bfloat> %b, %c
%nz = fneg <vscale x 8 x bfloat> zeroinitializer
|
MacDue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Hi - If this is specifically for a fadd node then the fp-exception flags are not observable outside of strict-fp and we assume cannot trap. Otherwise fp operations would have side-effects and not be speculatable. In order for fp exceptions to be observable you are required to use the strict-fp llvm.constrained.fadd intrinsics, which should turn into a STRICT_FADD ISD node. There are details in https://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics. Strictfp functions require a function attribute. From target predicated llvm.aarch64.sve.fadd intrinsics my understanding is that it will depend on the function and should correctly preserve/honour fp exceptions and rounding modes in strict-fp, but don't need to outside of it. So if this is just for fadd I think the code was fine and it is OK to do whatever in the off lanes, including setting fp exception bits. Some of this might change, there are discussion on discourse, but my understanding is the basics wouldn't change (there are still a difference between ISD::FADD and ISD::STRICT_FADD and fp exceptions by default are not observable). |
|
I have been following the floating point environment support for a while although kind of lost track since #109798 stalled. The constrained intrinsic approach is not a viable implementation for target intrinsics, with ongoing work to redesign/replace/reimplement them using operand bundles. A previous attempt of mine to guide selection using a function's strict-fp attribute was rejected. I do need to extend the SVE support for the STRICT_* ISD nodes, but today they simply lower to their non-strict counterparts. When combined with the above issues, we chose to at least not aggravate the issue with a minimal effect on code generation given the required predicate is likely always available. |
|
Yeah everything I talked about was the way things did / currently work, not how they might work in the future. Adding support for strict-fp certainly sounds good. I had seen the strict-fp intrinsics being converted to standard nodes for unsupported operations in the past and been surprised by it, it did not sound very sound. It looks like most vector operations will currently try to unroll. https://godbolt.org/z/Y7vbq7Tj3 |
|
Thanks @davemgreen. Seems the situation is worse than I thought so I'll bump the SVE support for the STRICT_* ISD nodes higher up my list. |
While SVE support for exception safe floating point code generation is bare bones we try to ensure inactive lanes remiain inert. I mistakenly broke this rule when adding support for SVE-B16B16 by lowering some bfloat operations of unpacked vectors to unpredicated instructions.