-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
I have a working version of decimal.DecCalc which uses MultiplyNoFlags from dotnet/coreclr#21480 in a branch, but I discovered two issues.
If (fixed in dotnet/coreclr#21928).MultiplyNoFlags is called without having its result used, then it's assumed to be a no-op, even if the low part is used. While such use would be sub-optimal, it should still be valid
The second problem is that performance is increased only up to 3% for some methods, while others suffer a performance penalty up to 20%! This is primarily caused by forcing the low result to be written to memory and excessive temporary register use, compounded by forced zero-init of the locals (even with no .locals init) which affects all code paths of the function.
static unsafe ulong mulx(ulong a, ulong b)
{
ulong r;
return X86.Bmi2.X64.MultiplyNoFlags(a, b, &r) + r;
}; Assembly listing for method DecCalc:mulx(long,long):long
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) long -> rcx
; V01 arg1 [V01,T01] ( 3, 3 ) long -> [rsp+0x18]
; V02 loc0 [V02 ] ( 2, 2 ) long -> [rsp+0x00] do-not-enreg[X] must-init addr-exposed ld-addr-op
;# V03 OutArgs [V03 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace"
;
; Lcl frame size = 8
G_M42317_IG01:
push rax
xor rax, rax
mov qword ptr [rsp], rax
mov qword ptr [rsp+18H], rdx
G_M42317_IG02:
lea rax, bword ptr [rsp]
mov rdx, rcx
mulx rdx, r8, qword ptr [rsp+18H]
mov qword ptr [rax], r8
mov rax, rdx
add rax, qword ptr [rsp]
G_M42317_IG03:
add rsp, 8
ret
; Total bytes of code 41, prolog size 7 for method DecCalc:mulx(long,long):longWhile ideally this should be just:
mulx rax, rcx, rcx
add rax, rcx
ret category:cq
theme:vector-codegen
skill-level:expert
cost:medium
impact:small