You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tested how many independent adds llvm-mca thinks the cores can decode per cycle and compared it with the actual decode with:
CPU: llvm-mca vs Arm-Software-Optimization-Guide "4.1 Dispatch constraints"
Neoverse-V1: 15 vs 8
Neoverse-V2: 16 vs 8
Neoverse-V3: 16 vs 10
Neoverse-N1: 8 vs 4
Neoverse-N2: 10 vs 5
Neoverse-N3: 10 vs 5
The decode/issue width currently used in the scheduling models seems to correspond to the number of uops that can be processed, not MOPs, that are decoded or read from opcache.
Still, unless the cores are capable of fusing independent additions, they shouldn't be able to decode the instructions this quickly.
Here is a code snippet where the additional decode capabilities cause an impossible result: https://godbolt.org/z/GbGrKWxsq
Here the V1 can execute a loop with 13 instructions with 13 IPC, even though it should only be able to decode up to 8 instructions per cycle.