[AArch64] Neoverse V2 FeatureDisableLatencySchedHeuristic #140897

sjoerdmeijer · 2025-05-21T13:23:39Z

This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core tuning description. This gives us a 20% improvement on a key workload, some other minor improvements here and there, and no real regressions; nothing outside the noise levels.

Earlier attempts to solve this problems included disabling the MI scheduler entirely (#127784), and #139557 was about a heuristic to not schedule hand-written vector code. This solution is preferred because it avoids another heuristic and achieves what we want, and for what is worth, there is a lot of precedent for setting this feature.

Thanks to:

Ricardo Jesus for pointing out this subtarget feature, and
Cameron McInally for the extensive performance testing.

This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core tuning description. This gives us a 20% improvement on a key workload, some other minor improvements here and there, and no real regressions; nothing outside the noise levels. Earlier attempts to solve this problems included disabling the MI scheduler entirely (llvm#127784), and llvm#139557 was about a heuristic to not schedule hand-written vector code. This solution is preferred because it avoids another heuristic and achieves what we want, and for what is worth, there is a lot of precedent for setting this feature. Thanks to: - Ricardo Jesus for pointing out this subtarget feature, and - Cameron McInally for the extensive performance testing.

llvmbot · 2025-05-21T13:24:16Z

@llvm/pr-subscribers-backend-aarch64

Author: Sjoerd Meijer (sjoerdmeijer)

Changes

This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core tuning description. This gives us a 20% improvement on a key workload, some other minor improvements here and there, and no real regressions; nothing outside the noise levels.

Earlier attempts to solve this problems included disabling the MI scheduler entirely (#127784), and #139557 was about a heuristic to not schedule hand-written vector code. This solution is preferred because it avoids another heuristic and achieves what we want, and for what is worth, there is a lot of precedent for setting this feature.

Thanks to:

Ricardo Jesus for pointing out this subtarget feature, and
Cameron McInally for the extensive performance testing.

Full diff: https://github.com/llvm/llvm-project/pull/140897.diff

1 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64Processors.td (+2-1)

diff --git a/llvm/lib/Target/AArch64/AArch64Processors.td b/llvm/lib/Target/AArch64/AArch64Processors.td
index a5f0f6a2eb150..e7a3527202f6a 100644
--- a/llvm/lib/Target/AArch64/AArch64Processors.td
+++ b/llvm/lib/Target/AArch64/AArch64Processors.td
@@ -561,7 +561,8 @@ def TuneNeoverseV2 : SubtargetFeature<"neoversev2", "ARMProcFamily", "NeoverseV2
                                       FeatureEnableSelectOptimize,
                                       FeatureUseFixedOverScalableIfEqualCost,
                                       FeatureAvoidLDAPUR,
-                                      FeaturePredictableSelectIsExpensive]>;
+                                      FeaturePredictableSelectIsExpensive,
+                                      FeatureDisableLatencySchedHeuristic]>;
 
 def TuneNeoverseV3 : SubtargetFeature<"neoversev3", "ARMProcFamily", "NeoverseV3",
                                       "Neoverse V3 ARM processors", [

rj-jesus

We have looked at this and this looks good to me, but maybe give it a day or so before merging so that everyone can share their thoughts. :)

davemgreen

Yeah SGTM. (And I don't know a good way to test it either).

c-rhodes

Also LGTM, can revisit if we notice any fallout for workloads we're interested in.

llvm-ci · 2025-06-06T13:59:08Z

LLVM Buildbot has detected a new failure on builder llvm-x86_64-debian-dylib running on gribozavr4 while building llvm at step 7 "test-build-unified-tree-check-llvm".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/60/builds/29556

Here is the relevant piece of the build log for the reference

Step 7 (test-build-unified-tree-check-llvm) failure: test (failure)
******************** TEST 'LLVM :: tools/llvm-exegesis/RISCV/rvv/filter.test' FAILED ********************
Exit Code: 2

Command Output (stderr):
--
/b/1/llvm-x86_64-debian-dylib/build/bin/llvm-exegesis -mtriple=riscv64 -mcpu=sifive-x280 -benchmark-phase=assemble-measured-code --mode=inverse_throughput --opcode-name=PseudoVNCLIPU_WX_M1_MASK     --riscv-filter-config='vtype = {VXRM: rod, AVL: VLMAX, SEW: e(8|16), Policy: ta/mu}' --max-configs-per-opcode=1000 --min-instructions=10 | /b/1/llvm-x86_64-debian-dylib/build/bin/FileCheck /b/1/llvm-x86_64-debian-dylib/llvm-project/llvm/test/tools/llvm-exegesis/RISCV/rvv/filter.test # RUN: at line 1
+ /b/1/llvm-x86_64-debian-dylib/build/bin/FileCheck /b/1/llvm-x86_64-debian-dylib/llvm-project/llvm/test/tools/llvm-exegesis/RISCV/rvv/filter.test
+ /b/1/llvm-x86_64-debian-dylib/build/bin/llvm-exegesis -mtriple=riscv64 -mcpu=sifive-x280 -benchmark-phase=assemble-measured-code --mode=inverse_throughput --opcode-name=PseudoVNCLIPU_WX_M1_MASK '--riscv-filter-config=vtype = {VXRM: rod, AVL: VLMAX, SEW: e(8|16), Policy: ta/mu}' --max-configs-per-opcode=1000 --min-instructions=10
PseudoVNCLIPU_WX_M1_MASK: Failed to produce any snippet via: instruction has tied variables, avoiding Read-After-Write issue, picking random def and use registers not aliasing each other, for uses, one unique register for each position
FileCheck error: '<stdin>' is empty.
FileCheck command line:  /b/1/llvm-x86_64-debian-dylib/build/bin/FileCheck /b/1/llvm-x86_64-debian-dylib/llvm-project/llvm/test/tools/llvm-exegesis/RISCV/rvv/filter.test

--

********************

This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core tuning description. This gives us a 20% improvement on a key workload, some other minor improvements here and there, and no real regressions; nothing outside the noise levels. Earlier attempts to solve this problems included disabling the MI scheduler entirely (llvm#127784), and llvm#139557 was about a heuristic to not schedule hand-written vector code. This solution is preferred because it avoids another heuristic and achieves what we want, and for what is worth, there is a lot of precedent for setting this feature. Thanks to: - Ricardo Jesus for pointing out this subtarget feature, and - Cameron McInally for the extensive performance testing.

igogo-x86 · 2025-07-09T10:14:26Z

Hi, @sjoerdmeijer! There’s a bit of a story here: PR #136329 caused about a 1.8% performance regression in 525.x264 on Neoverse-V2 concentrated in x264_pixel_satd_16x16 and x264_pixel_satd_8x8. I put a patch to address that (PR #146694), but in the meantime, your patch seems to have made codegen a bit less efficient in those functions.

With my current patch, we’re only recovering around 0.3–0.4% of the lost performance, while reverting your patch on top gives an additional 1.4–1.5% improvement. I’m not sure if this is something we can address easily - the scheduler is tricky territory - but I just wanted to bring it to your attention in case you have any thoughts.

sjoerdmeijer · 2025-07-11T08:55:24Z

Hi @igogo-x86 , sorry for the late reply, thanks for sharing this.
I have also benchmarked this patch on the V2. I don't think I saw this regression in x264, not sure if I missed this or if something else is going on. I trust your numbers, but I am going to double check them at my end just out of curiousity. I will report back if I find something interesting.

We have certainly seen some up and down behaviour too. But there were a lot more ups than downs (and a big up), making this overall a good change.

We do have a follow up task on our backlog, we are interested to see if we can claw back one of the regressions. But like you say, this is tricky business because of scheduler and regalloc interaction, and there are different ways of solving this, so we will have to see. I don't think I can get more concrete than this at this point. The x264 problem sounds a bit different though, so I am going to have a look at #146694 and see if I can help with that one.

sjoerdmeijer · 2025-07-11T09:01:49Z

I have also benchmarked this patch on the V2. I don't think I saw this regression in x264, not sure if I missed this or if something else is going on. I trust your numbers, but I am going to double check them at my end just out of curiousity. I will report back if I find something interesting.

Ah, sorry, I misunderstood. It was the SLP patch that caused a slow down first. And then you found that this patch is interacting, and is not really helping. Do you know why, by any chance?

igogo-x86 · 2025-07-11T10:39:36Z

LLVM IR is exactly the same before the SLP patch (#136329) and after my fix (#140897). However, the performance was still worse. So I looked at the assembly and saw that the instructions were identical, but the order was different. I then bisected the issue down to your change.

Here are the perf sample data:

Function	Old Good Times	After SLP Patch	My Fix with Your Patch	My Fix without Your Patch
`sym.x264_pixel_satd_16x16`	11488	12236	12044	11582
`sym.x264_pixel_satd_8x8`	7192	7884	6949	6644

P.S. Something improved sym.x264_pixel_satd_8x8 in the meantime.

sjoerdmeijer requested review from rj-jesus, davemgreen and mcinally May 21, 2025 13:23

llvmbot added the backend:AArch64 label May 21, 2025

sjoerdmeijer mentioned this pull request May 21, 2025

[MachineScheduler][AArch64] Skip Neoverse V2 Pre-RA MISched for large vector intrinsic codes #139557

Closed

rj-jesus approved these changes May 21, 2025

View reviewed changes

davemgreen requested a review from c-rhodes May 21, 2025 15:23

davemgreen approved these changes May 21, 2025

View reviewed changes

c-rhodes approved these changes May 21, 2025

View reviewed changes

sjoerdmeijer mentioned this pull request Jun 4, 2025

[AArch64] Change IssueWidth to 6 in AArch64SchedNeoverseV2.td #142565

Merged

sjoerdmeijer merged commit 1540ed5 into llvm:main Jun 6, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AArch64] Neoverse V2 FeatureDisableLatencySchedHeuristic #140897

[AArch64] Neoverse V2 FeatureDisableLatencySchedHeuristic #140897

Uh oh!

sjoerdmeijer commented May 21, 2025

Uh oh!

llvmbot commented May 21, 2025

Uh oh!

rj-jesus left a comment

Uh oh!

davemgreen left a comment

Uh oh!

c-rhodes left a comment

Uh oh!

Uh oh!

llvm-ci commented Jun 6, 2025

Uh oh!

igogo-x86 commented Jul 9, 2025

Uh oh!

sjoerdmeijer commented Jul 11, 2025 •

edited

Loading

Uh oh!

sjoerdmeijer commented Jul 11, 2025

Uh oh!

igogo-x86 commented Jul 11, 2025

Uh oh!

Uh oh!

[AArch64] Neoverse V2 FeatureDisableLatencySchedHeuristic #140897

[AArch64] Neoverse V2 FeatureDisableLatencySchedHeuristic #140897

Uh oh!

Conversation

sjoerdmeijer commented May 21, 2025

Uh oh!

llvmbot commented May 21, 2025

Uh oh!

rj-jesus left a comment

Choose a reason for hiding this comment

Uh oh!

davemgreen left a comment

Choose a reason for hiding this comment

Uh oh!

c-rhodes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvm-ci commented Jun 6, 2025

Uh oh!

igogo-x86 commented Jul 9, 2025

Uh oh!

sjoerdmeijer commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjoerdmeijer commented Jul 11, 2025

Uh oh!

igogo-x86 commented Jul 11, 2025

Uh oh!

Uh oh!

sjoerdmeijer commented Jul 11, 2025 •

edited

Loading