Skip to content

[AArch64] Neoverse V2 FeatureDisableLatencySchedHeuristic #140897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

sjoerdmeijer
Copy link
Collaborator

This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core tuning description. This gives us a 20% improvement on a key workload, some other minor improvements here and there, and no real regressions; nothing outside the noise levels.

Earlier attempts to solve this problems included disabling the MI scheduler entirely (#127784), and #139557 was about a heuristic to not schedule hand-written vector code. This solution is preferred because it avoids another heuristic and achieves what we want, and for what is worth, there is a lot of precedent for setting this feature.

Thanks to:

  • Ricardo Jesus for pointing out this subtarget feature, and
  • Cameron McInally for the extensive performance testing.

This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core
tuning description. This gives us a 20% improvement on a key workload,
some other minor improvements here and there, and no real regressions;
nothing outside the noise levels.

Earlier attempts to solve this problems included disabling the MI
scheduler entirely (llvm#127784), and llvm#139557 was about a heuristic to
not schedule hand-written vector code. This solution is preferred
because it avoids another heuristic and achieves what we want, and for
what is worth, there is a lot of precedent for setting this feature.

Thanks to:
- Ricardo Jesus for pointing out this subtarget feature, and
- Cameron McInally for the extensive performance testing.
@llvmbot
Copy link
Member

llvmbot commented May 21, 2025

@llvm/pr-subscribers-backend-aarch64

Author: Sjoerd Meijer (sjoerdmeijer)

Changes

This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core tuning description. This gives us a 20% improvement on a key workload, some other minor improvements here and there, and no real regressions; nothing outside the noise levels.

Earlier attempts to solve this problems included disabling the MI scheduler entirely (#127784), and #139557 was about a heuristic to not schedule hand-written vector code. This solution is preferred because it avoids another heuristic and achieves what we want, and for what is worth, there is a lot of precedent for setting this feature.

Thanks to:

  • Ricardo Jesus for pointing out this subtarget feature, and
  • Cameron McInally for the extensive performance testing.

Full diff: https://github.com/llvm/llvm-project/pull/140897.diff

1 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64Processors.td (+2-1)
diff --git a/llvm/lib/Target/AArch64/AArch64Processors.td b/llvm/lib/Target/AArch64/AArch64Processors.td
index a5f0f6a2eb150..e7a3527202f6a 100644
--- a/llvm/lib/Target/AArch64/AArch64Processors.td
+++ b/llvm/lib/Target/AArch64/AArch64Processors.td
@@ -561,7 +561,8 @@ def TuneNeoverseV2 : SubtargetFeature<"neoversev2", "ARMProcFamily", "NeoverseV2
                                       FeatureEnableSelectOptimize,
                                       FeatureUseFixedOverScalableIfEqualCost,
                                       FeatureAvoidLDAPUR,
-                                      FeaturePredictableSelectIsExpensive]>;
+                                      FeaturePredictableSelectIsExpensive,
+                                      FeatureDisableLatencySchedHeuristic]>;
 
 def TuneNeoverseV3 : SubtargetFeature<"neoversev3", "ARMProcFamily", "NeoverseV3",
                                       "Neoverse V3 ARM processors", [

Copy link
Contributor

@rj-jesus rj-jesus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have looked at this and this looks good to me, but maybe give it a day or so before merging so that everyone can share their thoughts. :)

@davemgreen davemgreen requested a review from c-rhodes May 21, 2025 15:23
Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah SGTM. (And I don't know a good way to test it either).

Copy link
Collaborator

@c-rhodes c-rhodes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also LGTM, can revisit if we notice any fallout for workloads we're interested in.

@sjoerdmeijer sjoerdmeijer merged commit 1540ed5 into llvm:main Jun 6, 2025
13 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Jun 6, 2025

LLVM Buildbot has detected a new failure on builder llvm-x86_64-debian-dylib running on gribozavr4 while building llvm at step 7 "test-build-unified-tree-check-llvm".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/60/builds/29556

Here is the relevant piece of the build log for the reference
Step 7 (test-build-unified-tree-check-llvm) failure: test (failure)
******************** TEST 'LLVM :: tools/llvm-exegesis/RISCV/rvv/filter.test' FAILED ********************
Exit Code: 2

Command Output (stderr):
--
/b/1/llvm-x86_64-debian-dylib/build/bin/llvm-exegesis -mtriple=riscv64 -mcpu=sifive-x280 -benchmark-phase=assemble-measured-code --mode=inverse_throughput --opcode-name=PseudoVNCLIPU_WX_M1_MASK     --riscv-filter-config='vtype = {VXRM: rod, AVL: VLMAX, SEW: e(8|16), Policy: ta/mu}' --max-configs-per-opcode=1000 --min-instructions=10 | /b/1/llvm-x86_64-debian-dylib/build/bin/FileCheck /b/1/llvm-x86_64-debian-dylib/llvm-project/llvm/test/tools/llvm-exegesis/RISCV/rvv/filter.test # RUN: at line 1
+ /b/1/llvm-x86_64-debian-dylib/build/bin/FileCheck /b/1/llvm-x86_64-debian-dylib/llvm-project/llvm/test/tools/llvm-exegesis/RISCV/rvv/filter.test
+ /b/1/llvm-x86_64-debian-dylib/build/bin/llvm-exegesis -mtriple=riscv64 -mcpu=sifive-x280 -benchmark-phase=assemble-measured-code --mode=inverse_throughput --opcode-name=PseudoVNCLIPU_WX_M1_MASK '--riscv-filter-config=vtype = {VXRM: rod, AVL: VLMAX, SEW: e(8|16), Policy: ta/mu}' --max-configs-per-opcode=1000 --min-instructions=10
PseudoVNCLIPU_WX_M1_MASK: Failed to produce any snippet via: instruction has tied variables, avoiding Read-After-Write issue, picking random def and use registers not aliasing each other, for uses, one unique register for each position
FileCheck error: '<stdin>' is empty.
FileCheck command line:  /b/1/llvm-x86_64-debian-dylib/build/bin/FileCheck /b/1/llvm-x86_64-debian-dylib/llvm-project/llvm/test/tools/llvm-exegesis/RISCV/rvv/filter.test

--

********************


rorth pushed a commit to rorth/llvm-project that referenced this pull request Jun 11, 2025
This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core
tuning description. This gives us a 20% improvement on a key workload,
some other minor improvements here and there, and no real regressions;
nothing outside the noise levels.

Earlier attempts to solve this problems included disabling the MI
scheduler entirely (llvm#127784), and llvm#139557 was about a heuristic to not
schedule hand-written vector code. This solution is preferred because it
avoids another heuristic and achieves what we want, and for what is
worth, there is a lot of precedent for setting this feature.

Thanks to:
- Ricardo Jesus for pointing out this subtarget feature, and
- Cameron McInally for the extensive performance testing.
DhruvSrivastavaX pushed a commit to DhruvSrivastavaX/lldb-for-aix that referenced this pull request Jun 12, 2025
This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core
tuning description. This gives us a 20% improvement on a key workload,
some other minor improvements here and there, and no real regressions;
nothing outside the noise levels.

Earlier attempts to solve this problems included disabling the MI
scheduler entirely (llvm#127784), and llvm#139557 was about a heuristic to not
schedule hand-written vector code. This solution is preferred because it
avoids another heuristic and achieves what we want, and for what is
worth, there is a lot of precedent for setting this feature.

Thanks to:
- Ricardo Jesus for pointing out this subtarget feature, and
- Cameron McInally for the extensive performance testing.
tomtor pushed a commit to tomtor/llvm-project that referenced this pull request Jun 14, 2025
This adds FeatureDisableLatencySchedHeuristic to the Neoverse V2 core
tuning description. This gives us a 20% improvement on a key workload,
some other minor improvements here and there, and no real regressions;
nothing outside the noise levels.

Earlier attempts to solve this problems included disabling the MI
scheduler entirely (llvm#127784), and llvm#139557 was about a heuristic to not
schedule hand-written vector code. This solution is preferred because it
avoids another heuristic and achieves what we want, and for what is
worth, there is a lot of precedent for setting this feature.

Thanks to:
- Ricardo Jesus for pointing out this subtarget feature, and
- Cameron McInally for the extensive performance testing.
@igogo-x86
Copy link
Contributor

Hi, @sjoerdmeijer! There’s a bit of a story here: PR #136329 caused about a 1.8% performance regression in 525.x264 on Neoverse-V2 concentrated in x264_pixel_satd_16x16 and x264_pixel_satd_8x8. I put a patch to address that (PR #146694), but in the meantime, your patch seems to have made codegen a bit less efficient in those functions.

With my current patch, we’re only recovering around 0.3–0.4% of the lost performance, while reverting your patch on top gives an additional 1.4–1.5% improvement. I’m not sure if this is something we can address easily - the scheduler is tricky territory - but I just wanted to bring it to your attention in case you have any thoughts.

@sjoerdmeijer
Copy link
Collaborator Author

sjoerdmeijer commented Jul 11, 2025

Hi @igogo-x86 , sorry for the late reply, thanks for sharing this.
I have also benchmarked this patch on the V2. I don't think I saw this regression in x264, not sure if I missed this or if something else is going on. I trust your numbers, but I am going to double check them at my end just out of curiousity. I will report back if I find something interesting.

We have certainly seen some up and down behaviour too. But there were a lot more ups than downs (and a big up), making this overall a good change.

We do have a follow up task on our backlog, we are interested to see if we can claw back one of the regressions. But like you say, this is tricky business because of scheduler and regalloc interaction, and there are different ways of solving this, so we will have to see. I don't think I can get more concrete than this at this point. The x264 problem sounds a bit different though, so I am going to have a look at #146694 and see if I can help with that one.

@sjoerdmeijer
Copy link
Collaborator Author

I have also benchmarked this patch on the V2. I don't think I saw this regression in x264, not sure if I missed this or if something else is going on. I trust your numbers, but I am going to double check them at my end just out of curiousity. I will report back if I find something interesting.

Ah, sorry, I misunderstood. It was the SLP patch that caused a slow down first. And then you found that this patch is interacting, and is not really helping. Do you know why, by any chance?

@igogo-x86
Copy link
Contributor

LLVM IR is exactly the same before the SLP patch (#136329) and after my fix (#140897). However, the performance was still worse. So I looked at the assembly and saw that the instructions were identical, but the order was different. I then bisected the issue down to your change.

Here are the perf sample data:

Function Old Good Times After SLP Patch My Fix with Your Patch My Fix without Your Patch
sym.x264_pixel_satd_16x16 11488 12236 12044 11582
sym.x264_pixel_satd_8x8 7192 7884 6949 6644

P.S. Something improved sym.x264_pixel_satd_8x8 in the meantime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants