-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8308606: C2 SuperWord: remove alignment checks when not required #14096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back epeter! A progress list of the required criteria for merging this PR into |
Webrevs
|
| public int[] indexWithDifferentConstants() { | ||
| // No true dependency in read-forward case. | ||
| @IR(applyIfCPUFeatureOr = {"asimd", "true", "sse2", "true"}, | ||
| counts = {IRNode.STORE_VECTOR, ">0"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may need add applyIf = {"AlignVector", "false"} for these newly added IR check rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fg1417 you are right!
I think we should also add AlignVector to the IR whitelist. It makes sense to add it with this change here, because only from now on can we actually have misaligned loads / stores on the same memory-slice! So we should also test things more thoroughly now.
I also see that in test/hotspot/jtreg/compiler/vectorization/runner/ a lot of tests have @requires vm.flagless. That means we actually do not check any flag combinations with those tests. I think we should file an RFE to make them more general, and add the requirements to the IR rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, there are some bugs around, I cannot yet directly add AlignVector to the IR framework whitelist. We can do it in a follow up RFE https://bugs.openjdk.org/browse/JDK-8309662
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @eme64, I'd like to explain more about the @requires vm.flagless. Vladimir Kozlov had suggested removing those annotations. I didn't do that before because those annotations cannot be simply removed. All tests under compiler/vectorization/runner/ are used for both correctness check and vectorizability (IR) check. For correctness check, each test method is invoked twice and the return results from the interpreter and C2 compiled code are compared. We use compiler control via WhiteBox API from the test runner to force these methods running in interpreter and C2 (see the logic in VectorizationTestRunner.java). The force compilation would fail if some extra vm option of compiler control (such as -Xint) is specified.
A way of removing @requires vm.flagless I can think of may be skipping the correctness check in the vectorization test runner if the compiler control fails. I just filed JDK-8309697 for this. Please let me know if you have any better ideas or suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pfustc If I run it with -Xint, then it says "Test results: no tests selected". I think that is because of @requires vm.compiler2.enabled. But sure, there may be some other flags that mess with the compiler controls.
But I think it is important to remove the @requires vm.flagless, there are always bugs lurking around with more flag combinations. Plus, we don't have all the hardware that exists out there. That is why it is crucial that we can run with flags AlignVector (some ARM machines have it on true) or UseKNLSetting (intel), for example.
I'll temporarily add the @requires vm.flagless back in for test/hotspot/jtreg/compiler/vectorization/runner/LoopArrayIndexComputeTest.java.
Hi @eme64 , nice rewrite! May I ask if you have any benchmark data of misaligned-load-store cases for other data types? For example, |
|
@fg1417 Ok, I will expand the misaligned load-store case for some other data types and test it on my two machines again! |
|
Hi @eme64 , I don't see any problem after going through everything you wrote. But as I'm not an official reviewer and don't have enough confidence on complex changes, I hope someone who is more familiar with this part can have a review. BTW: Have you generated and run some JavaFuzzer tests for this patch? Based on my personal experience, it's quite helpful for finding hidden bugs in SuperWord or other complex loop optimizations. And a few more comments on this:
What we are current doing for https://bugs.openjdk.org/browse/JDK-8308994 is like a traditional loop vectorizer - it can vectorize loops without unrolling. It can also support strided accesses (gather/scatter) with a few updates. But our current implementation is outside SuperWord and for post loops only. Perhaps a hybrid vectorizer implemented in SuperWord is a better ideal. We will push our draft patch to GitHub soon for your feedback. Currently I'm finishing some routines before I can push the code. It's expected to be done in a few days. |
|
I'm collecting the new benchmark results here, so that we see the effect of misaligned load-stores. Machine: 11th Gen Intel® Core™ i7-11850H @ 2.50GHz × 16. With AVX512 support. With patch: Master: In comparison on a aarch64 machine with asimd support: With patch: Master: Discussion I'm only discussing the new results from
Conclusion
So from these experiments it is clearly profitable to make the change. The question is if there are any other examples where the misalignment leads to such a bad penalty that vectorization is not profitable. Question for aarch64 specialists: How bad does the misalignment penalty really get? Do you think it would ever make it unprofitable to vectorize? |
|
@pfustc We by default run the fuzzer for a few runs, but I'm running it a bit more now just to get a bit more confidence. I'm looking forward to your draft PR. Maybe we can work together towards a hybrid-vectorizer. I plan to keep working on SuperWord and vectorization in general. |
Hi @eme64 , thanks for your perf data! I also tried your new benchmark on some latest I believe that the perf gap between the vectorized misaligned cases and the vectorized aligned cases may become smaller and sometimes prospectively can be removed on newer Also, I strongly agree on your conclusion: it is clearly profitable to vectorize these misaligned cases. Thanks! |
|
@fg1417 perfect, thanks for looking into that! |
fg1417
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for your work.
vnkozlov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice re-write. Looks good to me.
|
@eme64 This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 82 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
|
Going to push as commit 886ac1c.
Your commit was automatically rebased without conflicts. |
|
@pfustc Thanks for the info, I'll look at it soon! |
This change should strictly expand the set of vectorized loops. And this change also makes
SuperWordconceptually simpler.As discussed in #12350, we should remove the alignment checks when alignment is actually not required (either by the hardware or explicitly asked for with
-XX:+AlignVector). We did not do it directly in the same task to avoid too many changes of behavior.This alignment check was originally there instead of a proper dependency checker. Requiring alignments on the packs per memory slice meant that all vector lanes were aligned, and there could be no cross-iteration dependencies that lead to cycles. But this is not general enough (we may for example allow the vector lanes to cross at some point). And we now have proper independence checks in
SuperWord::combine_packs, as well as the cycle check inSuperWord::schedule.Alignment is nice when we can make it happen, as it ensures that we do not have memory accesses across cache lines. But we should not prevent vectorization just because we cannot align all memory accesses for the same memory slice. As the benchmark shows below, we get a good speedup from vectorizing unaligned memory accesses.
Note: this reduces the
CompileCommand Option Vectorizeflag to now only controlling if we use theCloneMapor not. Read more about that in this PR #13930. In the benchmarks below you can find some examples that only vectorize with or only vectorize without theVectorizeflag. My goal is to eventually try out both approaches and pick the better one, removing the need for the flag entirely (see "Unifying multiple SuperWord Strategies and beyond" below).Changes to Tests
I could remove the
CompileCommand Option VectorizefromTestDependencyOffsets.java, which means that those loops now vectorize without the need of the flag.LoopArrayIndexComputeTest.javahad a few "negative" tests that expeced that there is no vectorization because of "dependencies". But they were not real dependencies since they were "read forward" cases. I now check that those do vectorize, and added symmetric tests that are "read backward" cases which should currently not vectorize. However, these are still not "real dependencies" either: the arrays that are used could in theory be proven to be not equal, and then the dependencies could be dropped. But I think it is ok to leave them as "negative" tests for now, until we add such optimizations.Testing
Passes tier6 and stress testing.
No significant change in performance testing.
You can find some
x64andaarch64benchmarks below, together with analysis and explanations.There is a lot of information below. Feel free to read as little or as much as you want and find helpful.
Benchmark Data
Machine: 11th Gen Intel® Core™ i7-11850H @ 2.50GHz × 16. With
AVX512support.Executed like this:
I have 4 flag combinations:
With patch:
On master:
Generally: we can see the difference between vectorization and non-vectorization easily: without vectorization the runtime is over
2000 ns/op, with vectorization it is under600 ns/op.In comparison on a
aarch64machine withasimdsupport:With the patch:
On master:
Also with
aarch64we can see a clear difference between vectorization and non-vectorization. The pattern is the same, even though the concrete numbers are a bit different.Benchmark Discussion: 0xx control
These are simple examples that vectorize unless
SuperWordis disabled. Just to make sure the benchmark works.jdk/test/micro/org/openjdk/bench/vm/compiler/VectorAlignment.java
Lines 70 to 77 in e0658a4
Benchmark Discussion: 1xx load and store misaligned
This vectorizes with the patch, but does not vectorize on master. It does not vectorize with
AlignVectorbecause of the misalignment (misaligned by1 int = 4 byte). On master, we require all vectors to align with all other vectors of the same memory slice.jdk/test/micro/org/openjdk/bench/vm/compiler/VectorAlignment.java
Lines 79 to 85 in e0658a4
Benchmark Discussion: 2xx vectorizes only without Vectorize
Hand-unrolling confuses SuperWord with
Vectorizeflag. The issue is that adjacent memops are not from the same original same-iteration node - rather they are from two different lines.jdk/test/micro/org/openjdk/bench/vm/compiler/VectorAlignment.java
Lines 87 to 94 in e0658a4
Here the relevant checks in
SuperWord::find_adjacent_refs:jdk/src/hotspot/share/opto/superword.cpp
Lines 717 to 718 in e0658a4
jdk/src/hotspot/share/opto/superword.cpp
Line 743 in e0658a4
Benchmark Discussion: 3xx vectorizes only with Vectorize
Regular SuperWord fails in these cases for 2 reasons:
aI[5]withaI[4+1]).jdk/test/micro/org/openjdk/bench/vm/compiler/VectorAlignment.java
Lines 96 to 110 in e0658a4
In
SuperWord::memory_alignmentwe compute the alignment, modulo thevw(wector width).jdk/src/hotspot/share/opto/superword.cpp
Lines 3717 to 3720 in e0658a4
Now assume we have two load vectors, with
offsets[0,4,8,12]and[4,8,12,16]. If we havevw=16, then we get theoff_modto be[0,4,8,12]and[4,8,12,0]. The second vector thus has the last element wrap in the modulo space, and it does not pass the alignment checks (align1 + data_size == align2):jdk/src/hotspot/share/opto/superword.cpp
Lines 1302 to 1303 in e0658a4
The consequence is that we only pack 3 of the 4 memops. And then the pack gets filtered out here, and vectorization fails:
jdk/src/hotspot/share/opto/superword.cpp
Line 1855 in e0658a4
One solution for this is to compute the alignment without modulo. The modulo computation of alignment comes from a time where we could only have strictly aligned memory accesses, and so we would not want to ever pack pairs that cross an alignment boundary, where the modulo wraps around. We could address this in a future RFE.
The second issue is for vectorization is confusing multiple loads that look identical, eg
aI[5]andaI[4+1].A loop like
b[i] = a[i] + a[i+1]that was unrolled 4 times will have these loads:a[i], a[i+1], a[i+1], a[i+2], a[i+2], a[i+3], a[i+3], a[i+4].The SuperWord (SLP) algorithm just greedily picks pairs that are adjacent, and has no mechanism to deal with multiple packing
options: we can pair
a[i]with either of the twoa[i+1].If we do not perfectly pack them, then the packs will not line up with other packs, and vectorization fails.
In the literature I have seen people who solve this problem with integer linear programming, but that would
most likely be too expensive for our JIT C2. We just have to accept that SuperWord (SLP) is greedy and cannot pack things
optimally in all cases.
Lickily, the
Vectorizeapproach does solve most of these cases, as it can separate the two loads inb[i] = a[i] + a[i+1], they come from two different nodes in the single-iteration loop.Benchmark Discussion: 4xx vectorizes does not vectorize at all, even though it should be possible in principle
We combine the issues from 2xx and 3xx: hand-unrolling prevents
Vectorizefrom working, and confusion of multiple loads or the modulo alignment computation prevent non-Vectorize from working.Unifying multiple SuperWord Strategies and beyond
We have now seen examples where sometimes it is better to go with the
Vectorizeflag, and sometimes it is better without it.Would it not be great if we could try out both strategies, and then pick the better one?
A very naive solution: Just try without
Vectorizefirst. If we get a non-empty packset go with it. If it is empty, then try withVectorize. That may work in many cases, but there will also be a few cases where withoutVectorizewe do create a non-empty packset, which is just very very suboptimal. Plus: in the future we may consider expanding the approachesto non-adjacent memory refs, such as strided accesses or even gather/scatter (as long as the dependency checks pass).
Then it will be even more possible that both strategies create a non-empty packset, but one of the two strategies
creates a much better packset than the other.
A better solution: try out both approaches, and evaluate them with a cost-model. Also compute the cost of the
non-vectorized loop. Then pick the best option. This cost-model will also be helpful to decide if we should vectorize
when we have
Reductionnodes (they can be very expensive) or when introducing vector-shuffles (we probably wantto introduce them to allow reverse-order loops, where we need to reverse the vector lane order with a shuffle).
My suggestion is this: run both SuperWord approaches until we have the PacksetGraph. At this point we know if
we could vectorize, including if any dependency-checks fail (independence checks, cycle check).
Then, we evaluate the cost if we were to apply this
PacksetGraph.We pick the cheapest
PacksetGraphand apply it.This approach is also extensible (I got a bit inspired by LLVM talks about VPlan).
We can rename the
PacksetGraphto a more genralVectorTransformGraphin the following steps:VectorTransformGraphthrough multiple SuperWord strategies (with and withoutVectorize). With a cost model pick the best one.VectorTransformGraph, so that we can indeed make use of the whole vector.VectorTransformGraphfrom a single iteration loop, and try to widen the instructions there. If this succeeds we do not have to unroll before vectorizing. This is essentially a traditional loop vectorizer. Except that we can also run the SuperWord algorith over it first to see if we have already any parallelism in the single iteration loop. And then widen that. This makes it a hybrid vectorizer. Not having to unroll means direct time savings, but also that we could vectorize larger loops in the first place, since we would not hit the node limit for unrolling.VectorMaskCmpandVectorBlend, or if the branch is highly likely to take one side for all vector elements, we can also usetest_all_zeros/test_all_onesto still branch.Maybe there are even more vectorization approaches that could fit into this
VectorTransformGraphscheme.The advantage is that it is modular, and we do not affect the C2-graph until we have decided on the best vectorization option
via a cost-model.
One item I have to spend more time learning and integrating into this plan is
PostLoopMultiversioning. It seems to use the widening approach. Maybe we can just extend the widening to a vector-masked version.Example of large loop that is not vectorized
We have a limit of about 50 or 60 nodes for unrolling (
LoopUnrollLimit). Only vectorizes if we raise the limit.Vectorizing before unrolling could help here. Or partially unroll, SuperWord, and widen more.
Re-reviewing TestPickLastMemoryState.java
We once had some "collateral damage" in
TestPickLastMemoryState.java, where we had to accept some cases that would not vectorize anymore to ensure correctness in all other cases (#12350 (comment)). Let's re-asses how many of them now vectorize:fhas a cyclic dependency in the graph (because we do not know thata != b):Run it with either:
In either case, we do not vectorize. Actually, we already do not create the pair packs for any memops except the
b[i-1]store, since they detect that we do not haveindependent(s1,s2)for adjacent memop pairs.If we change the loop to:
Now we do not detect the dependence at distance 1, but only later when we check for dependence at further distances. We see lots of warnings because of pack removal
WARNING: Found dependency at distance greater than 1.. Without theVectorizeflag we somehow still manage to vectorize a vector with 2 elements, but that is hardly a success as my machine would allow packing16ints in a512bit register. That just seems to be an artefact that at distance 1 we do not have dependence. It is not very interesting to add IR verification for that kind of vectorization.test1-6are also relatively complex, and have cyclic dependencies of different kinds. I think we should just keep them as correctness tests for correct results, but not extend them to IR verification tests.Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14096/head:pull/14096$ git checkout pull/14096Update a local copy of the PR:
$ git checkout pull/14096$ git pull https://git.openjdk.org/jdk.git pull/14096/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 14096View PR using the GUI difftool:
$ git pr show -t 14096Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14096.diff
Webrev
Link to Webrev Comment