-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8298935: fix independence bug in create_pack logic in SuperWord::find_adjacent_refs #12350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back epeter! A progress list of the required criteria for merging this PR into |
|
I think it would be helpful if you could include the Maybe folks that recently worked on superword (@jbhateja, @sviswanathan, @fg1417) know more. |
Hi @eme64, superword combines the results from jdk/src/hotspot/share/opto/superword.cpp Line 646 in 4c9de87
and jdk/src/hotspot/share/opto/superword.cpp Line 665 in 4c9de87
For example, in the case like it supposes that, if it holds But the case you proposed hits the bug of the algorithm. In your case, Anyway, never mind. Your fix looks good to me! Thanks. |
|
Sorry to clarify: if it holds |
|
@fg1417 Thanks for your response. I was trying to understand the same code yesterday. And your explanations match with what I have found. The issue is that we find I have think we need to change the order of the if statements. When we have If we have There is the additional complicating factor of It passes with My analysis is this: we rely on the I discussed with @chhagedorn and @vnkozlov : The plan is to swap the if statements in this change, and fix the issues with @fg1417 thanks again for taking the time to investigate this, I'm now more sure I'm going into the right direction. |
|
Generally, I am wondering about this though: This would then allow things like this: (We only do this if we have But it would prevent things like this: (We currently vectorize this if we have |
Yes, that's really a good idea to help vectorize more scenarios about reading forward. My concern is that we need to call once it holds Thanks. |
|
@fg1417 right, if we have A first idea: I have some more ideas, but I would need to better understand what we allow to be parallelized, to know what assumptions we can make. I guess this would become a larger project, designing the algorithm and implementing, testing and benchmarking it. For now I will just fix the broken issues, before diving more into extending vectorization. |
…ignment not required
Webrevs
|
|
@tstuefe @reinrich Would you mind reviewing / testing testing this for PPC? Maybe you would also want to port some tests to PPC? I also wonder: |
|
There was no big meaning in my question "Does not matter if they are in an other pack or not?" |
vnkozlov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
|
@eme64 This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been no new commits pushed to the ➡️ To integrate this PR with the above commit message to the |
jatin-bhateja
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @eme64 for your very informative and detailed explanations.
@vnkozlov I now found an example that reveals this Bug 4. I want to fix it in a separate Bug JDK-8304042. This is the method: Note: Lines 1 and 4 are As explained in the previous #12350 (comment), we have to verify that there are no cyclic dependencies between the packs, just before we schedule. The SuperWord paper states this in "3.7 Scheduling". |
…_framework/IRNode.java
Agree. |
|
@eme64, I ran with With these changes we almost don't generate vectors (may be 2 per |
|
@vnkozlov I'm looking into These are relevant vectorized methods:
Explanation:
My first guess is that it means that we reject the misaligned slice during I am running the test like this: On master, I see for Now let's see what would vectorize using the I generally see everything vectorize again. But actually, more vectorization, because we do not only re-introduce the rejected memops that are connected to the "surviving" memops. We keep all memops, do not reject any, and see later that it is safe to keep them. There is one exception: With However, with my patch and I think that the issue with Conclusion: What you saw in
There are some odd cases, where that remedy does not work. We saw an example above. The issue is that if we have different types with different element-byte-sizes, we may pack different numbers of elements into the packs. At the conversion-points, this leads to a vectorization-blocker, because we cannot have multiple vectors feed into a single one, or the other way around. We could try to address this with a follow-up RFE. For example, we could split vectors if we have multiple vectors as input/output. But most likely this is an edge case that is not very worrying. as I explain above. This exercise here also shows that we need even more IR tests, to catch these cases. Or we need to convert more of the SuperWord tests to IR tests. @vnkozlov What do you think about this? |
The test was added by 8290910 fix in JDK 20. Before that superword produces wrong results in these tests. The issue was found with fuzzer testing. I don't know where the test in 8293216 comes from. So you are right about this be corner case. Based on this I agree with not allowing vectorization in such cases for now. But file RFE to look on these cases much later. If vectorization produces valid result we should allow it. I understand that we are missing more precise checks which separate valid from invalid misaligned operations. I am not suggesting adding back code which extend memory ops without any checks but may be improve find_adjacent_refs when we can accept such cases. It is very complex case and fix could be also complex. |
|
Ok, great. I have the follwing follow-up RFE's: JDK-8303113 [SuperWord] investigate if enabling _do_vector_loop by default creates speedup JDK-8260943 Revisit vectorization optimization added by 8076284 JDK-8303827 C2 SuperWord: allow more fine grained alignment for +AlignVector JDK-8304042 C2 SuperWord: schedule must remove packs with cyclic dependencies Thanks again to all the involved! |
|
Going to push as commit 01e6920.
Your commit was automatically rebased without conflicts. |
openjdk#8877 introduced the global option `SuperWordMaxVectorSize` as a temporary solution to fix the performance regression on some x86 machines. Currently, SuperWordMaxVectorSize behaves differently between x86 and other platforms[1]. For example, if the current machine only supports `MaxVectorSize <= 32`, but we set `SuperWordMaxVectorSize = 64`, then `SuperWordMaxVectorSize` will be kept at 64 on other platforms while x86 machine would change `SuperWordMaxVectorSize` to `MaxVectorSize`. Other platforms except x86 miss similar implementations like [2]. Also, `SuperWordMaxVectorSize` limits the max vector size of auto-vectorization as `64`, which is fine for current aarch64 hardware but SVE architecture supports larger than 512 bits. The patch is to drop the global option and use an architecture- dependent interface to consult the max vector size for auto- vectorization, fixing the performance issue on x86 and reducing side effects for other platforms. After the patch, auto- vectorization is still limited to 32-byte vectors by default on Cascade Lake and users can override this by either setting `-XX:UseAVX=3` or `-XX:MaxVectorSize=64` on JVM command line. So my question is: Before the patch, we could have a smaller max vector size for auto-vectorization than `MaxVectorSize` on x86. For example, users could have `MaxVectorSize=64` and `SuperWordMaxVectorSize=32`. But after the change, if we set `-XX:MaxVectorSize=64` explicitly, then the max vector size for auto-vectorization would be `MaxVectorSize`, i.e. 64 bytes, which I believe is more reasonable. @sviswa7 @jatin-bhateja, are you happy about the change? [1] openjdk#12350 (comment) [2] https://github.com/openjdk/jdk/blob/33bec207103acd520eb99afb093cfafa44aecfda/src/hotspot/cpu/x86/vm_version_x86.cpp#L1314-L1333
As @eme64 said in [1], JDK-8298935 introduced some "collateral damage", disabling the vectorization of some conversions when `+AlignVector`. That affects IR checks of TestVectorizeTypeConversion.java and ArrayTypeConvertTest.java on some aarch64 platforms like ThunderX and ThunderX2 [2]. This trivial patch is to allow IR check only when we have "-AlignVector". [1] openjdk#12350 (comment) [2] https://github.com/openjdk/jdk/blob/7239150f8aff0e3dc07c5b27f6b7fb07237bfc55/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp#L154
List of important things below
dependency_graphandDepPreds8298935: fix independence bug in create_pack logic in SuperWord::find_adjacent_refs #12350 (comment)find_dependency. Arguments aboutindependenceof packs and cyclic dependencies between packs 8298935: fix independence bug in create_pack logic in SuperWord::find_adjacent_refs #12350 (comment)Original RFE description:
Cyclic dependencies are not handled correctly in all cases. Three examples:
jdk/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java
Lines 270 to 277 in 0a834cd
And this, compiled with
-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize:jdk/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java
Lines 173 to 180 in 0a834cd
And for
vmIntrinsics::_forEachRemainingcompile optionVectorizeis always enabled:jdk/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java
Lines 69 to 73 in 0a834cd
All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values.
Analysis
The
create_packlogic inSuperWord::find_adjacent_refsis broken in two ways:Vectorizeis on, or we compilevmIntrinsics::_forEachRemainingwe have_do_vector_loop == true. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by theindependence(s1, s2)checks we do for all adjacent memops. But for larger distances, we rely onmemory_alignment == 0. But the compile directive avoids these checks.best_align_to_mem_refis of a different type, and we havememory_alignment(mem_ref, best_align_to_mem_ref) == 0, we do not check ifmem_refhasmemory_alignment == 0for all other refs of the same type. In the exampleTestCyclicDependency::test2, we havebest_align_to_mem_refas theStoreF. Then we assess theStoreI, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look atLoadI, which has perfect alignment with theStoreF, so we accept it too (even though it is in conflict with theStoreI).Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code.
I also propose to only allow the compile directive
Vectorizeonly ifvectors_should_be_aligned() == false. If all vector operations have to bevector_widthaligned, then they also have to be mutually aligned, and we cannot have patterns likev[i] = v[i] + v[i+1]for which the compile directive was introduced in the first place c7d33de.Update: I found a Test.java that lead to a crash (
SIGBUS) on a ARM32 on master. The example bypassed the alignment requirement because of_do_vector_loop, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me!Solution
First, I implemented
SuperWord::verify_packswhich catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (seeSuperWord::find_dependence). With this verification in place we at least get an assert instead of wrong execution.I then refactored and fixed the
create_packcode, and put the logic all inSuperWord::is_mem_ref_alignment_ok. With the added comments, I hope the logic is more straight forward and readable. If_do_vector_loop == true, then I filter the vector packs again inSuperWord::combine_packs, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent.Another change I have made:
Disallow
extend_packlistfrom addingMemNodesback in. Because if we have rejected some memops, we do not want them to be added back in later.Testing
I added a few more regression tests, and am running tier1-3, plus some stress testing.
However, I need help from someone who can test this on ARM32 and PPC, basically machines that have
vectors_should_be_aligned() == true. I would love to have additional testing on those machine, and some reviews.Update: @fg1417 did testing on ARM32, @reinrich did testing on PPC.
Discussion / Future Work
I wonder if we should have
_do_vector_loop == trueby default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? JDK-8303113Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/12350/head:pull/12350$ git checkout pull/12350Update a local copy of the PR:
$ git checkout pull/12350$ git pull https://git.openjdk.org/jdk pull/12350/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 12350View PR using the GUI difftool:
$ git pr show -t 12350Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/12350.diff