-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8318446: C2: optimize stores into primitive arrays by combining values into larger store #16245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back epeter! A progress list of the required criteria for merging this PR into |
|
@eme64 this pull request can not be integrated into git checkout JDK-8318446
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
|
I imagine it would be beneficial if we could merge stores to fields and stores from loads, which are common in object constructions. Thanks. |
|
@merykitty do you have examples for both? Maybe stores to fields already works. Merging loads and stores may be out of scope. That sounds a little much like SLP. We can still try to do that in a future RFE. We could even try to use (masked) vector instructions. |
|
@eme64 I have tried your patch, it seems that there are some limitations:
Regarding the final point, fields may be of different types with different sizes and there may be padding between them. This means that for load-store sequence merges, I think SLP cannot handle these cases. Thanks. |
|
@vnkozlov @rwestrel @TobiHartmann I refactored the code significantly, I think it is now much more well structured. Feel free to re-review. I'm out of the office next week, and will return to this then, and re-run the benchmarks. |
| @@ -0,0 +1,696 @@ | |||
| /* | |||
| * Copyright (c) 2023, Oracle and/or its affiliates. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New Year ;)
| // The 4 StoreB are merged into a single StoreI node. We have to be careful with RangeCheck[i+1]: before | ||
| // the optimization, if this RangeCheck[i+1] fails, then we execute only StoreB[i+0], and then trap. After | ||
| // the optimization, the new StoreI[i+0] is on the passing path of RangeCheck[i+1], and StoreB[i+0] on the | ||
| // failing path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we detect presence of RangeCheck which may cause us to move some stores on fail path and bailout the optimization. I don't think it is frequent case. I assume you will get RC on each store or not at all ("main" part of counted loop). Am I wrong here?
I don't remember, does C2 optimize RangeCheck nodes in linear code (it does in loops)?
I know about 2 relevant optimizations that remove / move RangeChecks:
becomes: I think the use-cases from @cl4es are often in straight-line code. Therefore we should cover the "smearing" case where exactly 1 RC remains in the sequence. What you can also see in Does that make sense? I should probably add this information in the comments, so that it is clear why we worry about a single RC at all. People are probably going to wonder like you: "I assume you will get RC on each store or not at all". |
src/hotspot/share/opto/memnode.cpp
Outdated
| // Thus, it is a common pattern that in a long chain of adjacent stores there | ||
| // remains exactly one RangeCheck, between the first and the second store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remains exactly one RangeCheck is confusing because you still have RC for [i + 0]. So you have 2 RCs.
|
@eme64 thank you for looking on C2 RC optimizations. Now it is clear why you need to check for RC. |
|
New comment is good now. Thanks! |
|
@vnkozlov so you are approving of the current state of the code? Just asking because you have not explicitly re-approved the code ; @rwestrel @TobiHartmann Would you mind re-reviewing? |
TobiHartmann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
|
Thanks @vnkozlov @TobiHartmann for the reviews! |
|
Going to push as commit 3ccb64c.
Your commit was automatically rebased without conflicts. |
|
The case where only stores of constant values are merged wouldn't be difficult to get working also on big endian platforms I think. |
|
@reinrich feel free to implement and thest the big-endian version. I just wanted to limit the scope of the PR, and I don't really have a big-endian machine to test on. I think one could surely get both, the constant and variable case implemented, in analogy to what I did. But maybe it would require some refactoring, to make sure the two versions live together nicely. |
|
Thanks for the quick answer. I might play a little bit with the version that stores constants. I expect the effort to be small there. |
This is a feature requiested by @RogerRiggs and @cl4es .
Idea
Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g.
Unsafe.putLongUnaligned), or ByteArrayLittleEndian (e.g.ByteArrayLittleEndian.setLong). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code.This patch here supports a few simple use-cases, like these:
Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant:
jdk/test/hotspot/jtreg/compiler/c2/TestMergeStores.java
Lines 383 to 395 in adca9e2
Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable:
jdk/test/hotspot/jtreg/compiler/c2/TestMergeStores.java
Lines 444 to 456 in adca9e2
The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian):
jdk/test/hotspot/jtreg/compiler/c2/TestMergeStores.java
Lines 327 to 338 in adca9e2
jdk/test/hotspot/jtreg/compiler/c2/TestMergeStores.java
Lines 467 to 472 in adca9e2
Details
This draft currently implements the optimization in an additional special IGVN phase:
jdk/src/hotspot/share/opto/compile.cpp
Lines 2479 to 2485 in adca9e2
We first collect all
StoreB|C|I, and put them in the IGVN worklist (seeCompile::gather_nodes_for_merge_stores). During IGVN, we callStoreNode::Ideal_merge_storesat the end (after all other optimizations) ofStoreNode::Ideal. We essentially try to establish a chain of mergable stores:jdk/src/hotspot/share/opto/memnode.cpp
Lines 2802 to 2806 in adca9e2
Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either both store constants, or adjacent segments of a larger value (i.e. a larger value right-shifted by a constant offset, see
is_con_RShift). Further, we must be able to prove that the stores reference adjacent memory (i.e. the address is shifted by the element size). For two mergable stores (oneuse, onedef), the def-store should not have any other use than the use-store, so that we only merge stores that are in the same basic block. With the only exceptions of merging through RangeChecks (which can have MergeMem nodes on the memory path, and hence such MergeMem are allowed as secondary uses of the def-node).I made this optimization a new phase, and placed it after loop-opts for these reasons:
CastIInodes have disappeared, and the address expression becomes much simpler (in particular, the constants from the integer index can only sink through the CastI2L after loop-opts). We could do adjacency checking with a more complicated algorithm, such asVPointerin the current autovectorizer.Performance
Performance (unit: ns/op) before (without patch) and after (with patch):

Legend:
ByteArrayLittleEndian.storeLongLE). Essentially the same as direct, except wrapped in funciton.Unsafe.Comments:
0.3) and only allocation (8.5).15%speedup. With that we reach equal performance with unsafe and bale.25-40%speedup. With that, direct reaches equal performance as unsafe or bale. But leapi still lags behind.14%speedup. Sadly, that only gets us half-way closer to unsafe and bale performance.34-50%speedup. With that, direct and bale are equally the fastest, and leapi and unsafe lag a bit behind.150%speedup. But only for direct, leapi somehow did not work out (TODO investigate). Still, this slightly lags behind bale and unsafe performance, about10%.1%. But in all those cases the error is larger than the regression, hence this is down to noise, i.e. insignifficant.66%speedup. With that, direct, leapi and unsafe are equally fastest, and bale lags more than10%behind.16-52%speedup.Conclusions:
perfasmmay help explain the remaining gaps.nonalloc). However, there is still some speedup for the benchmarks with allocation, however it is harder to see in the noise (larger errors).Related Bug
During development and testing of this changeset here, I found an independent bug: JDK-8319690. A PR was suggested, but now dropped. From my understanding, the assert in question is the problem, and product is unaffected. However, it is blocking the integration of this changeset here, since this changeset here triggers patterns that hit the problematic assert.
Testing
Tier 1-6 + stress-testing.
Performance testing: no significant difference.
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16245/head:pull/16245$ git checkout pull/16245Update a local copy of the PR:
$ git checkout pull/16245$ git pull https://git.openjdk.org/jdk.git pull/16245/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 16245View PR using the GUI difftool:
$ git pr show -t 16245Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16245.diff
Webrev
Link to Webrev Comment