Skip to content

Conversation

@eme64
Copy link
Contributor

@eme64 eme64 commented Oct 18, 2023

This is a feature requiested by @RogerRiggs and @cl4es .

Idea

Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. Unsafe.putLongUnaligned), or ByteArrayLittleEndian (e.g. ByteArrayLittleEndian.setLong). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code.

This patch here supports a few simple use-cases, like these:

Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant:

@Test
@IR(counts = {IRNode.STORE_L_OF_CLASS, "byte\\\\[int:>=0] \\\\(java/lang/Cloneable,java/io/Serializable\\\\)", "1"})
static Object[] test1a(byte[] a) {
a[0] = (byte)0xbe;
a[1] = (byte)0xba;
a[2] = (byte)0xad;
a[3] = (byte)0xba;
a[4] = (byte)0xef;
a[5] = (byte)0xbe;
a[6] = (byte)0xad;
a[7] = (byte)0xde;
return new Object[]{ a };
}

Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable:

@Test
@IR(counts = {IRNode.STORE_L_OF_CLASS, "byte\\\\[int:>=0] \\\\(java/lang/Cloneable,java/io/Serializable\\\\)", "1"})
static Object[] test2a(byte[] a, int offset, long v) {
a[offset + 0] = (byte)(v >> 0);
a[offset + 1] = (byte)(v >> 8);
a[offset + 2] = (byte)(v >> 16);
a[offset + 3] = (byte)(v >> 24);
a[offset + 4] = (byte)(v >> 32);
a[offset + 5] = (byte)(v >> 40);
a[offset + 6] = (byte)(v >> 48);
a[offset + 7] = (byte)(v >> 56);
return new Object[]{ a };
}

The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian):

// Store an int LE into an array using store bytes in an array
@ForceInline
static void storeLongLE(byte[] bytes, int offset, long value) {
storeBytes(bytes, offset, (byte)(value >> 0 ),
(byte)(value >> 8 ),
(byte)(value >> 16),
(byte)(value >> 24),
(byte)(value >> 32),
(byte)(value >> 40),
(byte)(value >> 48),
(byte)(value >> 56));
}

@Test
@IR(counts = {IRNode.STORE_L_OF_CLASS, "byte\\\\[int:>=0] \\\\(java/lang/Cloneable,java/io/Serializable\\\\)", "1"})
static Object[] test2c(byte[] a, int offset, long v) {
storeLongLE(a, offset, v);
return new Object[]{ a };
}

Details

This draft currently implements the optimization in an additional special IGVN phase:

if (true) {
assert(!C->merge_stores_phase(), "merge store phase not yet set");
C->gather_nodes_for_merge_stores(igvn);
C->set_merge_stores_phase(true);
igvn.optimize();
C->set_merge_stores_phase(false);
}

We first collect all StoreB|C|I, and put them in the IGVN worklist (see Compile::gather_nodes_for_merge_stores). During IGVN, we call StoreNode::Ideal_merge_stores at the end (after all other optimizations) of StoreNode::Ideal. We essentially try to establish a chain of mergable stores:

// Collect list of stores
while (def != nullptr && merge_list.size() <= merge_list_max_size) {
merge_list.push(def);
def = def->can_merge_with_def(phase, true);
}

Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either both store constants, or adjacent segments of a larger value (i.e. a larger value right-shifted by a constant offset, seeis_con_RShift). Further, we must be able to prove that the stores reference adjacent memory (i.e. the address is shifted by the element size). For two mergable stores (one use, one def), the def-store should not have any other use than the use-store, so that we only merge stores that are in the same basic block. With the only exceptions of merging through RangeChecks (which can have MergeMem nodes on the memory path, and hence such MergeMem are allowed as secondary uses of the def-node).

I made this optimization a new phase, and placed it after loop-opts for these reasons:

  • I do not want it to interfere with loop-opts, in particular with the autovectorizer (SuperWord).
  • I don't want it to interfere with any other memory optimizations, this should just improve things if nothing else worked.
  • Checking if two memory addresses are adjacent is much simpler after loop-opts, when some of the CastII nodes have disappeared, and the address expression becomes much simpler (in particular, the constants from the integer index can only sink through the CastI2L after loop-opts). We could do adjacency checking with a more complicated algorithm, such as VPointer in the current autovectorizer.

Performance

Performance (unit: ns/op) before (without patch) and after (with patch):
image

Legend:

  • bale: ByteArrayLittleEndian.
  • direct: Java code with multiple stores (and possibly shifting the variable to split its value).
  • leapi: Custom "Little endian" API (e.g. storeLongLE). Essentially the same as direct, except wrapped in funciton.
  • unsafe: Unsafe.

Comments:

  1. Baseline performance of empty method (0.3) and only allocation (8.5).
  2. Storing two constant bytes: 15% speedup. With that we reach equal performance with unsafe and bale.
  3. Storing int to 4 bytes: 25-40% speedup. With that, direct reaches equal performance as unsafe or bale. But leapi still lags behind.
  4. Storing 4 constant bytes: 14% speedup. Sadly, that only gets us half-way closer to unsafe and bale performance.
  5. Storing an int twice to bytes: 34-50% speedup. With that, direct and bale are equally the fastest, and leapi and unsafe lag a bit behind.
  6. Storing a long to bytes: 150% speedup. But only for direct, leapi somehow did not work out (TODO investigate). Still, this slightly lags behind bale and unsafe performance, about 10%.
  7. Seeming regressions of 1%. But in all those cases the error is larger than the regression, hence this is down to noise, i.e. insignifficant.
  8. Storing 8 constant bytes: 66% speedup. With that, direct, leapi and unsafe are equally fastest, and bale lags more than 10% behind.
  9. A few examples with stores to short/int arrays. 16-52% speedup.

Conclusions:

  • There are significant speedups. They close the gap of direct/leapi to unsafe/bale. With a few exceptions where there is still a gap, though a smaller one. A closer investigation with perfasm may help explain the remaining gaps.
  • Especially the occasional gap between leapi and direct require investigation.
  • The most clear wins can be seen if the benchmark does not have any allocation (nonalloc). However, there is still some speedup for the benchmarks with allocation, however it is harder to see in the noise (larger errors).
  • Generally: the more stores can be merged together into one larger store, the more clear the speedup.

Related Bug

During development and testing of this changeset here, I found an independent bug: JDK-8319690. A PR was suggested, but now dropped. From my understanding, the assert in question is the problem, and product is unaffected. However, it is blocking the integration of this changeset here, since this changeset here triggers patterns that hit the problematic assert.

Testing

Tier 1-6 + stress-testing.
Performance testing: no significant difference.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8318446: C2: optimize stores into primitive arrays by combining values into larger store (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16245/head:pull/16245
$ git checkout pull/16245

Update a local copy of the PR:
$ git checkout pull/16245
$ git pull https://git.openjdk.org/jdk.git pull/16245/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 16245

View PR using the GUI difftool:
$ git pr show -t 16245

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16245.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 18, 2023

👋 Welcome back epeter! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot changed the title 8318446 8318446: C2: implement StoreNode::Ideal_merge_stores Oct 18, 2023
@openjdk
Copy link

openjdk bot commented Oct 18, 2023

@eme64 this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8318446
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Oct 18, 2023
@openjdk
Copy link

openjdk bot commented Oct 18, 2023

@eme64 The following labels will be automatically applied to this pull request:

  • build
  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Oct 23, 2023
@merykitty
Copy link
Member

I imagine it would be beneficial if we could merge stores to fields and stores from loads, which are common in object constructions.

Thanks.

@eme64
Copy link
Contributor Author

eme64 commented Oct 25, 2023

@merykitty do you have examples for both? Maybe stores to fields already works. Merging loads and stores may be out of scope. That sounds a little much like SLP. We can still try to do that in a future RFE. We could even try to use (masked) vector instructions.

@merykitty
Copy link
Member

@eme64 I have tried your patch, it seems that there are some limitations:

  • The stores are not merged if the order is not right (e.g a[2] = 2; a[1] = 1;)
  • The stores are not merged if they are floating point constants.
  • The stores are not merged if they are consecutive fields in an object. E.g:
    class Point {
        int x; int y;
    }

    p.x = 1;
    p.y = 2; // Cannot merge into mov [p.x], 0x200000001

Regarding the final point, fields may be of different types with different sizes and there may be padding between them. This means that for load-store sequence merges, I think SLP cannot handle these cases.

Thanks.

@openjdk openjdk bot added ready Pull request is ready to be integrated rfr Pull request is ready for review labels Apr 5, 2024
@eme64
Copy link
Contributor Author

eme64 commented Apr 5, 2024

@vnkozlov @rwestrel @TobiHartmann I refactored the code significantly, I think it is now much more well structured.
I also only allow a singe RangeCheck now, that makes sure that the "first" store floats towards the uncommon trap.

Feel free to re-review. I'm out of the office next week, and will return to this then, and re-run the benchmarks.

@@ -0,0 +1,696 @@
/*
* Copyright (c) 2023, Oracle and/or its affiliates. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New Year ;)

Comment on lines +2877 to +2880
// The 4 StoreB are merged into a single StoreI node. We have to be careful with RangeCheck[i+1]: before
// the optimization, if this RangeCheck[i+1] fails, then we execute only StoreB[i+0], and then trap. After
// the optimization, the new StoreI[i+0] is on the passing path of RangeCheck[i+1], and StoreB[i+0] on the
// failing path.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we detect presence of RangeCheck which may cause us to move some stores on fail path and bailout the optimization. I don't think it is frequent case. I assume you will get RC on each store or not at all ("main" part of counted loop). Am I wrong here?
I don't remember, does C2 optimize RangeCheck nodes in linear code (it does in loops)?

@eme64
Copy link
Contributor Author

eme64 commented Apr 15, 2024

@vnkozlov

Can we detect presence of RangeCheck which may cause us to move some stores on fail path and bailout the optimization. I don't think it is frequent case. I assume you will get RC on each store or not at all ("main" part of counted loop). Am I wrong here? I don't remember, does C2 optimize RangeCheck nodes in linear code (it does in loops)?

I know about 2 relevant optimizations that remove / move RangeChecks:

  • RCE (RangeCheck Elimination from loops): hoist all RangeCheck before the loop. That way, there are no RangeChecks left in the loop, and there would be no RangeChecks between the stores we are merging.
  • RangeCheck Smearing: this also applies in straight-line code, outside of loops. See RangeCheckNode::Ideal. Example:
RangeCheck[i+0]
Store[i+0]
RangeCheck[i+1]  <--- replaced with i+3 ("smearing" to cover all RC below)
Store[i+1]
RangeCheck[i+2]  <--- removed
Store[i+2]
RangeCheck[i+3]  <--- removed
Store[i+3]

becomes:

RangeCheck[i+0]
Store[i+0]
RangeCheck[i+3]  <--- the RangeCheck that remains between the first and the rest of the consecutive (and adjacent) stores.
Store[i+1]
Store[i+2]
Store[i+3]

I think the use-cases from @cl4es are often in straight-line code. Therefore we should cover the "smearing" case where exactly 1 RC remains in the sequence.

What you can also see in RangeCheckNode::Ideal: if we ever trap (or often enough, I don't remember) in one of the RangeChecks, then we disable phase->C->allow_range_check_smearing(). Then we don't do the smearing, and all the RC remain in the sequence. At that point, my optimization would fail since it sees more than 1 RC in the sequence.

Does that make sense? I should probably add this information in the comments, so that it is clear why we worry about a single RC at all. People are probably going to wonder like you: "I assume you will get RC on each store or not at all".

Comment on lines 2897 to 2898
// Thus, it is a common pattern that in a long chain of adjacent stores there
// remains exactly one RangeCheck, between the first and the second store.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remains exactly one RangeCheck is confusing because you still have RC for [i + 0]. So you have 2 RCs.

@vnkozlov
Copy link
Contributor

@eme64 thank you for looking on C2 RC optimizations. Now it is clear why you need to check for RC.
I would only suggest to adjust your new comment about TC optimization to avoid confusion.

@vnkozlov
Copy link
Contributor

New comment is good now. Thanks!

@eme64
Copy link
Contributor Author

eme64 commented Apr 23, 2024

@vnkozlov so you are approving of the current state of the code? Just asking because you have not explicitly re-approved the code ;

@rwestrel @TobiHartmann Would you mind re-reviewing?

Copy link
Member

@TobiHartmann TobiHartmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@eme64
Copy link
Contributor Author

eme64 commented Apr 24, 2024

Thanks @vnkozlov @TobiHartmann for the reviews!
Thanks @rwestrel for the helpful comments earlier on.
Thanks @cl4es @RogerRiggs for bringing up the idea for such an optimization, and cheering me on with it!
/integrate

@openjdk
Copy link

openjdk bot commented Apr 24, 2024

Going to push as commit 3ccb64c.
Since your change was applied there have been 292 commits pushed to the master branch:

  • 5c38386: 8326541: [AArch64] ZGC C2 load barrier stub should consider the length of live registers when spilling registers
  • 438e643: 8329531: compiler/c2/irTests/TestIfMinMax.java fails with IRViolationException: There were one or multiple IR rule failures.
  • 80b381e: 8329555: Crash in intrinsifying heap-based MemorySegment Vector store/loads
  • 7a89555: 8330844: Add aliases for conditional jumps and additional instruction forms for x86
  • f60798a: 8329222: java.text.NumberFormat (and subclasses) spec updates
  • 2555166: 8329113: Deprecate -XX:+UseNotificationThread
  • 09b8809: 8327289: Remove unused PrintMethodFlushingStatistics option
  • 9cc163a: 8330178: Clean up non-standard use of /** comments in java.base
  • 88a5dce: 8330805: ARM32 build is broken after JDK-8139457
  • 7157eea: 8327290: Remove unused notproduct option TraceInvocationCounterOverflow
  • ... and 282 more: https://git.openjdk.org/jdk/compare/e3e6c2a8991fbc4f56e051e9abe004f0aa5674a0...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Apr 24, 2024
@openjdk openjdk bot closed this Apr 24, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Apr 24, 2024
@openjdk
Copy link

openjdk bot commented Apr 24, 2024

@eme64 Pushed as commit 3ccb64c.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@reinrich
Copy link
Member

The case where only stores of constant values are merged wouldn't be difficult to get working also on big endian platforms I think.
#15990 seems to be a use of this optimization and it only makes use of this case, doesn't it?
Do you have an idea how important the second pattern in the JBS issue is?

        a[1] = (byte)v;
        a[2] = (byte)(v >> 8 );
        a[3] = (byte)(v >> 16);
        a[4] = (byte)(v >> 24); 

@eme64
Copy link
Contributor Author

eme64 commented Apr 29, 2024

@reinrich feel free to implement and thest the big-endian version. I just wanted to limit the scope of the PR, and I don't really have a big-endian machine to test on.
I'm currently tracking down a follow-up bug or two from this patch, so I have my hands full.

I think one could surely get both, the constant and variable case implemented, in analogy to what I did. But maybe it would require some refactoring, to make sure the two versions live together nicely.

@reinrich
Copy link
Member

Thanks for the quick answer. I might play a little bit with the version that stores constants. I expect the effort to be small there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler [email protected] integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.