8331311: C2: Big Endian Port of 8318446: optimize stores into primitive arrays by combining values into larger store #19218

reinrich · 2024-05-13T15:53:52Z

This pr adds a few tweaks to JDK-8318446 which allows enabling it also on big endian platforms (e.g. AIX, S390). JDK-8318446 introduced a C2 optimization to replace consecutive stores to a primitive array with just one store.

By example (from TestMergeStores.java):

    static Object[] test2a(byte[] a, int offset, long v) {
        if (IS_BIG_ENDIAN) {
            a[offset + 0] = (byte)(v >> 56);
            a[offset + 1] = (byte)(v >> 48);
            a[offset + 2] = (byte)(v >> 40);
            a[offset + 3] = (byte)(v >> 32);
            a[offset + 4] = (byte)(v >> 24);
            a[offset + 5] = (byte)(v >> 16);
            a[offset + 6] = (byte)(v >> 8);
            a[offset + 7] = (byte)(v >> 0);
        } else {
            a[offset + 0] = (byte)(v >> 0);
            a[offset + 1] = (byte)(v >> 8);
            a[offset + 2] = (byte)(v >> 16);
            a[offset + 3] = (byte)(v >> 24);
            a[offset + 4] = (byte)(v >> 32);
            a[offset + 5] = (byte)(v >> 40);
            a[offset + 6] = (byte)(v >> 48);
            a[offset + 7] = (byte)(v >> 56);
        }
        return new Object[]{ a };
    }

Depending on the endianess 8 bytes are stored into an array. The order of the stores is the same as the order of an 8-byte-store therefore 8 1-byte-stores can be replaced with just one 8-byte-store (if there aren't too many range checks).

Additionally I've fixed a few comments and a test bug.

The optimization seems to be a little bit more effective on big endian platforms.

Again by example:

    static Object[] test800a(byte[] a, int offset, long v) {
        if (IS_BIG_ENDIAN) {
            a[offset + 0] = (byte)(v >> 40); // Removed from candidate list
            a[offset + 1] = (byte)(v >> 32); // Removed from candidate list
            a[offset + 2] = (byte)(v >> 24); // Merged
            a[offset + 3] = (byte)(v >> 16); // Merged
            a[offset + 4] = (byte)(v >> 8);  // Merged
            a[offset + 5] = (byte)(v >> 0);  // Merged
        } else {
            a[offset + 0] = (byte)(v >> 0);  // Removed from candidate list
            a[offset + 1] = (byte)(v >> 8);  // Removed from candidate list
            a[offset + 2] = (byte)(v >> 16); // Not merged
            a[offset + 3] = (byte)(v >> 24); // Not merged
            a[offset + 4] = (byte)(v >> 32); // Not merged
            a[offset + 5] = (byte)(v >> 40); // Not merged
        }
        return new Object[]{ a };
    }

The sequence of candidate stores begins at the lowest store (in Memory def-use order) and is trimmed to a power of 2 removing higher stores if necessary. On little endian platforms this removes the least significant bytes to be stored. Merging would require a right shift of the input value. While possible this is currently not done.
With big endian order the stores of the more significant bytes are removed and the merge succeeds because no shift is needed.

I introduced new platform attributes little-endian, big-endian to the IR testing framework to be able to adapt IR matching rules to this difference.

Testing:

TestMergeStores.java on AIX and S390.

JTReg tests: tier1-4 of hotspot and jdk. All of Langtools and jaxp. JCK, SPECjvm2008, SPECjbb2015, Renaissance Suite, and SAP specific tests.
Testing was done with fastdebug builds on the main platforms and also on Linux/PPC64le and AIX.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8331311: C2: Big Endian Port of 8318446: optimize stores into primitive arrays by combining values into larger store (Bug - P4)

Reviewers

Emanuel Peter (@eme64 - Reviewer) ⚠️ Review applies to 3169a310
Vladimir Kozlov (@vnkozlov - Reviewer) ⚠️ Review applies to 3169a310

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19218/head:pull/19218
$ git checkout pull/19218

Update a local copy of the PR:
$ git checkout pull/19218
$ git pull https://git.openjdk.org/jdk.git pull/19218/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 19218

View PR using the GUI difftool:
$ git pr show -t 19218

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19218.diff

Webrev

Link to Webrev Comment

…ve arrays by combining values into larger store

bridgekeeper · 2024-05-13T15:54:49Z

👋 Welcome back rrich! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2024-05-13T15:55:00Z

@reinrich This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8331311: C2: Big Endian Port of 8318446: optimize stores into primitive arrays by combining values into larger store

Reviewed-by: epeter, kvn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 33 new commits pushed to the master branch:

2a37764: 8333743: Change .jcheck/conf branches property to match valid branches
75dc2f8: 8330182: Start of release updates for JDK 24
054362a: 8332550: [macos] Voice Over: java.awt.IllegalComponentStateException: component must be showing on the screen to determine its location
9b436d0: 8333674: Disable CollectorPolicy.young_min_ergo_vm for PPC64
487c477: 8333647: C2 SuperWord: some additional PopulateIndex tests
d02cb74: 8333270: HandlersOnComplexResetUpdate and HandlersOnComplexUpdate tests fail with "Unexpected reference" if timeoutFactor is less than 1/3
02f2404: 8333560: -Xlint:restricted does not work with --release
606df44: 8332670: C1 clone intrinsic needs memory barriers
33fd6ae: 8333622: ubsan: relocInfo_x86.cpp:101:56: runtime error: pointer index expression with base (-1) overflowed
8de5d20: 8332865: ubsan: os::attempt_reserve_memory_between reports overflow
... and 23 more: https://git.openjdk.org/jdk/compare/326dbb1b139dd1ec1b8605339b91697cdf49da9a...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk · 2024-05-13T15:55:23Z

@reinrich The following label will be automatically applied to this pull request:

hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

reinrich · 2024-05-13T15:58:31Z

@offamitkumar you can put this through your testing if you like. It should solve the issues with test/hotspot/jtreg/compiler/c2/TestMergeStores.java also for s390.

offamitkumar · 2024-05-13T16:45:49Z

@reinrich test is passing on s390x with your change. tier1 test are in progress.

Update: tier1 test are also clean on s390x;

mlbridge · 2024-05-14T13:02:10Z

Webrevs

eme64 · 2024-05-14T16:12:40Z

@reinrich thanks for taking this up!
Just did a quick scan of the tests. I think it could be good to have both big/small endian tests run on both big/small endian machines, but only expect IR rules to pass if the test and platform are expected to optimize. This just makes sure that the logic is correct, and does not optimize the wrong cases, producing wrong results.

src/hotspot/share/opto/memnode.cpp

eme64 · 2024-05-14T16:16:39Z

test/hotspot/jtreg/compiler/c2/TestMergeStores.java

    private static final Unsafe UNSAFE = Unsafe.getUnsafe();
    private static final Random RANDOM = Utils.getRandomInstance();

+    private static final boolean IS_BIG_ENDIAN = UNSAFE.isBigEndian();


static is very important here, so that the if constant fold in the test. Otherwise we don't know if we have the IR rule pass because of the correct branch. Maybe add a comment for that.

Sure. I assumed that is clear to people looking at jit compiler tests :)
I removed IS_BIG_ENDIAN again since it wasn't needed anymore with the last comit (3169a31).

test/hotspot/jtreg/compiler/c2/TestMergeStores.java

dean-long · 2024-05-14T21:04:02Z

It's not obvious to me why something like

            a[offset + 2] = (byte)(v >> 16); // Not merged
            a[offset + 3] = (byte)(v >> 24); // Not merged
            a[offset + 4] = (byte)(v >> 32); // Not merged
            a[offset + 5] = (byte)(v >> 40); // Not merged

can't be merged. Is it because you only use vas a possible 32-bit value? Why not use something like the following pseudo-code?

int bytes2word(byte b1, byte b2, byte b3, byte b4) {
  return (b1 & 0xff) << 24 | (b2 & 0xff) << 16 | (b3 & 0xff) << 8 | (b4 & 0xff);
}
// Substituting in the values from the example:
int big_endian = bytes2word((byte)(v >> 40), (byte)(v >> 32), (byte)(v >> 24), (byte)(v >> 16));
int little_endian = bytes2word((byte)(v >> 16), (byte)(v >> 24), (byte)(v >> 32), (byte)(v >> 40));

dean-long · 2024-05-14T21:09:35Z

In other words, it seems like it could work for arbitrary byte values if the merged value was computed from those individual values. They wouldn't need to be shifted values.

            a[offset + 0] = (byte)0x1;
            a[offset + 1] = (byte)(0x2;
            a[offset + 2] = (byte)0x3;
            a[offset + 3] = (byte)(0x4;

The example above would either write 0x01020304 or 0x04030201 depending on the endianness.

reinrich · 2024-05-15T07:45:31Z

It's not obvious to me why something like

            a[offset + 2] = (byte)(v >> 16); // Not merged
            a[offset + 3] = (byte)(v >> 24); // Not merged
            a[offset + 4] = (byte)(v >> 32); // Not merged
            a[offset + 5] = (byte)(v >> 40); // Not merged

can't be merged.

The stores could be merged to the following pseudo code:

  *(int*)&a[offset + 2] = (int)(v >> 16); // Merged

The current logic doesn't accept the right shift here.
I think at that location we can always accept merged_input_value asserting that it is a right shift of base_last since the is_adjacent_input_pair checks succeeded before.
I haven't tried it though.

I'll clarify the synopsis of this pr and the comment in TestMergeStores.java.

eme64 · 2024-05-15T07:50:05Z

@dean-long @reinrich
Yes, I guess that is a generalization that could be made. It would require a lot more tests to make sure all combinations are checked. I would suggest doing that in a separate RFE to keep things simple and reviewable.

reinrich · 2024-05-15T14:08:56Z

Thanks for looking at the pr.

Just did a quick scan of the tests. I think it could be good to have both big/small endian tests run on both big/small endian machines, but only expect IR rules to pass if the test and platform are expected to optimize. This just makes sure that the logic is correct, and does not optimize the wrong cases, producing wrong results.

I've done that for test2 and introduced test2BE. Is that want you mean?

reinrich · 2024-05-16T05:29:58Z

Test error is unrelated to the changes. Upload of test results failed:
Error: Failed to CreateArtifact: Failed to make request after 5 attempts: Request timeout: /twirp/github.actions.results.api.v1.ArtifactService/CreateArtifact

eme64

I'm running testing again, but the code looks good now!

I just had another idea:
Could we use some sort of "byte reverse / shuffle" operation to do these use cases for both big/little-endian?

        storeBytes(bytes, offset, (byte)(value >> 8),
                                  (byte)(value >> 0));

        storeBytes(bytes, offset, (byte)(value >> 0),
                                  (byte)(value >> 8));

Not sure if that would be profitable or even available on all platforms. Could be a future RFE someone can work on after this. What do you think? It might make performance more predictable across platforms.

eme64 · 2024-05-24T08:18:41Z

src/hotspot/share/opto/memnode.cpp

+    // `_store` and `first` are swapped in the diagram above
+    Node* hi = first->in(MemNode::ValueIn);
+    Node* lo = _store->in(MemNode::ValueIn);
+#endif // VM_LITTLE_ENDIAN


A swap could be more concise. But I leave that up to you ;)

You're right. It's better to just swap hi with lo and it matches the comment.

eme64 · 2024-05-24T08:26:17Z

@reinrich please ping me again to ask if testing is ok before you integrate ;)

reinrich · 2024-05-24T15:05:32Z

@reinrich please ping me again to ask if testing is ok before you integrate ;)

Thanks for picking this up again. I quickly wanted to let you know that I'm out of office. I will be back in a week.

vnkozlov

Good.

eme64 · 2024-06-04T16:03:48Z

@reinrich please still wait until the JDK24 fork on Thrusday to integrate, so that we do not have to backport possible regression fixes - I had 3 or 4 with my original patch ;)

reinrich · 2024-06-05T07:27:04Z

I'm running testing again, but the code looks good now!

I just had another idea: Could we use some sort of "byte reverse / shuffle" operation to do these use cases for both big/little-endian?
        storeBytes(bytes, offset, (byte)(value >> 8),
                                  (byte)(value >> 0));

        storeBytes(bytes, offset, (byte)(value >> 0),
                                  (byte)(value >> 8));
Not sure if that would be profitable or even available on all platforms. Could be a future RFE someone can work on after this. What do you think? It might make performance more predictable across platforms.

You mean to combine the stores even if the explicit ordering does not match the ordering of the store instruction, adding a ReverseBytes[SIL]Node iff supported in that case, right? I've been thinking about this, too. In my opinion it would be worthwhile.

reinrich · 2024-06-05T07:31:22Z

@reinrich please still wait until the JDK24 fork on Thrusday to integrate, so that we do not have to backport possible regression fixes - I had 3 or 4 with my original patch ;)

Thanks for the reviews @eme64 and @vnkozlov! I'll integrate after the code split if more local testing is successful.

offamitkumar · 2024-06-06T07:02:25Z

I did another round of testing on s390x. looks good.

reinrich · 2024-06-06T08:37:55Z

I did another round of testing on s390x. looks good.

Thanks Amit.

reinrich · 2024-06-07T06:15:23Z

/integrate

openjdk · 2024-06-07T06:16:04Z

Going to push as commit f7862bd.
Since your change was applied there have been 38 commits pushed to the master branch:

b4beda2: 8332537: C2: High memory usage reported for compiler/loopopts/superword/TestAlignVectorFuzzer.java
e5383d7: 8333713: C2 SuperWord: cleanup in vectornode.cpp/hpp
944aeb8: 8325155: C2 SuperWord: remove alignment boundaries
d8af589: 8026127: Deflater/Inflater documentation incomplete/misleading
6238bc8: 8333456: CompactNumberFormat integer parsing fails when string has no suffix
2a37764: 8333743: Change .jcheck/conf branches property to match valid branches
75dc2f8: 8330182: Start of release updates for JDK 24
054362a: 8332550: [macos] Voice Over: java.awt.IllegalComponentStateException: component must be showing on the screen to determine its location
9b436d0: 8333674: Disable CollectorPolicy.young_min_ergo_vm for PPC64
487c477: 8333647: C2 SuperWord: some additional PopulateIndex tests
... and 28 more: https://git.openjdk.org/jdk/compare/326dbb1b139dd1ec1b8605339b91697cdf49da9a...master

Your commit was automatically rebased without conflicts.

openjdk · 2024-06-07T06:16:10Z

@reinrich Pushed as commit f7862bd.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

8331311: C2: Big Endian Port of 8318446: optimize stores into primiti…

f56b8e8

…ve arrays by combining values into larger store

openjdk bot added the hotspot-compiler [email protected] label May 13, 2024

Typo

d55ebb6

reinrich added 2 commits May 14, 2024 09:04

Add bug id

63e37a1

Improve comment

9cbe964

reinrich marked this pull request as ready for review May 14, 2024 12:56

openjdk bot added the rfr Pull request is ready for review label May 14, 2024

eme64 reviewed May 14, 2024

View reviewed changes

src/hotspot/share/opto/memnode.cpp Outdated Show resolved Hide resolved

eme64 reviewed May 14, 2024

View reviewed changes

test/hotspot/jtreg/compiler/c2/TestMergeStores.java Show resolved Hide resolved

Improve comment

dc05bb0

reinrich added 2 commits May 15, 2024 16:01

Improve make_merged_input_value based on Emanuel's feedback

6ba1915

test2BE: big endian version of test2

8844c83

TheRealMDoerr mentioned this pull request May 15, 2024

8331935: Add support for primitive array C1 clone intrinsic in PPC #19250

Closed

3 tasks

Eliminate IS_BIG_ENDIAN and always execute both variants

3169a31

eme64 approved these changes May 24, 2024

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label May 24, 2024

Feedback Emanuel

fc870e2

vnkozlov approved these changes Jun 4, 2024

View reviewed changes

Merge branch 'master' into 8331311_merge_stores_on_big_endian

f7dc0f9

openjdk bot added the integrated Pull request has been integrated label Jun 7, 2024

openjdk bot closed this Jun 7, 2024

openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jun 7, 2024

eme64 mentioned this pull request Jun 17, 2024

8334342: Add MergeStore JMH benchmarks #19734

Closed

3 tasks

reinrich deleted the 8331311_merge_stores_on_big_endian branch January 24, 2025 08:41

8331311: C2: Big Endian Port of 8318446: optimize stores into primitive arrays by combining values into larger store #19218

8331311: C2: Big Endian Port of 8318446: optimize stores into primitive arrays by combining values into larger store #19218

Uh oh!

Conversation

reinrich commented May 13, 2024 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Reviewing

Webrev

Uh oh!

bridgekeeper bot commented May 13, 2024

Uh oh!

openjdk bot commented May 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk bot commented May 13, 2024

Uh oh!

reinrich commented May 13, 2024

Uh oh!

offamitkumar commented May 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlbridge bot commented May 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

eme64 commented May 14, 2024

Uh oh!

Uh oh!

eme64 May 14, 2024

Choose a reason for hiding this comment

Uh oh!

reinrich May 16, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dean-long commented May 14, 2024

Uh oh!

dean-long commented May 14, 2024

Uh oh!

reinrich commented May 15, 2024

Uh oh!

eme64 commented May 15, 2024

Uh oh!

reinrich commented May 15, 2024

Uh oh!

reinrich commented May 16, 2024

Uh oh!

eme64 left a comment

Choose a reason for hiding this comment

Uh oh!

eme64 May 24, 2024

Choose a reason for hiding this comment

Uh oh!

reinrich Jun 5, 2024

Choose a reason for hiding this comment

Uh oh!

eme64 commented May 24, 2024

Uh oh!

reinrich commented May 24, 2024

Uh oh!

vnkozlov left a comment

Choose a reason for hiding this comment

Uh oh!

eme64 commented Jun 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

reinrich commented Jun 5, 2024

Uh oh!

reinrich commented Jun 5, 2024

Uh oh!

offamitkumar commented Jun 6, 2024

Uh oh!

reinrich commented Jun 6, 2024

Uh oh!

reinrich commented Jun 7, 2024

Uh oh!

openjdk bot commented Jun 7, 2024

reinrich commented May 13, 2024 •

edited by openjdk bot

Loading

openjdk bot commented May 13, 2024 •

edited

Loading

offamitkumar commented May 13, 2024 •

edited

Loading

mlbridge bot commented May 14, 2024 •

edited

Loading

eme64 commented Jun 4, 2024 •

edited

Loading