8287697: Limit auto vectorization to 32-byte vector on Cascade Lake #8877

sviswa7 · 2022-05-25T01:48:16Z

We observe ~20% regression in SPECjvm2008 mpegaudio sub benchmark on Cascade Lake with Default vs -XX:UseAVX=2.
The performance of all the other non-startup sub benchmarks of SPECjvm2008 is within +/- 5%.
The performance regression is due to auto-vectorization of small loops.
We don’t have AVX3Threshold consideration in auto-vectorization.
The performance regression in mpegaudio can be recovered by limiting auto-vectorization to 32-byte vectors.

This PR limits auto-vectorization to 32-byte vectors by default on Cascade Lake. Users can override this by either setting -XX:UseAVX=3 or -XX:SuperWordMaxVectorSize=64 on JVM command line.

Please review.

Best Regard,
Sandhya

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8287697: Limit auto vectorization to 32-byte vector on Cascade Lake

Reviewers

Vladimir Kozlov (@vnkozlov - Reviewer)
Jatin Bhateja (@jatin-bhateja - Committer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/8877/head:pull/8877
$ git checkout pull/8877

Update a local copy of the PR:
$ git checkout pull/8877
$ git pull https://git.openjdk.java.net/jdk pull/8877/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 8877

View PR using the GUI difftool:
$ git pr show -t 8877

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/8877.diff

bridgekeeper · 2022-05-25T01:49:50Z

👋 Welcome back sviswanathan! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2022-05-25T01:52:22Z

@sviswa7 The following label will be automatically applied to this pull request:

hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

vnkozlov · 2022-05-27T04:05:47Z

You have trailing white spaces.

vnkozlov · 2022-05-27T04:08:42Z

src/hotspot/share/opto/vectornode.cpp

  if (is_java_primitive(bt) &&
      (vlen > 1) && is_power_of_2(vlen) &&
-      Matcher::vector_size_supported(bt, vlen)) {
+      Matcher::vector_size_supported(bt, vlen) &&
+      (vlen * type2aelembytes(bt) <= SuperWordMaxVectorSize)) {


Can you put this whole condition into separate static bool VectorNode::vector_size_supported(vlen, bt) and use in both cases?

jatin-bhateja · 2022-05-30T14:58:04Z

Vectorization through SLP can be controlled by constraining MaxVectorSize and through Vector APIs using narrower SPECIES.
Can you kindly share more details on need for a separate SuperWordMaxVectorSize here. User already has all the necessary controls to limit C2 vector length, it will rarely happen that one want to emit 512 vector code using vector APIs and still limit auto-vectorizer to infer 256 bit vector operations and vice-versa. May be we should pessimistically just constrain the vector size of those loops which may result into AVX512 heavy instructions through a target specific analysis pass.

sviswa7 · 2022-06-01T23:30:23Z

/label hotspot-compiler

openjdk · 2022-06-01T23:32:14Z

@sviswa7
The hotspot-compiler label was successfully added.

sviswa7 · 2022-06-01T23:35:36Z

@vnkozlov Your review comments are resolved.
@jatin-bhateja This is a simple fix for the problem in the short time frame that we have for the upcoming feature freeze. A more complex fix to enhance auto-vectorizer is a good thought.

mlbridge · 2022-06-01T23:36:02Z

Webrevs

vnkozlov · 2022-06-02T01:16:33Z

I think we missed the test with setting MaxVectorSize to 32 (vs 64) on Cascade Lake CPU. We should do that.

That may be preferable "simple fix" vs suggested changes for "short term solution".

The objection was that user may still want to use wide 64 bytes vectors for Vector API. But I agree with Jatin argument about that.
Limiting MaxVectorSize will affect our intrinsics/stubs code and may affect performance. That is why we need to test it. I will ask Eric.

BTW, SuperWordMaxVectorSize should be diagnostic or experimental since it is temporary solution.

sviswa7 · 2022-06-02T04:37:58Z

@vnkozlov I have made SuperWordMaxVectorSize as a develop option as you suggested. As far as I know, the only intrinsics/stubs that uses MaxVectorSize are for clear/copy. This is done in conjunction with AVX3Threshold so we are ok there for Cascade Lake.

vnkozlov · 2022-06-02T05:24:51Z

@vnkozlov I have made SuperWordMaxVectorSize as a develop option as you suggested. As far as I know, the only intrinsics/stubs that uses MaxVectorSize are for clear/copy. This is done in conjunction with AVX3Threshold so we are ok there for Cascade Lake.

Thank you for checking stubs code.

We still have to run performance testing with this patch. We need only additional run with MaxVectorSize=32 to compare results.

And I want @jatin-bhateja to approve this change too. Or give better suggestion.

vnkozlov

Changes look good. I will start testing it.

jatin-bhateja · 2022-06-02T05:38:57Z

src/hotspot/cpu/x86/vm_version_x86.cpp

+      FLAG_SET_DEFAULT(SuperWordMaxVectorSize, 32);
+    } else {


SuperWordMaxVectorSize is set to 32 bytes by default, it should still be capped by MaxVectorSize, in case user sets MaxVectorSize to 16 bytes.

Yes. I submitted testing with FLAG_SET_DEFAULT(SuperWordMaxVectorSize, MIN2(MaxVectorSize, (intx)32));
And the flag declared as DIAGNOSTIC - product build fail otherwise.

vnkozlov · 2022-06-02T05:45:34Z

src/hotspot/share/opto/c2_globals.hpp

          "actual size could be less depending on elements type")           \
          range(0, max_jint)                                                \
                                                                            \
+  develop(intx, SuperWordMaxVectorSize, 64,                                 \


The flag can't be develop because it is used in product code. It should be diagnostic.

jatin-bhateja · 2022-06-02T05:46:35Z

@vnkozlov Your review comments are resolved. @jatin-bhateja This is a simple fix for the problem in the short time frame that we have for the upcoming feature freeze. A more complex fix to enhance auto-vectorizer is a good thought.

Hi @sviswa7 . This looks reasonable since stubs and some macro assembly routines anyways operate under thresholds and does not strictly comply with max vector size.

fg1417 · 2022-06-02T14:12:45Z

Hi @sviswa7 , #7806 implemented an interface for auto-vectorization to disable some unprofitable cases on aarch64. Can it also be applied to your case?

vnkozlov · 2022-06-02T15:37:16Z

src/hotspot/cpu/x86/vm_version_x86.cpp

-    if (use_avx_limit > 2 && is_intel_skylake() && _stepping < 5) {
-      FLAG_SET_DEFAULT(UseAVX, 2);
+    if (use_avx_limit > 2 && is_intel_skylake()) {
+      if (_stepping < 5) {
+        FLAG_SET_DEFAULT(UseAVX, 2);
+      }


What is this change for?

I had some changes in this area before. This is an artifact of that. I will set it back to exactly as it was.

vnkozlov · 2022-06-02T15:38:26Z

src/hotspot/cpu/x86/vm_version_x86.cpp


+  if (FLAG_IS_DEFAULT(SuperWordMaxVectorSize)) {
+    if (FLAG_IS_DEFAULT(UseAVX) && UseAVX > 2 &&
+        is_intel_skylake() && _stepping > 5) {


Should you check _stepping >= 5? Otherwise _stepping == 5 is missing in all adjustments.

vnkozlov · 2022-06-02T17:01:26Z

Hi @sviswa7 , #7806 implemented an interface for auto-vectorization to disable some unprofitable cases on aarch64. Can it also be applied to your case?

Maybe. But it would require more careful changes. And that changeset is not integrated yet.
Current changes are clean and serve their purpose good.

And, as Jatin and Sandhya said, we may do proper fix after JDK 19 fork. Then we can look on your proposal.

sviswa7 · 2022-06-02T17:44:54Z

@vnkozlov @jatin-bhateja Your review comments are implemented. Please take a look.

vnkozlov

Looks good.
Please wait until regression and performance testing are finished. I will let you know results.

openjdk · 2022-06-02T18:03:15Z

@sviswa7 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8287697: Limit auto vectorization to 32-byte vector on Cascade Lake

Reviewed-by: kvn, jbhateja

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 38 new commits pushed to the master branch:

26d2426: 8287340: Refactor old code using StringTokenizer in locale related code
ccec5d1: 8287704: Small logging clarification about shrunk bytes after heap shrinkage
7f44f57: 8285868: x86 intrinsics for floating point method isInfinite
13596cd: 8287097: (fs) Files::copy requires an undocumented permission when copying from the default file system to a non-default file system
49e24f0: 8287567: AArch64: Implement post-call NOPs
1fcbaa4: 8278598: AlignmentReserve is repeatedly reinitialized
e51ca1d: 8287171: Refactor null caller tests to a single directory
3cfd38c: 8287726: Fix JVMTI tests with "requires vm.continuations" after JDK-8287496
c78392d: 8287606: standardize spelling of subtype and supertype etc in comments
5acac22: 8286830: ~HandshakeState should not touch oops
... and 28 more: https://git.openjdk.java.net/jdk/compare/97bd4c255a319ce626a316ed211ef1fd7d0f1e14...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

jatin-bhateja · 2022-06-02T18:58:39Z

Thanks @sviswa7 , changes looks good to me.

vnkozlov · 2022-06-03T02:47:32Z

Regression testing results are good. Waiting performance results.

vnkozlov · 2022-06-06T20:17:11Z

jbb2015 are only left in queue for performance testing. It may take time and I don't expect much variations in them.

Testing also include MaxVectorSize=32 to compare with current changes. It shows slightly (1-3%) better results in some Crypto-AESBench_decrypt/encrypt sub-benchmarks but it could be due to variations we observed in them. On other hand SuperWordMaxVectorSize=32 shows better results in some Renaissance sub-benchmarks - actually it keep scores similar to current code and MaxVectorSize=32 gives regression in them. Based on this I agree with current changes vs setting MaxVectorSize=32.

Both changes gives 4-5% improvement to SPECjvm2008-MPEG.

But I also observed 2.7% regression in SPECjvm2008-SOR.small with ParallelGC. For both types of changes.

vnkozlov · 2022-06-07T23:22:42Z

@sviswa7, please, file separate RFE for SPECjvm2008-SOR.small issue (different unrolling factor).

I got all performance data from our and yours data and I think these change are ready for integration. Thanks!

sviswa7 · 2022-06-08T01:04:25Z

@vnkozlov Thanks a lot. I have filed the RFE: https://bugs.openjdk.org/browse/JDK-8287966.

sviswa7 · 2022-06-08T01:04:41Z

/integrate

openjdk · 2022-06-08T01:05:31Z

Going to push as commit 45f1b72.
Since your change was applied there have been 124 commits pushed to the master branch:

39ec58b: 8287886: Further terminology updates to match JLS
68c5957: 8287869: -XX:+AutoCreateSharedArchive doesn't work when JDK build is switched
bf439f8: 8287876: The recently de-problemlisted TestTitledBorderLeak test is unstable
b7a34f7: 8287927: ProblemList java/awt/GraphicsDevice/DisplayModes/UnknownRefrshRateTest.java on macosx-aarch64
8e07839: 8285081: Improve XPath operators count accuracy
b12e7f1: 8279358: vmTestbase/nsk/jvmti/scenarios/jni_interception/JI03/ji03t003/TestDescription.java fails with usage tracker
1aa87e0: 8287148: Avoid redundant HashMap.containsKey calls in ExtendedKeyCodes.getExtendedKeyCodeForChar
74be2d9: 8286983: rename jdb -trackvthreads and debug agent enumeratevthreads options and clarify "Preview Feature" nature of these options
8e10c2b: 8287877: Exclude vmTestbase/nsk/jvmti/AttachOnDemand/attach022/TestDescription.java until JDK-8277573 is fixed
9ec27d0: 8287872: Disable concurrent execution of hotspot docker tests
... and 114 more: https://git.openjdk.java.net/jdk/compare/97bd4c255a319ce626a316ed211ef1fd7d0f1e14...master

Your commit was automatically rebased without conflicts.

openjdk · 2022-06-08T01:05:43Z

@sviswa7 Pushed as commit 45f1b72.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@sviswa7

openjdk#8877 introduced the global option `SuperWordMaxVectorSize` as a temporary solution to fix the performance regression on some x86 machines. Currently, SuperWordMaxVectorSize behaves differently between x86 and other platforms[1]. For example, if the current machine only supports `MaxVectorSize <= 32`, but we set `SuperWordMaxVectorSize = 64`, then `SuperWordMaxVectorSize` will be kept at 64 on other platforms while x86 machine would change `SuperWordMaxVectorSize` to `MaxVectorSize`. Other platforms except x86 miss similar implementations like [2]. Also, `SuperWordMaxVectorSize` limits the max vector size of auto-vectorization as `64`, which is fine for current aarch64 hardware but SVE architecture supports larger than 512 bits. The patch is to drop the global option and use an architecture- dependent interface to consult the max vector size for auto- vectorization, fixing the performance issue on x86 and reducing side effects for other platforms. After the patch, auto- vectorization is still limited to 32-byte vectors by default on Cascade Lake and users can override this by either setting `-XX:UseAVX=3` or `-XX:MaxVectorSize=64` on JVM command line. So my question is: Before the patch, we could have a smaller max vector size for auto-vectorization than `MaxVectorSize` on x86. For example, users could have `MaxVectorSize=64` and `SuperWordMaxVectorSize=32`. But after the change, if we set `-XX:MaxVectorSize=64` explicitly, then the max vector size for auto-vectorization would be `MaxVectorSize`, i.e. 64 bytes, which I believe is more reasonable. @sviswa7 @jatin-bhateja, are you happy about the change? [1] openjdk#12350 (comment) [2] https://github.com/openjdk/jdk/blob/33bec207103acd520eb99afb093cfafa44aecfda/src/hotspot/cpu/x86/vm_version_x86.cpp#L1314-L1333

Limit auto vectorization to 32 byte vector on Cascade Lake

d9d742e

openjdk bot added the hotspot [email protected] label May 25, 2022

Change option name and add checks

9c18c0a

vnkozlov reviewed May 27, 2022

View reviewed changes

sviswa7 added 2 commits May 31, 2022 20:03

review comment resolution

6465395

Fix 32-bit build

acea766

sviswa7 changed the title ~~Limit auto vectorization to 32 byte vector on Cascade Lake~~ 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake Jun 1, 2022

sviswa7 added 2 commits June 1, 2022 16:00

x86 build fix

d677fd9

Merge branch 'master' into maxvector

7f4c41e

sviswa7 marked this pull request as ready for review June 1, 2022 23:30

openjdk bot added rfr Pull request is ready for review hotspot-compiler [email protected] labels Jun 1, 2022

sviswa7 mentioned this pull request Jun 1, 2022

8286823: Default to UseAVX=2 on all Skylake/Cascade Lake CPUs #8731

Closed

3 tasks

Change SuperWordMaxVectorSize to develop option

e8ea837

vnkozlov reviewed Jun 2, 2022

View reviewed changes

jatin-bhateja reviewed Jun 2, 2022

View reviewed changes

vnkozlov reviewed Jun 2, 2022

View reviewed changes

Review comment resolution

4208516

vnkozlov approved these changes Jun 2, 2022

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Jun 2, 2022

jatin-bhateja approved these changes Jun 2, 2022

View reviewed changes

vnkozlov mentioned this pull request Jun 3, 2022

8283091: Support type conversion between different data sizes in SLP #7806

Closed

3 tasks

openjdk bot added the integrated Pull request has been integrated label Jun 8, 2022

openjdk bot closed this Jun 8, 2022

openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jun 8, 2022

vnkozlov mentioned this pull request Jun 8, 2022

8286941: Add mask IR for partial vector operations for ARM SVE #9037

Closed

3 tasks

fg1417 mentioned this pull request Mar 1, 2023

8298935: fix independence bug in create_pack logic in SuperWord::find_adjacent_refs #12350

Closed

3 tasks

fg1417 mentioned this pull request Mar 21, 2023

8304301: Remove the global option SuperWordMaxVectorSize #13112

Closed

3 tasks

sviswa7 deleted the maxvector branch June 3, 2024 21:42

8287697: Limit auto vectorization to 32-byte vector on Cascade Lake #8877

8287697: Limit auto vectorization to 32-byte vector on Cascade Lake #8877

Uh oh!

Conversation

sviswa7 commented May 25, 2022 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Reviewing

Uh oh!

bridgekeeper bot commented May 25, 2022

Uh oh!

openjdk bot commented May 25, 2022

Uh oh!

vnkozlov commented May 27, 2022

Uh oh!

vnkozlov May 27, 2022

Choose a reason for hiding this comment

Uh oh!

jatin-bhateja commented May 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sviswa7 commented Jun 1, 2022

Uh oh!

openjdk bot commented Jun 1, 2022

Uh oh!

sviswa7 commented Jun 1, 2022

Uh oh!

mlbridge bot commented Jun 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

vnkozlov commented Jun 2, 2022

Uh oh!

sviswa7 commented Jun 2, 2022

Uh oh!

vnkozlov commented Jun 2, 2022

Uh oh!

vnkozlov left a comment

Choose a reason for hiding this comment

Uh oh!

jatin-bhateja Jun 2, 2022

Choose a reason for hiding this comment

Uh oh!

vnkozlov Jun 2, 2022

Choose a reason for hiding this comment

Uh oh!

vnkozlov Jun 2, 2022

Choose a reason for hiding this comment

Uh oh!

jatin-bhateja commented Jun 2, 2022

Uh oh!

fg1417 commented Jun 2, 2022

Uh oh!

vnkozlov Jun 2, 2022

Choose a reason for hiding this comment

Uh oh!

sviswa7 Jun 2, 2022

Choose a reason for hiding this comment

Uh oh!

vnkozlov Jun 2, 2022

Choose a reason for hiding this comment

Uh oh!

vnkozlov commented Jun 2, 2022

Uh oh!

sviswa7 commented Jun 2, 2022

Uh oh!

vnkozlov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openjdk bot commented Jun 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jatin-bhateja commented Jun 2, 2022

Uh oh!

vnkozlov commented Jun 3, 2022

Uh oh!

vnkozlov commented Jun 6, 2022

sviswa7 commented May 25, 2022 •

edited by openjdk bot

Loading

jatin-bhateja commented May 30, 2022 •

edited

Loading

mlbridge bot commented Jun 1, 2022 •

edited

Loading

vnkozlov left a comment •

edited

Loading

openjdk bot commented Jun 2, 2022 •

edited

Loading