Skip to content

Conversation

@sviswa7
Copy link

@sviswa7 sviswa7 commented May 25, 2022

We observe ~20% regression in SPECjvm2008 mpegaudio sub benchmark on Cascade Lake with Default vs -XX:UseAVX=2.
The performance of all the other non-startup sub benchmarks of SPECjvm2008 is within +/- 5%.
The performance regression is due to auto-vectorization of small loops.
We don’t have AVX3Threshold consideration in auto-vectorization.
The performance regression in mpegaudio can be recovered by limiting auto-vectorization to 32-byte vectors.

This PR limits auto-vectorization to 32-byte vectors by default on Cascade Lake. Users can override this by either setting -XX:UseAVX=3 or -XX:SuperWordMaxVectorSize=64 on JVM command line.

Please review.

Best Regard,
Sandhya


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8287697: Limit auto vectorization to 32-byte vector on Cascade Lake

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/8877/head:pull/8877
$ git checkout pull/8877

Update a local copy of the PR:
$ git checkout pull/8877
$ git pull https://git.openjdk.java.net/jdk pull/8877/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 8877

View PR using the GUI difftool:
$ git pr show -t 8877

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/8877.diff

@bridgekeeper
Copy link

bridgekeeper bot commented May 25, 2022

👋 Welcome back sviswanathan! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented May 25, 2022

@sviswa7 The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@vnkozlov
Copy link
Contributor

You have trailing white spaces.

Comment on lines 1277 to 1280
if (is_java_primitive(bt) &&
(vlen > 1) && is_power_of_2(vlen) &&
Matcher::vector_size_supported(bt, vlen)) {
Matcher::vector_size_supported(bt, vlen) &&
(vlen * type2aelembytes(bt) <= SuperWordMaxVectorSize)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this whole condition into separate static bool VectorNode::vector_size_supported(vlen, bt) and use in both cases?

@jatin-bhateja
Copy link
Member

jatin-bhateja commented May 30, 2022

Vectorization through SLP can be controlled by constraining MaxVectorSize and through Vector APIs using narrower SPECIES.
Can you kindly share more details on need for a separate SuperWordMaxVectorSize here. User already has all the necessary controls to limit C2 vector length, it will rarely happen that one want to emit 512 vector code using vector APIs and still limit auto-vectorizer to infer 256 bit vector operations and vice-versa. May be we should pessimistically just constrain the vector size of those loops which may result into AVX512 heavy instructions through a target specific analysis pass.

@sviswa7 sviswa7 changed the title Limit auto vectorization to 32 byte vector on Cascade Lake 8287697: Limit auto vectorization to 32-byte vector on Cascade Lake Jun 1, 2022
@sviswa7 sviswa7 marked this pull request as ready for review June 1, 2022 23:30
@sviswa7
Copy link
Author

sviswa7 commented Jun 1, 2022

/label hotspot-compiler

@openjdk openjdk bot added rfr Pull request is ready for review hotspot-compiler [email protected] labels Jun 1, 2022
@openjdk
Copy link

openjdk bot commented Jun 1, 2022

@sviswa7
The hotspot-compiler label was successfully added.

@sviswa7
Copy link
Author

sviswa7 commented Jun 1, 2022

@vnkozlov Your review comments are resolved.
@jatin-bhateja This is a simple fix for the problem in the short time frame that we have for the upcoming feature freeze. A more complex fix to enhance auto-vectorizer is a good thought.

@mlbridge
Copy link

mlbridge bot commented Jun 1, 2022

Webrevs

@vnkozlov
Copy link
Contributor

vnkozlov commented Jun 2, 2022

I think we missed the test with setting MaxVectorSize to 32 (vs 64) on Cascade Lake CPU. We should do that.

That may be preferable "simple fix" vs suggested changes for "short term solution".

The objection was that user may still want to use wide 64 bytes vectors for Vector API. But I agree with Jatin argument about that.
Limiting MaxVectorSize will affect our intrinsics/stubs code and may affect performance. That is why we need to test it. I will ask Eric.

BTW, SuperWordMaxVectorSize should be diagnostic or experimental since it is temporary solution.

@sviswa7
Copy link
Author

sviswa7 commented Jun 2, 2022

@vnkozlov I have made SuperWordMaxVectorSize as a develop option as you suggested. As far as I know, the only intrinsics/stubs that uses MaxVectorSize are for clear/copy. This is done in conjunction with AVX3Threshold so we are ok there for Cascade Lake.

@vnkozlov
Copy link
Contributor

vnkozlov commented Jun 2, 2022

@vnkozlov I have made SuperWordMaxVectorSize as a develop option as you suggested. As far as I know, the only intrinsics/stubs that uses MaxVectorSize are for clear/copy. This is done in conjunction with AVX3Threshold so we are ok there for Cascade Lake.

Thank you for checking stubs code.

We still have to run performance testing with this patch. We need only additional run with MaxVectorSize=32 to compare results.

And I want @jatin-bhateja to approve this change too. Or give better suggestion.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. I will start testing it.

Comment on lines 1305 to 1306
FLAG_SET_DEFAULT(SuperWordMaxVectorSize, 32);
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SuperWordMaxVectorSize is set to 32 bytes by default, it should still be capped by MaxVectorSize, in case user sets MaxVectorSize to 16 bytes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I submitted testing with FLAG_SET_DEFAULT(SuperWordMaxVectorSize, MIN2(MaxVectorSize, (intx)32));
And the flag declared as DIAGNOSTIC - product build fail otherwise.

"actual size could be less depending on elements type") \
range(0, max_jint) \
\
develop(intx, SuperWordMaxVectorSize, 64, \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag can't be develop because it is used in product code. It should be diagnostic.

@jatin-bhateja
Copy link
Member

@vnkozlov Your review comments are resolved. @jatin-bhateja This is a simple fix for the problem in the short time frame that we have for the upcoming feature freeze. A more complex fix to enhance auto-vectorizer is a good thought.

Hi @sviswa7 . This looks reasonable since stubs and some macro assembly routines anyways operate under thresholds and does not strictly comply with max vector size.

@fg1417
Copy link

fg1417 commented Jun 2, 2022

Hi @sviswa7 , #7806 implemented an interface for auto-vectorization to disable some unprofitable cases on aarch64. Can it also be applied to your case?

Comment on lines 899 to 902
if (use_avx_limit > 2 && is_intel_skylake() && _stepping < 5) {
FLAG_SET_DEFAULT(UseAVX, 2);
if (use_avx_limit > 2 && is_intel_skylake()) {
if (_stepping < 5) {
FLAG_SET_DEFAULT(UseAVX, 2);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this change for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some changes in this area before. This is an artifact of that. I will set it back to exactly as it was.


if (FLAG_IS_DEFAULT(SuperWordMaxVectorSize)) {
if (FLAG_IS_DEFAULT(UseAVX) && UseAVX > 2 &&
is_intel_skylake() && _stepping > 5) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you check _stepping >= 5? Otherwise _stepping == 5 is missing in all adjustments.

@vnkozlov
Copy link
Contributor

vnkozlov commented Jun 2, 2022

Hi @sviswa7 , #7806 implemented an interface for auto-vectorization to disable some unprofitable cases on aarch64. Can it also be applied to your case?

Maybe. But it would require more careful changes. And that changeset is not integrated yet.
Current changes are clean and serve their purpose good.

And, as Jatin and Sandhya said, we may do proper fix after JDK 19 fork. Then we can look on your proposal.

@sviswa7
Copy link
Author

sviswa7 commented Jun 2, 2022

@vnkozlov @jatin-bhateja Your review comments are implemented. Please take a look.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
Please wait until regression and performance testing are finished. I will let you know results.

@openjdk
Copy link

openjdk bot commented Jun 2, 2022

@sviswa7 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8287697: Limit auto vectorization to 32-byte vector on Cascade Lake

Reviewed-by: kvn, jbhateja

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 38 new commits pushed to the master branch:

  • 26d2426: 8287340: Refactor old code using StringTokenizer in locale related code
  • ccec5d1: 8287704: Small logging clarification about shrunk bytes after heap shrinkage
  • 7f44f57: 8285868: x86 intrinsics for floating point method isInfinite
  • 13596cd: 8287097: (fs) Files::copy requires an undocumented permission when copying from the default file system to a non-default file system
  • 49e24f0: 8287567: AArch64: Implement post-call NOPs
  • 1fcbaa4: 8278598: AlignmentReserve is repeatedly reinitialized
  • e51ca1d: 8287171: Refactor null caller tests to a single directory
  • 3cfd38c: 8287726: Fix JVMTI tests with "requires vm.continuations" after JDK-8287496
  • c78392d: 8287606: standardize spelling of subtype and supertype etc in comments
  • 5acac22: 8286830: ~HandshakeState should not touch oops
  • ... and 28 more: https://git.openjdk.java.net/jdk/compare/97bd4c255a319ce626a316ed211ef1fd7d0f1e14...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jun 2, 2022
@jatin-bhateja
Copy link
Member

Thanks @sviswa7 , changes looks good to me.

@vnkozlov
Copy link
Contributor

vnkozlov commented Jun 3, 2022

Regression testing results are good. Waiting performance results.

@vnkozlov
Copy link
Contributor

vnkozlov commented Jun 6, 2022

jbb2015 are only left in queue for performance testing. It may take time and I don't expect much variations in them.

Testing also include MaxVectorSize=32 to compare with current changes. It shows slightly (1-3%) better results in some Crypto-AESBench_decrypt/encrypt sub-benchmarks but it could be due to variations we observed in them. On other hand SuperWordMaxVectorSize=32 shows better results in some Renaissance sub-benchmarks - actually it keep scores similar to current code and MaxVectorSize=32 gives regression in them. Based on this I agree with current changes vs setting MaxVectorSize=32.

Both changes gives 4-5% improvement to SPECjvm2008-MPEG.

But I also observed 2.7% regression in SPECjvm2008-SOR.small with ParallelGC. For both types of changes.

@vnkozlov
Copy link
Contributor

vnkozlov commented Jun 7, 2022

@sviswa7, please, file separate RFE for SPECjvm2008-SOR.small issue (different unrolling factor).

I got all performance data from our and yours data and I think these change are ready for integration. Thanks!

@sviswa7
Copy link
Author

sviswa7 commented Jun 8, 2022

@vnkozlov Thanks a lot. I have filed the RFE: https://bugs.openjdk.org/browse/JDK-8287966.

@sviswa7
Copy link
Author

sviswa7 commented Jun 8, 2022

/integrate

@openjdk
Copy link

openjdk bot commented Jun 8, 2022

Going to push as commit 45f1b72.
Since your change was applied there have been 124 commits pushed to the master branch:

  • 39ec58b: 8287886: Further terminology updates to match JLS
  • 68c5957: 8287869: -XX:+AutoCreateSharedArchive doesn't work when JDK build is switched
  • bf439f8: 8287876: The recently de-problemlisted TestTitledBorderLeak test is unstable
  • b7a34f7: 8287927: ProblemList java/awt/GraphicsDevice/DisplayModes/UnknownRefrshRateTest.java on macosx-aarch64
  • 8e07839: 8285081: Improve XPath operators count accuracy
  • b12e7f1: 8279358: vmTestbase/nsk/jvmti/scenarios/jni_interception/JI03/ji03t003/TestDescription.java fails with usage tracker
  • 1aa87e0: 8287148: Avoid redundant HashMap.containsKey calls in ExtendedKeyCodes.getExtendedKeyCodeForChar
  • 74be2d9: 8286983: rename jdb -trackvthreads and debug agent enumeratevthreads options and clarify "Preview Feature" nature of these options
  • 8e10c2b: 8287877: Exclude vmTestbase/nsk/jvmti/AttachOnDemand/attach022/TestDescription.java until JDK-8277573 is fixed
  • 9ec27d0: 8287872: Disable concurrent execution of hotspot docker tests
  • ... and 114 more: https://git.openjdk.java.net/jdk/compare/97bd4c255a319ce626a316ed211ef1fd7d0f1e14...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jun 8, 2022
@openjdk openjdk bot closed this Jun 8, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jun 8, 2022
@openjdk
Copy link

openjdk bot commented Jun 8, 2022

@sviswa7 Pushed as commit 45f1b72.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

fg1417 pushed a commit to fg1417/jdk that referenced this pull request Mar 21, 2023
openjdk#8877 introduced the global
option `SuperWordMaxVectorSize` as a temporary solution to fix
the performance regression on some x86 machines.

Currently, SuperWordMaxVectorSize behaves differently between
x86 and other platforms[1]. For example, if the current machine
only supports `MaxVectorSize <= 32`, but we set
`SuperWordMaxVectorSize = 64`, then `SuperWordMaxVectorSize`
will be kept at 64 on other platforms while x86 machine would
change `SuperWordMaxVectorSize` to `MaxVectorSize`. Other
platforms except x86 miss similar implementations like [2].

Also, `SuperWordMaxVectorSize` limits the max vector size of
auto-vectorization as `64`, which is fine for current aarch64
hardware but SVE architecture supports larger than 512 bits.

The patch is to drop the global option and use an architecture-
dependent interface to consult the max vector size for auto-
vectorization, fixing the performance issue on x86 and reducing
side effects for other platforms. After the patch, auto-
vectorization is still limited to 32-byte vectors by default
on Cascade Lake and users can override this by either setting
`-XX:UseAVX=3` or `-XX:MaxVectorSize=64` on JVM command line.

So my question is:

Before the patch, we could have a smaller max vector size for
auto-vectorization than `MaxVectorSize` on x86. For example,
users could have `MaxVectorSize=64` and
`SuperWordMaxVectorSize=32`. But after the change, if we set
`-XX:MaxVectorSize=64` explicitly, then the max vector size for
auto-vectorization would be `MaxVectorSize`, i.e. 64 bytes, which
I believe is more reasonable. @sviswa7 @jatin-bhateja, are you
happy about the change?

[1] openjdk#12350 (comment)
[2] https://github.com/openjdk/jdk/blob/33bec207103acd520eb99afb093cfafa44aecfda/src/hotspot/cpu/x86/vm_version_x86.cpp#L1314-L1333
@sviswa7 sviswa7 deleted the maxvector branch June 3, 2024 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants