Skip to content

Conversation

@fg1417
Copy link

@fg1417 fg1417 commented Mar 21, 2023

#8877 introduced the global option SuperWordMaxVectorSize as a temporary solution to fix the performance regression on some x86 machines.

Currently, SuperWordMaxVectorSize behaves differently between x86 and other platforms [1]. For example, if the current machine only supports MaxVectorSize <= 32, but we set SuperWordMaxVectorSize = 64, then SuperWordMaxVectorSize will be kept at 64 on other platforms while x86 machine would change SuperWordMaxVectorSize to MaxVectorSize. Other platforms except x86 miss similar implementations like [2].

Also, SuperWordMaxVectorSize limits the max vector size of auto-vectorization as 64, which is fine for current aarch64 hardware, but SVE architecture supports larger than 512 bits.

The patch is to drop the global option and use an architecture-dependent interface to consult the max vector size for auto-vectorization, fixing the performance issue on x86 and reducing side effects for other platforms. After the patch, auto-vectorization is still limited to 32-byte vectors by default on Cascade Lake and users can override this by either setting
-XX:UseAVX=3 or -XX:MaxVectorSize=64 on JVM command line.

So my question is:

Before the patch, we could have a smaller max vector size for auto-vectorization than MaxVectorSize on x86. For example, users could have MaxVectorSize=64 and SuperWordMaxVectorSize=32. But after the change, if we set
-XX:MaxVectorSize=64 explicitly, then the max vector size for auto-vectorization would be MaxVectorSize, i.e. 64 bytes, which I believe is more reasonable. @sviswa7 @jatin-bhateja, are you happy about the change?

[1] #12350 (comment)
[2]

#if defined(COMPILER2)
if (FLAG_IS_DEFAULT(SuperWordMaxVectorSize)) {
if (FLAG_IS_DEFAULT(UseAVX) && UseAVX > 2 &&
is_intel_skylake() && _stepping >= 5) {
// Limit auto vectorization to 256 bit (32 byte) by default on Cascade Lake
FLAG_SET_DEFAULT(SuperWordMaxVectorSize, MIN2(MaxVectorSize, (intx)32));
} else {
FLAG_SET_DEFAULT(SuperWordMaxVectorSize, MaxVectorSize);
}
} else {
if (SuperWordMaxVectorSize > MaxVectorSize) {
warning("SuperWordMaxVectorSize cannot be greater than MaxVectorSize %i", (int) MaxVectorSize);
FLAG_SET_DEFAULT(SuperWordMaxVectorSize, MaxVectorSize);
}
if (!is_power_of_2(SuperWordMaxVectorSize)) {
warning("SuperWordMaxVectorSize must be a power of 2, setting to MaxVectorSize: %i", (int) MaxVectorSize);
FLAG_SET_DEFAULT(SuperWordMaxVectorSize, MaxVectorSize);
}
}
#endif


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

  • JDK-8304301: Remove the global option SuperWordMaxVectorSize

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/13112/head:pull/13112
$ git checkout pull/13112

Update a local copy of the PR:
$ git checkout pull/13112
$ git pull https://git.openjdk.org/jdk.git pull/13112/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 13112

View PR using the GUI difftool:
$ git pr show -t 13112

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/13112.diff

openjdk#8877 introduced the global
option `SuperWordMaxVectorSize` as a temporary solution to fix
the performance regression on some x86 machines.

Currently, SuperWordMaxVectorSize behaves differently between
x86 and other platforms[1]. For example, if the current machine
only supports `MaxVectorSize <= 32`, but we set
`SuperWordMaxVectorSize = 64`, then `SuperWordMaxVectorSize`
will be kept at 64 on other platforms while x86 machine would
change `SuperWordMaxVectorSize` to `MaxVectorSize`. Other
platforms except x86 miss similar implementations like [2].

Also, `SuperWordMaxVectorSize` limits the max vector size of
auto-vectorization as `64`, which is fine for current aarch64
hardware but SVE architecture supports larger than 512 bits.

The patch is to drop the global option and use an architecture-
dependent interface to consult the max vector size for auto-
vectorization, fixing the performance issue on x86 and reducing
side effects for other platforms. After the patch, auto-
vectorization is still limited to 32-byte vectors by default
on Cascade Lake and users can override this by either setting
`-XX:UseAVX=3` or `-XX:MaxVectorSize=64` on JVM command line.

So my question is:

Before the patch, we could have a smaller max vector size for
auto-vectorization than `MaxVectorSize` on x86. For example,
users could have `MaxVectorSize=64` and
`SuperWordMaxVectorSize=32`. But after the change, if we set
`-XX:MaxVectorSize=64` explicitly, then the max vector size for
auto-vectorization would be `MaxVectorSize`, i.e. 64 bytes, which
I believe is more reasonable. @sviswa7 @jatin-bhateja, are you
happy about the change?

[1] openjdk#12350 (comment)
[2] https://github.com/openjdk/jdk/blob/33bec207103acd520eb99afb093cfafa44aecfda/src/hotspot/cpu/x86/vm_version_x86.cpp#L1314-L1333
@bridgekeeper
Copy link

bridgekeeper bot commented Mar 21, 2023

👋 Welcome back fgao! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Mar 21, 2023
@openjdk
Copy link

openjdk bot commented Mar 21, 2023

@fg1417 The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@mlbridge
Copy link

mlbridge bot commented Mar 21, 2023

Webrevs

@sviswa7
Copy link

sviswa7 commented Mar 21, 2023

@fg1417 SuperWordMaxVectorSize defines the maximum vector size generated by the auto vectorization.
MaxVectorSize defines the vector width supported by the underlying platform.
For CascadeLake we are setting SuperWordMaxVectorSize=32 and MaxVectorSize=64 by default.
This allows the usage of larger 64 byte width vector instructions in places like Java Vector API and intrinsics.
We would like to keep this behavior for x86.

@fg1417
Copy link
Author

fg1417 commented Mar 21, 2023

@fg1417 SuperWordMaxVectorSize defines the maximum vector size generated by the auto vectorization. MaxVectorSize defines the vector width supported by the underlying platform. For CascadeLake we are setting SuperWordMaxVectorSize=32 and MaxVectorSize=64 by default. This allows the usage of larger 64 byte width vector instructions in places like Java Vector API and intrinsics. We would like to keep this behavior for x86.

Hi @sviswa7, thanks for your quick response! Yes, the patch keeps the special handling for auto-vectorization on Cascade Lake. For Cascade Lake, even after the patch, we still have 32 bytes for auto-vectorization and larger 64 bytes for Java Vector API and intrinsics. Can it cover your needs?

Copy link

@sviswa7 sviswa7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@sviswa7
Copy link

sviswa7 commented Mar 21, 2023

/reviewers 2

@openjdk
Copy link

openjdk bot commented Mar 21, 2023

@fg1417 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8304301: Remove the global option SuperWordMaxVectorSize

Reviewed-by: sviswanathan, kvn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 51 new commits pushed to the master branch:

  • 51035a7: 8294137: Review running times of java.math tests
  • 46cca1a: 4842457: (bf spec) Clarify meaning of "(optional operation)"
  • 6fa25cc: 8184444: The compiler error "variable not initialized in the default constructor" is not apt in case of static final variables
  • 4b8f7db: 8027682: javac wrongly accepts semicolons in package and import decls
  • c00d088: 8043179: Lambda expression can mutate final field
  • 147f347: 8219083: java/net/MulticastSocket/SetGetNetworkInterfaceTest.java failed in same binary run on windows x64
  • bf917ba: 8304687: Move add_to_hierarchy
  • 63d4afb: 8304671: javac regression: Compilation with --release 8 fails on underscore in enum identifiers
  • e2cfcfb: 6817009: Action.SELECTED_KEY not toggled when using key binding
  • af4d560: 8303951: Add asserts before record_method_not_compilable where possible
  • ... and 41 more: https://git.openjdk.org/jdk/compare/c09f83ec25749af349fb5609e3641b5bb6d34072...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Mar 21, 2023
@openjdk
Copy link

openjdk bot commented Mar 21, 2023

@sviswa7
The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Mar 21, 2023
@fg1417
Copy link
Author

fg1417 commented Mar 22, 2023

@sviswa7 thanks for your kind review!

@fg1417
Copy link
Author

fg1417 commented Mar 22, 2023

Hi @vnkozlov @TobiHartmann, could you please help review the patch?

Since I don't have sufficient x86 systems, I would appreciate it if you could help verify that the patch will not introduce any performance regression on x86, especially on Cascade Lake.

Thanks!

@vnkozlov
Copy link
Contributor

I will test it.

@vnkozlov
Copy link
Contributor

vnkozlov commented Mar 22, 2023

After the patch, auto-vectorization is still limited to 32-byte vectors by default on Cascade Lake and users can override this by either setting -XX:UseAVX=3 or -XX:MaxVectorSize=64 on JVM command line.

Correction: only -XX:MaxVectorSize=64 will affect it. -XX:UseAVX=3 does not affect superword vector size on CL because the condition for CL already includes UseAVX > 2. And with -XX:UseAVX=2 all vectors are limited to 32 bytes on all platforms.

EDIT: I am wrong and you are right. I missed that you added FLAG_IS_DEFAULT(UseAVX) check. So any UseAVX setting on command line will bypass superword vector size setting for CL.

@vnkozlov
Copy link
Contributor

I am not sure about this effect of UseAVX setting on command line.
We use UseAVX setting on command line to emulate runs on different platforms. Some our machines are Cascade Lake. Before, even if we use -XX:UseAVX=3 on command line we still limit superword vector size for CL. With this patch we will not do that. It will not match what happens in production with default setting and we miss testing coverage for CL.

Why you added FLAG_IS_DEFAULT(UseAVX) check?

@fg1417
Copy link
Author

fg1417 commented Mar 23, 2023

I am not sure about this effect of UseAVX setting on command line. We use UseAVX setting on command line to emulate runs on different platforms. Some our machines are Cascade Lake. Before, even if we use -XX:UseAVX=3 on command line we still limit superword vector size for CL. With this patch we will not do that. It will not match what happens in production with default setting and we miss testing coverage for CL.

Why you added FLAG_IS_DEFAULT(UseAVX) check?

Hi @vnkozlov, thanks for your review!

I added the FLAG_IS_DEFAULT(UseAVX) check to follow its original logic, see

if (FLAG_IS_DEFAULT(UseAVX) && UseAVX > 2 &&
. Before the patch, using UseAVX=3 alone would change superword vector size for CL to be MaxVectorSize, but we could still explicitly use -XX:SuperWordMaxVectorSize=32 to limit superword vector size for CL.

Certainly, removing FLAG_IS_DEFAULT(UseAVX) check here is also reasonable to me. Then only -XX:MaxVectorSize=64 will affect it.

@vnkozlov
Copy link
Contributor

I added the FLAG_IS_DEFAULT(UseAVX) check to follow its original logic, see

I should have looked on existing code. Yes, it also checks FLAG_IS_DEFAULT(UseAVX) and you did not change behavior. Good.

I don't have anymore questions. I am currently running specjvm2008 on Cascade Lake with default settings with and without your changes to make sure performance stays the same (I don't know why it would change based on your changes but to be safe). I will let you know results.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finished running specjvm2008 on Cascade Lake. I was first surprise that I got better scores on few sub-benchmarks but after rerunning they were matching. Some kind of instability (TurboBust?, memory channel?). I did use numactl -m 1 -N 1 on dual socket system I have. Anyway, I would say results are matching as expecting.

@TobiHartmann regression testing passed too.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Mar 23, 2023
@fg1417
Copy link
Author

fg1417 commented Mar 24, 2023

Thanks for all your review and testing, @sviswa7 @vnkozlov @TobiHartmann.

@fg1417
Copy link
Author

fg1417 commented Mar 24, 2023

/integrate

@openjdk
Copy link

openjdk bot commented Mar 24, 2023

Going to push as commit 941a7ac.
Since your change was applied there have been 59 commits pushed to the master branch:

  • ac6af6a: 7176515: ExceptionInInitializerError for an enum with multiple switch statements
  • dd23ee9: 8303917: Update ISO 639 language codes table
  • 6f67abd: 8304557: java/util/concurrent/CompletableFuture/CompletableFutureOrTimeoutExceptionallyTest.java times out
  • 568dd57: 8304716: Clean up G1Policy::calc_max_old_cset_length()
  • af0504e: 8304691: Remove jlink --post-process-path option
  • 3859faf: 8231349: Move intrinsic stubs generation to compiler runtime initialization code
  • f37674a: 8304711: Combine G1 root region abort and wait into a single method
  • 7f9e691: 8304712: Only pass total number of regions into G1Policy::calc_min_old_cset_length
  • 51035a7: 8294137: Review running times of java.math tests
  • 46cca1a: 4842457: (bf spec) Clarify meaning of "(optional operation)"
  • ... and 49 more: https://git.openjdk.org/jdk/compare/c09f83ec25749af349fb5609e3641b5bb6d34072...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Mar 24, 2023
@openjdk openjdk bot closed this Mar 24, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Mar 24, 2023
@openjdk
Copy link

openjdk bot commented Mar 24, 2023

@fg1417 Pushed as commit 941a7ac.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot [email protected] integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

3 participants