-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8349138: Optimize Math.copySign API for Intel e-core targets #23386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/label add hotspot-compiler-dev |
|
👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into |
|
❗ This change is not yet ready to be integrated. |
|
@jatin-bhateja |
Webrevs
|
jaskarth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good improvement! Having more intrinsics available for AVX2 targets is nice. I've left some comments below.
src/hotspot/cpu/x86/x86.ad
Outdated
| case Op_CopySignD: | ||
| case Op_CopySignF: | ||
| if (UseAVX < 3 || !is_LP64) { | ||
| if (UseAVX < 1 || !is_LP64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be limited to just AVX2, or can the new rules work on AVX1 as well? Since they only use instructions that are available to AVX1.
| #endif // _LP64 | ||
|
|
||
| instruct copySignF_reg_avx(regF dst, regF src, regF xtmp1, regF xtmp2) %{ | ||
| predicate(!VM_Version::supports_avx512vl()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| predicate(!VM_Version::supports_avx512vl()); | |
| predicate(UseAVX > 0 && !VM_Version::supports_avx512vl()); |
Just to be a bit more explicit (and same for the one below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its already handled by match_rule_supported contraint.
src/hotspot/cpu/x86/x86.ad
Outdated
| __ movl($tmp2$$Register, 0x7FFFFFFF); | ||
| __ movdl($tmp1$$XMMRegister, $tmp2$$Register); | ||
| __ vpternlogd($dst$$XMMRegister, 0xE4, $src$$XMMRegister, $tmp1$$XMMRegister, Assembler::AVX_128bit); | ||
| __ vpcmpeqd($xtmp1$$XMMRegister, $xtmp1$$XMMRegister, $xtmp1$$XMMRegister, Assembler::AVX_128bit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any of the vector operands is from a higher register bank (16-31) then we need an EVEX encoding and in such a case, the results of the comparison is always an opmask register.
| #endif // _LP64 | ||
|
|
||
| instruct copySignF_reg_avx(regF dst, regF src, regF xtmp1, regF xtmp2) %{ | ||
| predicate(!VM_Version::supports_avx512vl()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its already handled by match_rule_supported contraint.
|
Could you instead do this by trying to transform |
@merykitty , this patch does not break existing IR invariants as multiple targets already emit efficient instruction sequences for it, we have just improved upon the x86-backed implementation. Introducing another new IR "AndF" will again need changes in auto-vectorizer. |
|
@jatin-bhateja Doing the transformation to
But currently, |
Also, what invariant can be broken by transforming |
Yes, I have a follow-up patch to auto-vectorized CopySign.
Hi @merykitty , I meant that in the context of CopySign, targets emit efficient instruction sequences for existing IR (CopySignF/D), this patch simply tuned x86 backend implementation to improve performance. |
|
|
Also currently, logical And mask is a long value, in case we opt-in for new AndF/D node creation, to preserve the IR semantics we would also need to perform an integral to floating point constant conversion, this will incur additional memory load penalty since floating-point constants are emitted into the constant table before native method body. For the time being, taking CopySign intrinsic route looks reasonable. |
|
@jatin-bhateja let me know when this is ready for more testing / review. Quick comment: it seems you are not just optimizing Math.copySign as the PR title says, but also adding vector nodes. Maybe you should update the PR title? Have not looked at the code in detail to suggest a better one yet ;) |
That means we can improve the generation of floating-point constants. The reason I object this approach is that it is short-sighted. It's not like we cannot generate similar machine code with the more general approach. Furthermore, after we do |
Hi @eme64 , vectorization is a form of optimization, so the title is generic enough to cover both vector and scalar performance. |
Hi @merykitty , the patch intends to absorb domain crossover penalty due to the movement of floating point arguments to GPRs, if we introduce a floating-point constant load penalty then we may degrade the performance. |
|
@jatin-bhateja This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
eme64
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then non x64 specific code looks reasonable, though I have 2 comments ;)
| IntStream.range(0, SIZE - 8).forEach(i -> { fmagnitude[i] = rd.nextFloat(-Float.MAX_VALUE, Float.MAX_VALUE); }); | ||
| IntStream.range(0, SIZE - 8).forEach(i -> { dmagnitude[i] = rd.nextFloat(-Float.MAX_VALUE, Float.MAX_VALUE); }); | ||
| IntStream.range(0, SIZE).forEach(i -> { fsign[i] = rd.nextFloat(-Float.MAX_VALUE, Float.MAX_VALUE); }); | ||
| IntStream.range(0, SIZE).forEach(i -> { dsign[i] = rd.nextFloat(-Float.MAX_VALUE, Float.MAX_VALUE); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use Generators.java ? That would also give you NaN, infinity, etc ;)
|
@jatin-bhateja This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the |
|
/open |
|
@jatin-bhateja This pull request is now open |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two comments:
- You probably wanted to use the double generator for the double arrays, right?
- You can fill a whole array directly with e.g.
Generators.G.fill(genFloat, fmagniture).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add a comment that we consider NaN with different encoding as the same value.
| #endif // _LP64 | ||
|
|
||
| instruct copySignF_reg(regF dst, regF src, regF tmp1, rRegI tmp2) %{ | ||
| instruct copySignF_reg_avx(regF dst, regF src, regF xtmp) %{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be vlRegF.
| %} | ||
|
|
||
| instruct copySignD_imm(regD dst, regD src, regD tmp1, rRegL tmp2, immD zero) %{ | ||
| instruct copySignD_imm_avx(regD dst, regD src, regD xtmp, immD zero) %{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be vlRegD.
| instruct copySignV_reg(vec dst, vec src, vec xtmp) %{ | ||
| match(Set dst (CopySignVF dst src)); | ||
| match(Set dst (CopySignVD dst src)); | ||
| effect(TEMP xtmp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vector_copy_sign_avx needs TEMP dst so may need two different instruct rules.
| if (elem_sz == 2) { | ||
| vpsllw(dst, src, shift, vlen_enc); | ||
| } else if (elem_sz == 4) { | ||
| vpslld(dst, src, shift, vlen_enc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AVX 1 supports 256-bit float/double vector and only128-bit vpsll, vpsrl, vpor for integer vectors. So you will have issues on AVX 1 platform for 256bit float/double vector copysign implementation using vpsll, vpsrl, vpor.
| // | ||
| // Result going from high bit to low bit is 0x11100100 = 0xe4 | ||
| // --------------------------------------- | ||
| #ifdef _LP64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_LP64 ifdef no more needed in .ad file (32 bit support has been removed).
|
@jatin-bhateja This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a |
|
@jatin-bhateja This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the |

Math.copySign is only intrinsified on x86 targets supporting the AVX512 feature.
Intel E-core Xeons support only the AVX2 feature set and still compile Java implementation which is composed of logical operations.
Since there is a 3-cycle penalty for copying incoming float/double values to GPRs before being operated upon by logical operation there is an opportunity to optimize this using an efficient instruction sequence.
Patch uses ANDPS and ANDPD logical instruction to generate efficient instruction sequences to absorb domain copy over penalty. Also, performs minor tuning for existing AVX512 instruction sequence based on VPTERNLOG instruction.
Following are the performance numbers of the following existing microbenchmark
https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/vm/compiler/Signum.java
Patch passes following validation test
test/jdk/java/lang/Math/IeeeRecommendedTests.java
New instruction sequence is now vector friendly and will be vectorized in follow up patch.
Progress
Issue
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23386/head:pull/23386$ git checkout pull/23386Update a local copy of the PR:
$ git checkout pull/23386$ git pull https://git.openjdk.org/jdk.git pull/23386/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 23386View PR using the GUI difftool:
$ git pr show -t 23386Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23386.diff
Using Webrev
Link to Webrev Comment