Skip to content

Conversation

@jatin-bhateja
Copy link
Member

@jatin-bhateja jatin-bhateja commented Jan 31, 2025

Math.copySign is only intrinsified on x86 targets supporting the AVX512 feature.
Intel E-core Xeons support only the AVX2 feature set and still compile Java implementation which is composed of logical operations.

Since there is a 3-cycle penalty for copying incoming float/double values to GPRs before being operated upon by logical operation there is an opportunity to optimize this using an efficient instruction sequence.

Patch uses ANDPS and ANDPD logical instruction to generate efficient instruction sequences to absorb domain copy over penalty. Also, performs minor tuning for existing AVX512 instruction sequence based on VPTERNLOG instruction.

Following are the performance numbers of the following existing microbenchmark
https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/vm/compiler/Signum.java

Patch passes following validation test
test/jdk/java/lang/Math/IeeeRecommendedTests.java


Granite Rapids-AP (P-core Xeon)
Baseline AVX512:
Benchmark                      Mode  Cnt     Score   Error   Units
Signum._5_copySignFloatTest   thrpt    2  1296.141          ops/ns
Signum._7_copySignDoubleTest  thrpt    2   838.954          ops/ns

Withopt :
Benchmark                      Mode  Cnt    Score   Error   Units
Signum._5_copySignFloatTest   thrpt    2  940.240          ops/ns
Signum._7_copySignDoubleTest  thrpt    2  967.370          ops/ns

Baseline AVX2:
Benchmark                      Mode  Cnt   Score   Error   Units
Signum._5_copySignFloatTest   thrpt    2  63.673          ops/ns
Signum._7_copySignDoubleTest  thrpt    2  26.898          ops/ns

Withopt :
Benchmark                      Mode  Cnt    Score   Error   Units
Signum._5_copySignFloatTest   thrpt    2  785.801          ops/ns
Signum._7_copySignDoubleTest  thrpt    2  558.710          ops/ns

Sierra Forest (E-core Xeon)
Baseline:
Benchmark                                       (seed)   Mode  Cnt        Score   Error   Units
o.o.b.vm.compiler.Signum._5_copySignFloatTest      N/A  thrpt    2       40.528          ops/ns
o.o.b.vm.compiler.Signum._7_copySignDoubleTest     N/A  thrpt    2       25.101          ops/ns

Withopt:
Benchmark                                       (seed)   Mode  Cnt        Score   Error   Units
o.o.b.vm.compiler.Signum._5_copySignFloatTest      N/A  thrpt    2      676.101          ops/ns
o.o.b.vm.compiler.Signum._7_copySignDoubleTest     N/A  thrpt    2      605.714          ops/ns


New instruction sequence is now vector friendly and will be vectorized in follow up patch.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8349138: Optimize Math.copySign API for Intel e-core targets (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23386/head:pull/23386
$ git checkout pull/23386

Update a local copy of the PR:
$ git checkout pull/23386
$ git pull https://git.openjdk.org/jdk.git pull/23386/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 23386

View PR using the GUI difftool:
$ git pr show -t 23386

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23386.diff

Using Webrev

Link to Webrev Comment

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Jan 31, 2025

/label add hotspot-compiler-dev

@jatin-bhateja jatin-bhateja marked this pull request as ready for review January 31, 2025 11:30
@bridgekeeper
Copy link

bridgekeeper bot commented Jan 31, 2025

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jan 31, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot added rfr Pull request is ready for review hotspot-compiler [email protected] labels Jan 31, 2025
@openjdk
Copy link

openjdk bot commented Jan 31, 2025

@jatin-bhateja
The hotspot-compiler label was successfully added.

@mlbridge
Copy link

mlbridge bot commented Jan 31, 2025

Webrevs

@jatin-bhateja jatin-bhateja marked this pull request as draft January 31, 2025 12:12
@openjdk openjdk bot removed the rfr Pull request is ready for review label Jan 31, 2025
@jatin-bhateja jatin-bhateja marked this pull request as ready for review January 31, 2025 12:16
@openjdk openjdk bot added the rfr Pull request is ready for review label Jan 31, 2025
Copy link
Member

@jaskarth jaskarth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good improvement! Having more intrinsics available for AVX2 targets is nice. I've left some comments below.

case Op_CopySignD:
case Op_CopySignF:
if (UseAVX < 3 || !is_LP64) {
if (UseAVX < 1 || !is_LP64) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be limited to just AVX2, or can the new rules work on AVX1 as well? Since they only use instructions that are available to AVX1.

#endif // _LP64

instruct copySignF_reg_avx(regF dst, regF src, regF xtmp1, regF xtmp2) %{
predicate(!VM_Version::supports_avx512vl());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
predicate(!VM_Version::supports_avx512vl());
predicate(UseAVX > 0 && !VM_Version::supports_avx512vl());

Just to be a bit more explicit (and same for the one below).

Copy link
Member Author

@jatin-bhateja jatin-bhateja Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its already handled by match_rule_supported contraint.

__ movl($tmp2$$Register, 0x7FFFFFFF);
__ movdl($tmp1$$XMMRegister, $tmp2$$Register);
__ vpternlogd($dst$$XMMRegister, 0xE4, $src$$XMMRegister, $tmp1$$XMMRegister, Assembler::AVX_128bit);
__ vpcmpeqd($xtmp1$$XMMRegister, $xtmp1$$XMMRegister, $xtmp1$$XMMRegister, Assembler::AVX_128bit);
Copy link
Member Author

@jatin-bhateja jatin-bhateja Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any of the vector operands is from a higher register bank (16-31) then we need an EVEX encoding and in such a case, the results of the comparison is always an opmask register.

#endif // _LP64

instruct copySignF_reg_avx(regF dst, regF src, regF xtmp1, regF xtmp2) %{
predicate(!VM_Version::supports_avx512vl());
Copy link
Member Author

@jatin-bhateja jatin-bhateja Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its already handled by match_rule_supported contraint.

@merykitty
Copy link
Member

Could you instead do this by trying to transform AndI(MoveF2I(x), MoveF2I(y)) into AndF(x, y) instead?

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Feb 4, 2025

Could you instead do this by trying to transform AndI(MoveF2I(x), MoveF2I(y)) into AndF(x, y) instead?

@merykitty , this patch does not break existing IR invariants as multiple targets already emit efficient instruction sequences for it, we have just improved upon the x86-backed implementation.
image

Introducing another new IR "AndF" will again need changes in auto-vectorizer.

@merykitty
Copy link
Member

@jatin-bhateja Doing the transformation to AndF would be a more general solution and thus better.

Introducing another new IR "AndF" will again need changes in auto-vectorizer.

But currently, CopySign and MoveF2I are not vectorized anyway so we can do the vectorization of AndF in a separate patch without much hassle. AndF is vectorized into existing AndV nicely so it is not a too complicated work.

@merykitty
Copy link
Member

this patch does not break existing IR invariants

Also, what invariant can be broken by transforming AndI(MoveF2I(x), MoveF2I(y) into MoveF2I(AndF(x, y))?

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Feb 4, 2025

@jatin-bhateja Doing the transformation to AndF would be a more general solution and thus better.

Introducing another new IR "AndF" will again need changes in auto-vectorizer.

But currently, CopySign and MoveF2I are not vectorized anyway so we can do the vectorization of AndF in a separate patch without much hassle. AndF is vectorized into existing AndV nicely so it is not a too complicated work.

Yes, I have a follow-up patch to auto-vectorized CopySign.

this patch does not break existing IR invariants

Also, what invariant can be broken by transforming AndI(MoveF2I(x), MoveF2I(y) into MoveF2I(AndF(x, y))?

Hi @merykitty , I meant that in the context of CopySign, targets emit efficient instruction sequences for existing IR (CopySignF/D), this patch simply tuned x86 backend implementation to improve performance.

@TobiHartmann
Copy link
Member

compiler/intrinsics/math/TestCopySignIntrinsic.java fails with -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:+TieredCompilation on Mac x64:

1) Method "public void compiler.intrinsics.math.TestCopySignIntrinsic.testCopySignD()" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#COPYSIGN_D#_", " >0 "}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={"avx", "true"}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: "(\\d+(\\s){2}(CopySignD.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 0 > 0 [given]
           - No nodes matched!

2) Method "public void compiler.intrinsics.math.TestCopySignIntrinsic.testCopySignF()" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#COPYSIGN_F#_", " >0 "}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={"avx", "true"}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: "(\\d+(\\s){2}(CopySignF.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 0 > 0 [given]
           - No nodes matched!

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Feb 12, 2025

@jatin-bhateja Doing the transformation to AndF would be a more general solution and thus better.

Introducing another new IR "AndF" will again need changes in auto-vectorizer.

But currently, CopySign and MoveF2I are not vectorized anyway so we can do the vectorization of AndF in a separate patch without much hassle. AndF is vectorized into existing AndV nicely so it is not a too complicated work.

Yes, I have a follow-up patch to auto-vectorized CopySign.

this patch does not break existing IR invariants

Also, what invariant can be broken by transforming AndI(MoveF2I(x), MoveF2I(y) into MoveF2I(AndF(x, y))?

Hi @merykitty , I meant that in the context of CopySign, targets emit efficient instruction sequences for existing IR (CopySignF/D), this patch simply tuned x86 backend implementation to improve performance.

Also currently, logical And mask is a long value, in case we opt-in for new AndF/D node creation, to preserve the IR semantics we would also need to perform an integral to floating point constant conversion, this will incur additional memory load penalty since floating-point constants are emitted into the constant table before native method body.

For the time being, taking CopySign intrinsic route looks reasonable.

@eme64
Copy link
Contributor

eme64 commented Feb 13, 2025

@jatin-bhateja let me know when this is ready for more testing / review.

Quick comment: it seems you are not just optimizing Math.copySign as the PR title says, but also adding vector nodes. Maybe you should update the PR title? Have not looked at the code in detail to suggest a better one yet ;)

@merykitty
Copy link
Member

Also currently, logical And mask is a long value, in case we opt-in for new AndF/D node creation, to preserve the IR semantics we would also need to perform an integral to floating point constant conversion, this will incur additional memory load penalty since floating-point constants are emitted into the constant table before native method body.

That means we can improve the generation of floating-point constants.

The reason I object this approach is that it is short-sighted. It's not like we cannot generate similar machine code with the more general approach. Furthermore, after we do AndF transformations, this patch is redundant and can be removed entirely.

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Feb 20, 2025

@jatin-bhateja let me know when this is ready for more testing / review.

Quick comment: it seems you are not just optimizing Math.copySign as the PR title says, but also adding vector nodes. Maybe you should update the PR title? Have not looked at the code in detail to suggest a better one yet ;)

Hi @eme64 , vectorization is a form of optimization, so the title is generic enough to cover both vector and scalar performance.
Let me know if you have other comments.

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Feb 20, 2025

Also currently, logical And mask is a long value, in case we opt-in for new AndF/D node creation, to preserve the IR semantics we would also need to perform an integral to floating point constant conversion, this will incur additional memory load penalty since floating-point constants are emitted into the constant table before native method body.

That means we can improve the generation of floating-point constants.

The reason I object this approach is that it is short-sighted. It's not like we cannot generate similar machine code with the more general approach. Furthermore, after we do AndF transformations, this patch is redundant and can be removed entirely.

Hi @merykitty , the patch intends to absorb domain crossover penalty due to the movement of floating point arguments to GPRs, if we introduce a floating-point constant load penalty then we may degrade the performance.

@bridgekeeper
Copy link

bridgekeeper bot commented Mar 20, 2025

@jatin-bhateja This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then non x64 specific code looks reasonable, though I have 2 comments ;)

Comment on lines 76 to 79
IntStream.range(0, SIZE - 8).forEach(i -> { fmagnitude[i] = rd.nextFloat(-Float.MAX_VALUE, Float.MAX_VALUE); });
IntStream.range(0, SIZE - 8).forEach(i -> { dmagnitude[i] = rd.nextFloat(-Float.MAX_VALUE, Float.MAX_VALUE); });
IntStream.range(0, SIZE).forEach(i -> { fsign[i] = rd.nextFloat(-Float.MAX_VALUE, Float.MAX_VALUE); });
IntStream.range(0, SIZE).forEach(i -> { dsign[i] = rd.nextFloat(-Float.MAX_VALUE, Float.MAX_VALUE); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use Generators.java ? That would also give you NaN, infinity, etc ;)

@bridgekeeper
Copy link

bridgekeeper bot commented Apr 30, 2025

@jatin-bhateja This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

@bridgekeeper bridgekeeper bot closed this Apr 30, 2025
@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented May 1, 2025

/open

@openjdk openjdk bot reopened this May 1, 2025
@openjdk
Copy link

openjdk bot commented May 1, 2025

@jatin-bhateja This pull request is now open

Comment on lines +79 to +86
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments:

  • You probably wanted to use the double generator for the double arrays, right?
  • You can fill a whole array directly with e.g. Generators.G.fill(genFloat, fmagniture).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add a comment that we consider NaN with different encoding as the same value.

#endif // _LP64

instruct copySignF_reg(regF dst, regF src, regF tmp1, rRegI tmp2) %{
instruct copySignF_reg_avx(regF dst, regF src, regF xtmp) %{
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be vlRegF.

%}

instruct copySignD_imm(regD dst, regD src, regD tmp1, rRegL tmp2, immD zero) %{
instruct copySignD_imm_avx(regD dst, regD src, regD xtmp, immD zero) %{
Copy link

@sviswa7 sviswa7 May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be vlRegD.

Comment on lines +6574 to +6577
instruct copySignV_reg(vec dst, vec src, vec xtmp) %{
match(Set dst (CopySignVF dst src));
match(Set dst (CopySignVD dst src));
effect(TEMP xtmp);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vector_copy_sign_avx needs TEMP dst so may need two different instruct rules.

if (elem_sz == 2) {
vpsllw(dst, src, shift, vlen_enc);
} else if (elem_sz == 4) {
vpslld(dst, src, shift, vlen_enc);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AVX 1 supports 256-bit float/double vector and only128-bit vpsll, vpsrl, vpor for integer vectors. So you will have issues on AVX 1 platform for 256bit float/double vector copysign implementation using vpsll, vpsrl, vpor.

//
// Result going from high bit to low bit is 0x11100100 = 0xe4
// ---------------------------------------
#ifdef _LP64
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_LP64 ifdef no more needed in .ad file (32 bit support has been removed).

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 18, 2025

@jatin-bhateja This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@bridgekeeper bridgekeeper bot added the oca Needs verification of OCA signatory status label Jul 15, 2025
@openjdk openjdk bot removed the rfr Pull request is ready for review label Jul 15, 2025
@bridgekeeper bridgekeeper bot removed the oca Needs verification of OCA signatory status label Jul 17, 2025
@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 17, 2025
@bridgekeeper
Copy link

bridgekeeper bot commented Aug 14, 2025

@jatin-bhateja This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

@bridgekeeper bridgekeeper bot closed this Aug 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler [email protected] rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

6 participants