8349138: Optimize Math.copySign API for Intel e-core targets #23386

jatin-bhateja · 2025-01-31T11:22:47Z

Math.copySign is only intrinsified on x86 targets supporting the AVX512 feature.
Intel E-core Xeons support only the AVX2 feature set and still compile Java implementation which is composed of logical operations.

Since there is a 3-cycle penalty for copying incoming float/double values to GPRs before being operated upon by logical operation there is an opportunity to optimize this using an efficient instruction sequence.

Patch uses ANDPS and ANDPD logical instruction to generate efficient instruction sequences to absorb domain copy over penalty. Also, performs minor tuning for existing AVX512 instruction sequence based on VPTERNLOG instruction.

Following are the performance numbers of the following existing microbenchmark
https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/vm/compiler/Signum.java

Patch passes following validation test
test/jdk/java/lang/Math/IeeeRecommendedTests.java


Granite Rapids-AP (P-core Xeon)
Baseline AVX512:
Benchmark                      Mode  Cnt     Score   Error   Units
Signum._5_copySignFloatTest   thrpt    2  1296.141          ops/ns
Signum._7_copySignDoubleTest  thrpt    2   838.954          ops/ns

Withopt :
Benchmark                      Mode  Cnt    Score   Error   Units
Signum._5_copySignFloatTest   thrpt    2  940.240          ops/ns
Signum._7_copySignDoubleTest  thrpt    2  967.370          ops/ns

Baseline AVX2:
Benchmark                      Mode  Cnt   Score   Error   Units
Signum._5_copySignFloatTest   thrpt    2  63.673          ops/ns
Signum._7_copySignDoubleTest  thrpt    2  26.898          ops/ns

Withopt :
Benchmark                      Mode  Cnt    Score   Error   Units
Signum._5_copySignFloatTest   thrpt    2  785.801          ops/ns
Signum._7_copySignDoubleTest  thrpt    2  558.710          ops/ns

Sierra Forest (E-core Xeon)
Baseline:
Benchmark                                       (seed)   Mode  Cnt        Score   Error   Units
o.o.b.vm.compiler.Signum._5_copySignFloatTest      N/A  thrpt    2       40.528          ops/ns
o.o.b.vm.compiler.Signum._7_copySignDoubleTest     N/A  thrpt    2       25.101          ops/ns

Withopt:
Benchmark                                       (seed)   Mode  Cnt        Score   Error   Units
o.o.b.vm.compiler.Signum._5_copySignFloatTest      N/A  thrpt    2      676.101          ops/ns
o.o.b.vm.compiler.Signum._7_copySignDoubleTest     N/A  thrpt    2      605.714          ops/ns

New instruction sequence is now vector friendly and will be vectorized in follow up patch.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8349138: Optimize Math.copySign API for Intel e-core targets (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23386/head:pull/23386
$ git checkout pull/23386

Update a local copy of the PR:
$ git checkout pull/23386
$ git pull https://git.openjdk.org/jdk.git pull/23386/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 23386

View PR using the GUI difftool:
$ git pr show -t 23386

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23386.diff

Using Webrev

Link to Webrev Comment

jatin-bhateja · 2025-01-31T11:25:58Z

/label add hotspot-compiler-dev

bridgekeeper · 2025-01-31T11:31:34Z

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-01-31T11:32:01Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

openjdk · 2025-01-31T11:32:44Z

@jatin-bhateja
The hotspot-compiler label was successfully added.

mlbridge · 2025-01-31T11:36:31Z

Webrevs

jaskarth

I think this is a good improvement! Having more intrinsics available for AVX2 targets is nice. I've left some comments below.

jaskarth · 2025-02-01T22:19:46Z

src/hotspot/cpu/x86/x86.ad

    case Op_CopySignD:
    case Op_CopySignF:
-      if (UseAVX < 3 || !is_LP64)  {
+      if (UseAVX < 1 || !is_LP64)  {


Should it be limited to just AVX2, or can the new rules work on AVX1 as well? Since they only use instructions that are available to AVX1.

jaskarth · 2025-02-01T22:19:57Z

src/hotspot/cpu/x86/x86.ad

 #endif // _LP64

+instruct copySignF_reg_avx(regF dst, regF src, regF xtmp1, regF xtmp2) %{
+  predicate(!VM_Version::supports_avx512vl());


Suggested change

predicate(!VM_Version::supports_avx512vl());

predicate(UseAVX > 0 && !VM_Version::supports_avx512vl());

Just to be a bit more explicit (and same for the one below).

Its already handled by match_rule_supported contraint.

jatin-bhateja · 2025-02-03T12:04:28Z

src/hotspot/cpu/x86/x86.ad

-    __ movl($tmp2$$Register, 0x7FFFFFFF);
-    __ movdl($tmp1$$XMMRegister, $tmp2$$Register);
-    __ vpternlogd($dst$$XMMRegister, 0xE4, $src$$XMMRegister, $tmp1$$XMMRegister, Assembler::AVX_128bit);
+    __ vpcmpeqd($xtmp1$$XMMRegister, $xtmp1$$XMMRegister, $xtmp1$$XMMRegister, Assembler::AVX_128bit);


If any of the vector operands is from a higher register bank (16-31) then we need an EVEX encoding and in such a case, the results of the comparison is always an opmask register.

jatin-bhateja · 2025-02-03T12:05:42Z

src/hotspot/cpu/x86/x86.ad

 #endif // _LP64

+instruct copySignF_reg_avx(regF dst, regF src, regF xtmp1, regF xtmp2) %{
+  predicate(!VM_Version::supports_avx512vl());


Its already handled by match_rule_supported contraint.

merykitty · 2025-02-03T14:36:54Z

Could you instead do this by trying to transform AndI(MoveF2I(x), MoveF2I(y)) into AndF(x, y) instead?

jatin-bhateja · 2025-02-04T06:14:11Z

Could you instead do this by trying to transform AndI(MoveF2I(x), MoveF2I(y)) into AndF(x, y) instead?

@merykitty , this patch does not break existing IR invariants as multiple targets already emit efficient instruction sequences for it, we have just improved upon the x86-backed implementation.

Introducing another new IR "AndF" will again need changes in auto-vectorizer.

merykitty · 2025-02-04T16:47:03Z

@jatin-bhateja Doing the transformation to AndF would be a more general solution and thus better.

Introducing another new IR "AndF" will again need changes in auto-vectorizer.

But currently, CopySign and MoveF2I are not vectorized anyway so we can do the vectorization of AndF in a separate patch without much hassle. AndF is vectorized into existing AndV nicely so it is not a too complicated work.

merykitty · 2025-02-04T16:51:33Z

this patch does not break existing IR invariants

Also, what invariant can be broken by transforming AndI(MoveF2I(x), MoveF2I(y) into MoveF2I(AndF(x, y))?

jatin-bhateja · 2025-02-04T17:41:34Z

@jatin-bhateja Doing the transformation to AndF would be a more general solution and thus better.

Introducing another new IR "AndF" will again need changes in auto-vectorizer.

But currently, CopySign and MoveF2I are not vectorized anyway so we can do the vectorization of AndF in a separate patch without much hassle. AndF is vectorized into existing AndV nicely so it is not a too complicated work.

Yes, I have a follow-up patch to auto-vectorized CopySign.

this patch does not break existing IR invariants

Also, what invariant can be broken by transforming AndI(MoveF2I(x), MoveF2I(y) into MoveF2I(AndF(x, y))?

Hi @merykitty , I meant that in the context of CopySign, targets emit efficient instruction sequences for existing IR (CopySignF/D), this patch simply tuned x86 backend implementation to improve performance.

TobiHartmann · 2025-02-05T07:37:28Z

compiler/intrinsics/math/TestCopySignIntrinsic.java fails with -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:+TieredCompilation on Mac x64:

1) Method "public void compiler.intrinsics.math.TestCopySignIntrinsic.testCopySignD()" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#COPYSIGN_D#_", " >0 "}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={"avx", "true"}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: "(\\d+(\\s){2}(CopySignD.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 0 > 0 [given]
           - No nodes matched!

2) Method "public void compiler.intrinsics.math.TestCopySignIntrinsic.testCopySignF()" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#COPYSIGN_F#_", " >0 "}, applyIfPlatform={}, applyIfPlatformOr={}, failOn={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={"avx", "true"}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: "(\\d+(\\s){2}(CopySignF.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 0 > 0 [given]
           - No nodes matched!

jatin-bhateja · 2025-02-12T12:07:16Z

@jatin-bhateja Doing the transformation to AndF would be a more general solution and thus better.

Introducing another new IR "AndF" will again need changes in auto-vectorizer.

But currently, CopySign and MoveF2I are not vectorized anyway so we can do the vectorization of AndF in a separate patch without much hassle. AndF is vectorized into existing AndV nicely so it is not a too complicated work.

Yes, I have a follow-up patch to auto-vectorized CopySign.

this patch does not break existing IR invariants

Also, what invariant can be broken by transforming AndI(MoveF2I(x), MoveF2I(y) into MoveF2I(AndF(x, y))?

Hi @merykitty , I meant that in the context of CopySign, targets emit efficient instruction sequences for existing IR (CopySignF/D), this patch simply tuned x86 backend implementation to improve performance.

Also currently, logical And mask is a long value, in case we opt-in for new AndF/D node creation, to preserve the IR semantics we would also need to perform an integral to floating point constant conversion, this will incur additional memory load penalty since floating-point constants are emitted into the constant table before native method body.

For the time being, taking CopySign intrinsic route looks reasonable.

eme64 · 2025-02-13T09:18:00Z

@jatin-bhateja let me know when this is ready for more testing / review.

Quick comment: it seems you are not just optimizing Math.copySign as the PR title says, but also adding vector nodes. Maybe you should update the PR title? Have not looked at the code in detail to suggest a better one yet ;)

merykitty · 2025-02-13T11:01:34Z

Also currently, logical And mask is a long value, in case we opt-in for new AndF/D node creation, to preserve the IR semantics we would also need to perform an integral to floating point constant conversion, this will incur additional memory load penalty since floating-point constants are emitted into the constant table before native method body.

That means we can improve the generation of floating-point constants.

The reason I object this approach is that it is short-sighted. It's not like we cannot generate similar machine code with the more general approach. Furthermore, after we do AndF transformations, this patch is redundant and can be removed entirely.

jatin-bhateja · 2025-02-20T15:47:55Z

@jatin-bhateja let me know when this is ready for more testing / review.

Quick comment: it seems you are not just optimizing Math.copySign as the PR title says, but also adding vector nodes. Maybe you should update the PR title? Have not looked at the code in detail to suggest a better one yet ;)

Hi @eme64 , vectorization is a form of optimization, so the title is generic enough to cover both vector and scalar performance.
Let me know if you have other comments.

jatin-bhateja · 2025-02-20T15:50:50Z

Also currently, logical And mask is a long value, in case we opt-in for new AndF/D node creation, to preserve the IR semantics we would also need to perform an integral to floating point constant conversion, this will incur additional memory load penalty since floating-point constants are emitted into the constant table before native method body.

That means we can improve the generation of floating-point constants.

The reason I object this approach is that it is short-sighted. It's not like we cannot generate similar machine code with the more general approach. Furthermore, after we do AndF transformations, this patch is redundant and can be removed entirely.

Hi @merykitty , the patch intends to absorb domain crossover penalty due to the movement of floating point arguments to GPRs, if we introduce a floating-point constant load penalty then we may degrade the performance.

bridgekeeper · 2025-03-20T16:21:20Z

@jatin-bhateja This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

eme64

Then non x64 specific code looks reasonable, though I have 2 comments ;)

eme64 · 2025-04-02T11:30:31Z

test/hotspot/jtreg/compiler/intrinsics/math/TestCopySignIntrinsic.java

+        IntStream.range(0, SIZE - 8).forEach(i -> { fmagnitude[i] = rd.nextFloat(-Float.MAX_VALUE, Float.MAX_VALUE); });
+        IntStream.range(0, SIZE - 8).forEach(i -> { dmagnitude[i] = rd.nextFloat(-Float.MAX_VALUE, Float.MAX_VALUE); });
+        IntStream.range(0, SIZE).forEach(i -> { fsign[i] = rd.nextFloat(-Float.MAX_VALUE, Float.MAX_VALUE); });
+        IntStream.range(0, SIZE).forEach(i -> { dsign[i] = rd.nextFloat(-Float.MAX_VALUE, Float.MAX_VALUE); });


Why not use Generators.java ? That would also give you NaN, infinity, etc ;)

test/hotspot/jtreg/compiler/intrinsics/math/TestCopySignIntrinsic.java

bridgekeeper · 2025-04-30T15:55:57Z

@jatin-bhateja This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

jatin-bhateja · 2025-05-01T11:22:31Z

/open

openjdk · 2025-05-01T11:23:08Z

@jatin-bhateja This pull request is now open

eme64 · 2025-05-19T06:54:05Z

test/hotspot/jtreg/compiler/intrinsics/math/TestCopySignIntrinsic.java

Two comments:

You probably wanted to use the double generator for the double arrays, right?

You can fill a whole array directly with e.g. Generators.G.fill(genFloat, fmagniture).

eme64 · 2025-05-19T06:56:08Z

test/hotspot/jtreg/compiler/intrinsics/math/TestCopySignIntrinsic.java

You could add a comment that we consider NaN with different encoding as the same value.

sviswa7 · 2025-05-20T22:36:44Z

src/hotspot/cpu/x86/x86.ad

+#endif // _LP64

-instruct copySignF_reg(regF dst, regF src, regF tmp1, rRegI tmp2) %{
+instruct copySignF_reg_avx(regF dst, regF src, regF xtmp) %{


These should be vlRegF.

sviswa7 · 2025-05-20T22:38:32Z

src/hotspot/cpu/x86/x86.ad

 %}

-instruct copySignD_imm(regD dst, regD src, regD tmp1, rRegL tmp2, immD zero) %{
+instruct copySignD_imm_avx(regD dst, regD src, regD xtmp, immD zero) %{


These should be vlRegD.

sviswa7 · 2025-05-20T22:41:15Z

src/hotspot/cpu/x86/x86.ad

+instruct copySignV_reg(vec dst, vec src, vec xtmp) %{
+  match(Set dst (CopySignVF dst src));
+  match(Set dst (CopySignVD dst src));
+  effect(TEMP xtmp);


vector_copy_sign_avx needs TEMP dst so may need two different instruct rules.

sviswa7 · 2025-05-20T23:01:29Z

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp

+  if (elem_sz == 2) {
+    vpsllw(dst, src, shift, vlen_enc);
+  } else if (elem_sz == 4) {
+    vpslld(dst, src, shift, vlen_enc);


AVX 1 supports 256-bit float/double vector and only128-bit vpsll, vpsrl, vpor for integer vectors. So you will have issues on AVX 1 platform for 256bit float/double vector copysign implementation using vpsll, vpsrl, vpor.

sviswa7 · 2025-05-20T23:13:46Z

src/hotspot/cpu/x86/x86.ad

-//
-// Result going from high bit to low bit is 0x11100100 = 0xe4
-// ---------------------------------------
+#ifdef _LP64


_LP64 ifdef no more needed in .ad file (32 bit support has been removed).

bridgekeeper · 2025-06-18T04:59:30Z

@jatin-bhateja This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper · 2025-08-14T20:02:15Z

@jatin-bhateja This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

8349138: Optimize Math.copySign API for Intel e-core and p-core targets

d620eb6

jatin-bhateja marked this pull request as ready for review January 31, 2025 11:30

openjdk bot added rfr Pull request is ready for review hotspot-compiler [email protected] labels Jan 31, 2025

jatin-bhateja marked this pull request as draft January 31, 2025 12:12

openjdk bot removed the rfr Pull request is ready for review label Jan 31, 2025

jatin-bhateja marked this pull request as ready for review January 31, 2025 12:16

openjdk bot added the rfr Pull request is ready for review label Jan 31, 2025

jaskarth reviewed Feb 1, 2025

View reviewed changes

Adding IR framework verification test

2181850

jatin-bhateja commented Feb 3, 2025

View reviewed changes

Adding vector support along with some refactoring.

a254873

eme64 suggested changes Apr 2, 2025

View reviewed changes

bridgekeeper bot closed this Apr 30, 2025

openjdk bot reopened this May 1, 2025

Jatin Bhateja added 2 commits May 6, 2025 07:27

Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8349138

d3c3f1d

Review comments resolutions

ecc3658

eme64 suggested changes May 19, 2025

View reviewed changes

sviswa7 reviewed May 20, 2025

View reviewed changes

bridgekeeper bot added the oca Needs verification of OCA signatory status label Jul 15, 2025

openjdk bot removed the rfr Pull request is ready for review label Jul 15, 2025

bridgekeeper bot removed the oca Needs verification of OCA signatory status label Jul 17, 2025

openjdk bot added the rfr Pull request is ready for review label Jul 17, 2025

bridgekeeper bot closed this Aug 14, 2025

	predicate(!VM_Version::supports_avx512vl());
	predicate(UseAVX > 0 && !VM_Version::supports_avx512vl());

8349138: Optimize Math.copySign API for Intel e-core targets #23386

8349138: Optimize Math.copySign API for Intel e-core targets #23386

Uh oh!

Conversation

jatin-bhateja commented Jan 31, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewing

Uh oh!

jatin-bhateja commented Jan 31, 2025 • edited by bridgekeeper bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bridgekeeper bot commented Jan 31, 2025

Uh oh!

openjdk bot commented Jan 31, 2025

Uh oh!

openjdk bot commented Jan 31, 2025

Uh oh!

mlbridge bot commented Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

jaskarth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jatin-bhateja Feb 3, 2025 • edited by bridgekeeper bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jatin-bhateja Feb 3, 2025 • edited by bridgekeeper bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jatin-bhateja Feb 3, 2025 • edited by bridgekeeper bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merykitty commented Feb 3, 2025

Uh oh!

jatin-bhateja commented Feb 4, 2025 • edited by bridgekeeper bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

merykitty commented Feb 4, 2025

Uh oh!

merykitty commented Feb 4, 2025

Uh oh!

jatin-bhateja commented Feb 4, 2025 • edited by bridgekeeper bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TobiHartmann commented Feb 5, 2025

Uh oh!

jatin-bhateja commented Feb 12, 2025 • edited by bridgekeeper bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eme64 commented Feb 13, 2025

Uh oh!

merykitty commented Feb 13, 2025

Uh oh!

jatin-bhateja commented Feb 20, 2025 • edited by bridgekeeper bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jatin-bhateja commented Feb 20, 2025 • edited by bridgekeeper bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bridgekeeper bot commented Mar 20, 2025

Uh oh!

eme64 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bridgekeeper bot commented Apr 30, 2025

Uh oh!

jatin-bhateja commented May 1, 2025 • edited by bridgekeeper bot Loading Uh oh! There was an error while loading. Please reload this page.

jatin-bhateja commented Jan 31, 2025 •

edited by openjdk bot

Loading

jatin-bhateja commented Jan 31, 2025 •

edited by bridgekeeper bot

Loading

mlbridge bot commented Jan 31, 2025 •

edited

Loading

jatin-bhateja Feb 3, 2025 •

edited by bridgekeeper bot

Loading

jatin-bhateja Feb 3, 2025 •

edited by bridgekeeper bot

Loading

jatin-bhateja Feb 3, 2025 •

edited by bridgekeeper bot

Loading

jatin-bhateja commented Feb 4, 2025 •

edited by bridgekeeper bot

Loading

jatin-bhateja commented Feb 4, 2025 •

edited by bridgekeeper bot

Loading

jatin-bhateja commented Feb 12, 2025 •

edited by bridgekeeper bot

Loading

jatin-bhateja commented Feb 20, 2025 •

edited by bridgekeeper bot

Loading

jatin-bhateja commented Feb 20, 2025 •

edited by bridgekeeper bot

Loading

jatin-bhateja commented May 1, 2025 •

edited by bridgekeeper bot

Loading

sviswa7 May 20, 2025 •

edited

Loading