Skip to content

Conversation

@shqking
Copy link
Contributor

@shqking shqking commented Mar 7, 2022

Implementation

In AArch64 NEON, vector shift right is implemented by vector shift left
instructions (SSHL[1] and USHL[2]) with negative shift count value. In
C2 backend, we generate a neg to given shift value followed by sshl
or ushl instruction.

For vector shift right, the vector shift count has two origins:

  1. it can be duplicated from scalar variable/immediate(case-1),
  2. it can be loaded directly from one vector(case-2).

This patch aims to optimize case-1. Specifically, we move the negate
from RShiftV* rules to RShiftCntV rule. As a result, the negate can be
hoisted outside of the loop if it's a loop invariant.

In this patch,

  1. we split vshiftcnt* rules into vslcnt* and vsrcnt* rules to handle
    shift left and shift right respectively. Compared to vslcnt* rules, the
    negate is conducted in vsrcnt*.
  2. for each vsra* and vsrl* rules, we create one variant, i.e. vsra*_var
    and vsrl*_var. We use vsra* and vsrl* rules to handle case-1, and use
    vsra*_var and vsrl*_var rules to handle case-2. Note that
    ShiftVNode::is_var_shift() can be used to distinguish case-1 from
    case-2.
  3. we add one assertion for the vs*_imm rules as we have done on
    ARM32[3].
  4. several style issues are resolved.

Example

Take function rShiftInt() in the newly added micro benchmark
VectorShiftRight.java as an example.

public void rShiftInt() {
    for (int i = 0; i < SIZE; i++) {
        intsB[i] = intsA[i] >> count;
    }
}

Arithmetic shift right is conducted inside a big loop. The following
code snippet shows the disassembly code generated by auto-vectorization
before we apply current patch. We can see that neg is conducted in the
loop body.

0x0000ffff89057a64:   dup     v16.16b, w13              <-- dup
0x0000ffff89057a68:   mov     w12, #0x7d00                    // #32000
0x0000ffff89057a6c:   sub     w13, w2, w10
0x0000ffff89057a70:   cmp     w2, w10
0x0000ffff89057a74:   csel    w13, wzr, w13, lt
0x0000ffff89057a78:   mov     w8, #0x7d00                     // #32000
0x0000ffff89057a7c:   cmp     w13, w8
0x0000ffff89057a80:   csel    w13, w12, w13, hi
0x0000ffff89057a84:   add     w14, w13, w10
0x0000ffff89057a88:   nop
0x0000ffff89057a8c:   nop
0x0000ffff89057a90:   sbfiz   x13, x10, #2, #32         <-- loop entry
0x0000ffff89057a94:   add     x15, x17, x13
0x0000ffff89057a98:   ldr     q17, [x15,#16]
0x0000ffff89057a9c:   add     x13, x0, x13
0x0000ffff89057aa0:   neg     v18.16b, v16.16b          <-- neg
0x0000ffff89057aa4:   sshl    v17.4s, v17.4s, v18.4s    <-- shift right
0x0000ffff89057aa8:   str     q17, [x13,#16]
0x0000ffff89057aac:   ...
0x0000ffff89057b1c:   add     w10, w10, #0x20
0x0000ffff89057b20:   cmp     w10, w14
0x0000ffff89057b24:   b.lt    0x0000ffff89057a90        <-- loop end

Here is the disassembly code after we apply current patch. We can see
that the negate is no longer conducted inside the loop, and it is
hoisted to the outside.

0x0000ffff8d053a68:   neg     w14, w13                  <---- neg
0x0000ffff8d053a6c:   dup     v16.16b, w14              <---- dup
0x0000ffff8d053a70:   sub     w14, w2, w10
0x0000ffff8d053a74:   cmp     w2, w10
0x0000ffff8d053a78:   csel    w14, wzr, w14, lt
0x0000ffff8d053a7c:   mov     w8, #0x7d00                     // #32000
0x0000ffff8d053a80:   cmp     w14, w8
0x0000ffff8d053a84:   csel    w14, w12, w14, hi
0x0000ffff8d053a88:   add     w13, w14, w10
0x0000ffff8d053a8c:   nop
0x0000ffff8d053a90:   sbfiz   x14, x10, #2, #32         <-- loop entry
0x0000ffff8d053a94:   add     x15, x17, x14
0x0000ffff8d053a98:   ldr     q17, [x15,#16]
0x0000ffff8d053a9c:   sshl    v17.4s, v17.4s, v16.4s    <-- shift right
0x0000ffff8d053aa0:   add     x14, x0, x14
0x0000ffff8d053aa4:   str     q17, [x14,#16]
0x0000ffff8d053aa8:   ...
0x0000ffff8d053afc:   add     w10, w10, #0x20
0x0000ffff8d053b00:   cmp     w10, w13
0x0000ffff8d053b04:   b.lt    0x0000ffff8d053a90        <-- loop end

Testing

Tier1~3 tests passed on Linux/AArch64 platform.

Performance Evaluation

  • Auto-vectorization

One micro benchmark, i.e. VectorShiftRight.java, is added by this patch
in order to evaluate the optimization on vector shift right.

The following table shows the result. Column Score-1 shows the score
before we apply current patch, and column Score-2 shows the score when
we apply current patch.

We witness about 30% ~ 53% improvement on microbenchmarks.

Benchmark                      Units    Score-1    Score-2
VectorShiftRight.rShiftByte   ops/ms  10601.980  13816.353
VectorShiftRight.rShiftInt    ops/ms   3592.831   5502.941
VectorShiftRight.rShiftLong   ops/ms   1584.012   2425.247
VectorShiftRight.rShiftShort  ops/ms   6643.414   9728.762
VectorShiftRight.urShiftByte  ops/ms   2066.965   2048.336 (*)
VectorShiftRight.urShiftChar  ops/ms   6660.805   9728.478
VectorShiftRight.urShiftInt   ops/ms   3592.909   5514.928
VectorShiftRight.urShiftLong  ops/ms   1583.995   2422.991

*: Logical shift right for Byte type(urShiftByte) is not vectorized, as
disscussed in [4].
  • VectorAPI

Furthermore, we also evaluate the impact of this patch on VectorAPI
benchmarks, e.g., [5]. Details can be found in the table below. Columns
Score-1 and Score-2 show the scores before and after applying
current patch.

Benchmark                  Units    Score-1    Score-2
Byte128Vector.LSHL        ops/ms  10867.666  10873.993
Byte128Vector.LSHLShift   ops/ms  10945.729  10945.741
Byte128Vector.LSHR        ops/ms   8629.305   8629.343
Byte128Vector.LSHRShift   ops/ms   8245.864  10303.521   <--
Byte128Vector.ASHR        ops/ms   8619.691   8629.438
Byte128Vector.ASHRShift   ops/ms   8245.860  10305.027   <--
Int128Vector.LSHL         ops/ms   3104.213   3103.702
Int128Vector.LSHLShift    ops/ms   3114.354   3114.371
Int128Vector.LSHR         ops/ms   2380.717   2380.693
Int128Vector.LSHRShift    ops/ms   2312.871   2992.377   <--
Int128Vector.ASHR         ops/ms   2380.668   2380.647
Int128Vector.ASHRShift    ops/ms   2312.894   2992.332   <--
Long128Vector.LSHL        ops/ms   1586.907   1587.591
Long128Vector.LSHLShift   ops/ms   1589.469   1589.540
Long128Vector.LSHR        ops/ms   1209.754   1209.687
Long128Vector.LSHRShift   ops/ms   1174.718   1527.502   <--
Long128Vector.ASHR        ops/ms   1209.713   1209.669
Long128Vector.ASHRShift   ops/ms   1174.712   1527.174   <--
Short128Vector.LSHL       ops/ms   5945.542   5943.770
Short128Vector.LSHLShift  ops/ms   5984.743   5984.640
Short128Vector.LSHR       ops/ms   4613.378   4613.577
Short128Vector.LSHRShift  ops/ms   4486.023   5746.466   <--
Short128Vector.ASHR       ops/ms   4613.389   4613.478
Short128Vector.ASHRShift  ops/ms   4486.019   5746.368   <--
  1. For logical shift left(LSHL and LSHLShift), and shift right with
    variable vector shift count(LSHR and ASHR) cases, we didn't find much
    changes, which is expected.

  2. For shift right with scalar shift count(LSHRShift and ASHRShift)
    case, about 25% ~ 30% improvement can be observed, and this benefit is
    introduced by current patch.

[1] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/SSHL--Signed-Shift-Left--register--
[2] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/USHL--Unsigned-Shift-Left--register--
[3] openjdk/jdk18#41
[4] #1087
[5] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L509


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8265263: AArch64: Combine vneg with right shift count

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/7724/head:pull/7724
$ git checkout pull/7724

Update a local copy of the PR:
$ git checkout pull/7724
$ git pull https://git.openjdk.java.net/jdk pull/7724/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 7724

View PR using the GUI difftool:
$ git pr show -t 7724

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/7724.diff

*** Implementation

In AArch64 NEON, vector shift right is implemented by vector shift left
instructions (SSHL[1] and USHL[2]) with negative shift count value. In
C2 backend, we generate a `neg` to given shift value followed by `sshl`
or `ushl` instruction.

For vector shift right, the vector shift count has two origins:
1) it can be duplicated from scalar variable/immediate(case-1),
2) it can be loaded directly from one vector(case-2).

This patch aims to optimize case-1. Specifically, we move the negate
from RShiftV* rules to RShiftCntV rule. As a result, the negate can be
hoisted outside of the loop if it's a loop invariant.

In this patch,
1) we split vshiftcnt* rules into vslcnt* and vsrcnt* rules to handle
shift left and shift right respectively. Compared to vslcnt* rules, the
negate is conducted in vsrcnt*.
2) for each vsra* and vsrl* rules, we create one variant, i.e. vsra*_var
and vsrl*_var. We use vsra* and vsrl* rules to handle case-1, and use
vsra*_var and vsrl*_var rules to handle case-2. Note that
ShiftVNode::is_var_shift() can be used to distinguish case-1 from
case-2.
3) we add one assertion for the vs*_imm rules as we have done on
ARM32[3].
4) several style issues are resolved.

*** Example

Take function `rShiftInt()` in the newly added micro benchmark
VectorShiftRight.java as an example.

```
public void rShiftInt() {
    for (int i = 0; i < SIZE; i++) {
        intsB[i] = intsA[i] >> count;
    }
}
```

Arithmetic shift right is conducted inside a big loop. The following
code snippet shows the disassembly code generated by auto-vectorization
before we apply current patch. We can see that `neg` is conducted in the
loop body.

```
0x0000ffff89057a64:   dup     v16.16b, w13              <-- dup
0x0000ffff89057a68:   mov     w12, #0x7d00                    // #32000
0x0000ffff89057a6c:   sub     w13, w2, w10
0x0000ffff89057a70:   cmp     w2, w10
0x0000ffff89057a74:   csel    w13, wzr, w13, lt
0x0000ffff89057a78:   mov     w8, #0x7d00                     // #32000
0x0000ffff89057a7c:   cmp     w13, w8
0x0000ffff89057a80:   csel    w13, w12, w13, hi
0x0000ffff89057a84:   add     w14, w13, w10
0x0000ffff89057a88:   nop
0x0000ffff89057a8c:   nop
0x0000ffff89057a90:   sbfiz   x13, x10, openjdk#2, openjdk#32         <-- loop entry
0x0000ffff89057a94:   add     x15, x17, x13
0x0000ffff89057a98:   ldr     q17, [x15,openjdk#16]
0x0000ffff89057a9c:   add     x13, x0, x13
0x0000ffff89057aa0:   neg     v18.16b, v16.16b          <-- neg
0x0000ffff89057aa4:   sshl    v17.4s, v17.4s, v18.4s    <-- shift right
0x0000ffff89057aa8:   str     q17, [x13,openjdk#16]
0x0000ffff89057aac:   ...
0x0000ffff89057b1c:   add     w10, w10, #0x20
0x0000ffff89057b20:   cmp     w10, w14
0x0000ffff89057b24:   b.lt    0x0000ffff89057a90        <-- loop end
```

Here is the disassembly code after we apply current patch. We can see
that the negate is no longer conducted inside the loop, and it is
hoisted to the outside.

```
0x0000ffff8d053a68:   neg     w14, w13                  <---- neg
0x0000ffff8d053a6c:   dup     v16.16b, w14              <---- dup
0x0000ffff8d053a70:   sub     w14, w2, w10
0x0000ffff8d053a74:   cmp     w2, w10
0x0000ffff8d053a78:   csel    w14, wzr, w14, lt
0x0000ffff8d053a7c:   mov     w8, #0x7d00                     // #32000
0x0000ffff8d053a80:   cmp     w14, w8
0x0000ffff8d053a84:   csel    w14, w12, w14, hi
0x0000ffff8d053a88:   add     w13, w14, w10
0x0000ffff8d053a8c:   nop
0x0000ffff8d053a90:   sbfiz   x14, x10, openjdk#2, openjdk#32         <-- loop entry
0x0000ffff8d053a94:   add     x15, x17, x14
0x0000ffff8d053a98:   ldr     q17, [x15,openjdk#16]
0x0000ffff8d053a9c:   sshl    v17.4s, v17.4s, v16.4s    <-- shift right
0x0000ffff8d053aa0:   add     x14, x0, x14
0x0000ffff8d053aa4:   str     q17, [x14,openjdk#16]
0x0000ffff8d053aa8:   ...
0x0000ffff8d053afc:   add     w10, w10, #0x20
0x0000ffff8d053b00:   cmp     w10, w13
0x0000ffff8d053b04:   b.lt    0x0000ffff8d053a90        <-- loop end
```

*** Testing

Tier1~3 tests passed on Linux/AArch64 platform.

*** Performance Evaluation

- Auto-vectorization

One micro benchmark, i.e. VectorShiftRight.java, is added by this patch
in order to evaluate the optimization on vector shift right.

The following table shows the result. Column `Score-1` shows the score
before we apply current patch, and column `Score-2` shows the score when
we apply current patch.

We witness about 30% ~ 53% improvement on microbenchmarks.

```
Benchmark                      Units    Score-1    Score-2
VectorShiftRight.rShiftByte   ops/ms  10601.980  13816.353
VectorShiftRight.rShiftInt    ops/ms   3592.831   5502.941
VectorShiftRight.rShiftLong   ops/ms   1584.012   2425.247
VectorShiftRight.rShiftShort  ops/ms   6643.414   9728.762
VectorShiftRight.urShiftByte  ops/ms   2066.965   2048.336 (*)
VectorShiftRight.urShiftChar  ops/ms   6660.805   9728.478
VectorShiftRight.urShiftInt   ops/ms   3592.909   5514.928
VectorShiftRight.urShiftLong  ops/ms   1583.995   2422.991

*: Logical shift right for Byte type(urShiftByte) is not vectorized, as
disscussed in [4].
```

- VectorAPI

Furthermore, we also evaluate the impact of this patch on VectorAPI
benchmarks, e.g., [5]. Details can be found in the table below. Columns
`Score-1` and `Score-2` show the scores before and after applying
current patch.

```
Benchmark                  Units    Score-1    Score-2
Byte128Vector.LSHL        ops/ms  10867.666  10873.993
Byte128Vector.LSHLShift   ops/ms  10945.729  10945.741
Byte128Vector.LSHR        ops/ms   8629.305   8629.343
Byte128Vector.LSHRShift   ops/ms   8245.864  10303.521   <--
Byte128Vector.ASHR        ops/ms   8619.691   8629.438
Byte128Vector.ASHRShift   ops/ms   8245.860  10305.027   <--
Int128Vector.LSHL         ops/ms   3104.213   3103.702
Int128Vector.LSHLShift    ops/ms   3114.354   3114.371
Int128Vector.LSHR         ops/ms   2380.717   2380.693
Int128Vector.LSHRShift    ops/ms   2312.871   2992.377   <--
Int128Vector.ASHR         ops/ms   2380.668   2380.647
Int128Vector.ASHRShift    ops/ms   2312.894   2992.332   <--
Long128Vector.LSHL        ops/ms   1586.907   1587.591
Long128Vector.LSHLShift   ops/ms   1589.469   1589.540
Long128Vector.LSHR        ops/ms   1209.754   1209.687
Long128Vector.LSHRShift   ops/ms   1174.718   1527.502   <--
Long128Vector.ASHR        ops/ms   1209.713   1209.669
Long128Vector.ASHRShift   ops/ms   1174.712   1527.174   <--
Short128Vector.LSHL       ops/ms   5945.542   5943.770
Short128Vector.LSHLShift  ops/ms   5984.743   5984.640
Short128Vector.LSHR       ops/ms   4613.378   4613.577
Short128Vector.LSHRShift  ops/ms   4486.023   5746.466   <--
Short128Vector.ASHR       ops/ms   4613.389   4613.478
Short128Vector.ASHRShift  ops/ms   4486.019   5746.368   <--
```

1) For logical shift left(LSHL and LSHLShift), and shift right with
variable vector shift count(LSHR and ASHR) cases, we didn't find much
changes, which is expected.

2) For shift right with scalar shift count(LSHRShift and ASHRShift)
case, about 25% ~ 30% improvement can be observed, and this benefit is
introduced by current patch.

[1] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/SSHL--Signed-Shift-Left--register--
[2] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/USHL--Unsigned-Shift-Left--register--
[3] openjdk/jdk18#41
[4] openjdk#1087
[5] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L509
@bridgekeeper
Copy link

bridgekeeper bot commented Mar 7, 2022

👋 Welcome back haosun! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Mar 7, 2022
@openjdk
Copy link

openjdk bot commented Mar 7, 2022

@shqking The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@mlbridge
Copy link

mlbridge bot commented Mar 7, 2022

Webrevs

Copy link
Contributor

@adinn adinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine. Very nice work.

@openjdk
Copy link

openjdk bot commented Mar 7, 2022

@shqking This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8265263: AArch64: Combine vneg with right shift count

Reviewed-by: adinn, dlong

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 16 new commits pushed to the master branch:

  • 3f0684d: 8275775: Add jcmd VM.classes to print details of all classes
  • cde923d: 8282690: runtime/CommandLine/VMDeprecatedOptions.java fails after JDK-8281181
  • 50eb915: 8282632: Cleanup unnecessary calls to Throwable.initCause() in java.security.jgss
  • ccad392: 8282657: Code cleanup: removing double semicolons at the end of lines
  • 5d5bf16: 8282567: Improve source-date handling in build system
  • 3996782: 8281093: Violating Attribute-Value Normalization in the XML specification 1.0
  • 2e298b8: 8272691: Fix HotSpot style guide terminology for "non-local variables"
  • 5953b22: 8257589: HotSpot Style Guide should link to rfc7282
  • 1faa5c8: 8282686: Add constructors taking a cause to SocketException
  • 7194097: 8252577: HotSpot Style Guide should link to One-True-Brace-Style description
  • ... and 6 more: https://git.openjdk.java.net/jdk/compare/894ffb098c80bfeb4209038c017d01dbf53fac0f...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@adinn, @dean-long) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Mar 7, 2022
@shqking
Copy link
Contributor Author

shqking commented Mar 7, 2022

Hi @dean-long this is the follow-up patch to our previous commit on ARM32. Could you help to review it when you got a chance? Thanks.

@shqking
Copy link
Contributor Author

shqking commented Mar 8, 2022

Thanks a lot for your review!
/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Mar 8, 2022
@openjdk
Copy link

openjdk bot commented Mar 8, 2022

@shqking
Your change (at version 20b6762) is now ready to be sponsored by a Committer.

@pfustc
Copy link
Member

pfustc commented Mar 9, 2022

/sponsor

@openjdk
Copy link

openjdk bot commented Mar 9, 2022

Going to push as commit 4924513.
Since your change was applied there have been 30 commits pushed to the master branch:

  • ea19114: 8282832: Update file path for HostnameMatcher/cert5.crt in test sun/security/util/Pem/encoding.sh
  • 72e987e: 7192189: Support endpoint identification algorithm in RFC 6125
  • 288d1af: 8282715: typo compileony in test Test8005033.java
  • 6b34884: 8282234: Create a regression test for JDK-4532513
  • 3fc009b: 8281560: Matcher.hitEnd returns unexpected results in presence of CANON_EQ flag.
  • 2549e55: 8275640: (win) java.net.NetworkInterface issues with IPv6-only environments
  • 3e4dfc6: 8282295: SymbolPropertyEntry::set_method_type fails with assert
  • 0cbc4b8: 8281266: [JVMCI] MetaUtil.toInternalName() doesn't handle hidden classes correctly
  • 0f88fc1: 8282769: BSD date cannot handle all ISO 8601 formats
  • c6d743f: 8282770: Set source date in jib profiles from buildId
  • ... and 20 more: https://git.openjdk.java.net/jdk/compare/894ffb098c80bfeb4209038c017d01dbf53fac0f...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Mar 9, 2022
@openjdk openjdk bot closed this Mar 9, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Mar 9, 2022
@openjdk
Copy link

openjdk bot commented Mar 9, 2022

@pfustc @shqking Pushed as commit 4924513.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@openjdk openjdk bot removed the sponsor Pull request is ready to be sponsored label Mar 9, 2022
@shqking shqking deleted the neon-vshr branch March 9, 2022 01:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler [email protected] integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

4 participants