Skip to content

Commit 20b6762

Browse files
committed
8265263: AArch64: Combine vneg with right shift count
*** Implementation In AArch64 NEON, vector shift right is implemented by vector shift left instructions (SSHL[1] and USHL[2]) with negative shift count value. In C2 backend, we generate a `neg` to given shift value followed by `sshl` or `ushl` instruction. For vector shift right, the vector shift count has two origins: 1) it can be duplicated from scalar variable/immediate(case-1), 2) it can be loaded directly from one vector(case-2). This patch aims to optimize case-1. Specifically, we move the negate from RShiftV* rules to RShiftCntV rule. As a result, the negate can be hoisted outside of the loop if it's a loop invariant. In this patch, 1) we split vshiftcnt* rules into vslcnt* and vsrcnt* rules to handle shift left and shift right respectively. Compared to vslcnt* rules, the negate is conducted in vsrcnt*. 2) for each vsra* and vsrl* rules, we create one variant, i.e. vsra*_var and vsrl*_var. We use vsra* and vsrl* rules to handle case-1, and use vsra*_var and vsrl*_var rules to handle case-2. Note that ShiftVNode::is_var_shift() can be used to distinguish case-1 from case-2. 3) we add one assertion for the vs*_imm rules as we have done on ARM32[3]. 4) several style issues are resolved. *** Example Take function `rShiftInt()` in the newly added micro benchmark VectorShiftRight.java as an example. ``` public void rShiftInt() { for (int i = 0; i < SIZE; i++) { intsB[i] = intsA[i] >> count; } } ``` Arithmetic shift right is conducted inside a big loop. The following code snippet shows the disassembly code generated by auto-vectorization before we apply current patch. We can see that `neg` is conducted in the loop body. ``` 0x0000ffff89057a64: dup v16.16b, w13 <-- dup 0x0000ffff89057a68: mov w12, #0x7d00 // #32000 0x0000ffff89057a6c: sub w13, w2, w10 0x0000ffff89057a70: cmp w2, w10 0x0000ffff89057a74: csel w13, wzr, w13, lt 0x0000ffff89057a78: mov w8, #0x7d00 // #32000 0x0000ffff89057a7c: cmp w13, w8 0x0000ffff89057a80: csel w13, w12, w13, hi 0x0000ffff89057a84: add w14, w13, w10 0x0000ffff89057a88: nop 0x0000ffff89057a8c: nop 0x0000ffff89057a90: sbfiz x13, x10, #2, #32 <-- loop entry 0x0000ffff89057a94: add x15, x17, x13 0x0000ffff89057a98: ldr q17, [x15,#16] 0x0000ffff89057a9c: add x13, x0, x13 0x0000ffff89057aa0: neg v18.16b, v16.16b <-- neg 0x0000ffff89057aa4: sshl v17.4s, v17.4s, v18.4s <-- shift right 0x0000ffff89057aa8: str q17, [x13,#16] 0x0000ffff89057aac: ... 0x0000ffff89057b1c: add w10, w10, #0x20 0x0000ffff89057b20: cmp w10, w14 0x0000ffff89057b24: b.lt 0x0000ffff89057a90 <-- loop end ``` Here is the disassembly code after we apply current patch. We can see that the negate is no longer conducted inside the loop, and it is hoisted to the outside. ``` 0x0000ffff8d053a68: neg w14, w13 <---- neg 0x0000ffff8d053a6c: dup v16.16b, w14 <---- dup 0x0000ffff8d053a70: sub w14, w2, w10 0x0000ffff8d053a74: cmp w2, w10 0x0000ffff8d053a78: csel w14, wzr, w14, lt 0x0000ffff8d053a7c: mov w8, #0x7d00 // #32000 0x0000ffff8d053a80: cmp w14, w8 0x0000ffff8d053a84: csel w14, w12, w14, hi 0x0000ffff8d053a88: add w13, w14, w10 0x0000ffff8d053a8c: nop 0x0000ffff8d053a90: sbfiz x14, x10, #2, #32 <-- loop entry 0x0000ffff8d053a94: add x15, x17, x14 0x0000ffff8d053a98: ldr q17, [x15,#16] 0x0000ffff8d053a9c: sshl v17.4s, v17.4s, v16.4s <-- shift right 0x0000ffff8d053aa0: add x14, x0, x14 0x0000ffff8d053aa4: str q17, [x14,#16] 0x0000ffff8d053aa8: ... 0x0000ffff8d053afc: add w10, w10, #0x20 0x0000ffff8d053b00: cmp w10, w13 0x0000ffff8d053b04: b.lt 0x0000ffff8d053a90 <-- loop end ``` *** Testing Tier1~3 tests passed on Linux/AArch64 platform. *** Performance Evaluation - Auto-vectorization One micro benchmark, i.e. VectorShiftRight.java, is added by this patch in order to evaluate the optimization on vector shift right. The following table shows the result. Column `Score-1` shows the score before we apply current patch, and column `Score-2` shows the score when we apply current patch. We witness about 30% ~ 53% improvement on microbenchmarks. ``` Benchmark Units Score-1 Score-2 VectorShiftRight.rShiftByte ops/ms 10601.980 13816.353 VectorShiftRight.rShiftInt ops/ms 3592.831 5502.941 VectorShiftRight.rShiftLong ops/ms 1584.012 2425.247 VectorShiftRight.rShiftShort ops/ms 6643.414 9728.762 VectorShiftRight.urShiftByte ops/ms 2066.965 2048.336 (*) VectorShiftRight.urShiftChar ops/ms 6660.805 9728.478 VectorShiftRight.urShiftInt ops/ms 3592.909 5514.928 VectorShiftRight.urShiftLong ops/ms 1583.995 2422.991 *: Logical shift right for Byte type(urShiftByte) is not vectorized, as disscussed in [4]. ``` - VectorAPI Furthermore, we also evaluate the impact of this patch on VectorAPI benchmarks, e.g., [5]. Details can be found in the table below. Columns `Score-1` and `Score-2` show the scores before and after applying current patch. ``` Benchmark Units Score-1 Score-2 Byte128Vector.LSHL ops/ms 10867.666 10873.993 Byte128Vector.LSHLShift ops/ms 10945.729 10945.741 Byte128Vector.LSHR ops/ms 8629.305 8629.343 Byte128Vector.LSHRShift ops/ms 8245.864 10303.521 <-- Byte128Vector.ASHR ops/ms 8619.691 8629.438 Byte128Vector.ASHRShift ops/ms 8245.860 10305.027 <-- Int128Vector.LSHL ops/ms 3104.213 3103.702 Int128Vector.LSHLShift ops/ms 3114.354 3114.371 Int128Vector.LSHR ops/ms 2380.717 2380.693 Int128Vector.LSHRShift ops/ms 2312.871 2992.377 <-- Int128Vector.ASHR ops/ms 2380.668 2380.647 Int128Vector.ASHRShift ops/ms 2312.894 2992.332 <-- Long128Vector.LSHL ops/ms 1586.907 1587.591 Long128Vector.LSHLShift ops/ms 1589.469 1589.540 Long128Vector.LSHR ops/ms 1209.754 1209.687 Long128Vector.LSHRShift ops/ms 1174.718 1527.502 <-- Long128Vector.ASHR ops/ms 1209.713 1209.669 Long128Vector.ASHRShift ops/ms 1174.712 1527.174 <-- Short128Vector.LSHL ops/ms 5945.542 5943.770 Short128Vector.LSHLShift ops/ms 5984.743 5984.640 Short128Vector.LSHR ops/ms 4613.378 4613.577 Short128Vector.LSHRShift ops/ms 4486.023 5746.466 <-- Short128Vector.ASHR ops/ms 4613.389 4613.478 Short128Vector.ASHRShift ops/ms 4486.019 5746.368 <-- ``` 1) For logical shift left(LSHL and LSHLShift), and shift right with variable vector shift count(LSHR and ASHR) cases, we didn't find much changes, which is expected. 2) For shift right with scalar shift count(LSHRShift and ASHRShift) case, about 25% ~ 30% improvement can be observed, and this benefit is introduced by current patch. [1] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/SSHL--Signed-Shift-Left--register-- [2] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/USHL--Unsigned-Shift-Left--register-- [3] openjdk/jdk18#41 [4] #1087 [5] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L509
1 parent 894ffb0 commit 20b6762

File tree

4 files changed

+809
-374
lines changed

4 files changed

+809
-374
lines changed

src/hotspot/cpu/aarch64/aarch64.ad

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1311,6 +1311,9 @@ public:
13111311
// predicate controlling translation of CompareAndSwapX
13121312
bool needs_acquiring_load_exclusive(const Node *load);
13131313

1314+
// Assert that the given node is not a variable shift.
1315+
bool assert_not_var_shift(const Node* n);
1316+
13141317
// predicate controlling addressing modes
13151318
bool size_fits_all_mem_uses(AddPNode* addp, int shift);
13161319
%}
@@ -1725,6 +1728,12 @@ bool needs_acquiring_load_exclusive(const Node *n)
17251728
return true;
17261729
}
17271730

1731+
// Assert that the given node is not a variable shift.
1732+
bool assert_not_var_shift(const Node* n) {
1733+
assert(!n->as_ShiftV()->is_var_shift(), "illegal variable shift");
1734+
return true;
1735+
}
1736+
17281737
#define __ _masm.
17291738

17301739
// advance declarations for helper functions to convert register

0 commit comments

Comments
 (0)