Skip to content

Conversation

@eme64
Copy link
Contributor

@eme64 eme64 commented May 23, 2023

This change should strictly expand the set of vectorized loops. And this change also makes SuperWord conceptually simpler.

As discussed in #12350, we should remove the alignment checks when alignment is actually not required (either by the hardware or explicitly asked for with -XX:+AlignVector). We did not do it directly in the same task to avoid too many changes of behavior.

This alignment check was originally there instead of a proper dependency checker. Requiring alignments on the packs per memory slice meant that all vector lanes were aligned, and there could be no cross-iteration dependencies that lead to cycles. But this is not general enough (we may for example allow the vector lanes to cross at some point). And we now have proper independence checks in SuperWord::combine_packs, as well as the cycle check in SuperWord::schedule.

Alignment is nice when we can make it happen, as it ensures that we do not have memory accesses across cache lines. But we should not prevent vectorization just because we cannot align all memory accesses for the same memory slice. As the benchmark shows below, we get a good speedup from vectorizing unaligned memory accesses.

Note: this reduces the CompileCommand Option Vectorize flag to now only controlling if we use the CloneMap or not. Read more about that in this PR #13930. In the benchmarks below you can find some examples that only vectorize with or only vectorize without the Vectorize flag. My goal is to eventually try out both approaches and pick the better one, removing the need for the flag entirely (see "Unifying multiple SuperWord Strategies and beyond" below).

Changes to Tests
I could remove the CompileCommand Option Vectorize from TestDependencyOffsets.java, which means that those loops now vectorize without the need of the flag.

LoopArrayIndexComputeTest.java had a few "negative" tests that expeced that there is no vectorization because of "dependencies". But they were not real dependencies since they were "read forward" cases. I now check that those do vectorize, and added symmetric tests that are "read backward" cases which should currently not vectorize. However, these are still not "real dependencies" either: the arrays that are used could in theory be proven to be not equal, and then the dependencies could be dropped. But I think it is ok to leave them as "negative" tests for now, until we add such optimizations.

Testing

Passes tier6 and stress testing.

No significant change in performance testing.

You can find some x64 and aarch64 benchmarks below, together with analysis and explanations.

There is a lot of information below. Feel free to read as little or as much as you want and find helpful.


Benchmark Data

Machine: 11th Gen Intel® Core™ i7-11850H @ 2.50GHz × 16. With AVX512 support.

Executed like this:

make test TEST="micro:vm.compiler.VectorAlignment" CONF=linux-x64

I have 4 flag combinations:

NoSuperWord: -XX:-UseSuperWord     (expect no vectorization)
SuperWord: -XX:+UseSuperWord     (normal mode)
SuperWordAlignVector: -XX:+UseSuperWord -XX:+AlignVector   (normal mode on machine with strict alignment)
SuperWordWithVectorize: -XX:+UseSuperWord -XX:CompileCommand=Option,*::*,Vectorize    (Vectorize flag enabled)

With patch:

VectorAlignment.VectorAlignmentNoSuperWord.bench000_control                                             2048       0  avgt       2465.937          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench001_control                                             2048       0  avgt       2509.747          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100_misaligned_load                                     2048       0  avgt       2484.883          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench200_hand_unrolled_aligned                               2048       0  avgt       2489.044          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench300_multiple_misaligned_loads                           2048       0  avgt       2463.388          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench301_multiple_misaligned_loads                           2048       0  avgt       2464.048          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench302_multiple_misaligned_loads_and_stores                2048       0  avgt       2476.954          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench400_hand_unrolled_misaligned                            2048       0  avgt       2592.562          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench401_hand_unrolled_misaligned                            2048       0  avgt       2563.649          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000_control                                               2048       0  avgt        315.926          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench001_control                                               2048       0  avgt        327.533          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100_misaligned_load                                       2048       0  avgt        319.991          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench200_hand_unrolled_aligned                                 2048       0  avgt        318.550          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench300_multiple_misaligned_loads                             2048       0  avgt       2504.033          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench301_multiple_misaligned_loads                             2048       0  avgt       2455.425          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench302_multiple_misaligned_loads_and_stores                  2048       0  avgt       2545.703          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench400_hand_unrolled_misaligned                              2048       0  avgt       2499.617          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench401_hand_unrolled_misaligned                              2048       0  avgt       2473.191          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000_control                                    2048       0  avgt        313.877          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench001_control                                    2048       0  avgt        341.554          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100_misaligned_load                            2048       0  avgt       2465.338          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench200_hand_unrolled_aligned                      2048       0  avgt        312.662          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench300_multiple_misaligned_loads                  2048       0  avgt       2455.039          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench301_multiple_misaligned_loads                  2048       0  avgt       2456.872          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench302_multiple_misaligned_loads_and_stores       2048       0  avgt       2604.665          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench400_hand_unrolled_misaligned                   2048       0  avgt       2456.425          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench401_hand_unrolled_misaligned                   2048       0  avgt       2507.887          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000_control                                  2048       0  avgt        312.670          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench001_control                                  2048       0  avgt        328.561          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100_misaligned_load                          2048       0  avgt        314.785          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench200_hand_unrolled_aligned                    2048       0  avgt       2454.712          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench300_multiple_misaligned_loads                2048       0  avgt        320.622          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench301_multiple_misaligned_loads                2048       0  avgt        341.595          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench302_multiple_misaligned_loads_and_stores     2048       0  avgt        516.716          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench400_hand_unrolled_misaligned                 2048       0  avgt       2469.011          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench401_hand_unrolled_misaligned                 2048       0  avgt       2542.513          ns/op

On master:

VectorAlignment.VectorAlignmentNoSuperWord.bench000_control                                             2048       0  avgt       2467.072          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench001_control                                             2048       0  avgt       2476.239          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100_misaligned_load                                     2048       0  avgt       2467.182          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench200_hand_unrolled_aligned                               2048       0  avgt       2460.985          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench300_multiple_misaligned_loads                           2048       0  avgt       2564.807          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench301_multiple_misaligned_loads                           2048       0  avgt       2568.871          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench302_multiple_misaligned_loads_and_stores                2048       0  avgt       2498.102          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench400_hand_unrolled_misaligned                            2048       0  avgt       2492.498          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench401_hand_unrolled_misaligned                            2048       0  avgt       2473.459          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000_control                                               2048       0  avgt        320.142          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench001_control                                               2048       0  avgt        328.415          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100_misaligned_load                                       2048       0  avgt       2464.787          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench200_hand_unrolled_aligned                                 2048       0  avgt        313.505          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench300_multiple_misaligned_loads                             2048       0  avgt       2459.245          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench301_multiple_misaligned_loads                             2048       0  avgt       2500.698          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench302_multiple_misaligned_loads_and_stores                  2048       0  avgt       2579.449          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench400_hand_unrolled_misaligned                              2048       0  avgt       2465.709          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench401_hand_unrolled_misaligned                              2048       0  avgt       2470.722          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000_control                                    2048       0  avgt        312.058          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench001_control                                    2048       0  avgt        329.024          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100_misaligned_load                            2048       0  avgt       2472.375          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench200_hand_unrolled_aligned                      2048       0  avgt        309.370          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench300_multiple_misaligned_loads                  2048       0  avgt       2468.434          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench301_multiple_misaligned_loads                  2048       0  avgt       2477.122          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench302_multiple_misaligned_loads_and_stores       2048       0  avgt       2561.528          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench400_hand_unrolled_misaligned                   2048       0  avgt       2478.820          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench401_hand_unrolled_misaligned                   2048       0  avgt       2462.620          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000_control                                  2048       0  avgt        313.276          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench001_control                                  2048       0  avgt        331.348          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100_misaligned_load                          2048       0  avgt        314.130          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench200_hand_unrolled_aligned                    2048       0  avgt       2465.140          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench300_multiple_misaligned_loads                2048       0  avgt        335.176          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench301_multiple_misaligned_loads                2048       0  avgt        335.492          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench302_multiple_misaligned_loads_and_stores     2048       0  avgt        550.598          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench400_hand_unrolled_misaligned                 2048       0  avgt       2511.170          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench401_hand_unrolled_misaligned                 2048       0  avgt       2468.112          ns/op

Generally: we can see the difference between vectorization and non-vectorization easily: without vectorization the runtime is over 2000 ns/op, with vectorization it is under 600 ns/op.

In comparison on a aarch64 machine with asimd support:

With the patch:

VectorAlignment.VectorAlignmentNoSuperWord.bench000_control                                             2048       0  avgt       2058.132          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench001_control                                             2048       0  avgt       2071.570          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100_misaligned_load                                     2048       0  avgt       2063.994          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench200_hand_unrolled_aligned                               2048       0  avgt       2051.104          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench300_multiple_misaligned_loads                           2048       0  avgt       2058.493          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench301_multiple_misaligned_loads                           2048       0  avgt       2060.856          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench302_multiple_misaligned_loads_and_stores                2048       0  avgt       2213.880          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench400_hand_unrolled_misaligned                            2048       0  avgt       2060.412          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench401_hand_unrolled_misaligned                            2048       0  avgt       2055.939          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000_control                                               2048       0  avgt       1032.666          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench001_control                                               2048       0  avgt       1034.138          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100_misaligned_load                                       2048       0  avgt       1031.412          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench200_hand_unrolled_aligned                                 2048       0  avgt       1030.791          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench300_multiple_misaligned_loads                             2048       0  avgt       2057.689          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench301_multiple_misaligned_loads                             2048       0  avgt       2057.009          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench302_multiple_misaligned_loads_and_stores                  2048       0  avgt       1465.270          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench400_hand_unrolled_misaligned                              2048       0  avgt       2053.011          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench401_hand_unrolled_misaligned                              2048       0  avgt       2055.820          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000_control                                    2048       0  avgt       1032.645          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench001_control                                    2048       0  avgt       1034.199          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100_misaligned_load                            2048       0  avgt       2064.206          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench200_hand_unrolled_aligned                      2048       0  avgt       1026.581          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench300_multiple_misaligned_loads                  2048       0  avgt       2057.236          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench301_multiple_misaligned_loads                  2048       0  avgt       2057.276          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench302_multiple_misaligned_loads_and_stores       2048       0  avgt       1465.736          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench400_hand_unrolled_misaligned                   2048       0  avgt       2056.355          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench401_hand_unrolled_misaligned                   2048       0  avgt       2064.056          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000_control                                  2048       0  avgt       1033.816          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench001_control                                  2048       0  avgt       1034.002          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100_misaligned_load                          2048       0  avgt       1032.607          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench200_hand_unrolled_aligned                    2048       0  avgt       2052.119          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench300_multiple_misaligned_loads                2048       0  avgt       1026.828          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench301_multiple_misaligned_loads                2048       0  avgt       1027.582          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench302_multiple_misaligned_loads_and_stores     2048       0  avgt       1034.751          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench400_hand_unrolled_misaligned                 2048       0  avgt       2052.453          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench401_hand_unrolled_misaligned                 2048       0  avgt       2058.007          ns/op

On master:

VectorAlignment.VectorAlignmentNoSuperWord.bench000_control                                             2048       0  avgt       2058.009          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench001_control                                             2048       0  avgt       2070.553          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100_misaligned_load                                     2048       0  avgt       2064.553          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench200_hand_unrolled_aligned                               2048       0  avgt       2053.390          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench300_multiple_misaligned_loads                           2048       0  avgt       2058.187          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench301_multiple_misaligned_loads                           2048       0  avgt       2060.125          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench302_multiple_misaligned_loads_and_stores                2048       0  avgt       2208.483          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench400_hand_unrolled_misaligned                            2048       0  avgt       2058.145          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench401_hand_unrolled_misaligned                            2048       0  avgt       2056.145          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000_control                                               2048       0  avgt       1032.566          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench001_control                                               2048       0  avgt       1033.856          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100_misaligned_load                                       2048       0  avgt       2065.720          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench200_hand_unrolled_aligned                                 2048       0  avgt       1026.648          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench300_multiple_misaligned_loads                             2048       0  avgt       2057.476          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench301_multiple_misaligned_loads                             2048       0  avgt       2058.508          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench302_multiple_misaligned_loads_and_stores                  2048       0  avgt       1465.702          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench400_hand_unrolled_misaligned                              2048       0  avgt       2053.303          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench401_hand_unrolled_misaligned                              2048       0  avgt       2052.170          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000_control                                    2048       0  avgt       1032.788          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench001_control                                    2048       0  avgt       1033.912          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100_misaligned_load                            2048       0  avgt       2064.447          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench200_hand_unrolled_aligned                      2048       0  avgt       1027.305          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench300_multiple_misaligned_loads                  2048       0  avgt       2058.339          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench301_multiple_misaligned_loads                  2048       0  avgt       2057.675          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench302_multiple_misaligned_loads_and_stores       2048       0  avgt       1465.643          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench400_hand_unrolled_misaligned                   2048       0  avgt       2055.289          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench401_hand_unrolled_misaligned                   2048       0  avgt       2052.978          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000_control                                  2048       0  avgt       1032.738          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench001_control                                  2048       0  avgt       1034.188          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100_misaligned_load                          2048       0  avgt       1031.948          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench200_hand_unrolled_aligned                    2048       0  avgt       2051.954          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench300_multiple_misaligned_loads                2048       0  avgt       1027.746          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench301_multiple_misaligned_loads                2048       0  avgt       1028.121          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench302_multiple_misaligned_loads_and_stores     2048       0  avgt       1035.034          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench400_hand_unrolled_misaligned                 2048       0  avgt       2054.449          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench401_hand_unrolled_misaligned                 2048       0  avgt       2053.339          ns/op

Also with aarch64 we can see a clear difference between vectorization and non-vectorization. The pattern is the same, even though the concrete numbers are a bit different.

Benchmark Discussion: 0xx control
These are simple examples that vectorize unless SuperWord is disabled. Just to make sure the benchmark works.

@Benchmark
// Control: should always vectorize with SuperWord
public void bench001_control() {
for (int i = 0; i < COUNT; i++) {
// Have multiple MUL operations to make loop compute bound (more compute than load/store)
rI[i] = aI[i] * aI[i] * aI[i] * aI[i] + bI[i];
}
}

Benchmark Discussion: 1xx load and store misaligned

This vectorizes with the patch, but does not vectorize on master. It does not vectorize with AlignVector because of the misalignment (misaligned by 1 int = 4 byte). On master, we require all vectors to align with all other vectors of the same memory slice.

@Benchmark
// Vectorizes without AlignVector
public void bench100_misaligned_load() {
for (int i = 0; i < COUNT-1; i++) {
rI[i] = aI[i+1] * aI[i+1] * aI[i+1] * aI[i+1];
}
}

Benchmark Discussion: 2xx vectorizes only without Vectorize

Hand-unrolling confuses SuperWord with Vectorize flag. The issue is that adjacent memops are not from the same original same-iteration node - rather they are from two different lines.

@Benchmark
// Only without "Vectorize" (confused by hand-unrolling)
public void bench200_hand_unrolled_aligned() {
for (int i = 0; i < COUNT-10; i+=2) {
rI[i+0] = aI[i+0] * aI[i+0] * aI[i+0] * aI[i+0];
rI[i+1] = aI[i+1] * aI[i+1] * aI[i+1] * aI[i+1];
}
}

Here the relevant checks in SuperWord::find_adjacent_refs:

if (isomorphic(s, mem_ref) &&
(!_do_vector_loop || same_origin_idx(s, mem_ref))) {

if (!_do_vector_loop || same_origin_idx(s1, s2)) {

Benchmark Discussion: 3xx vectorizes only with Vectorize

Regular SuperWord fails in these cases for 2 reasons:

  • 300 fails because of the modulo computation of the algignment
  • 301 fails because we can confuse multiple loads (aI[5] with aI[4+1]).

@Benchmark
// Only with "Vectorize", without we get issues with modulo computation of alignment for bI
public void bench300_multiple_misaligned_loads() {
for (int i = 0; i < COUNT-10; i++) {
rI[i] = aI[i] * aI[i] * aI[i] * aI[i] + bI[i+1];
}
}
@Benchmark
// Only with "Vectorize", without we may confuse aI[5] with aI[4+1] and pack loads in wrong pack
public void bench301_multiple_misaligned_loads() {
for (int i = 0; i < COUNT-10; i++) {
rI[i] = aI[i] * aI[i] * aI[i] * aI[i] + aI[i+1];
}
}

In SuperWord::memory_alignment we compute the alignment, modulo the vw (wector width).

int offset = p.offset_in_bytes();
offset += iv_adjust*p.memory_size();
int off_rem = offset % vw;
int off_mod = off_rem >= 0 ? off_rem : off_rem + vw;

Now assume we have two load vectors, with offsets [0,4,8,12] and [4,8,12,16]. If we have vw=16, then we get the off_mod to be [0,4,8,12] and [4,8,12,0]. The second vector thus has the last element wrap in the modulo space, and it does not pass the alignment checks (align1 + data_size == align2):

if (s1_align == top_align || s1_align == align) {
if (s2_align == top_align || s2_align == align + data_size(s1)) {

The consequence is that we only pack 3 of the 4 memops. And then the pack gets filtered out here, and vectorization fails:

if (!is_power_of_2(psize)) {

One solution for this is to compute the alignment without modulo. The modulo computation of alignment comes from a time where we could only have strictly aligned memory accesses, and so we would not want to ever pack pairs that cross an alignment boundary, where the modulo wraps around. We could address this in a future RFE.

The second issue is for vectorization is confusing multiple loads that look identical, eg aI[5] and aI[4+1].
A loop like b[i] = a[i] + a[i+1] that was unrolled 4 times will have these loads:
a[i], a[i+1], a[i+1], a[i+2], a[i+2], a[i+3], a[i+3], a[i+4].
The SuperWord (SLP) algorithm just greedily picks pairs that are adjacent, and has no mechanism to deal with multiple packing
options: we can pair a[i] with either of the two a[i+1].
If we do not perfectly pack them, then the packs will not line up with other packs, and vectorization fails.
In the literature I have seen people who solve this problem with integer linear programming, but that would
most likely be too expensive for our JIT C2. We just have to accept that SuperWord (SLP) is greedy and cannot pack things
optimally in all cases.
Lickily, the Vectorize approach does solve most of these cases, as it can separate the two loads in
b[i] = a[i] + a[i+1], they come from two different nodes in the single-iteration loop.

Benchmark Discussion: 4xx vectorizes does not vectorize at all, even though it should be possible in principle

We combine the issues from 2xx and 3xx: hand-unrolling prevents Vectorize from working, and confusion of multiple loads or the modulo alignment computation prevent non-Vectorize from working.


Unifying multiple SuperWord Strategies and beyond

We have now seen examples where sometimes it is better to go with the Vectorize flag, and sometimes it is better without it.
Would it not be great if we could try out both strategies, and then pick the better one?

A very naive solution: Just try without Vectorize first. If we get a non-empty packset go with it. If it is empty, then try with Vectorize. That may work in many cases, but there will also be a few cases where without Vectorize we do create a non-empty packset, which is just very very suboptimal. Plus: in the future we may consider expanding the approaches
to non-adjacent memory refs, such as strided accesses or even gather/scatter (as long as the dependency checks pass).
Then it will be even more possible that both strategies create a non-empty packset, but one of the two strategies
creates a much better packset than the other.

A better solution: try out both approaches, and evaluate them with a cost-model. Also compute the cost of the
non-vectorized loop. Then pick the best option. This cost-model will also be helpful to decide if we should vectorize
when we have Reduction nodes (they can be very expensive) or when introducing vector-shuffles (we probably want
to introduce them to allow reverse-order loops, where we need to reverse the vector lane order with a shuffle).
My suggestion is this: run both SuperWord approaches until we have the PacksetGraph. At this point we know if
we could vectorize, including if any dependency-checks fail (independence checks, cycle check).
Then, we evaluate the cost if we were to apply this PacksetGraph.
We pick the cheapest PacksetGraph and apply it.

This approach is also extensible (I got a bit inspired by LLVM talks about VPlan).
We can rename the PacksetGraph to a more genral VectorTransformGraph in the following steps:

  1. Create multiple VectorTransformGraph through multiple SuperWord strategies (with and without Vectorize). With a cost model pick the best one.
  2. Sometimes, there are too many nodes in the loop and we cannot unroll enough times to ensure there are enough parallel operations to fill all elements in the vector registers. That way, we lose a lot of performance. We could consider widening the operations in the VectorTransformGraph, so that we can indeed make use of the whole vector.
  3. We could even create a VectorTransformGraph from a single iteration loop, and try to widen the instructions there. If this succeeds we do not have to unroll before vectorizing. This is essentially a traditional loop vectorizer. Except that we can also run the SuperWord algorith over it first to see if we have already any parallelism in the single iteration loop. And then widen that. This makes it a hybrid vectorizer. Not having to unroll means direct time savings, but also that we could vectorize larger loops in the first place, since we would not hit the node limit for unrolling.
  4. Later, we can also incorporate if-conversion into this approach. Let the previous points all allow packing/widening control flow. Now, we do if-conversion: either flatten the CFG with the use of VectorMaskCmp and VectorBlend, or if the branch is highly likely to take one side for all vector elements, we can also use test_all_zeros / test_all_ones to still branch.

Maybe there are even more vectorization approaches that could fit into this VectorTransformGraph scheme.
The advantage is that it is modular, and we do not affect the C2-graph until we have decided on the best vectorization option
via a cost-model.

One item I have to spend more time learning and integrating into this plan is PostLoopMultiversioning. It seems to use the widening approach. Maybe we can just extend the widening to a vector-masked version.


Example of large loop that is not vectorized

We have a limit of about 50 or 60 nodes for unrolling (LoopUnrollLimit). Only vectorizes if we raise the limit.
Vectorizing before unrolling could help here. Or partially unroll, SuperWord, and widen more.

java -Xbatch -XX:CompileCommand=compileonly,Test::test -XX:+TraceNewVectors -XX:+TraceLoopOpts -XX:LoopUnrollLimit=1000 Test.java
class Test {
    static final int RANGE = 1024*2;
    static final int ITER  = 10_000;

    static void init(int[] data) {
        for (int i = 0; i < RANGE; i++) {
            data[i] = i + 1;
        }
    }

    static void test(int[] a, int[] b) {
        for (int i = 10; i < RANGE-10; i++) {
            int aa = a[i];
            aa = aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa;
            aa = aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa;
            aa = aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa;
            aa = aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa;
            b[i] = aa;
        }
    }

    public static void main(String[] args) {
        int[] a = new int[RANGE];
        int[] b = new int[RANGE];
        init(a);
        init(b);
        for (int i = 0; i < ITER; i++) {
            test(a, b);
        }
    }
}

Re-reviewing TestPickLastMemoryState.java

We once had some "collateral damage" in TestPickLastMemoryState.java, where we had to accept some cases that would not vectorize anymore to ensure correctness in all other cases (#12350 (comment)). Let's re-asses how many of them now vectorize:

  • f has a cyclic dependency in the graph (because we do not know that a != b):
class Test {
    static final int RANGE = 1024;
    static final int ITER  = 10_000;

    static void init(int[] data) {
        for (int i = 0; i < RANGE; i++) {
            data[i] = i + 1;
        }
    }

    static void test(int[] a, int[] b) {
        for (int i = 10; i < RANGE-10; i++) {
            a[i] = b[i - 1]--;   // store a[i]  ->  load b[i]
            b[i]--; // store b[i] must happend before load b[i - 1] of next iteration
        }
    }

    public static void main(String[] args) {
        int[] a = new int[RANGE];
        int[] b = new int[RANGE];
        init(a);
        init(b);
        for (int i = 0; i < ITER; i++) {
            test(a, b);
        }
    }
}

Run it with either:

java -Xbatch -XX:CompileCommand=compileonly,Test::test -XX:CompileCommand=Option,Test::test,Vectorize -XX:+TraceNewVectors -XX:+TraceSuperWord -XX:+Verbose Test.java
java -Xbatch -XX:CompileCommand=compileonly,Test::test -XX:+TraceNewVectors -XX:+TraceSuperWord -XX:+Verbose Test.java

In either case, we do not vectorize. Actually, we already do not create the pair packs for any memops except the b[i-1] store, since they detect that we do not have independent(s1,s2) for adjacent memop pairs.

If we change the loop to:

    static void test(int[] a, int[] b) {
        for (int i = 10; i < RANGE-10; i++) {
            a[i] = b[i - 2]--;   // store a[i]  ->  load b[i]
            b[i]--; // store b[i] must happend before load b[i - 1] of next iteration
        }
    }

Now we do not detect the dependence at distance 1, but only later when we check for dependence at further distances. We see lots of warnings because of pack removal WARNING: Found dependency at distance greater than 1.. Without the Vectorize flag we somehow still manage to vectorize a vector with 2 elements, but that is hardly a success as my machine would allow packing 16 ints in a 512 bit register. That just seems to be an artefact that at distance 1 we do not have dependence. It is not very interesting to add IR verification for that kind of vectorization.

  • test1-6 are also relatively complex, and have cyclic dependencies of different kinds. I think we should just keep them as correctness tests for correct results, but not extend them to IR verification tests.

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8308606: C2 SuperWord: remove alignment checks when not required (Sub-task - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14096/head:pull/14096
$ git checkout pull/14096

Update a local copy of the PR:
$ git checkout pull/14096
$ git pull https://git.openjdk.org/jdk.git pull/14096/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 14096

View PR using the GUI difftool:
$ git pr show -t 14096

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14096.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented May 23, 2023

👋 Welcome back epeter! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented May 23, 2023

@eme64 The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@eme64 eme64 changed the title 8308606: C2 SuperWord: remove alignment checks where not required 8308606: C2 SuperWord: remove alignment checks when not required May 25, 2023
@eme64 eme64 marked this pull request as ready for review May 30, 2023 07:11
@openjdk openjdk bot added the rfr Pull request is ready for review label May 30, 2023
@mlbridge
Copy link

mlbridge bot commented May 30, 2023

Webrevs

public int[] indexWithDifferentConstants() {
// No true dependency in read-forward case.
@IR(applyIfCPUFeatureOr = {"asimd", "true", "sse2", "true"},
counts = {IRNode.STORE_VECTOR, ">0"})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may need add applyIf = {"AlignVector", "false"} for these newly added IR check rules.

Copy link
Contributor Author

@eme64 eme64 Jun 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fg1417 you are right!

I think we should also add AlignVector to the IR whitelist. It makes sense to add it with this change here, because only from now on can we actually have misaligned loads / stores on the same memory-slice! So we should also test things more thoroughly now.

I also see that in test/hotspot/jtreg/compiler/vectorization/runner/ a lot of tests have @requires vm.flagless. That means we actually do not check any flag combinations with those tests. I think we should file an RFE to make them more general, and add the requirements to the IR rules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, there are some bugs around, I cannot yet directly add AlignVector to the IR framework whitelist. We can do it in a follow up RFE https://bugs.openjdk.org/browse/JDK-8309662

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @eme64, I'd like to explain more about the @requires vm.flagless. Vladimir Kozlov had suggested removing those annotations. I didn't do that before because those annotations cannot be simply removed. All tests under compiler/vectorization/runner/ are used for both correctness check and vectorizability (IR) check. For correctness check, each test method is invoked twice and the return results from the interpreter and C2 compiled code are compared. We use compiler control via WhiteBox API from the test runner to force these methods running in interpreter and C2 (see the logic in VectorizationTestRunner.java). The force compilation would fail if some extra vm option of compiler control (such as -Xint) is specified.

A way of removing @requires vm.flagless I can think of may be skipping the correctness check in the vectorization test runner if the compiler control fails. I just filed JDK-8309697 for this. Please let me know if you have any better ideas or suggestions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pfustc If I run it with -Xint, then it says "Test results: no tests selected". I think that is because of @requires vm.compiler2.enabled. But sure, there may be some other flags that mess with the compiler controls.

But I think it is important to remove the @requires vm.flagless, there are always bugs lurking around with more flag combinations. Plus, we don't have all the hardware that exists out there. That is why it is crucial that we can run with flags AlignVector (some ARM machines have it on true) or UseKNLSetting (intel), for example.

I'll temporarily add the @requires vm.flagless back in for test/hotspot/jtreg/compiler/vectorization/runner/LoopArrayIndexComputeTest.java.

@eme64
Copy link
Contributor Author

eme64 commented Jun 12, 2023

@fg1417 @pfustc I think I have addressed your concerns. Can you please re-review ;)

@fg1417
Copy link

fg1417 commented Jun 14, 2023

Alignment is nice when we can make it happen, as it ensures that we do not have memory accesses across cache lines. But we should not prevent vectorization just because we cannot align all memory accesses for the same memory slice. As the benchmark shows below, we get a good speedup from vectorizing unaligned memory accesses.

Hi @eme64 , nice rewrite!

May I ask if you have any benchmark data of misaligned-load-store cases for other data types? For example, Double or Long on 128-bit machines (maybe aarch64 asimd).

@eme64
Copy link
Contributor Author

eme64 commented Jun 14, 2023

@fg1417 Ok, I will expand the misaligned load-store case for some other data types and test it on my two machines again!

@pfustc
Copy link
Member

pfustc commented Jun 14, 2023

Hi @eme64 , I don't see any problem after going through everything you wrote. But as I'm not an official reviewer and don't have enough confidence on complex changes, I hope someone who is more familiar with this part can have a review.

BTW: Have you generated and run some JavaFuzzer tests for this patch? Based on my personal experience, it's quite helpful for finding hidden bugs in SuperWord or other complex loop optimizations.

And a few more comments on this:

  1. We could even create a VectorTransformGraph from a single iteration loop, and try to widen the instructions there. If this succeeds we do not have to unroll before vectorizing. This is essentially a traditional loop vectorizer. Except that we can also run the SuperWord algorith over it first to see if we have already any parallelism in the single iteration loop. And then widen that. This makes it a hybrid vectorizer. Not having to unroll means direct time savings, but also that we could vectorize larger loops in the first place, since we would not hit the node limit for unrolling.

What we are current doing for https://bugs.openjdk.org/browse/JDK-8308994 is like a traditional loop vectorizer - it can vectorize loops without unrolling. It can also support strided accesses (gather/scatter) with a few updates. But our current implementation is outside SuperWord and for post loops only. Perhaps a hybrid vectorizer implemented in SuperWord is a better ideal. We will push our draft patch to GitHub soon for your feedback. Currently I'm finishing some routines before I can push the code. It's expected to be done in a few days.

@eme64
Copy link
Contributor Author

eme64 commented Jun 14, 2023

I'm collecting the new benchmark results here, so that we see the effect of misaligned load-stores.
I have a series of control cases (aligned), and a series of misaligned cases.


Machine: 11th Gen Intel® Core™ i7-11850H @ 2.50GHz × 16. With AVX512 support.

With patch:

Benchmark                                                                                            (COUNT)  (seed)  Mode  Cnt     Score   Error  Units
VectorAlignment.VectorAlignmentNoSuperWord.bench000B_control                                            2048       0  avgt       2465.281          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000C_control                                            2048       0  avgt       2467.440          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000D_control                                            2048       0  avgt       1276.895          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000F_control                                            2048       0  avgt       1313.390          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000I_control                                            2048       0  avgt       2465.260          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000L_control                                            2048       0  avgt       2469.814          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000S_control                                            2048       0  avgt       2466.305          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench001_control                                             2048       0  avgt       2470.130          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100B_misaligned_load                                    2048       0  avgt       2463.569          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100C_misaligned_load                                    2048       0  avgt       2467.426          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100D_misaligned_load                                    2048       0  avgt       1244.256          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100F_misaligned_load                                    2048       0  avgt       1268.847          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100I_misaligned_load                                    2048       0  avgt       2465.870          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100L_misaligned_load                                    2048       0  avgt       2473.035          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100S_misaligned_load                                    2048       0  avgt       2467.638          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench200_hand_unrolled_aligned                               2048       0  avgt       2452.871          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench300_multiple_misaligned_loads                           2048       0  avgt       2467.560          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench301_multiple_misaligned_loads                           2048       0  avgt       2503.790          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench302_multiple_misaligned_loads_and_stores                2048       0  avgt       2475.180          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench400_hand_unrolled_misaligned                            2048       0  avgt       2503.802          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench401_hand_unrolled_misaligned                            2048       0  avgt       2459.743          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000B_control                                              2048       0  avgt        325.126          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000C_control                                              2048       0  avgt         97.953          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000D_control                                              2048       0  avgt        330.818          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000F_control                                              2048       0  avgt        174.710          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000I_control                                              2048       0  avgt        313.795          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000L_control                                              2048       0  avgt        928.760          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000S_control                                              2048       0  avgt        108.428          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench001_control                                               2048       0  avgt        345.589          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100B_misaligned_load                                      2048       0  avgt        321.662          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100C_misaligned_load                                      2048       0  avgt        103.066          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100D_misaligned_load                                      2048       0  avgt        327.455          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100F_misaligned_load                                      2048       0  avgt        177.764          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100I_misaligned_load                                      2048       0  avgt        316.122          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100L_misaligned_load                                      2048       0  avgt        925.852          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100S_misaligned_load                                      2048       0  avgt        111.269          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench200_hand_unrolled_aligned                                 2048       0  avgt        315.747          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench300_multiple_misaligned_loads                             2048       0  avgt       2453.563          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench301_multiple_misaligned_loads                             2048       0  avgt       2453.191          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench302_multiple_misaligned_loads_and_stores                  2048       0  avgt       2582.489          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench400_hand_unrolled_misaligned                              2048       0  avgt       2454.593          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench401_hand_unrolled_misaligned                              2048       0  avgt       2495.125          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000B_control                                   2048       0  avgt        322.621          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000C_control                                   2048       0  avgt        105.592          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000D_control                                   2048       0  avgt        332.315          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000F_control                                   2048       0  avgt        178.128          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000I_control                                   2048       0  avgt        320.263          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000L_control                                   2048       0  avgt       2379.473          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000S_control                                   2048       0  avgt        103.485          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench001_control                                    2048       0  avgt        336.568          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100B_misaligned_load                           2048       0  avgt       2467.802          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100C_misaligned_load                           2048       0  avgt       2468.560          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100D_misaligned_load                           2048       0  avgt       1280.718          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100F_misaligned_load                           2048       0  avgt       1238.107          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100I_misaligned_load                           2048       0  avgt       2502.481          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100L_misaligned_load                           2048       0  avgt       2515.784          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100S_misaligned_load                           2048       0  avgt       2574.165          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench200_hand_unrolled_aligned                      2048       0  avgt        323.674          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench300_multiple_misaligned_loads                  2048       0  avgt       2483.910          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench301_multiple_misaligned_loads                  2048       0  avgt       2476.093          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench302_multiple_misaligned_loads_and_stores       2048       0  avgt       2539.519          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench400_hand_unrolled_misaligned                   2048       0  avgt       2456.025          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench401_hand_unrolled_misaligned                   2048       0  avgt       2455.324          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000B_control                                 2048       0  avgt        324.876          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000C_control                                 2048       0  avgt        104.053          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000D_control                                 2048       0  avgt        329.380          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000F_control                                 2048       0  avgt        178.084          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000I_control                                 2048       0  avgt        312.267          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000L_control                                 2048       0  avgt        928.972          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000S_control                                 2048       0  avgt        103.146          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench001_control                                  2048       0  avgt        335.605          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100B_misaligned_load                         2048       0  avgt        321.806          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100C_misaligned_load                         2048       0  avgt         96.239          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100D_misaligned_load                         2048       0  avgt        335.705          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100F_misaligned_load                         2048       0  avgt        177.388          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100I_misaligned_load                         2048       0  avgt        314.159          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100L_misaligned_load                         2048       0  avgt        929.355          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100S_misaligned_load                         2048       0  avgt        103.515          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench200_hand_unrolled_aligned                    2048       0  avgt       2452.942          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench300_multiple_misaligned_loads                2048       0  avgt        324.258          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench301_multiple_misaligned_loads                2048       0  avgt        316.609          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench302_multiple_misaligned_loads_and_stores     2048       0  avgt        496.744          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench400_hand_unrolled_misaligned                 2048       0  avgt       2482.812          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench401_hand_unrolled_misaligned                 2048       0  avgt       2456.151          ns/op

Master:

Benchmark                                                                                            (COUNT)  (seed)  Mode  Cnt     Score   Error  Units
VectorAlignment.VectorAlignmentNoSuperWord.bench000B_control                                            2048       0  avgt       2465.111          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000C_control                                            2048       0  avgt       2465.792          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000D_control                                            2048       0  avgt       1299.166          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000F_control                                            2048       0  avgt       1276.829          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000I_control                                            2048       0  avgt       2464.928          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000L_control                                            2048       0  avgt       2470.452          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000S_control                                            2048       0  avgt       2467.766          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench001_control                                             2048       0  avgt       2477.214          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100B_misaligned_load                                    2048       0  avgt       2472.792          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100C_misaligned_load                                    2048       0  avgt       2466.156          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100D_misaligned_load                                    2048       0  avgt       1249.213          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100F_misaligned_load                                    2048       0  avgt       1271.666          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100I_misaligned_load                                    2048       0  avgt       2469.197          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100L_misaligned_load                                    2048       0  avgt       5048.249          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100S_misaligned_load                                    2048       0  avgt       2570.552          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench200_hand_unrolled_aligned                               2048       0  avgt       2477.658          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench300_multiple_misaligned_loads                           2048       0  avgt       2461.930          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench301_multiple_misaligned_loads                           2048       0  avgt       2464.809          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench302_multiple_misaligned_loads_and_stores                2048       0  avgt       2469.329          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench400_hand_unrolled_misaligned                            2048       0  avgt       2460.789          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench401_hand_unrolled_misaligned                            2048       0  avgt       2465.124          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000B_control                                              2048       0  avgt        328.358          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000C_control                                              2048       0  avgt        101.450          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000D_control                                              2048       0  avgt        336.524          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000F_control                                              2048       0  avgt        181.461          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000I_control                                              2048       0  avgt        312.179          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000L_control                                              2048       0  avgt        926.032          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000S_control                                              2048       0  avgt        108.438          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench001_control                                               2048       0  avgt        330.634          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100B_misaligned_load                                      2048       0  avgt       2478.893          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100C_misaligned_load                                      2048       0  avgt       2467.903          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100D_misaligned_load                                      2048       0  avgt       1303.476          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100F_misaligned_load                                      2048       0  avgt       1271.854          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100I_misaligned_load                                      2048       0  avgt       2464.187          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100L_misaligned_load                                      2048       0  avgt       2469.901          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100S_misaligned_load                                      2048       0  avgt       2468.698          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench200_hand_unrolled_aligned                                 2048       0  avgt        321.711          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench300_multiple_misaligned_loads                             2048       0  avgt       2456.548          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench301_multiple_misaligned_loads                             2048       0  avgt       2455.254          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench302_multiple_misaligned_loads_and_stores                  2048       0  avgt       2586.609          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench400_hand_unrolled_misaligned                              2048       0  avgt       2454.082          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench401_hand_unrolled_misaligned                              2048       0  avgt       2459.663          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000B_control                                   2048       0  avgt        328.260          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000C_control                                   2048       0  avgt        110.001          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000D_control                                   2048       0  avgt        347.260          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000F_control                                   2048       0  avgt        175.408          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000I_control                                   2048       0  avgt        319.980          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000L_control                                   2048       0  avgt        926.987          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000S_control                                   2048       0  avgt         96.883          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench001_control                                    2048       0  avgt        331.807          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100B_misaligned_load                           2048       0  avgt       2469.823          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100C_misaligned_load                           2048       0  avgt       2467.840          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100D_misaligned_load                           2048       0  avgt       1277.344          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100F_misaligned_load                           2048       0  avgt       1256.043          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100I_misaligned_load                           2048       0  avgt       2464.069          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100L_misaligned_load                           2048       0  avgt       2464.847          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100S_misaligned_load                           2048       0  avgt       2511.085          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench200_hand_unrolled_aligned                      2048       0  avgt        309.642          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench300_multiple_misaligned_loads                  2048       0  avgt       2508.013          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench301_multiple_misaligned_loads                  2048       0  avgt       2455.004          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench302_multiple_misaligned_loads_and_stores       2048       0  avgt       2549.172          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench400_hand_unrolled_misaligned                   2048       0  avgt       2454.518          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench401_hand_unrolled_misaligned                   2048       0  avgt       2454.899          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000B_control                                 2048       0  avgt        333.081          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000C_control                                 2048       0  avgt         96.798          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000D_control                                 2048       0  avgt        341.050          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000F_control                                 2048       0  avgt        175.195          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000I_control                                 2048       0  avgt        312.981          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000L_control                                 2048       0  avgt        926.601          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000S_control                                 2048       0  avgt        101.604          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench001_control                                  2048       0  avgt        326.318          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100B_misaligned_load                         2048       0  avgt        321.232          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100C_misaligned_load                         2048       0  avgt        108.472          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100D_misaligned_load                         2048       0  avgt        366.364          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100F_misaligned_load                         2048       0  avgt        363.415          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100I_misaligned_load                         2048       0  avgt        316.285          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100L_misaligned_load                         2048       0  avgt        953.423          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100S_misaligned_load                         2048       0  avgt        102.263          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench200_hand_unrolled_aligned                    2048       0  avgt       2453.593          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench300_multiple_misaligned_loads                2048       0  avgt        317.559          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench301_multiple_misaligned_loads                2048       0  avgt        335.746          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench302_multiple_misaligned_loads_and_stores     2048       0  avgt        509.121          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench400_hand_unrolled_misaligned                 2048       0  avgt       2463.076          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench401_hand_unrolled_misaligned                 2048       0  avgt       2459.950          ns/op

In comparison on a aarch64 machine with asimd support:

With patch:

Benchmark                                                                                            (COUNT)  (seed)  Mode  Cnt     Score   Error  Units
VectorAlignment.VectorAlignmentNoSuperWord.bench000B_control                                            2048       0  avgt       2073.898          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000C_control                                            2048       0  avgt       2069.891          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000D_control                                            2048       0  avgt       1530.363          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000F_control                                            2048       0  avgt       1531.315          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000I_control                                            2048       0  avgt       2058.755          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000L_control                                            2048       0  avgt       6162.654          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000S_control                                            2048       0  avgt       2066.975          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench001_control                                             2048       0  avgt       2064.712          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100B_misaligned_load                                    2048       0  avgt       2077.995          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100C_misaligned_load                                    2048       0  avgt       2070.731          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100D_misaligned_load                                    2048       0  avgt       1585.590          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100F_misaligned_load                                    2048       0  avgt       1577.224          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100I_misaligned_load                                    2048       0  avgt       2065.937          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100L_misaligned_load                                    2048       0  avgt       6154.060          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100S_misaligned_load                                    2048       0  avgt       2077.245          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench200_hand_unrolled_aligned                               2048       0  avgt       2052.183          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench300_multiple_misaligned_loads                           2048       0  avgt       2057.574          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench301_multiple_misaligned_loads                           2048       0  avgt       2056.507          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench302_multiple_misaligned_loads_and_stores                2048       0  avgt       2229.080          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench400_hand_unrolled_misaligned                            2048       0  avgt       2057.722          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench401_hand_unrolled_misaligned                            2048       0  avgt       2057.490          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000B_control                                              2048       0  avgt        277.949          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000C_control                                              2048       0  avgt        524.459          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000D_control                                              2048       0  avgt        761.802          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000F_control                                              2048       0  avgt        397.975          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000I_control                                              2048       0  avgt       1031.366          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000L_control                                              2048       0  avgt       6160.167          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000S_control                                              2048       0  avgt        524.403          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench001_control                                               2048       0  avgt       1033.700          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100B_misaligned_load                                      2048       0  avgt        296.661          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100C_misaligned_load                                      2048       0  avgt        534.637          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100D_misaligned_load                                      2048       0  avgt        896.924          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100F_misaligned_load                                      2048       0  avgt        444.249          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100I_misaligned_load                                      2048       0  avgt       1034.407          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100L_misaligned_load                                      2048       0  avgt       6159.291          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100S_misaligned_load                                      2048       0  avgt        535.201          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench200_hand_unrolled_aligned                                 2048       0  avgt       1026.480          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench300_multiple_misaligned_loads                             2048       0  avgt       2058.960          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench301_multiple_misaligned_loads                             2048       0  avgt       2057.263          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench302_multiple_misaligned_loads_and_stores                  2048       0  avgt       1469.346          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench400_hand_unrolled_misaligned                              2048       0  avgt       2051.330          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench401_hand_unrolled_misaligned                              2048       0  avgt       2053.124          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000B_control                                   2048       0  avgt        277.969          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000C_control                                   2048       0  avgt        524.459          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000D_control                                   2048       0  avgt        762.614          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000F_control                                   2048       0  avgt        398.871          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000I_control                                   2048       0  avgt       1031.324          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000L_control                                   2048       0  avgt       6159.809          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000S_control                                   2048       0  avgt        524.543          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench001_control                                    2048       0  avgt       1033.951          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100B_misaligned_load                           2048       0  avgt       2070.282          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100C_misaligned_load                           2048       0  avgt       2072.930          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100D_misaligned_load                           2048       0  avgt       1571.295          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100F_misaligned_load                           2048       0  avgt       1575.858          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100I_misaligned_load                           2048       0  avgt       2064.058          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100L_misaligned_load                           2048       0  avgt       6158.596          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100S_misaligned_load                           2048       0  avgt       2066.354          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench200_hand_unrolled_aligned                      2048       0  avgt       1026.634          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench300_multiple_misaligned_loads                  2048       0  avgt       2059.959          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench301_multiple_misaligned_loads                  2048       0  avgt       2056.331          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench302_multiple_misaligned_loads_and_stores       2048       0  avgt       1468.871          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench400_hand_unrolled_misaligned                   2048       0  avgt       2053.207          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench401_hand_unrolled_misaligned                   2048       0  avgt       2054.207          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000B_control                                 2048       0  avgt        278.020          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000C_control                                 2048       0  avgt        523.834          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000D_control                                 2048       0  avgt        763.723          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000F_control                                 2048       0  avgt        398.785          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000I_control                                 2048       0  avgt       1032.582          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000L_control                                 2048       0  avgt       6165.589          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000S_control                                 2048       0  avgt        524.599          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench001_control                                  2048       0  avgt       1034.263          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100B_misaligned_load                         2048       0  avgt        296.969          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100C_misaligned_load                         2048       0  avgt        527.155          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100D_misaligned_load                         2048       0  avgt        887.883          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100F_misaligned_load                         2048       0  avgt        450.608          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100I_misaligned_load                         2048       0  avgt       1032.089          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100L_misaligned_load                         2048       0  avgt       6156.948          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100S_misaligned_load                         2048       0  avgt        534.860          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench200_hand_unrolled_aligned                    2048       0  avgt       2052.838          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench300_multiple_misaligned_loads                2048       0  avgt       1034.799          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench301_multiple_misaligned_loads                2048       0  avgt       1026.710          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench302_multiple_misaligned_loads_and_stores     2048       0  avgt       1040.495          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench400_hand_unrolled_misaligned                 2048       0  avgt       2052.050          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench401_hand_unrolled_misaligned                 2048       0  avgt       2051.943          ns/op

Master:

Benchmark                                                                                            (COUNT)  (seed)  Mode  Cnt     Score   Error  Units
VectorAlignment.VectorAlignmentNoSuperWord.bench000B_control                                            2048       0  avgt       2073.172          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000C_control                                            2048       0  avgt       2069.024          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000D_control                                            2048       0  avgt       1530.282          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000F_control                                            2048       0  avgt       1530.620          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000I_control                                            2048       0  avgt       2057.784          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000L_control                                            2048       0  avgt       6162.051          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench000S_control                                            2048       0  avgt       2067.469          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench001_control                                             2048       0  avgt       2064.734          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100B_misaligned_load                                    2048       0  avgt       2077.420          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100C_misaligned_load                                    2048       0  avgt       2069.712          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100D_misaligned_load                                    2048       0  avgt       1567.685          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100F_misaligned_load                                    2048       0  avgt       1588.181          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100I_misaligned_load                                    2048       0  avgt       2063.875          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100L_misaligned_load                                    2048       0  avgt       6156.420          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100S_misaligned_load                                    2048       0  avgt       2071.831          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench200_hand_unrolled_aligned                               2048       0  avgt       2051.050          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench300_multiple_misaligned_loads                           2048       0  avgt       2058.347          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench301_multiple_misaligned_loads                           2048       0  avgt       2059.587          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench302_multiple_misaligned_loads_and_stores                2048       0  avgt       2226.472          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench400_hand_unrolled_misaligned                            2048       0  avgt       2059.083          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench401_hand_unrolled_misaligned                            2048       0  avgt       2057.710          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000B_control                                              2048       0  avgt        278.330          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000C_control                                              2048       0  avgt        523.515          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000D_control                                              2048       0  avgt        762.488          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000F_control                                              2048       0  avgt        397.938          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000I_control                                              2048       0  avgt       1033.271          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000L_control                                              2048       0  avgt       6159.941          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000S_control                                              2048       0  avgt        523.527          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench001_control                                               2048       0  avgt       1034.109          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100B_misaligned_load                                      2048       0  avgt       2070.728          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100C_misaligned_load                                      2048       0  avgt       2071.798          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100D_misaligned_load                                      2048       0  avgt       1573.162          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100F_misaligned_load                                      2048       0  avgt       1576.619          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100I_misaligned_load                                      2048       0  avgt       2063.931          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100L_misaligned_load                                      2048       0  avgt       6154.563          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100S_misaligned_load                                      2048       0  avgt       2070.740          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench200_hand_unrolled_aligned                                 2048       0  avgt       1026.360          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench300_multiple_misaligned_loads                             2048       0  avgt       2058.098          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench301_multiple_misaligned_loads                             2048       0  avgt       2057.420          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench302_multiple_misaligned_loads_and_stores                  2048       0  avgt       1468.739          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench400_hand_unrolled_misaligned                              2048       0  avgt       2057.625          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench401_hand_unrolled_misaligned                              2048       0  avgt       2052.928          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000B_control                                   2048       0  avgt        278.293          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000C_control                                   2048       0  avgt        524.414          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000D_control                                   2048       0  avgt        762.892          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000F_control                                   2048       0  avgt        398.785          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000I_control                                   2048       0  avgt       1031.174          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000L_control                                   2048       0  avgt       6161.848          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000S_control                                   2048       0  avgt        524.493          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench001_control                                    2048       0  avgt       1033.969          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100B_misaligned_load                           2048       0  avgt       2070.382          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100C_misaligned_load                           2048       0  avgt       2071.898          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100D_misaligned_load                           2048       0  avgt       1571.107          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100F_misaligned_load                           2048       0  avgt       1577.159          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100I_misaligned_load                           2048       0  avgt       2068.130          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100L_misaligned_load                           2048       0  avgt       6154.740          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100S_misaligned_load                           2048       0  avgt       2066.249          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench200_hand_unrolled_aligned                      2048       0  avgt       1026.455          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench300_multiple_misaligned_loads                  2048       0  avgt       2057.934          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench301_multiple_misaligned_loads                  2048       0  avgt       2056.206          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench302_multiple_misaligned_loads_and_stores       2048       0  avgt       1469.518          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench400_hand_unrolled_misaligned                   2048       0  avgt       2054.248          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench401_hand_unrolled_misaligned                   2048       0  avgt       2052.423          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000B_control                                 2048       0  avgt        278.086          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000C_control                                 2048       0  avgt        524.605          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000D_control                                 2048       0  avgt        762.109          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000F_control                                 2048       0  avgt        398.979          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000I_control                                 2048       0  avgt       1033.048          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000L_control                                 2048       0  avgt       6159.353          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000S_control                                 2048       0  avgt        524.595          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench001_control                                  2048       0  avgt       1034.222          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100B_misaligned_load                         2048       0  avgt        279.900          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100C_misaligned_load                         2048       0  avgt        534.811          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100D_misaligned_load                         2048       0  avgt        864.836          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100F_misaligned_load                         2048       0  avgt        455.771          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100I_misaligned_load                         2048       0  avgt       1034.040          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100L_misaligned_load                         2048       0  avgt       6158.096          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100S_misaligned_load                         2048       0  avgt        527.090          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench200_hand_unrolled_aligned                    2048       0  avgt       2051.502          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench300_multiple_misaligned_loads                2048       0  avgt       1032.550          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench301_multiple_misaligned_loads                2048       0  avgt       1027.912          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench302_multiple_misaligned_loads_and_stores     2048       0  avgt       1041.101          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench400_hand_unrolled_misaligned                 2048       0  avgt       2051.179          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench401_hand_unrolled_misaligned                 2048       0  avgt       2052.682          ns/op

Discussion

I'm only discussing the new results from bench000X (aligned) and bench100 (misaligned).

avx512:
NoSuperWord: all take 2500, except float/double take 1300 (about half). Alignment makes no difference.
SuperWord: Master misaligned like NoSuperWord. Master aligned, and patch in both cases: B 330, C 100, S 100, I 310, L 930, F 180, D 340 (+- 10%), this is a clear speedup.
AlignVector: Only vectorizes for the aligned case (good numbers from SuperWord), misaligned does not vectorize (bad numbers from NoSuperWord).
WithVectorize: all vectorize, with all the good numbers from "SuperWord".

aarch64 asimd:
NoSuperWord: B/C/S/I 2070, L 6160, F/D 1530 -> 1580 (sligntly worse for misaligned?).
SuperWord: With patch we haveB 280 -> 290, C 525 -> 535, S 525 -> 535, I 1030 -> 1035, L 6160, F 400 -> 440, D 760 -> 900 (aligned -> misaligned, misaligned a bit slower, and long does not vectorize). Master same for aligned, but "bad" numbers from "NoSuperWord" for misaligned (no vectorization).
AlignVector: same for patch and master. Only aligned vectorizes.
WithVectorize: same for patch and master. All vectorize. But aligned slightly faster than misaligned, just like for patch with "SuperWord".

Conclusion

avx512: no speed difference between aligned and misaligned case.
aarch64 asimd: vectorizing the misaligned cases leads to clear performance win compared to non-vectorization. However, we can see that the vectorized misaligned cases are consistently a bit slower than the vectorized aligned cases.

So from these experiments it is clearly profitable to make the change. The question is if there are any other examples where the misalignment leads to such a bad penalty that vectorization is not profitable.

Question for aarch64 specialists: How bad does the misalignment penalty really get? Do you think it would ever make it unprofitable to vectorize?

@eme64
Copy link
Contributor Author

eme64 commented Jun 14, 2023

@pfustc We by default run the fuzzer for a few runs, but I'm running it a bit more now just to get a bit more confidence.

I'm looking forward to your draft PR. Maybe we can work together towards a hybrid-vectorizer. I plan to keep working on SuperWord and vectorization in general.

@fg1417
Copy link

fg1417 commented Jun 16, 2023

aarch64 asimd: vectorizing the misaligned cases leads to clear performance win compared to non-vectorization. However, we can see that the vectorized misaligned cases are consistently a bit slower than the vectorized aligned cases.

Hi @eme64 , thanks for your perf data! I also tried your new benchmark on some latest aarch64 machines using asimd. Here are part of results:

  VectorAlignment.VectorAlignmentSuperWord.bench000B_control                                   2048       0  avgt        152.831          ns/op
  VectorAlignment.VectorAlignmentSuperWord.bench000C_control                                   2048       0  avgt        285.819          ns/op
  VectorAlignment.VectorAlignmentSuperWord.bench000D_control                                   2048       0  avgt        749.996          ns/op
  VectorAlignment.VectorAlignmentSuperWord.bench000F_control                                   2048       0  avgt        396.433          ns/op
  VectorAlignment.VectorAlignmentSuperWord.bench000I_control                                   2048       0  avgt        560.767          ns/op
  VectorAlignment.VectorAlignmentSuperWord.bench000L_control                                   2048       0  avgt       1131.909          ns/op
  VectorAlignment.VectorAlignmentSuperWord.bench000S_control                                   2048       0  avgt        285.215          ns/op
  VectorAlignment.VectorAlignmentSuperWord.bench001_control                                    2048       0  avgt        562.436          ns/op
  VectorAlignment.VectorAlignmentSuperWord.bench100B_misaligned_load                           2048       0  avgt        152.459          ns/op
  VectorAlignment.VectorAlignmentSuperWord.bench100C_misaligned_load                           2048       0  avgt        290.888          ns/op
  VectorAlignment.VectorAlignmentSuperWord.bench100D_misaligned_load                           2048       0  avgt        754.443          ns/op
  VectorAlignment.VectorAlignmentSuperWord.bench100F_misaligned_load                           2048       0  avgt        386.633          ns/op
  VectorAlignment.VectorAlignmentSuperWord.bench100I_misaligned_load                           2048       0  avgt        560.587          ns/op
  VectorAlignment.VectorAlignmentSuperWord.bench100L_misaligned_load                           2048       0  avgt       1134.492          ns/op
  VectorAlignment.VectorAlignmentSuperWord.bench100S_misaligned_load                           2048       0  avgt        284.768          ns/op

I believe that the perf gap between the vectorized misaligned cases and the vectorized aligned cases may become smaller and sometimes prospectively can be removed on newer aarch64 machines.

Also, I strongly agree on your conclusion: it is clearly profitable to vectorize these misaligned cases.

Thanks!

@eme64
Copy link
Contributor Author

eme64 commented Jun 19, 2023

@fg1417 perfect, thanks for looking into that!
Is there something you still want me to change on this RFE?

Copy link

@fg1417 fg1417 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for your work.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice re-write. Looks good to me.

@openjdk
Copy link

openjdk bot commented Jun 20, 2023

@eme64 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8308606: C2 SuperWord: remove alignment checks when not required

Reviewed-by: fgao, kvn, pli

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 82 new commits pushed to the master branch:

  • e022e87: 8310053: VarHandle and slice handle derived from layout are lacking alignment check
  • 45eaf5e: 8298443: Remove expired flags in JDK 22
  • 28415ad: 8310225: Reduce inclusion of oopMapCache.hpp and generateOopMap.hpp
  • 4c3efb3: 8309034: NoClassDefFoundError when initializing Long$LongCache
  • 1120106: 8310458: Fix build failure caused by JDK-8310049
  • 09174e0: 8310049: Refactor Charset tests to use JUnit
  • 99d2a9a: 8310330: HttpClient: debugging interestOps/readyOps could cause exceptions and smaller cleanup
  • 31b6fd7: 8309258: RISC-V: Add riscv_hwprobe syscall
  • 4a9cc8a: 8309266: C2: assert(final_con == (jlong)final_int) failed: final value should be integer
  • 4e4e586: 8310194: Generational ZGC: Lock-order asserts in JVMTI IterateThroughHeap
  • ... and 72 more: https://git.openjdk.org/jdk/compare/bd79db3930f192f6742e29a63a6d1c3bc3dd3385...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jun 20, 2023
@eme64
Copy link
Contributor Author

eme64 commented Jun 21, 2023

Thanks @vnkozlov @pfustc @fg1417 for the suggestions and reviews!
/integrate

@openjdk
Copy link

openjdk bot commented Jun 21, 2023

Going to push as commit 886ac1c.
Since your change was applied there have been 83 commits pushed to the master branch:

  • 47d00a4: 8310265: (process) jspawnhelper should not use argv[0]
  • e022e87: 8310053: VarHandle and slice handle derived from layout are lacking alignment check
  • 45eaf5e: 8298443: Remove expired flags in JDK 22
  • 28415ad: 8310225: Reduce inclusion of oopMapCache.hpp and generateOopMap.hpp
  • 4c3efb3: 8309034: NoClassDefFoundError when initializing Long$LongCache
  • 1120106: 8310458: Fix build failure caused by JDK-8310049
  • 09174e0: 8310049: Refactor Charset tests to use JUnit
  • 99d2a9a: 8310330: HttpClient: debugging interestOps/readyOps could cause exceptions and smaller cleanup
  • 31b6fd7: 8309258: RISC-V: Add riscv_hwprobe syscall
  • 4a9cc8a: 8309266: C2: assert(final_con == (jlong)final_int) failed: final value should be integer
  • ... and 73 more: https://git.openjdk.org/jdk/compare/bd79db3930f192f6742e29a63a6d1c3bc3dd3385...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jun 21, 2023
@openjdk openjdk bot closed this Jun 21, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jun 21, 2023
@openjdk
Copy link

openjdk bot commented Jun 21, 2023

@eme64 Pushed as commit 886ac1c.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@pfustc
Copy link
Member

pfustc commented Jun 21, 2023

Hi @eme64 , I have just pushed our post loop patch to Github for review. I also attached some documents in the reply of the PR for reviewers to understand the code. See #14581

@eme64
Copy link
Contributor Author

eme64 commented Jun 21, 2023

@pfustc Thanks for the info, I'll look at it soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler [email protected] integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

4 participants