Skip to content

Conversation

@eme64
Copy link
Contributor

@eme64 eme64 commented Jan 31, 2023

List of important things below

Original RFE description:
Cyclic dependencies are not handled correctly in all cases. Three examples:

static void test2(int[] dataI, float[] dataF) {
for (int i = 0; i < RANGE - 2; i++) {
// dataI has cyclid dependency of distance 2, cannot vectorize
int v = dataI[i];
dataI[i + 2] = v;
dataF[i] = v; // let's not get confused by another type
}
}

And this, compiled with -XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize:

static void test5(int[] data) {
for (int j = 0; j < RANGE - 3; j++) {
// write forward -> cyclic dependency -> cannot vectorize
// independent(s1, s2) for adjacent loads cannot detect this
// Checks with memory_alignment are disabled via compile option
data[j + 2] = data[j];
}
}

And for vmIntrinsics::_forEachRemaining compile option Vectorize is always enabled:

static void test5(int[] data) {
IntStream.range(0, RANGE - 2).forEach(j -> {
data[j + 2] = data[j];
});
}

All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values.

Analysis

The create_pack logic in SuperWord::find_adjacent_refs is broken in two ways:

  • When the compile directive Vectorize is on, or we compile vmIntrinsics::_forEachRemaining we have _do_vector_loop == true. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the independence(s1, s2) checks we do for all adjacent memops. But for larger distances, we rely on memory_alignment == 0. But the compile directive avoids these checks.
  • If best_align_to_mem_ref is of a different type, and we have memory_alignment(mem_ref, best_align_to_mem_ref) == 0, we do not check if mem_ref has memory_alignment == 0 for all other refs of the same type. In the example TestCyclicDependency::test2, we have best_align_to_mem_ref as the StoreF. Then we assess the StoreI, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at LoadI, which has perfect alignment with the StoreF, so we accept it too (even though it is in conflict with the StoreI).

Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code.

I also propose to only allow the compile directive Vectorize only if vectors_should_be_aligned() == false. If all vector operations have to be vector_width aligned, then they also have to be mutually aligned, and we cannot have patterns like v[i] = v[i] + v[i+1] for which the compile directive was introduced in the first place c7d33de.
Update: I found a Test.java that lead to a crash (SIGBUS) on a ARM32 on master. The example bypassed the alignment requirement because of _do_vector_loop, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me!

Solution

First, I implemented SuperWord::verify_packs which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see SuperWord::find_dependence). With this verification in place we at least get an assert instead of wrong execution.

I then refactored and fixed the create_pack code, and put the logic all in SuperWord::is_mem_ref_alignment_ok. With the added comments, I hope the logic is more straight forward and readable. If _do_vector_loop == true, then I filter the vector packs again in SuperWord::combine_packs, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent.

Another change I have made:
Disallow extend_packlist from adding MemNodes back in. Because if we have rejected some memops, we do not want them to be added back in later.

Testing

I added a few more regression tests, and am running tier1-3, plus some stress testing.

However, I need help from someone who can test this on ARM32 and PPC, basically machines that have vectors_should_be_aligned() == true. I would love to have additional testing on those machine, and some reviews.
Update: @fg1417 did testing on ARM32, @reinrich did testing on PPC.

Discussion / Future Work

I wonder if we should have _do_vector_loop == true by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? JDK-8303113


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8298935: fix independence bug in create_pack logic in SuperWord::find_adjacent_refs

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/12350/head:pull/12350
$ git checkout pull/12350

Update a local copy of the PR:
$ git checkout pull/12350
$ git pull https://git.openjdk.org/jdk pull/12350/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 12350

View PR using the GUI difftool:
$ git pr show -t 12350

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/12350.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Jan 31, 2023

👋 Welcome back epeter! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jan 31, 2023

@eme64 The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@TobiHartmann
Copy link
Member

I think it would be helpful if you could include the TraceSuperWord output in your description. I'm not an expert in this code but it seems to be that this has always been wrong.

Maybe folks that recently worked on superword (@jbhateja, @sviswanathan, @fg1417) know more.

@fg1417
Copy link

fg1417 commented Feb 3, 2023

We initially generate pairs in find_adjacent_refs . These are guaranteed to be independent.
But how do we guarantee that the packs stay independent when we do combine_packs?
Would we not have to check independent for each additional memop we add to a pack?
Because if we combine (a,b) with (b,c), we only know that a indep b and b indep c , but how would we know that a indep c?

Hi @eme64, superword combines the results from

if (memory_alignment(mem_ref, best_iv_adjustment) == 0 || _do_vector_loop) {

and
if (same_velt_type(mem_ref, best_align_to_mem_ref)) {
to guarantee there is no data dependency within one combined pack.

For example, in the case like

// int[] a, b;
for (int i = start; i < limit; i++) {
  b[i+OFFSET] = a[i];
}

it supposes that, if it holds OFFSET * element_size_in_bytes % MaxVectorSize == 0, vector load and vector store won't access overlapped memory within one vector execution. That also means that there won't be dependency among nodes in the future combined pack, because the dependency between a and c you mentioned above actually comes from memory dependency between load and store nodes. The algorithm works for all cases in the form above, and array a and b have the same data type.

But the case you proposed hits the bug of the algorithm. In your case, find_adjacent_refs() visits in the order: StoreF nodes, StoreI nodes and then LoadI nodes. So it naturally takes a StoreF node as best_align_to_mem_ref here and the pointer won't be updated. Then, it takes all memory nodes as potential packs because it thinks that whether LoadI or StoreI are independent accesses of StoreF, which is obviously true since arrays of different data types can't be the same one. In fact, it should also decide if there is potential overlapped access between LoadI and StoreI. And we know the overlapping does exist. I'm wondering if we can improve the algorithm by further comparing memory nodes with a special best_align_to_mem_ref_of_the_same_type for the same data type, if existing.

Anyway, never mind. Your fix looks good to me!

Thanks.

@fg1417
Copy link

fg1417 commented Feb 3, 2023

Sorry to clarify:

if it holds OFFSET * element_size_in_bytes % MaxVectorSize == 0, vector load and vector store won't access overlapped memory within one vector execution., which means vector load and vector store won't access partially overlapped memory within one vector execution. They're still allowed to access completely overlapped memory with one vector execution, namely b[i] = a[i].

@eme64
Copy link
Contributor Author

eme64 commented Feb 3, 2023

@fg1417 Thanks for your response. I was trying to understand the same code yesterday. And your explanations match with what I have found.

The issue is that we find best_align_to_mem_ref to be the StoreF, and then we look at StoreI, which does not have the same alignment memory_alignment(mem_ref, best_iv_adjustment) != 0, but also not the same type, so we accept it too. Then, when we look at the LoadI, we check again for memory_alignment(mem_ref, best_iv_adjustment) == 0 and get true, so we also generate the pack.

I have think we need to change the order of the if statements. When we have same_velt_type(mem_ref, best_align_to_mem_ref) == false, we should always check if the mem_ref is in conflict with any other pack that we already have - no matter if the alignment with best_align_to_mem_ref is perfect or not.

If we have same_velt_type(mem_ref, best_align_to_mem_ref) == true, then we can create the pack iff we have memory_alignment(mem_ref, best_iv_adjustment) == 0.

There is the additional complicating factor of _do_vector_loop: We want to bypass the alignment check memory_alignment(mem_ref, best_iv_adjustment) == 0 or with already existing packs. We set _do_vector_loop iff we are in the compilation of vmIntrinsics::_forEachRemaining, or if the compile directive is set like this: -XX:CompileCommand=vectorize,Test::test. The problem with this is that it removes the dependency checks, if we have distances larger than 1. I am for example getting wrong results for this example (Test2.java):

import java.util.stream.IntStream;

class Test2 {
    static final int RANGE = 512;
    static final int ITER  = 100;

    static void init(int[] data) {
       IntStream.range(0, RANGE).parallel().forEach(j -> {
           data[j] = j + 1;
       });
    }

    static void test(int[] data) {
       IntStream.range(0, RANGE - 2).forEach(j -> {
           data[j + 2] = data[j]; // distance 2, cyclic dependency, vectorization leads to wrong results
       });
    }

    static void verify(String name, int[] data, int[] gold) {
        for (int i = 0; i < RANGE; i++) {
            if (data[i] != gold[i]) {
                throw new RuntimeException(" Invalid " + name + " result: data[" + i + "]: " + data[i] + " != " + gold[i]);
            }
        }
    }

    public static void main(String[] args) {
        int[] data = new int[RANGE];
        int[] gold = new int[RANGE];
        init(gold);
        test(gold);
        for (int i = 0; i < ITER; i++) {
            init(data);
            test(data);
        }
        verify("test", data, gold);
    }
}

It passes with java -Xint Test2.java, but leads to a RuntimeException for java -Xbatch Test2.java, because of wrong results.

My analysis is this: we rely on the memory_alignment being == 0 for all packs of the same type. Otherwise, the independent(s1, s2) checks only guarantee independence for distance 1, but not any distance larger than 1.

I discussed with @chhagedorn and @vnkozlov : The plan is to swap the if statements in this change, and fix the issues with _do_vector_loop possibly in a next RFE. I will also add IR tests that ensure we do indeed vectorize if we have _do_vector_loop set.

@fg1417 thanks again for taking the time to investigate this, I'm now more sure I'm going into the right direction.

@eme64
Copy link
Contributor Author

eme64 commented Feb 3, 2023

Generally, I am wondering about this though:
Why do we force the loads / stores of the same type to be completely overlapped (like @fg1417 calles it), so have memory_alignment(p1, p2) == 0 for all p1, p2 of the same type? This seems to be more constrained than necessary. Why do we not just rely on packs being internally independent, ie independent(s1, s2) for all s1, s2 in the same pack?

This would then allow things like this:

    private static void test() {
        for (int i = 0; i < 100; i++) {
            iArr[i] = iArr[i+2]; // read forward is ok, vectorization leads to correct results
        }
    }

(We only do this if we have _do_vector_loop == true -> basically we need to give C2 the hint that vectorization is ok)

But it would prevent things like this:

    private static void test() {
        for (int i = 0; i < 100; i++) {
            iArr[i+2] = iArr[i]; // write forward leads to cyclic dependencies -> should not vectorize
        }
    }

(We currently vectorize this if we have _do_vector_loop == true, but that is a bug, leads to wrong results. C2 trusts us blindly.)

@fg1417
Copy link

fg1417 commented Feb 3, 2023

Generally, I am wondering about this though: Why do we force the loads / stores of the same type to be completely overlapped (like @fg1417 calles it), so have memory_alignment(p1, p2) == 0 for all p1, p2 of the same type? This seems to be more constrained than necessary. Why do we not just rely on packs being internally independent, ie independent(s1, s2) for all s1, s2 in the same pack?

Yes, that's really a good idea to help vectorize more scenarios about reading forward. My concern is that we need to call independent(s1, s2) for nodes in the same pack, and thus the calling times would increase rapidly as MaxVectorSize increases. For example, we have 64 nodes for one byte pack when MaxVectorSize=64. For the current algorithm, memory_alignment(), the complexity is low. Besides, currently, we partially support reading forward. For the case like

// int[] a, b;
for (int i = start; i < limit; i++) {
  b[i] = a[i+OFFSET];
}

once it holds OFFSET * element_size_in_bytes % MaxVectorSize == 0, which covers completely overlapped, the loop can be vectorized successfully.

Thanks.

@eme64
Copy link
Contributor Author

eme64 commented Feb 3, 2023

@fg1417 right, if we have 64 memops in a vector, we would have to check all-to-all that they are independent. Maybe that is too expensive in some cases.
However, maybe there is a way to reduce this overhead.

A first idea:
After combine_packs, we verify that all packs are independent. For each pack we do this:
For each memop in pack, start a BFS traversal via over nodes in block, following the edges in DepPreds recursively. If we ever find another memop of the same pack, we have a cyclic dependency and reject the pack.
We can actually do this BFS in parallel, starting at all memops of a pack at the same time. Thus, per pack we traverse all nodes in the block at most once. So we would traverse the graph m times, if we have m packs.

I have some more ideas, but I would need to better understand what we allow to be parallelized, to know what assumptions we can make.

I guess this would become a larger project, designing the algorithm and implementing, testing and benchmarking it. For now I will just fix the broken issues, before diving more into extending vectorization.

@eme64 eme64 changed the title 8298935: Superword: handle cyclic dependencies with offset larger than 1 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs Feb 8, 2023
@eme64 eme64 marked this pull request as ready for review February 9, 2023 08:38
@openjdk openjdk bot added the rfr Pull request is ready for review label Feb 9, 2023
@mlbridge
Copy link

mlbridge bot commented Feb 9, 2023

@eme64
Copy link
Contributor Author

eme64 commented Feb 9, 2023

@tstuefe @reinrich Would you mind reviewing / testing testing this for PPC? Maybe you would also want to port some tests to PPC?

I also wonder:
Would it be possible to have some verification added to the SuperWord code before scheduling, that verifies that all packs are memory aligned, if vectors_should_be_aligned == false ? Because currently I can do -XX:+AlignVector, but if we make a mistake and still emit unaligned vector load/stores we would not detect this on an intel machine for example.

@vnkozlov
Copy link
Contributor

vnkozlov commented Mar 9, 2023

There was no big meaning in my question "Does not matter if they are in an other pack or not?"
As you explained we go through memory and data inputs. In simple case they would be in an other pack (since we looking only inside block). But in _do_vector_loop case (and may be other cases) some packs could be eliminate leaving nodes not in packs. But it does not hinder the search for dependence. That is what I want to say and ask for confirmation.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@openjdk
Copy link

openjdk bot commented Mar 9, 2023

@eme64 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8298935: fix independence bug in create_pack logic in SuperWord::find_adjacent_refs

Reviewed-by: kvn, jbhateja

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Mar 9, 2023
Copy link
Member

@jatin-bhateja jatin-bhateja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eme64 for your very informative and detailed explanations.

@eme64 eme64 changed the title 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs 8298935: fix independence bug in create_pack logic in SuperWord::find_adjacent_refs Mar 13, 2023
@openjdk openjdk bot added merge-conflict Pull request has merge conflict with target branch and removed ready Pull request is ready to be integrated labels Mar 13, 2023
@eme64
Copy link
Contributor Author

eme64 commented Mar 13, 2023

#12350 (comment)

So is there a 4th Bug lurking here?

@vnkozlov I now found an example that reveals this Bug 4. I want to fix it in a separate Bug JDK-8304042.

This is the method:

    static void test(int[] dataI1, int[] dataI2, float[] dataF1, float[] dataF2) {
        for (int i = 0; i < RANGE/2; i+=2) {
            dataF1[i+0] = dataI1[i+0] + 0.33f;            // 1
            dataI2[i+1] = (int)(11.0f * dataF2[i+1]);     // 2

            dataI2[i+0] = (int)(11.0f * dataF2[i+0]);     // 3
            dataF1[i+1] = dataI1[i+1] + 0.33f;            // 4
        }
    }

Note: dataI1 == dataI2 and dataF1 == dataF2. I only had to use two references so that C2 does not know this, and does not optimize away load after store.

Lines 1 and 4 are isomorphic and independent. The same holds for line 2 and 3. We creates the packs [1,4] and [2,3], and vectorize (with and without my patch). However, we have the following dependencies: 1->3 and 2->4. This creates a cyclic dependency between the two packs.

As explained in the previous #12350 (comment), we have to verify that there are no cyclic dependencies between the packs, just before we schedule. The SuperWord paper states this in "3.7 Scheduling".

@openjdk openjdk bot added ready Pull request is ready to be integrated and removed merge-conflict Pull request has merge conflict with target branch labels Mar 13, 2023
@vnkozlov
Copy link
Contributor

@vnkozlov I now found an example that reveals this Bug 4. I want to fix it in a separate Bug JDK-8304042.

Agree.

@vnkozlov
Copy link
Contributor

vnkozlov commented Mar 15, 2023

@eme64, I ran with -XX:-TieredCompilation -Xbatch -XX:CICompilerCount=1 -XX:+TraceNewVectors on AVX512 linux machine our vectors jtregs tests (including `jdk/incubator/vector) and everything is fine except one test:

compiler/loopopts/superword/TestPickLastMemoryState.java`

With these changes we almost don't generate vectors (may be 2 per @run). Without changes we got about 160 (50 per @run) new vectors. It has several @run commands for different vector sizes.

@eme64
Copy link
Contributor Author

eme64 commented Mar 15, 2023

@vnkozlov I'm looking into TestPickLastMemoryState.java

These are relevant vectorized methods:

  • f: vectorize with with master, nothing with my patch.
  • reset: both vectorize.
  • test1 - 6: vectorize with master, nothing with patch.

Explanation:

  • f: b[h - 1] and b[h] misaligned. On master, it vectorizes for MaxVectorSize 16 and 32, but not 64. No vectorization with patch. That is because the loop range is too small in f (32 iterations).
  • reset: just a simple VectorStore with all zeros. Not much can go wrong here.
  • test1: iArr[i1] and iArr[i1+1] misaligned. Only vectorizes on master.
  • test2: iArr[i1] and iArr[i1+2] misaligned. Only vectorizes on master.
  • test3: iArr[i1] and iArr[i1-2] misaligned. Only vectorizes on master.
  • test4: iArr[i1] and iArr[i1-3] misaligned. Only vectorizes on master.
  • test5: iArr[i1] and iArr[i1+2] misaligned. Only vectorizes on master.
  • test6: iArr[i1] and iArr[i1+1] misaligned. Only vectorizes on master.

My first guess is that it means that we reject the misaligned slice during find_adjacent_refs. But then we re-introduce (some of) the memops during extend_packlist, because we do not only follow non-memops, but also memops. I call this "happy accident" before my fix, which should not be allowed. It can indeed lead to bugs on very similar examples. I now disallow these "happy accidents", I forbid the re-introduction of memops during extend_packlist.

I am running the test like this:

./java -Xcomp -XX:CompileCommand=compileonly,*TestPickLastMemoryState::test1 -XX:-TieredCompilation -Xbatch -XX:CICompilerCount=1 -XX:+TraceNewVectors -XX:CompileCommand=printcompilation,*TestPickLastMemoryState::* -XX:+TraceSuperWord /home/emanuel/Documents/fork6-jdk/open/test/hotspot/jtreg/compiler/loopopts/superword/TestPickLastMemoryState.java

On master, I see for test1, that we only have the LoadL, StoreL after find_adjacent_refs. During extend_packlist, we then extend also to LoadI, StoreI. The same seems to be true for test2 - 6. For f, I only see StoreF after find_adjacent_refs, but the StoreI, LoadI are re-introduced during extend_packlist. That perfectly matches with the "misalignments" I named above. If I am correct, it does not always vectorize everything, only the stores that are somehow "connected" are re-introduced. Those that are not connected to any memop that "survives" stay excluded, as we have no way to extend to them.

Now let's see what would vectorize using the -XX:CompileCommand=Vectorize,*TestPickLastMemoryState::*.

I generally see everything vectorize again. But actually, more vectorization, because we do not only re-introduce the rejected memops that are connected to the "surviving" memops. We keep all memops, do not reject any, and see later that it is safe to keep them.

There is one exception: With -XX:MaxVectorSize=16, test1-6 do still not vectorize. On master, it vectorizes, because we find the LoadL, StoreL during find_adjacent_refs. However, it seems we create two [LoadL,LoadL] pairs, and two [StoreL,StoreL] pairs. There is no "bridging" pack that would later bind them together. Why is there none? The reason is that two longs already fill a vector, and we do not create packs that go over the "alignment boundary". So when we technically have 4 adjacent ops 1, 2, 3, 4, then SuperWord::stmts_can_pack will only allow [1,2] and [3,4] to be paired, and [3,4] is rejected, because it crosses the 16-byte "alignment boundary". This is because the align(s) of both 1, 3 is 0, and for 2, 3 it is 8, modulo the 16 bytes. But of course if we had 32 byte vectors, they could have align(s) 0, 8, 16, 24, and then the pair [2,3] would be allowed by SuperWord::stmts_can_pack. On master, this leads to vectorization, because now all packs only ever have two elements, including the int packs. Of course we could fit 4 elements, but we will never create the "bridging" pairs, since we do not have them from the "seed" pairs given by find_adjacent_pairs.

However, with my patch and Vectorize, we still only create packs for long with two elements, but for int we can create 4-element packs (alignment 0, 4, 8, 12 fits in 16 bytes without modulo warping). So it looks like we might be able to vectorize, even after combine_packs. However, during the filter_packs, we run into issues at the long-int conversion: we have two 2-element long packs feed into a singe 4-element int pack. And that is not implemented. So we reject those packs, and the vectorization falls apart.

I think that the issue with MaxVectorSize=16 is that we should only unroll 2x. But we unroll 4x at minimum somehow. This gives us the 4 int ops, that fit into the vector_width. For larger MaxVectorSize=32, unrolling 4x is ok, and for MaxVectorSize=64 we even unroll 8x, but not more. Then, we only fill half of the vector_width with int elements, but the full vectors with the long elements. If I am correct about this, the issue may really only show up on small vector sizes with large types. So maybe this is not too worrying.

Conclusion:

What you saw in TestPickLastMemoryState.java was expected, it is one of the "happy accidents" -> "collateral damage" cases of re-introducing memops during extend_packlist. It is unsafe, so we have to forbid it. Further, it only allows partial vectorization when full vectorization would be allowed: we only re-introduce memops that are connected to memops of other memory slices, which did not get rejected during find_adjacent_refs.

CompileCommand Vectorize can remedy most of these cases. And I hope to bring that functionality to all -AlignVector cases in the future.

There are some odd cases, where that remedy does not work. We saw an example above. The issue is that if we have different types with different element-byte-sizes, we may pack different numbers of elements into the packs. At the conversion-points, this leads to a vectorization-blocker, because we cannot have multiple vectors feed into a single one, or the other way around.

We could try to address this with a follow-up RFE. For example, we could split vectors if we have multiple vectors as input/output. But most likely this is an edge case that is not very worrying. as I explain above.

This exercise here also shows that we need even more IR tests, to catch these cases. Or we need to convert more of the SuperWord tests to IR tests.

@vnkozlov What do you think about this?

@vnkozlov
Copy link
Contributor

I call this "happy accident" before my fix, which should not be allowed. It can indeed lead to bugs on very similar examples. I now disallow these "happy accidents", I forbid the re-introduction of memops during extend_packlist.

The test was added by 8290910 fix in JDK 20. Before that superword produces wrong results in these tests. The issue was found with fuzzer testing. I don't know where the test in 8293216 comes from. So you are right about this be corner case.

Based on this I agree with not allowing vectorization in such cases for now.

But file RFE to look on these cases much later. If vectorization produces valid result we should allow it. I understand that we are missing more precise checks which separate valid from invalid misaligned operations. I am not suggesting adding back code which extend memory ops without any checks but may be improve find_adjacent_refs when we can accept such cases. It is very complex case and fix could be also complex.

@eme64
Copy link
Contributor Author

eme64 commented Mar 15, 2023

Ok, great.
Thanks everybody for the help!
@vnkozlov @jatin-bhateja thanks for the reviews!
@fg1417 @reinrich thanks for the testing and feedback!
@TobiHartmann thanks for the review suggestions!

I have the follwing follow-up RFE's:

JDK-8303113 [SuperWord] investigate if enabling _do_vector_loop by default creates speedup
I want to see if we can vectorize more, using the Vectorize approach: given -AlignVector, do no alignment checks, create packs optimistically. Filter out packs that are not independent later. This will remedy lots of the "collateral damage" of this Bug-fix here.

JDK-8260943 Revisit vectorization optimization added by 8076284
This is an old one. But we should either delete the dead code that is not hard-coded to be false, or fix it. This is related to _do_vector_loop (JDK-8076284 first introduced it).

JDK-8303827 C2 SuperWord: allow more fine grained alignment for +AlignVector
We should fix the "collateral damage" for the +AlignVector case. We can do that by relaxing the strict alignment requirement a bit, to 4/8-byte. The vector_width alignment was required on SPARC. But even there, the vectors were not longer than 8 bytes (@vnkozlov ). Some collateral damage will happen, for example some conversions will not be vectorized after this fix here. That will for example affect TestVectorizeTypeConversion.java on some platforms with +AlignVector.
But: I will not do this, and probably nobody from Oracle, as all our machines have -AlignVector. If anybody is interested in fixing this, you are free to take over the bug!

JDK-8304042 C2 SuperWord: schedule must remove packs with cyclic dependencies
The Bug 4 mentioned above, where we have cyclic dependencies on the packs, even when all packs are independent.

Thanks again to all the involved!
/integrate

@openjdk
Copy link

openjdk bot commented Mar 15, 2023

Going to push as commit 01e6920.
Since your change was applied there have been 38 commits pushed to the master branch:

  • 3d77e21: 8301308: Remove version conditionalization for gcc/clang PRAGMA_DIAG_PUSH/POP
  • e3777b0: 8270865: Print process ID with -Xlog:os
  • 349139b: 8304030: Configure call fails on AIX when using --with-gtest option.
  • 714b5f0: 8294962: Convert java.base/jdk.internal.module package to use the Classfile API to modify and write module-info.class
  • 065d3e0: 8304171: Fix layout of JCov instrumented bundle on Mac OS
  • cd41c69: 8303705: Field sleeper.started should be volatile JdbLockTestTarg.java
  • f5c8b68: 8301998: Update HarfBuzz to 7.0.1
  • 617c15f: 8304172: ProblemList serviceability/sa/UniqueVtableTest.java
  • f81e1de: 8303882: Refactor some iterators in jdk.compiler
  • 45809fd: 8295884: Implement IDE support for Eclipse
  • ... and 28 more: https://git.openjdk.org/jdk/compare/25e7ac226a3be9c064c0a65c398a8165596150f7...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Mar 15, 2023
@openjdk openjdk bot closed this Mar 15, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Mar 15, 2023
@openjdk
Copy link

openjdk bot commented Mar 15, 2023

@eme64 Pushed as commit 01e6920.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

fg1417 pushed a commit to fg1417/jdk that referenced this pull request Mar 21, 2023
openjdk#8877 introduced the global
option `SuperWordMaxVectorSize` as a temporary solution to fix
the performance regression on some x86 machines.

Currently, SuperWordMaxVectorSize behaves differently between
x86 and other platforms[1]. For example, if the current machine
only supports `MaxVectorSize <= 32`, but we set
`SuperWordMaxVectorSize = 64`, then `SuperWordMaxVectorSize`
will be kept at 64 on other platforms while x86 machine would
change `SuperWordMaxVectorSize` to `MaxVectorSize`. Other
platforms except x86 miss similar implementations like [2].

Also, `SuperWordMaxVectorSize` limits the max vector size of
auto-vectorization as `64`, which is fine for current aarch64
hardware but SVE architecture supports larger than 512 bits.

The patch is to drop the global option and use an architecture-
dependent interface to consult the max vector size for auto-
vectorization, fixing the performance issue on x86 and reducing
side effects for other platforms. After the patch, auto-
vectorization is still limited to 32-byte vectors by default
on Cascade Lake and users can override this by either setting
`-XX:UseAVX=3` or `-XX:MaxVectorSize=64` on JVM command line.

So my question is:

Before the patch, we could have a smaller max vector size for
auto-vectorization than `MaxVectorSize` on x86. For example,
users could have `MaxVectorSize=64` and
`SuperWordMaxVectorSize=32`. But after the change, if we set
`-XX:MaxVectorSize=64` explicitly, then the max vector size for
auto-vectorization would be `MaxVectorSize`, i.e. 64 bytes, which
I believe is more reasonable. @sviswa7 @jatin-bhateja, are you
happy about the change?

[1] openjdk#12350 (comment)
[2] https://github.com/openjdk/jdk/blob/33bec207103acd520eb99afb093cfafa44aecfda/src/hotspot/cpu/x86/vm_version_x86.cpp#L1314-L1333
fg1417 pushed a commit to fg1417/jdk that referenced this pull request Mar 30, 2023
As @eme64 said in [1], JDK-8298935 introduced some "collateral
damage", disabling the vectorization of some conversions when
`+AlignVector`. That affects IR checks of
TestVectorizeTypeConversion.java and ArrayTypeConvertTest.java
on some aarch64 platforms like ThunderX and ThunderX2 [2].

This trivial patch is to allow IR check only when we have
"-AlignVector".

[1] openjdk#12350 (comment)
[2] https://github.com/openjdk/jdk/blob/7239150f8aff0e3dc07c5b27f6b7fb07237bfc55/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp#L154
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler [email protected] integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

6 participants