-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8286941: Add mask IR for partial vector operations for ARM SVE #9037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back xgong! A progress list of the required criteria for merging this PR into |
|
@XiaohongGong The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
vnkozlov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes I significant. I suggest to wait JDK 20 (next week).
|
Sure. It's fine for me to wait the JDK 20. Thanks a lot for the advice! |
| case Op_FmaVD: | ||
| case Op_FmaVF: | ||
| case Op_MacroLogicV: | ||
| case Op_LoadVectorMasked: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it is removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking at this PR! This is removed since we added two combined rules in the aarch64_sve.ad (line-2383) like:
match(Set dst (VectorLoadMask (LoadVectorMasked mem pg)));
Setting Op_LoadVectorMasked as a root in matcher will make such rules not be matched. I'm also curious why this op is added here. Does it have any influence if I remove it? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jatin-bhateja, could you please help to check the influence about removing this? Kindly know your feedback about this. Thanks so much!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was added specifically for #302
It used for ArrayCopyPartialInlineSize code macroArrayCopy.cpp#L250
Yes, it is strange that it is forced to be in register always. The instruction in x86.ad should be enough to put it into register.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jatin-bhateja, could you please help to check the influence about removing this? Kindly know your feedback about this. Thanks so much!
I think, this can be modified, GVN based sharing should be sufficient here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for looking at this part!
| Node* MaskAllNode::Ideal(PhaseGVN* phase, bool can_reshape) { | ||
| // Transform (MaskAll m1 (VectorMaskGen len)) ==> (VectorMaskGen len) | ||
| // if the vector length in bytes is lower than the MaxVectorSize. | ||
| if (is_con_M1(in(1)) && length_in_bytes() < MaxVectorSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to #8877 such length check may not correct here.
And I don't see in(2)->Opcode() == Op_VectorMaskGen check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think changes in #8877 influences the max vector length in superword? And since MaskAll is used for VectorAPI, the MaxVectorSize is still the right reference? @jatin-bhateja, could you please help to check whether this has any influence on x86 avx-512 system? Thanks so much!
And I don't see in(2)->Opcode() == Op_VectorMaskGen check.
Yes, the Op_VectorMaskGen is not generated for MaskAll when its input is a constant. We directly transform the MaskAll to VectorMaskGen here, since they two have the same meanings. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I don't see in(2)->Opcode() == Op_VectorMaskGen check.
Yes, the Op_VectorMaskGen is not generated for MaskAll when its input is a constant. We directly transform the MaskAll to VectorMaskGen here, since they two have the same meanings. Thanks!
I'm sorry that my comment in line-1819 is not right which misunderstood you. I will change this later. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to not transform MaskAll to VectorMaskGen now, since there are the match rules using MaskAll m1 both in sve and avx-512. Doing the transformation may influence those rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think changes in #8877 influences the max vector length in superword? And since
MaskAllis used for VectorAPI, theMaxVectorSizeis still the right reference? @jatin-bhateja, could you please help to check whether this has any influence on x86 avx-512 system? Thanks so much!
All these transforms are guarded and currently enabled only for AARCH64, do not think it will impact AVX512.And I don't see in(2)->Opcode() == Op_VectorMaskGen check.
Yes, the
Op_VectorMaskGenis not generated forMaskAllwhen its input is a constant. We directly transform theMaskAlltoVectorMaskGenhere, since they two have the same meanings. Thanks!
|
Hi @vnkozlov , all your review comments are resolved. Could you please help to take a look at this PR again? Thanks so much! |
vnkozlov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing my comments.
You need review from Jatin and someone familiar with aarch64 code (I did not look on it).
| case Op_FmaVD: | ||
| case Op_FmaVF: | ||
| case Op_MacroLogicV: | ||
| case Op_LoadVectorMasked: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was added specifically for #302
It used for ArrayCopyPartialInlineSize code macroArrayCopy.cpp#L250
Yes, it is strange that it is forced to be in register always. The instruction in x86.ad should be enough to put it into register.
Sure, thanks a lot for the review! |
|
Hi @jatin-bhateja, could you please help to take a look at this PR, especially the changes to the |
| Node* mask = NULL; | ||
| // Generate a vector mask for vector operation whose vector length is lower than the | ||
| // hardware supported max vector length. | ||
| if (vt->length_in_bytes() < MaxVectorSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness, length comparison check can be done against MIN(SuperWordMaxVectorSize, MaxVectorSize).
Even though SuperWordMaxVector differs from MaxVectorSize only for certain X86 targets and this control flow is only executed for AARCH64 SVE targets currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree with you to add the SuperWordMaxVectorSize reference. Thanks! I will change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that we cannot directly use MIN(SuperWordMaxVectorSize, MaxVectorSize) here. As I know SuperWordMaxVectorSize is used to control the max vector limit specially for auto-vectorization. It should not have any influence to the VectorAPI max vector size. So if the supported max vector size for VetorAPI is larger than auto-vectorization, the transformation will be influenced. For example, if a SVE hardware supported 128-bytes max vector size, but the SuperWordMaxVectorSize is 64, the predicate will not be generated for vectors whose vector size is smaller than 128-bytes. And I think x86 also has the similar issue. I think we'd better need a unit method to compute the max_vector_size that can handle the differences for superword and VectorAPI. And then all the orignal max_vector_size should be replaced with it. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, the max vector size used here should be referenced from the hardware supported max vector size, which should be MaxVectorSize for SVE. For those vectors whose vector size is SuperWordMaxVectorSize, but smaller than the hardware supported max size, we still need to generate a predicate for them to make sure the results are right. So using MaxVectorSize is necessary here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mask is generated based on the real vector length, which guarantees only the vector lanes that inside of its valid vector length will do the operations. Higher lanes do nothing. So if a VectorStore has a vector length which equals to SuperWordMaxVectorSize, but the MaxVectorSize is larger than the SuperWordMaxVectorSize, a mask will be generated based on the real length SuperWordMaxVectorSize. And only lanes under the SuperWordMaxVectorSize will be stored. So I cannot see the influence on the out of memory issue. Could you please give an example apart of x86 cases? Thanks so much!
BTW, I think flag SuperWordMaxVectorSize should be equal to MaxVectorSize by default for other platforms instead of 64. If a platform (liken SVE) supports >64 bytes max vector size for auto-vectorization, it can only generate the vectors with 64 bytes vector size. This seems unreasonable. So do you have any plan refactoring this flag, or removing the usages for it in future? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mask is generated based on the real vector length, which guarantees only the vector lanes that inside of its valid vector length will do the operations. Higher lanes do nothing. So if a VectorStore has a vector length which equals to
SuperWordMaxVectorSize, but the MaxVectorSize is larger than theSuperWordMaxVectorSize, a mask will be generated based on the real lengthSuperWordMaxVectorSize. And only lanes under theSuperWordMaxVectorSizewill be stored. So I cannot see the influence on the out of memory issue. Could you please give an example apart of x86 cases? Thanks so much!
Correct, got it, removed my above review comment.
BTW, I think flag
SuperWordMaxVectorSizeshould be equal toMaxVectorSizeby default for other platforms instead of 64. If a platform (liken SVE) supports >64 bytes max vector size for auto-vectorization, it can only generate the vectors with 64 bytes vector size. This seems unreasonable. So do you have any plan refactoring this flag, or removing the usages for it in future? Thanks!
| } | ||
| } | ||
| return NULL; | ||
| return LoadVectorNode::Ideal(phase, can_reshape); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These predicated nodes are concrete ones with fixed species and carry user specified mask, I am not clear why do we need a mask re-computation for predicated nodes.
Higher lanes of predicated operand should already be zero and mask attached to predicated node should be correct by construction, since mask lane count is always equal to vector lane count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we don't need to add an additional mask for these masked nodes for SVE. Please refer to the partial_op_sve_needed method in aarch64_sve.ad, which lists all the ops that needs this transformation. Thanks! LoadVectorNode::Ideal(phase, can_reshape) just because LoadVectorMaskedNode derived from LoadVectorNode, that it may derive its other transformations in future or for other platforms.
| } | ||
| } | ||
| return NULL; | ||
| return StoreVectorNode::Ideal(phase, can_reshape); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as LoadVectorMaskedNode, StoreVectorMaskedNode doesn't need the transformation as well.
| case Op_FmaVD: | ||
| case Op_FmaVF: | ||
| case Op_MacroLogicV: | ||
| case Op_LoadVectorMasked: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jatin-bhateja, could you please help to check the influence about removing this? Kindly know your feedback about this. Thanks so much!
I think, this can be modified, GVN based sharing should be sufficient here.
| if (Matcher::vector_needs_partial_operations(this, vt)) { | ||
| return VectorNode::try_to_gen_masked_vector(phase, this, vt); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a parent node of TrueCount/FirstTrue/LastTrue and MaskToLong which perform mask querying operation on concrete predicate operands, a transformation here looks redundant to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason to add the transformation here is: the FirstTrue needs the reference to the real vector length for SVE, that we need to generate a predicate when the vector length is smaller than the max vector size. Please check the changes of partial_op_sve_needed in aarch64_sve.ad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Real vector length can be obtained by incoming input mask node.
| Node* MaskAllNode::Ideal(PhaseGVN* phase, bool can_reshape) { | ||
| // Transform (MaskAll m1 (VectorMaskGen len)) ==> (VectorMaskGen len) | ||
| // if the vector length in bytes is lower than the MaxVectorSize. | ||
| if (is_con_M1(in(1)) && length_in_bytes() < MaxVectorSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think changes in #8877 influences the max vector length in superword? And since
MaskAllis used for VectorAPI, theMaxVectorSizeis still the right reference? @jatin-bhateja, could you please help to check whether this has any influence on x86 avx-512 system? Thanks so much!
All these transforms are guarded and currently enabled only for AARCH64, do not think it will impact AVX512.And I don't see in(2)->Opcode() == Op_VectorMaskGen check.
Yes, the
Op_VectorMaskGenis not generated forMaskAllwhen its input is a constant. We directly transform theMaskAlltoVectorMaskGenhere, since they two have the same meanings. Thanks!
jatin-bhateja
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @XiaohongGong , thanks for your explanations, common IR changes looks good to me.
| if (Matcher::vector_needs_partial_operations(this, vt)) { | ||
| return VectorNode::try_to_gen_masked_vector(phase, this, vt); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Real vector length can be obtained by incoming input mask node.
nsjian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AArch64 changes look good to me.
Thanks for the review @nsjian ! |
|
I submitted testing. Will let you know results before approval. |
Thanks a lot for doing this! |
|
I got failure in tier1: |
|
I put output into RFE comment. |
Thanks for pointing out this! My fault. I called wrong assembler function in the new added tests. I will fix it soon! |
|
Hi @vnkozlov , the fix is pushed to this PR. Would you mind running the test again? Thanks so much! |
|
Hi @nick-arm , could you please help to take a look at the aarch64 part changes? Thanks so much for your time! |
|
An other test failed in tier2: compiler/loopopts/superword/TestPickFirstMemoryState.java |
Thanks for the tests again! It seems this is the same issue with #2867 that the type of |
|
Hi @vnkozlov , the fixing is pushed to the PR. I tested the failure case on a avx-512 machine, and the failure gone. Could you please help to run the test again? Thanks a lot! |
|
I started new testing. |
vnkozlov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Results are good.
|
@XiaohongGong This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 207 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
Thanks a lot for your time again! |
|
@nick-arm , could you please take a look at this change again? Thanks so much! |
|
Thanks for the reivew @nick-arm ! |
|
/integrate |
|
Going to push as commit a79ce4e.
Your commit was automatically rebased without conflicts. |
|
@XiaohongGong Pushed as commit a79ce4e. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct.
For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops.
Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop.
Here is an example for vector load and add reduction inside a loop:
As we can see the mask generation code "
ptrue" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop.Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out.
Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system:
This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of
VectorMaskGen (ConvI2L src). So after this patch, we save onesxtwinstruction for most VectorMaskGen cases, like below:Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9037/head:pull/9037$ git checkout pull/9037Update a local copy of the PR:
$ git checkout pull/9037$ git pull https://git.openjdk.org/jdk pull/9037/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 9037View PR using the GUI difftool:
$ git pr show -t 9037Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9037.diff