-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8335444: Generalize implementation of AndNode mul_ring #20066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back jkarthikeyan! A progress list of the required criteria for merging this PR into |
|
@jaskarth This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 736 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
Webrevs
|
|
The common cases I saw in java.base and other real world programs that aren't handled currently are with positive ranges, negative constants, and ranges that cross 0. For positive ranges it's possible to generalize the existing case with a positive constant, and for negative constants and ranges that cross 0 I came up with an estimation. For a positive constant In the case of negative constants and ranges that cross 0 it's more difficult to numerically derive a solution due to two's complement, as the minimum value can be lower than both of the ranges' minimum values. Since a negative two's complement number will always begin with a sequence of 1s, and that sequence of 1s grows smaller as the number becomes lower, the lower bound will be the number that contains the sequence of ones of the smaller value of both ranges. For the upper bound, the result will be the higher value of both ranges, as the result of bitwise-and with a negative value can return a result up to the positive value. While these methods produce much more precise types for many cases, they are by no means optimal. For that I think proper bitwise propagation is needed, such as with #17508. However, I think this is still worthwhile to get in since it's an easy win for most cases and is a pretty simple patch. |
|
Not a review but I quickly ran it through our testing and the following test fails with Test: compiler/vectorization/runner/BasicBooleanOpTest.java Output: |
src/hotspot/share/opto/mulnode.cpp
Outdated
| int shift_bits = sel_val == 0 ? std::numeric_limits<NativeType>::digits : count_leading_zeros(sel_val) - 1; | ||
|
|
||
| // Since the result of bitwise-and can be as high as the largest value of each range, the result should find the maximum. | ||
| return IntegerType::make(std::numeric_limits<NativeType>::min() >> shift_bits, MAX2(r0->_hi, r1->_hi), widen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just wondering if it would be OK to use (r0->_lo & r1->_lo) as the minimum value instead of finding the number of leading 1s and shifting. It seems to be a bit more constrained and compact (and seemingly correct). What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually did experiment with this before coming up with the current approach, but I found that there are cases where it produces incorrect bounds. With a small example where r0 = [-3, -1] and r1 = [-7, -1], using r0->_lo & r1->_lo results in -3 & -7, which equals -7. However, in the case where the value in r0 is -3 and r1 is -6, we can get an out of bounds result as -3 & -6 equals -8. Because of that I thought the best way to fix this was to find the common leading 1s. This approach does lead to us losing some information in the lower-order bits but I thought it was an acceptable compromise since the code to handle finding common bits across a range of integers becomes quite a bit more complicated. I hope this clarifies!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're totally right! It is not even related to the LSB (-14 & -6 would have the same problem with -12). Finding the leading 1s is the right solution. Thanks a lot for the clarification!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the upper limit be improved similar to what you added for the "both ranges are positive" case if we know that both ranges are negative?
In the positive case, we have values from:
011...1
000...0
while in the negative case, we have values from:
111...1
100...0
It suggests that we can then use the same argument as for the positive case and say that the maximum will be the maximum of the smaller range (i.e. MIN2(r0->_hi, r1->_hi)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great observation! Since bitwise-and can only remove bits, the largest possible value is the smaller of each range's hi value so I think it's correct to use the minimum here rather than the maximum. I didn't look into this case too deeply initially since I didn't find any bitwise-and nodes of two negative ranges in my investigation, but I think we should include it since it's a simple enough condition to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another point of view, you can decompose [lo, hi] into [lo, -1] v [0, hi]. Then [lo1, hi1] & [lo2, hi2] can be calculated as:
([lo1, -1] & [lo2, -1]) v ([lo1, -1] & [0, hi2]) v ([0, hi1] & [lo2, -1]) v ([0, hi1] & [0, hi2]) =
[lo, -1] v [0, hi2] v [0, h1] v [0, min(hi1, hi2)] =
[lo, max(hi1, hi2)]`
with lo being the lower bound you calculated right above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear I think MAX2 is the correct thing here.
|
Thank you for running testing @chhagedorn! I think I didn't run into this because my device doesn't support AVX-512. Does the failure have an ideal node printout as well? I think that could help in diagnosing the issue. Thanks! |
Sure, here is the log file for linux-x64-debug: test_failure.log |
|
@jaskarth out of curiosity: could you by chance notice any measurable performance difference (e.g. for specific/ad-hoc benchmarks)? |
|
@chhagedorn I've had a chance to investigate the IR test failure, and it seems it's because with the patch we can find a sharper type during IGVN than with the baseline. After parsing, the code shape inside the loop looks like this: Since the IR output of the operation has changed I've updated the IR checks, bringing it in line with the other operations in the test. |
|
@dafedafe I added a microbenchmark based on the case I saw above, and got these results: Baseline Patch Improvement
Benchmark (COUNT) (seed) Mode Cnt Score Error Units Score Error Units
TypeVectorOperations.TypeVectorOperationsNonSuperWord.andZ 512 0 avgt 8 155.288 ± 1.175 ns/op 188.844 ± 4.189 ns/op (+ 19.5%)
TypeVectorOperations.TypeVectorOperationsNonSuperWord.andZ 2048 0 avgt 8 629.098 ± 7.489 ns/op 732.558 ± 3.983 ns/op (+ 15.2%)
TypeVectorOperations.TypeVectorOperationsSuperWord.andZ 512 0 avgt 8 22.917 ± 0.338 ns/op 23.578 ± 1.003 ns/op (+ 2.8%)
TypeVectorOperations.TypeVectorOperationsSuperWord.andZ 2048 0 avgt 8 35.683 ± 0.232 ns/op 37.546 ± 1.063 ns/op (+ 5.1%)In general though I've found that unfortunately it's pretty difficult to identify specific places where performance is improved, since rather than improving nodes locally this analysis strengthens other idealizations that use int types. By improving the type we might be able to find more operations that evaluate to constants or prune out redundant comparisons, either directly or through another node that transforms the type further. I've been wanting to make our type analysis stronger, so that we can find more nontrivial optimizations without needing specialized idealization rules. |
Thanks for the investigation and the update! That sounds reasonable. Let me submit some testing again for that test. |
|
The test is still failing, with |
|
Hmm, that failure is quite peculiar, since now it seems we're failing because we're creating extra |
Even if you don't have access to real hardware, you could try to emulate it with Intel SDE to see what's going on. Might be worth a shot. Nevertheless, I will have a look at your patch again in closer detail tomorrow or hopefully on Monday. |
Oh I hadn't heard of that tool before, I'll give it a try. Thank you for the link! |
chhagedorn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting findings! I like that you unified the integer and long case with a template. A few comments.
src/hotspot/share/opto/mulnode.cpp
Outdated
| int shift_bits = sel_val == 0 ? std::numeric_limits<NativeType>::digits : count_leading_zeros(sel_val) - 1; | ||
|
|
||
| // Since the result of bitwise-and can be as high as the largest value of each range, the result should find the maximum. | ||
| return IntegerType::make(std::numeric_limits<NativeType>::min() >> shift_bits, MAX2(r0->_hi, r1->_hi), widen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the upper limit be improved similar to what you added for the "both ranges are positive" case if we know that both ranges are negative?
In the positive case, we have values from:
011...1
000...0
while in the negative case, we have values from:
111...1
100...0
It suggests that we can then use the same argument as for the positive case and say that the maximum will be the maximum of the smaller range (i.e. MIN2(r0->_hi, r1->_hi)?
merykitty
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice. Given Christian's suggestion, what do you think about inferring the unsigned bounds and known bits from the signed bounds, do the calculation there and work back the signed bounds in the end?
|
Thanks for taking a look, @merykitty! With regard to unsigned and bitwise bounds I was thinking for this patch it might be better to keep everything in the signed domain to keep the patch simple, since AFAIK the analysis to go between the signed domain and the unsigned/bitwise domain can become quite complicated. I think it might be better to do that sort of analysis more generally in the type system, such as what you're doing with #17508. I was thinking this PR could be a sort of baseline for any future improvements using that system. Let me know what you think about this approach, or if you think I should explore unsigned and bitwise bounds in this patch. Thanks! |
merykitty
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I think your math is correct.
dafedafe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jaskarth!
|
How will this mesh with the changes that @merykitty is doing with the KnownBits, in the work of #17508 etc? |
|
I think there are still quite a few unaddressed review comments above, so I'll hold off with reviewing myself. Feel free to ping me if you need another review. |
|
@eme64 I was thinking that this PR and the associated IR tests could serve as a baseline for future future work in this area, such as #17508, since it just uses the signed bounds we already have to do its analysis. My goal was to make the changes as simple as possible, so that we can see some benefits short term while we work on the long-term solution for better type inference. Originally I just had the |
|
I agree that we should keep this RFE simple. And we are just using thing that we already have. So, we could just go with the optimizations that you currently have (if you like to apply the few simple improvement suggestions, you can already do that) and follow up with future RFEs to cover more cases. |
|
@chhagedorn I've pushed a new version that should address all the comments from code review. I ended up handling the case with two negative ranges as well because it didn't add too much code complexity, but I agree that we should just use the optimizations we have and not add more to this PR. |
chhagedorn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update, looks good to me! I'll give this another spinning in our testing over the weekend (will only be able to report back on Tuesday since Monday is a public holiday here).
|
Testing looked good! |
|
Thanks a lot for the testing, and thank you everyone for the reviews! /integrate |
|
Going to push as commit 9243104.
Your commit was automatically rebased without conflicts. |
Hi all,
I've written this patch which improves type calculation for bitwise-and functions. Previously, the only cases that were considered were if one of the inputs to the node were a positive constant. I've generalized this behavior, as well as added a case to better estimate the result for arbitrary ranges. Since these are very common patterns to see, this can help propagate more precise types throughout the ideal graph for a large number of methods, making other optimizations and analyses stronger. I was interested in where this patch improves types, so I ran CTW for
java_baseandjava_base_2and printed out the differences in this gist here. While I don't think it's particularly complicated I've also added some discussion of the mathematics below, mostly because I thought it was interesting to work through :)This patch passes tier1-3 testing on my linux x64 machine. Thoughts and reviews would be very appreciated!
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20066/head:pull/20066$ git checkout pull/20066Update a local copy of the PR:
$ git checkout pull/20066$ git pull https://git.openjdk.org/jdk.git pull/20066/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 20066View PR using the GUI difftool:
$ git pr show -t 20066Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20066.diff
Webrev
Link to Webrev Comment