Skip to content

Conversation

@jaskarth
Copy link
Member

@jaskarth jaskarth commented Jul 8, 2024

Hi all,
I've written this patch which improves type calculation for bitwise-and functions. Previously, the only cases that were considered were if one of the inputs to the node were a positive constant. I've generalized this behavior, as well as added a case to better estimate the result for arbitrary ranges. Since these are very common patterns to see, this can help propagate more precise types throughout the ideal graph for a large number of methods, making other optimizations and analyses stronger. I was interested in where this patch improves types, so I ran CTW for java_base and java_base_2 and printed out the differences in this gist here. While I don't think it's particularly complicated I've also added some discussion of the mathematics below, mostly because I thought it was interesting to work through :)

This patch passes tier1-3 testing on my linux x64 machine. Thoughts and reviews would be very appreciated!


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8335444: Generalize implementation of AndNode mul_ring (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20066/head:pull/20066
$ git checkout pull/20066

Update a local copy of the PR:
$ git checkout pull/20066
$ git pull https://git.openjdk.org/jdk.git pull/20066/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 20066

View PR using the GUI difftool:
$ git pr show -t 20066

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20066.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 8, 2024

👋 Welcome back jkarthikeyan! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 8, 2024

@jaskarth This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8335444: Generalize implementation of AndNode mul_ring

Reviewed-by: chagedorn, qamai, dfenacci

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 736 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 8, 2024
@openjdk
Copy link

openjdk bot commented Jul 8, 2024

@jaskarth The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@mlbridge
Copy link

mlbridge bot commented Jul 8, 2024

Webrevs

@jaskarth
Copy link
Member Author

jaskarth commented Jul 8, 2024

The common cases I saw in java.base and other real world programs that aren't handled currently are with positive ranges, negative constants, and ranges that cross 0. For positive ranges it's possible to generalize the existing case with a positive constant, and for negative constants and ranges that cross 0 I came up with an estimation.

For a positive constant $C$, we can say that for any integer type $I$ that the result of the operation $I \mathbin{\&} C$ will be bounded by the range $[0, C]$. This can then be generalized to positive ranges, as for any constant $c_{1}$ such that $0 \le c_{1} \le C$ the result of $I \mathbin{\&} c_{1}$ will also be bounded by $[0, C]$. Therefore, we can say that $I \mathbin{\&} [c_{1}, C]$ for any $0 \le c_{1} \le C$ will be in the range $[0, C]$. This case can be further generalized if we know both ranges are positive. For the operation $[L_{1}, U_{1}] \mathbin{\&} [L_{2}, U_{2}]$ where $L_{1} \ge 0$ and $L_{2} \ge 0$ the result will be bounded by $[0, \text{min}(U_{1}, U_{2})]$. Since bitwise-and can only remove bits, we know that the maximum possible value of the result will be determined by the range that has fewer bits set.

In the case of negative constants and ranges that cross 0 it's more difficult to numerically derive a solution due to two's complement, as the minimum value can be lower than both of the ranges' minimum values. Since a negative two's complement number will always begin with a sequence of 1s, and that sequence of 1s grows smaller as the number becomes lower, the lower bound will be the number that contains the sequence of ones of the smaller value of both ranges. For the upper bound, the result will be the higher value of both ranges, as the result of bitwise-and with a negative value can return a result up to the positive value.

While these methods produce much more precise types for many cases, they are by no means optimal. For that I think proper bitwise propagation is needed, such as with #17508. However, I think this is still worthwhile to get in since it's an easy win for most cases and is a pretty simple patch.

@chhagedorn
Copy link
Member

Not a review but I quickly ran it through our testing and the following test fails with -XX:UseAVX=3 on linux-x64-debug and windows-x64-debug and without any flags on windows-x64-debug which seems to be related to your patch:

Test: compiler/vectorization/runner/BasicBooleanOpTest.java

Output:

Failed IR Rules (1) of Methods (1)
----------------------------------
1) Method "public boolean[] compiler.vectorization.runner.BasicBooleanOpTest.vectorAnd()" - [Failed IR rules: 1]:
   * @IR rule 3: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#MACRO_LOGIC_V#_", ">0"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={"avx512f", "true"}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: "MacroLogicV"
           - Failed comparison: [found] 0 > 0 [given]
           - No nodes matched!

int shift_bits = sel_val == 0 ? std::numeric_limits<NativeType>::digits : count_leading_zeros(sel_val) - 1;

// Since the result of bitwise-and can be as high as the largest value of each range, the result should find the maximum.
return IntegerType::make(std::numeric_limits<NativeType>::min() >> shift_bits, MAX2(r0->_hi, r1->_hi), widen);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just wondering if it would be OK to use (r0->_lo & r1->_lo) as the minimum value instead of finding the number of leading 1s and shifting. It seems to be a bit more constrained and compact (and seemingly correct). What do you think?

Copy link
Member Author

@jaskarth jaskarth Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually did experiment with this before coming up with the current approach, but I found that there are cases where it produces incorrect bounds. With a small example where r0 = [-3, -1] and r1 = [-7, -1], using r0->_lo & r1->_lo results in -3 & -7, which equals -7. However, in the case where the value in r0 is -3 and r1 is -6, we can get an out of bounds result as -3 & -6 equals -8. Because of that I thought the best way to fix this was to find the common leading 1s. This approach does lead to us losing some information in the lower-order bits but I thought it was an acceptable compromise since the code to handle finding common bits across a range of integers becomes quite a bit more complicated. I hope this clarifies!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're totally right! It is not even related to the LSB (-14 & -6 would have the same problem with -12). Finding the leading 1s is the right solution. Thanks a lot for the clarification!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the upper limit be improved similar to what you added for the "both ranges are positive" case if we know that both ranges are negative?
In the positive case, we have values from:

011...1
000...0

while in the negative case, we have values from:

111...1
100...0

It suggests that we can then use the same argument as for the positive case and say that the maximum will be the maximum of the smaller range (i.e. MIN2(r0->_hi, r1->_hi)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great observation! Since bitwise-and can only remove bits, the largest possible value is the smaller of each range's hi value so I think it's correct to use the minimum here rather than the maximum. I didn't look into this case too deeply initially since I didn't find any bitwise-and nodes of two negative ranges in my investigation, but I think we should include it since it's a simple enough condition to check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another point of view, you can decompose [lo, hi] into [lo, -1] v [0, hi]. Then [lo1, hi1] & [lo2, hi2] can be calculated as:

([lo1, -1] & [lo2, -1]) v ([lo1, -1] & [0, hi2]) v ([0, hi1] & [lo2, -1]) v ([0, hi1] & [0, hi2]) =
[lo, -1] v [0, hi2] v [0, h1] v [0, min(hi1, hi2)] =
[lo, max(hi1, hi2)]`

with lo being the lower bound you calculated right above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear I think MAX2 is the correct thing here.

@jaskarth
Copy link
Member Author

jaskarth commented Jul 31, 2024

Thank you for running testing @chhagedorn! I think I didn't run into this because my device doesn't support AVX-512. Does the failure have an ideal node printout as well? I think that could help in diagnosing the issue. Thanks!

@chhagedorn
Copy link
Member

Thank you for running testing @chhagedorn! I think I didn't run into this because my device doesn't support AVX-512. Does the failure have an ideal node printout as well? I think that could help in diagnosing the issue. Thanks!

Sure, here is the log file for linux-x64-debug: test_failure.log

@dafedafe
Copy link
Contributor

dafedafe commented Aug 5, 2024

@jaskarth out of curiosity: could you by chance notice any measurable performance difference (e.g. for specific/ad-hoc benchmarks)?

@jaskarth
Copy link
Member Author

jaskarth commented Aug 6, 2024

@chhagedorn I've had a chance to investigate the IR test failure, and it seems it's because with the patch we can find a sharper type during IGVN than with the baseline. After parsing, the code shape inside the loop looks like this: StoreB[idx] = (a & b) & 1; As the StoreB is storing a boolean type, we perform a bitwise-and of 1 to ensure the boolean value is in bounds. With the patch, the type of (a & b) is bool so the & 1 is removed in AndINode::Identity. In the baseline, the type of (a & b) is int, so the redundant AndNode isn't removed. Later on, the logic chain is transformed into a MacroLogicVNode.

Since the IR output of the operation has changed I've updated the IR checks, bringing it in line with the other operations in the test.

@jaskarth
Copy link
Member Author

jaskarth commented Aug 6, 2024

@dafedafe I added a microbenchmark based on the case I saw above, and got these results:

                                                                                             Baseline                    Patch           Improvement
Benchmark                                                   (COUNT)  (seed)  Mode  Cnt    Score   Error  Units     Score   Error  Units
TypeVectorOperations.TypeVectorOperationsNonSuperWord.andZ      512       0  avgt    8  155.288 ± 1.175  ns/op   188.844 ± 4.189  ns/op  (+ 19.5%)
TypeVectorOperations.TypeVectorOperationsNonSuperWord.andZ     2048       0  avgt    8  629.098 ± 7.489  ns/op   732.558 ± 3.983  ns/op  (+ 15.2%)
TypeVectorOperations.TypeVectorOperationsSuperWord.andZ         512       0  avgt    8   22.917 ± 0.338  ns/op    23.578 ± 1.003  ns/op  (+ 2.8%)
TypeVectorOperations.TypeVectorOperationsSuperWord.andZ        2048       0  avgt    8   35.683 ± 0.232  ns/op    37.546 ± 1.063  ns/op  (+ 5.1%)

In general though I've found that unfortunately it's pretty difficult to identify specific places where performance is improved, since rather than improving nodes locally this analysis strengthens other idealizations that use int types. By improving the type we might be able to find more operations that evaluate to constants or prune out redundant comparisons, either directly or through another node that transforms the type further. I've been wanting to make our type analysis stronger, so that we can find more nontrivial optimizations without needing specialized idealization rules.

@chhagedorn
Copy link
Member

@chhagedorn I've had a chance to investigate the IR test failure, and it seems it's because with the patch we can find a sharper type during IGVN than with the baseline. After parsing, the code shape inside the loop looks like this: StoreB[idx] = (a & b) & 1; As the StoreB is storing a boolean type, we perform a bitwise-and of 1 to ensure the boolean value is in bounds. With the patch, the type of (a & b) is bool so the & 1 is removed in AndINode::Identity. In the baseline, the type of (a & b) is int, so the redundant AndNode isn't removed. Later on, the logic chain is transformed into a MacroLogicVNode.

Since the IR output of the operation has changed I've updated the IR checks, bringing it in line with the other operations in the test.

Thanks for the investigation and the update! That sounds reasonable. Let me submit some testing again for that test.

@chhagedorn
Copy link
Member

chhagedorn commented Aug 6, 2024

The test is still failing, with UseAVX=3 and without. It seems that MacroLogicVNode is present in both logs (see used flags in command line dump):
normal.log
avx3.log

@jaskarth
Copy link
Member Author

jaskarth commented Aug 7, 2024

Hmm, that failure is quite peculiar, since now it seems we're failing because we're creating extra MacroLogicV nodes rather than failing because we're not creating any. Unfortunately, I didn't have much luck debugging the root cause since I don't have access to AVX-512 hardware. I've changed the IR check to a phase before MacroLogicV nodes are created, which should hopefully fix the failure.

@chhagedorn
Copy link
Member

Hmm, that failure is quite peculiar, since now it seems we're failing because we're creating extra MacroLogicV nodes rather than failing because we're not creating any. Unfortunately, I didn't have much luck debugging the root cause since I don't have access to AVX-512 hardware. I've changed the IR check to a phase before MacroLogicV nodes are created, which should hopefully fix the failure.

Even if you don't have access to real hardware, you could try to emulate it with Intel SDE to see what's going on. Might be worth a shot. Nevertheless, I will have a look at your patch again in closer detail tomorrow or hopefully on Monday.

@jaskarth
Copy link
Member Author

you could try to emulate it with Intel SDE to see what's going on

Oh I hadn't heard of that tool before, I'll give it a try. Thank you for the link!

Copy link
Member

@chhagedorn chhagedorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting findings! I like that you unified the integer and long case with a template. A few comments.

int shift_bits = sel_val == 0 ? std::numeric_limits<NativeType>::digits : count_leading_zeros(sel_val) - 1;

// Since the result of bitwise-and can be as high as the largest value of each range, the result should find the maximum.
return IntegerType::make(std::numeric_limits<NativeType>::min() >> shift_bits, MAX2(r0->_hi, r1->_hi), widen);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the upper limit be improved similar to what you added for the "both ranges are positive" case if we know that both ranges are negative?
In the positive case, we have values from:

011...1
000...0

while in the negative case, we have values from:

111...1
100...0

It suggests that we can then use the same argument as for the positive case and say that the maximum will be the maximum of the smaller range (i.e. MIN2(r0->_hi, r1->_hi)?

Copy link
Member

@merykitty merykitty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice. Given Christian's suggestion, what do you think about inferring the unsigned bounds and known bits from the signed bounds, do the calculation there and work back the signed bounds in the end?

@jaskarth
Copy link
Member Author

Thanks for taking a look, @merykitty! With regard to unsigned and bitwise bounds I was thinking for this patch it might be better to keep everything in the signed domain to keep the patch simple, since AFAIK the analysis to go between the signed domain and the unsigned/bitwise domain can become quite complicated. I think it might be better to do that sort of analysis more generally in the type system, such as what you're doing with #17508. I was thinking this PR could be a sort of baseline for any future improvements using that system. Let me know what you think about this approach, or if you think I should explore unsigned and bitwise bounds in this patch. Thanks!

Copy link
Member

@merykitty merykitty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I think your math is correct.

Copy link
Contributor

@dafedafe dafedafe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jaskarth!

@eme64
Copy link
Contributor

eme64 commented Sep 3, 2024

How will this mesh with the changes that @merykitty is doing with the KnownBits, in the work of #17508 etc?

@eme64
Copy link
Contributor

eme64 commented Sep 3, 2024

I think there are still quite a few unaddressed review comments above, so I'll hold off with reviewing myself. Feel free to ping me if you need another review.

@jaskarth
Copy link
Member Author

jaskarth commented Sep 3, 2024

@eme64 I was thinking that this PR and the associated IR tests could serve as a baseline for future future work in this area, such as #17508, since it just uses the signed bounds we already have to do its analysis. My goal was to make the changes as simple as possible, so that we can see some benefits short term while we work on the long-term solution for better type inference. Originally I just had the [0..C] case since I saw that it showed up a lot in real-world code, but I kept finding generalizations and eventually it snowballed 😅

@chhagedorn
Copy link
Member

I agree that we should keep this RFE simple. And we are just using thing that we already have. So, we could just go with the optimizations that you currently have (if you like to apply the few simple improvement suggestions, you can already do that) and follow up with future RFEs to cover more cases.

@jaskarth
Copy link
Member Author

jaskarth commented Sep 6, 2024

@chhagedorn I've pushed a new version that should address all the comments from code review. I ended up handling the case with two negative ranges as well because it didn't add too much code complexity, but I agree that we should just use the optimizations we have and not add more to this PR.

Copy link
Member

@chhagedorn chhagedorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, looks good to me! I'll give this another spinning in our testing over the weekend (will only be able to report back on Tuesday since Monday is a public holiday here).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Sep 6, 2024
@chhagedorn
Copy link
Member

Testing looked good!

@jaskarth
Copy link
Member Author

Thanks a lot for the testing, and thank you everyone for the reviews!

/integrate

@openjdk
Copy link

openjdk bot commented Sep 10, 2024

Going to push as commit 9243104.
Since your change was applied there have been 770 commits pushed to the master branch:

  • 3352522: 8338894: Deprecate jhsdb debugd for removal
  • be0dca0: 8339698: x86 unused andw/orw/xorw/addw encoding could be removed
  • 64a79d8: 8335625: Update Javadoc for GetCpuLoad
  • c246ede: 8339799: Reduce work done in j.l.invoke bytecode generators
  • 38441b3: 8339677: [vectorapi] YYYXXXVector::withLaneHelper and laneHelper should use Double::doubleToRawLongBits/Float::floatToRawIntBits
  • fb51c1e: 8339837: Remove unused BootstrapMethodsInvokers.isLambdaMetafactoryCondyBSM
  • 4d597de: 8338930: StringConcatFactory hardCoded string concatenation strategy
  • ad10493: 8338526: Don't store abstract and interface Klasses in class metaspace
  • 0d8e52b: 8339800: Prefer invokeBasic in BootstrapMethodInvokers
  • 64de781: 8339587: runtime/reflect/ReflectOutOfMemoryError.java fails with "bootstrap method initialization exception"
  • ... and 760 more: https://git.openjdk.org/jdk/compare/bb18498d71dddf49db9bdfac886aed9ae123651d...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Sep 10, 2024
@openjdk openjdk bot closed this Sep 10, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Sep 10, 2024
@openjdk
Copy link

openjdk bot commented Sep 10, 2024

@jaskarth Pushed as commit 9243104.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler [email protected] integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

5 participants