8335444: Generalize implementation of AndNode mul_ring #20066

jaskarth · 2024-07-08T03:37:30Z

Hi all,
I've written this patch which improves type calculation for bitwise-and functions. Previously, the only cases that were considered were if one of the inputs to the node were a positive constant. I've generalized this behavior, as well as added a case to better estimate the result for arbitrary ranges. Since these are very common patterns to see, this can help propagate more precise types throughout the ideal graph for a large number of methods, making other optimizations and analyses stronger. I was interested in where this patch improves types, so I ran CTW for java_base and java_base_2 and printed out the differences in this gist here. While I don't think it's particularly complicated I've also added some discussion of the mathematics below, mostly because I thought it was interesting to work through :)

This patch passes tier1-3 testing on my linux x64 machine. Thoughts and reviews would be very appreciated!

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8335444: Generalize implementation of AndNode mul_ring (Enhancement - P4)

Reviewers

Damon Fenacci (@dafedafe - Committer) Review applies to ca2db583
Christian Hagedorn (@chhagedorn - Reviewer)
Quan Anh Mai (@merykitty - Committer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20066/head:pull/20066
$ git checkout pull/20066

Update a local copy of the PR:
$ git checkout pull/20066
$ git pull https://git.openjdk.org/jdk.git pull/20066/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 20066

View PR using the GUI difftool:
$ git pr show -t 20066

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20066.diff

Webrev

Link to Webrev Comment

bridgekeeper · 2024-07-08T03:38:11Z

👋 Welcome back jkarthikeyan! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2024-07-08T03:38:34Z

@jaskarth This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8335444: Generalize implementation of AndNode mul_ring

Reviewed-by: chagedorn, qamai, dfenacci

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 736 new commits pushed to the master branch:

5b72bbf: 8339519: Remove size field from instructions
0df10bb: 8339466: Enumerate shared stubs and define static fields and names via declarations
9ebc2ec: 8339317: Optimize ClassFile writeBuffer
d2b36f0: 8339642: Reduce overheads in InvokerBytecodeGenerator
cb00333: 8339640: Reduce construction overheads in StringConcatFactory$InlineHiddenClassStrategy
260908e: 8339592: Simplify and remove unused code in ObjectMethods.
febbd99: 8339168: Optimize ClassFile Util slotSize
a1eebbd: 8339576: Speed up raw bytecode processing in ClassFile API
a35fd38: 8339368: Switch targets are not inflated in CodeModel if no StackMap
7db4d46: 8330159: [C2] Remove or clarify Compile::init_start
... and 726 more: https://git.openjdk.org/jdk/compare/bb18498d71dddf49db9bdfac886aed9ae123651d...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk · 2024-07-08T03:39:01Z

@jaskarth The following label will be automatically applied to this pull request:

hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2024-07-08T03:42:01Z

Webrevs

jaskarth · 2024-07-08T03:42:54Z

The common cases I saw in java.base and other real world programs that aren't handled currently are with positive ranges, negative constants, and ranges that cross 0. For positive ranges it's possible to generalize the existing case with a positive constant, and for negative constants and ranges that cross 0 I came up with an estimation.

For a positive constant $C$, we can say that for any integer type $I$ that the result of the operation $I \mathbin{\&} C$ will be bounded by the range $[0, C]$. This can then be generalized to positive ranges, as for any constant $c_{1}$ such that $0 \le c_{1} \le C$ the result of $I \mathbin{\&} c_{1}$ will also be bounded by $[0, C]$. Therefore, we can say that $I \mathbin{\&} [c_{1}, C]$ for any $0 \le c_{1} \le C$ will be in the range $[0, C]$. This case can be further generalized if we know both ranges are positive. For the operation $[L_{1}, U_{1}] \mathbin{\&} [L_{2}, U_{2}]$ where $L_{1} \ge 0$ and $L_{2} \ge 0$ the result will be bounded by $[0, \text{min}(U_{1}, U_{2})]$. Since bitwise-and can only remove bits, we know that the maximum possible value of the result will be determined by the range that has fewer bits set.

In the case of negative constants and ranges that cross 0 it's more difficult to numerically derive a solution due to two's complement, as the minimum value can be lower than both of the ranges' minimum values. Since a negative two's complement number will always begin with a sequence of 1s, and that sequence of 1s grows smaller as the number becomes lower, the lower bound will be the number that contains the sequence of ones of the smaller value of both ranges. For the upper bound, the result will be the higher value of both ranges, as the result of bitwise-and with a negative value can return a result up to the positive value.

While these methods produce much more precise types for many cases, they are by no means optimal. For that I think proper bitwise propagation is needed, such as with #17508. However, I think this is still worthwhile to get in since it's an easy win for most cases and is a pretty simple patch.

chhagedorn · 2024-07-31T08:47:22Z

Not a review but I quickly ran it through our testing and the following test fails with -XX:UseAVX=3 on linux-x64-debug and windows-x64-debug and without any flags on windows-x64-debug which seems to be related to your patch:

Test: compiler/vectorization/runner/BasicBooleanOpTest.java

Output:

Failed IR Rules (1) of Methods (1)
----------------------------------
1) Method "public boolean[] compiler.vectorization.runner.BasicBooleanOpTest.vectorAnd()" - [Failed IR rules: 1]:
   * @IR rule 3: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#MACRO_LOGIC_V#_", ">0"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={"avx512f", "true"}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: "MacroLogicV"
           - Failed comparison: [found] 0 > 0 [given]
           - No nodes matched!

dafedafe · 2024-07-31T12:05:15Z

src/hotspot/share/opto/mulnode.cpp

+  int shift_bits = sel_val == 0 ? std::numeric_limits<NativeType>::digits : count_leading_zeros(sel_val) - 1;
+
+  // Since the result of bitwise-and can be as high as the largest value of each range, the result should find the maximum.
+  return IntegerType::make(std::numeric_limits<NativeType>::min() >> shift_bits, MAX2(r0->_hi, r1->_hi), widen);


I was just wondering if it would be OK to use (r0->_lo & r1->_lo) as the minimum value instead of finding the number of leading 1s and shifting. It seems to be a bit more constrained and compact (and seemingly correct). What do you think?

I actually did experiment with this before coming up with the current approach, but I found that there are cases where it produces incorrect bounds. With a small example where r0 = [-3, -1] and r1 = [-7, -1], using r0->_lo & r1->_lo results in -3 & -7, which equals -7. However, in the case where the value in r0 is -3 and r1 is -6, we can get an out of bounds result as -3 & -6 equals -8. Because of that I thought the best way to fix this was to find the common leading 1s. This approach does lead to us losing some information in the lower-order bits but I thought it was an acceptable compromise since the code to handle finding common bits across a range of integers becomes quite a bit more complicated. I hope this clarifies!

You're totally right! It is not even related to the LSB (-14 & -6 would have the same problem with -12). Finding the leading 1s is the right solution. Thanks a lot for the clarification!

Can the upper limit be improved similar to what you added for the "both ranges are positive" case if we know that both ranges are negative?
In the positive case, we have values from:

011...1 000...0

while in the negative case, we have values from:

111...1 100...0

It suggests that we can then use the same argument as for the positive case and say that the maximum will be the maximum of the smaller range (i.e. MIN2(r0->_hi, r1->_hi)?

This is a great observation! Since bitwise-and can only remove bits, the largest possible value is the smaller of each range's hi value so I think it's correct to use the minimum here rather than the maximum. I didn't look into this case too deeply initially since I didn't find any bitwise-and nodes of two negative ranges in my investigation, but I think we should include it since it's a simple enough condition to check.

Another point of view, you can decompose [lo, hi] into [lo, -1] v [0, hi]. Then [lo1, hi1] & [lo2, hi2] can be calculated as:

([lo1, -1] & [lo2, -1]) v ([lo1, -1] & [0, hi2]) v ([0, hi1] & [lo2, -1]) v ([0, hi1] & [0, hi2]) = [lo, -1] v [0, hi2] v [0, h1] v [0, min(hi1, hi2)] = [lo, max(hi1, hi2)]`

with lo being the lower bound you calculated right above.

Just to be clear I think MAX2 is the correct thing here.

jaskarth · 2024-07-31T15:27:35Z

Thank you for running testing @chhagedorn! I think I didn't run into this because my device doesn't support AVX-512. Does the failure have an ideal node printout as well? I think that could help in diagnosing the issue. Thanks!

chhagedorn · 2024-07-31T18:38:48Z

Thank you for running testing @chhagedorn! I think I didn't run into this because my device doesn't support AVX-512. Does the failure have an ideal node printout as well? I think that could help in diagnosing the issue. Thanks!

Sure, here is the log file for linux-x64-debug: test_failure.log

dafedafe · 2024-08-05T08:03:52Z

@jaskarth out of curiosity: could you by chance notice any measurable performance difference (e.g. for specific/ad-hoc benchmarks)?

jaskarth · 2024-08-06T03:42:43Z

@chhagedorn I've had a chance to investigate the IR test failure, and it seems it's because with the patch we can find a sharper type during IGVN than with the baseline. After parsing, the code shape inside the loop looks like this: StoreB[idx] = (a & b) & 1; As the StoreB is storing a boolean type, we perform a bitwise-and of 1 to ensure the boolean value is in bounds. With the patch, the type of (a & b) is bool so the & 1 is removed in AndINode::Identity. In the baseline, the type of (a & b) is int, so the redundant AndNode isn't removed. Later on, the logic chain is transformed into a MacroLogicVNode.

Since the IR output of the operation has changed I've updated the IR checks, bringing it in line with the other operations in the test.

jaskarth · 2024-08-06T04:04:27Z

@dafedafe I added a microbenchmark based on the case I saw above, and got these results:

                                                                                             Baseline                    Patch           Improvement
Benchmark                                                   (COUNT)  (seed)  Mode  Cnt    Score   Error  Units     Score   Error  Units
TypeVectorOperations.TypeVectorOperationsNonSuperWord.andZ      512       0  avgt    8  155.288 ± 1.175  ns/op   188.844 ± 4.189  ns/op  (+ 19.5%)
TypeVectorOperations.TypeVectorOperationsNonSuperWord.andZ     2048       0  avgt    8  629.098 ± 7.489  ns/op   732.558 ± 3.983  ns/op  (+ 15.2%)
TypeVectorOperations.TypeVectorOperationsSuperWord.andZ         512       0  avgt    8   22.917 ± 0.338  ns/op    23.578 ± 1.003  ns/op  (+ 2.8%)
TypeVectorOperations.TypeVectorOperationsSuperWord.andZ        2048       0  avgt    8   35.683 ± 0.232  ns/op    37.546 ± 1.063  ns/op  (+ 5.1%)

In general though I've found that unfortunately it's pretty difficult to identify specific places where performance is improved, since rather than improving nodes locally this analysis strengthens other idealizations that use int types. By improving the type we might be able to find more operations that evaluate to constants or prune out redundant comparisons, either directly or through another node that transforms the type further. I've been wanting to make our type analysis stronger, so that we can find more nontrivial optimizations without needing specialized idealization rules.

chhagedorn · 2024-08-06T07:17:03Z

@chhagedorn I've had a chance to investigate the IR test failure, and it seems it's because with the patch we can find a sharper type during IGVN than with the baseline. After parsing, the code shape inside the loop looks like this: StoreB[idx] = (a & b) & 1; As the StoreB is storing a boolean type, we perform a bitwise-and of 1 to ensure the boolean value is in bounds. With the patch, the type of (a & b) is bool so the & 1 is removed in AndINode::Identity. In the baseline, the type of (a & b) is int, so the redundant AndNode isn't removed. Later on, the logic chain is transformed into a MacroLogicVNode.

Since the IR output of the operation has changed I've updated the IR checks, bringing it in line with the other operations in the test.

Thanks for the investigation and the update! That sounds reasonable. Let me submit some testing again for that test.

chhagedorn · 2024-08-06T07:45:35Z

The test is still failing, with UseAVX=3 and without. It seems that MacroLogicVNode is present in both logs (see used flags in command line dump):
normal.log
avx3.log

jaskarth · 2024-08-07T01:16:47Z

Hmm, that failure is quite peculiar, since now it seems we're failing because we're creating extra MacroLogicV nodes rather than failing because we're not creating any. Unfortunately, I didn't have much luck debugging the root cause since I don't have access to AVX-512 hardware. I've changed the IR check to a phase before MacroLogicV nodes are created, which should hopefully fix the failure.

chhagedorn · 2024-08-15T06:07:04Z

Hmm, that failure is quite peculiar, since now it seems we're failing because we're creating extra MacroLogicV nodes rather than failing because we're not creating any. Unfortunately, I didn't have much luck debugging the root cause since I don't have access to AVX-512 hardware. I've changed the IR check to a phase before MacroLogicV nodes are created, which should hopefully fix the failure.

Even if you don't have access to real hardware, you could try to emulate it with Intel SDE to see what's going on. Might be worth a shot. Nevertheless, I will have a look at your patch again in closer detail tomorrow or hopefully on Monday.

jaskarth · 2024-08-16T03:20:42Z

you could try to emulate it with Intel SDE to see what's going on

Oh I hadn't heard of that tool before, I'll give it a try. Thank you for the link!

chhagedorn

Interesting findings! I like that you unified the integer and long case with a template. A few comments.

src/hotspot/share/opto/mulnode.cpp

chhagedorn · 2024-08-19T11:51:53Z

src/hotspot/share/opto/mulnode.cpp

+  int shift_bits = sel_val == 0 ? std::numeric_limits<NativeType>::digits : count_leading_zeros(sel_val) - 1;
+
+  // Since the result of bitwise-and can be as high as the largest value of each range, the result should find the maximum.
+  return IntegerType::make(std::numeric_limits<NativeType>::min() >> shift_bits, MAX2(r0->_hi, r1->_hi), widen);


Can the upper limit be improved similar to what you added for the "both ranges are positive" case if we know that both ranges are negative?
In the positive case, we have values from:

011...1 000...0

while in the negative case, we have values from:

111...1 100...0

It suggests that we can then use the same argument as for the positive case and say that the maximum will be the maximum of the smaller range (i.e. MIN2(r0->_hi, r1->_hi)?

src/hotspot/share/opto/mulnode.cpp

merykitty

This is really nice. Given Christian's suggestion, what do you think about inferring the unsigned bounds and known bits from the signed bounds, do the calculation there and work back the signed bounds in the end?

src/hotspot/share/opto/mulnode.cpp

jaskarth · 2024-08-23T03:33:42Z

Thanks for taking a look, @merykitty! With regard to unsigned and bitwise bounds I was thinking for this patch it might be better to keep everything in the signed domain to keep the patch simple, since AFAIK the analysis to go between the signed domain and the unsigned/bitwise domain can become quite complicated. I think it might be better to do that sort of analysis more generally in the type system, such as what you're doing with #17508. I was thinking this PR could be a sort of baseline for any future improvements using that system. Let me know what you think about this approach, or if you think I should explore unsigned and bitwise bounds in this patch. Thanks!

merykitty

I agree, I think your math is correct.

dafedafe

Thanks @jaskarth!

eme64 · 2024-09-03T14:37:46Z

How will this mesh with the changes that @merykitty is doing with the KnownBits, in the work of #17508 etc?

eme64 · 2024-09-03T14:39:45Z

I think there are still quite a few unaddressed review comments above, so I'll hold off with reviewing myself. Feel free to ping me if you need another review.

jaskarth · 2024-09-03T18:17:20Z

@eme64 I was thinking that this PR and the associated IR tests could serve as a baseline for future future work in this area, such as #17508, since it just uses the signed bounds we already have to do its analysis. My goal was to make the changes as simple as possible, so that we can see some benefits short term while we work on the long-term solution for better type inference. Originally I just had the [0..C] case since I saw that it showed up a lot in real-world code, but I kept finding generalizations and eventually it snowballed 😅

chhagedorn · 2024-09-04T06:09:06Z

I agree that we should keep this RFE simple. And we are just using thing that we already have. So, we could just go with the optimizations that you currently have (if you like to apply the few simple improvement suggestions, you can already do that) and follow up with future RFEs to cover more cases.

jaskarth · 2024-09-06T04:38:20Z

@chhagedorn I've pushed a new version that should address all the comments from code review. I ended up handling the case with two negative ranges as well because it didn't add too much code complexity, but I agree that we should just use the optimizations we have and not add more to this PR.

chhagedorn

Thanks for the update, looks good to me! I'll give this another spinning in our testing over the weekend (will only be able to report back on Tuesday since Monday is a public holiday here).

chhagedorn · 2024-09-10T15:47:26Z

Testing looked good!

jaskarth · 2024-09-10T16:50:58Z

Thanks a lot for the testing, and thank you everyone for the reviews!

/integrate

openjdk · 2024-09-10T16:52:59Z

Going to push as commit 9243104.
Since your change was applied there have been 770 commits pushed to the master branch:

3352522: 8338894: Deprecate jhsdb debugd for removal
be0dca0: 8339698: x86 unused andw/orw/xorw/addw encoding could be removed
64a79d8: 8335625: Update Javadoc for GetCpuLoad
c246ede: 8339799: Reduce work done in j.l.invoke bytecode generators
38441b3: 8339677: [vectorapi] YYYXXXVector::withLaneHelper and laneHelper should use Double::doubleToRawLongBits/Float::floatToRawIntBits
fb51c1e: 8339837: Remove unused BootstrapMethodsInvokers.isLambdaMetafactoryCondyBSM
4d597de: 8338930: StringConcatFactory hardCoded string concatenation strategy
ad10493: 8338526: Don't store abstract and interface Klasses in class metaspace
0d8e52b: 8339800: Prefer invokeBasic in BootstrapMethodInvokers
64de781: 8339587: runtime/reflect/ReflectOutOfMemoryError.java fails with "bootstrap method initialization exception"
... and 760 more: https://git.openjdk.org/jdk/compare/bb18498d71dddf49db9bdfac886aed9ae123651d...master

Your commit was automatically rebased without conflicts.

openjdk · 2024-09-10T16:53:06Z

@jaskarth Pushed as commit 9243104.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Improve result of AndNode mul_ring

808d806

openjdk bot added the rfr Pull request is ready for review label Jul 8, 2024

openjdk bot added the hotspot-compiler [email protected] label Jul 8, 2024

dafedafe suggested changes Jul 31, 2024

View reviewed changes

Fix IR test checks and add benchmark

1de4cee

Check IR before macro expansion

ca2db58

chhagedorn reviewed Aug 19, 2024

View reviewed changes

merykitty reviewed Aug 19, 2024

View reviewed changes

src/hotspot/share/opto/mulnode.cpp Outdated Show resolved Hide resolved

merykitty approved these changes Aug 23, 2024

View reviewed changes

dafedafe approved these changes Aug 26, 2024

View reviewed changes

Improve cases with two negative ranges, add more documentation

0d17742

chhagedorn approved these changes Sep 6, 2024

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Sep 6, 2024

merykitty approved these changes Sep 6, 2024

View reviewed changes

openjdk bot added the integrated Pull request has been integrated label Sep 10, 2024

openjdk bot closed this Sep 10, 2024

openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Sep 10, 2024

8335444: Generalize implementation of AndNode mul_ring #20066

8335444: Generalize implementation of AndNode mul_ring #20066

Uh oh!

Conversation

jaskarth commented Jul 8, 2024 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Reviewing

Webrev

Uh oh!

bridgekeeper bot commented Jul 8, 2024

Uh oh!

openjdk bot commented Jul 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk bot commented Jul 8, 2024

Uh oh!

mlbridge bot commented Jul 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

jaskarth commented Jul 8, 2024

Uh oh!

chhagedorn commented Jul 31, 2024

Uh oh!

dafedafe Jul 31, 2024

Choose a reason for hiding this comment

Uh oh!

jaskarth Jul 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dafedafe Jul 31, 2024

Choose a reason for hiding this comment

Uh oh!

chhagedorn Aug 19, 2024

Choose a reason for hiding this comment

Uh oh!

jaskarth Aug 22, 2024

Choose a reason for hiding this comment

Uh oh!

merykitty Aug 23, 2024

Choose a reason for hiding this comment

Uh oh!

merykitty Aug 23, 2024

Choose a reason for hiding this comment

Uh oh!

jaskarth commented Jul 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chhagedorn commented Jul 31, 2024

Uh oh!

dafedafe commented Aug 5, 2024

Uh oh!

jaskarth commented Aug 6, 2024

Uh oh!

jaskarth commented Aug 6, 2024

Uh oh!

chhagedorn commented Aug 6, 2024

Uh oh!

chhagedorn commented Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaskarth commented Aug 7, 2024

Uh oh!

chhagedorn commented Aug 15, 2024

Uh oh!

jaskarth commented Aug 16, 2024

Uh oh!

chhagedorn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chhagedorn Aug 19, 2024

Choose a reason for hiding this comment

Uh oh!

jaskarth commented Jul 8, 2024 •

edited by openjdk bot

Loading

openjdk bot commented Jul 8, 2024 •

edited

Loading

mlbridge bot commented Jul 8, 2024 •

edited

Loading

jaskarth Jul 31, 2024 •

edited

Loading

jaskarth commented Jul 31, 2024 •

edited

Loading

chhagedorn commented Aug 6, 2024 •

edited

Loading