Skip to content

Conversation

@MaxXSoft
Copy link
Member

@MaxXSoft MaxXSoft commented Jun 23, 2025

The result of count leading/trailing zeros is always non-negative, and the maximum value is integer type's size in bits. In previous versions, when C2 can not know the operand value of a CLZ/CTZ node at compile time, it will generate a full-width integer type for its result. This can significantly affect the efficiency of code in some cases.

This patch makes the type of CLZ/CTZ nodes more precise, to make C2 generate better code. For example, the following implementation runs ~115% faster on x86-64 with this patch:

public static int numberOfNibbles(int i) {
  int mag = Integer.SIZE - Integer.numberOfLeadingZeros(i);
  return Math.max((mag + 3) / 4, 1);
}

Testing: tier1, IR test


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8360192: C2: Make the type of count leading/trailing zero nodes more precise (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/25928/head:pull/25928
$ git checkout pull/25928

Update a local copy of the PR:
$ git checkout pull/25928
$ git pull https://git.openjdk.org/jdk.git pull/25928/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 25928

View PR using the GUI difftool:
$ git pr show -t 25928

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/25928.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 23, 2025

👋 Welcome back qxing! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jun 23, 2025

@MaxXSoft This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8360192: C2: Make the type of count leading/trailing zero nodes more precise

Reviewed-by: qamai, epeter, jbhateja

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 333 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@jaskarth, @merykitty, @jatin-bhateja, @eme64) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the rfr Pull request is ready for review label Jun 23, 2025
@openjdk
Copy link

openjdk bot commented Jun 23, 2025

@MaxXSoft The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@mlbridge
Copy link

mlbridge bot commented Jun 23, 2025

@merykitty
Copy link
Member

A stricter bound would be TypeInt::make(~t._bits._zeros, t._bits._ones, t._widen)

@MaxXSoft
Copy link
Member Author

A stricter bound would be TypeInt::make(~t._bits._zeros, t._bits._ones, t._widen)

@merykitty Thanks for your review, did you mean TypeInt::make(clz(~t._bits._zeros), clz(t._bits._ones), t._widen)?

@merykitty
Copy link
Member

@MaxXSoft Yes you are right, my mistake

Comment on lines 29 to 46
#include "utilities/count_leading_zeros.hpp"
#include "utilities/count_trailing_zeros.hpp"

static int count_leading_zeros_int(jint i) {
return i == 0 ? BitsPerInt : count_leading_zeros(i);
}

static int count_leading_zeros_long(jlong l) {
return l == 0 ? BitsPerLong : count_leading_zeros(l);
}

static int count_trailing_zeros_int(jint i) {
return i == 0 ? BitsPerInt : count_trailing_zeros(i);
}

static int count_trailing_zeros_long(jlong l) {
return l == 0 ? BitsPerLong : count_trailing_zeros(l);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why you need this?
Why is count_trailing_zeros and count_leading_zeros not enough, when you cast at the use-site?

Copy link
Member

@merykitty merykitty Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because our implementation does not accept 0 as an input. I suggest doing this at count_leading_zeros, it makes more sense and also aligns our behaviour with the well-known countr_zero and countl_zero

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why you need this? Why is count_trailing_zeros and count_leading_zeros not enough, when you cast at the use-site?

@eme64 The explanation of @merykitty is right, the implementation of count_leading_zeros and count_trailing_zeros reject zero as the input.

Perhaps we could open another PR to add zero support for these functions, since it's less relevant to this node type change and might require other changes to the code that calls them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In src/hotspot/share/utilities/count_leading_zeros.hpp, it says that 0 behavior is undefined. Ok... but why do we do that? Is that a performance optimization ? If yes, is it really worth it? If there is no good reason not to handle 0, we should just handle it.

We have some tests in test/hotspot/gtest/utilities/test_count_leading_zeros.cpp.

It would be interesting to quickly check if any use of these methods could ever encounter zero, and then hit the assert. I would not be surprised if we found a bug here.

I think this would be a worth while cleanup task. I would prefer if we clean things up now, and don't just let more special handling code get integrated.

@Test
@IR(failOn = IRNode.COUNT_LEADING_ZEROS_I)
public boolean clzCompareInt() {
return Integer.numberOfLeadingZeros(i) < 0 || Integer.numberOfLeadingZeros(i) > 32;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You limited the type to check the ones/zeroes of the input value, count_leading_zeros_int(~ti->_bits._zeros), count_leading_zeros_int(ti->_bits._ones),. I think we need test cases to cover those too, given they are more complex and thus more error prone.

Copy link
Member Author

@MaxXSoft MaxXSoft Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea how the KnownBits works. I tried masking i with i & 0x00ffffff | 0x0000ffff, and evaluating CLZ on it. But C2 still seems to return an integer type with a trivial KnownBits, instead of KnownBits with _zeros=0xff000000 and _ones=0x0000ffff.

I also checked the PR which introduced KnownBits (#17508). @merykitty said "there is no node taking advantage of the additional information yet", and there's no IR tests in that PR. So it's probably impossible to write a test for this at the moment?

@merykitty Could you provide more help on this please? Thanks.

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 22, 2025

@MaxXSoft This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@MaxXSoft
Copy link
Member Author

Hi all,

This patch has now passed all GHA tests and is ready for further reviews.

If there are any other suggestions for this PR, please let me know.

Thanks!

Copy link
Member

@jaskarth jaskarth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice! I just have a comment on the unit test.

Comment on lines 83 to 89
const TypeLong* tl = t->is_long();
if (tl->is_con()) {
return TypeInt::make(count_leading_zeros_long(tl->get_con()));
}
return TypeInt::make(count_leading_zeros_long(~tl->_bits._zeros),
count_leading_zeros_long(tl->_bits._ones),
tl->_widen);
Copy link
Member

@jatin-bhateja jatin-bhateja Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @MaxXSoft, Taking the liberty to add some comments related to the proof of the above assumptions, hope you won't mind :-)

Z3 proof for [KnownBits.ONES, ~KnownBits.ZEROS] >= [LB, UB] where >= is a superset relation.


from z3 import *
from functools import reduce

# 64-bit symbolic unsigned integers
UB = BitVec('UB', 64)
LB = BitVec('LB', 64)

# XOR for detecting differing bits
xor_val = UB ^ LB

# COUNT_LEADING_ZEROS
def count_leading_zeros(x):
    """Returns number of leading zeros in a 64-bit word."""
    clauses = []
    for i in range(64):
        bit_is_one = LShR(x, 63 - i) & 1 == 1
        bits_before_zero = And([LShR(x, 63 - j) & 1 == 0 for j in range(i)])
        clauses.append(If(And(bits_before_zero, bit_is_one), BitVecVal(i, 64), BitVecVal(64, 64)))
    return reduce(lambda a, b: If(a == 64, b, a), clauses)

# Step 1: Compute common prefix length
CPL = count_leading_zeros(xor_val)

# Step 2: COMMON_PREFIX_MASK = ((1 << CPL) - 1) << (64 - CPL)
one_shifted = (BitVecVal(1, 64) << CPL)
mask = one_shifted - 1
COMMON_PREFIX_MASK = mask << (64 - CPL)

# Step 3: COMMON_PREFIX
COMMON_PREFIX = UB & COMMON_PREFIX_MASK

# Step 4: ZEROS and ONES
ZEROS = COMMON_PREFIX_MASK & (~COMMON_PREFIX)
ONES = COMMON_PREFIX

# Step 5: Prove that [ONES, ~ZEROS] ⊇ [LB, UB]
prop = And(ULE(ONES, LB), ULE(UB, ~ZEROS))

# Step 6: Try to disprove (i.e., check if any UB, LB violates the above)
s = Solver()
s.add(Not(prop))  # Look for counterexamples

# Check the result
if s.check() == sat:
    m = s.model()
    ub_val = m.eval(UB).as_long()
    lb_val = m.eval(LB).as_long()
    ones_val = m.eval(ONES).as_long()
    zeros_val = m.eval(ZEROS).as_long()
    not_zeros_val = (~zeros_val) & 0xFFFFFFFFFFFFFFFF

    print("❌ Property does NOT hold. Counterexample found:")
    print(f"LB     = {lb_val:#018x}")
    print(f"UB     = {ub_val:#018x}")
    print(f"ONES   = {ones_val:#018x}")
    print(f"~ZEROS = {not_zeros_val:#018x}")
    print(f"ONES <= LB? {ones_val <= lb_val}")
    print(f"UB <= ~ZEROS? {ub_val <= not_zeros_val}")
else:
    print("✅ Property holds: [ONES, ~ZEROS] always covers [LB, UB] (UNSAT)")		 

image

Manually worked out Example 1:-
UB = 0xFFFF00FF
LB = 0xFFFF0000
COMMON_PREFIX = 0xFFFF0000
ONES = 0xFFFF0000
ZEROS = 0x00000000
---------------------------
MAX = ~ZEROS
= 0xFFFFFFFF
MIN = ONES
= 0xFFFF0000

Thus, it's evident that MAX >= UB AND MIN <= LB

Manually worked out Example 2:-
UB = 0xFF0F00FF
LB = 0xFF0F0000
COMMON_PREFIX = 0xFF0F0000
ONES = 0xFF0F0000
ZEROS = 0x00F00000

Since logical OR b/w ONES and ZEROS forms the COMMON_PREFIX MASK i.e. both ZEROS and ONES are derived from COMMON_PREFIX therefore COMMON_PREFIX_MASK & ~ZERO == COMMON_PREFIX_MASK & ONES, apart from COMMON_PREFIX there are other lower order bits which are not included in COMMON_PREFIX, when we flip the ZEROS then we get a VALUE which equals ONES + we set all other lower order bits, thus this value is guaranteed to be greater than UB since it also set other non-set bits of UB.

Thus, UB = COMMON_PREFIX + OTHER_SET_LOWER_ORDER_BITS_IN_UB
~ZEROS = COMMON_PREGIX + REMAINING_BITS_SET_TO_ONE.

While ONES is just the common prefix, which is guaranteed to be less than UB since we don’t account for set bits of UB, which are not part of the common prefix.

Thus, the following inequality holds good.

~ZEROS >= UB >= LB >= ONES

Proof for the KnownBits application to CTZ / CLZ

  • In both these cases, we are only concerned about the number of ZEROs which are either present at the start or towards the end of the integral bits sequence.
  • KnownBits.ZEROS and KnownBits.ONES captures the known set bits in value range, i.e., if a bit is set on ONES it means it's guaranteed to be one in all the values in the value range; similarly, a set bit in ZEROS signifies that the corresponding bit remains unset in all the values in the range. All unset bits in ONES/ZEROS signify not known at compile time.

Since the flipped value of ZEROS gives us the maximum value in the range, and given that number of leading zeros of maximum value will never be greater than the number of leading zeros of minimum value of the range hence, value range of CLZ varies b/w CLZ(~ZEROS) and CLZ(ONES) as its lower and upper bounds.

While counting leading zeros is relatively straightforward to prove given that each integral value is the sum of 2^position of set bits and given that TypeInt/TypeLong comply with the invariant (LB <= UB)
Hence highest set bits position of UB can never be less than the LB.

For CTZ, consider the following example
UB = 0xFF800000
LB = 0xFF8FF0F0
COMMON_PREFIX = 0xFF800000
ONES = 0xFF800000
ZEROS = 0x00700000

Again, both ONES and ZEROS bits are extracted from the common prefix of the delimiting bounds of the value range.

MAX = ~ZEROS = 0xFF8FFFFF
MIN = ONES = 0xFF800000

Thus, the number of trailing zeros will be [lb:0, ub:20], given that the lower order bits of knowbits which are not part of the common prefix will always be set to zero, hence CTZ(~ZERO) will always be 0. This is also in line with the fact that the number of leading/trailing zeros will never be a non-negative value. Therefore, delimiting lower and upper bounds of CTZ are CTZ(~ZEROS) and CTZ(ONES).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the addition! Very nice and solid proof.

Copy link
Member

@jatin-bhateja jatin-bhateja Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @MaxXSoft, Constant folding optimizations like these qualify for addition of a new benchmark. Please add one; I see a significant gain with following micro kernel on my Intel 13th Gen Intel(R) Core(TM) i3-1315U laptop.

    public static long micro(long param) {
        long constrained_param = Math.min(175, Math.max(param, 160));
        return Long.numberOfLeadingZeros(constrained_param);
    }

PROMPT>#baseline
PROMPT>java -Xbatch -XX:-TieredCompilation -cp . test_clz
[time] 17ms [res] 11200000

PROMPT>#withopt
PROMPT>java -Xbatch -XX:-TieredCompilation -cp . test_clz
[time] 5ms [res] 11200000

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll run some testing and review afterward.

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Aug 19, 2025
Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good patch, thanks for the work and patience with the review - it's been a bit slow over summer with vacation/travel.

Comment on lines 29 to 46
#include "utilities/count_leading_zeros.hpp"
#include "utilities/count_trailing_zeros.hpp"

static int count_leading_zeros_int(jint i) {
return i == 0 ? BitsPerInt : count_leading_zeros(i);
}

static int count_leading_zeros_long(jlong l) {
return l == 0 ? BitsPerLong : count_leading_zeros(l);
}

static int count_trailing_zeros_int(jint i) {
return i == 0 ? BitsPerInt : count_trailing_zeros(i);
}

static int count_trailing_zeros_long(jlong l) {
return l == 0 ? BitsPerLong : count_trailing_zeros(l);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In src/hotspot/share/utilities/count_leading_zeros.hpp, it says that 0 behavior is undefined. Ok... but why do we do that? Is that a performance optimization ? If yes, is it really worth it? If there is no good reason not to handle 0, we should just handle it.

We have some tests in test/hotspot/gtest/utilities/test_count_leading_zeros.cpp.

It would be interesting to quickly check if any use of these methods could ever encounter zero, and then hit the assert. I would not be surprised if we found a bug here.

I think this would be a worth while cleanup task. I would prefer if we clean things up now, and don't just let more special handling code get integrated.

Comment on lines +56 to +57
return TypeInt::make(count_leading_zeros_int(~ti->_bits._zeros),
count_leading_zeros_int(ti->_bits._ones),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct, but I would like to see a short comment why it is correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same in other cases below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@eme64
Copy link
Contributor

eme64 commented Sep 9, 2025

@MaxXSoft Feel free to just ping me again when you want another review :)
FYI: I'll be on a longer vacation starting in about a week, so don't expect me to respond then.

@MaxXSoft
Copy link
Member Author

@eme64 Thank you for your patience and kind reviews! I've updated this patch based on your suggestions.

This patch is now ready for further review.

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. Though I think you modified the test example so far that it does not work any more, i.e. it would not constant fold if the output range was wrong. I've identified 2 sources that would prevent constant folding:

  • It is unclear if getResultChecksum would get inlined. That way, the result loses the type information about the ranges.
  • The comparisons themselves would not constant fold, because the values you compare with are not constants, but array element loads. You need to compare result with a compile time constant.

Maybe the idea is not 100% clear for you:
Imagine result should be in some range 2..10. But with a bug, we now return 3..10. This means the output of numberOfLeadingZeros is still variable, and it does not constant fold. But: if there is something below it, like a result < 3 ... this would now constant fold to false, even though we could have had a 2 at runtime.
But for this to work, it all needs to be in the same compilation unit, and the value we compare to result must be a compile time constant.

Does that make sense?

}
}

int getResultChecksum(int result, int[] LIMITS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put a @ForceInlinie before this. You are using it in many methods, and so it may not get inlined reliably. And if it does not get inlined, then the result verifcation would not constant-fold, and so it would be kind of useless. Because we rely on the fact that if the range is wrong, we could get bad constant folding ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added @ForceInline.

Comment on lines 518 to 521
for (int i = 0; i < LIMITS.length; i += 2) {
if (result < LIMITS[i]) sum += 1 << i;
if (result > LIMITS[i + 1]) sum += 1 << (i + 1);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doublt that this works, because the test would not constant fold if the range was too narrow.
I think you need to manually unroll the loop, and load the constants from static final values, or another method that allows it to be a compile time constant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 10, 2025

@MaxXSoft This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@MaxXSoft
Copy link
Member Author

Hi @eme64, thanks for your review. I've updated the IR test to ensure that range checks work with the constant folding.

Do you have any other suggestions for this patch?

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxXSoft Ok, now it looks really good. I have one minor nit.

Running some internal testing now.

Comment on lines 68 to 84
LIMITS_32_0 = INTS_32.next();
LIMITS_32_1 = INTS_32.next();
LIMITS_32_2 = INTS_32.next();
LIMITS_32_3 = INTS_32.next();
LIMITS_32_4 = INTS_32.next();
LIMITS_32_5 = INTS_32.next();
LIMITS_32_6 = INTS_32.next();
LIMITS_32_7 = INTS_32.next();

LIMITS_64_0 = INTS_64.next();
LIMITS_64_1 = INTS_64.next();
LIMITS_64_2 = INTS_64.next();
LIMITS_64_3 = INTS_64.next();
LIMITS_64_4 = INTS_64.next();
LIMITS_64_5 = INTS_64.next();
LIMITS_64_6 = INTS_64.next();
LIMITS_64_7 = INTS_64.next();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not assign them directly? You just need to declare the generators first. Would save us a couple of lines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing passed, code looks good to me too now :)

Thanks for all the work you put into this, really appreciated ☺️

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Oct 28, 2025
Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, even better :)

@openjdk openjdk bot added ready Pull request is ready to be integrated and removed ready Pull request is ready to be integrated labels Oct 29, 2025
@MaxXSoft
Copy link
Member Author

@eme64 Thank you for the review!

@merykitty @jatin-bhateja Do you have any other suggestions regarding the latest changes in this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler [email protected] ready Pull request is ready to be integrated rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

6 participants