-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8310843: Reimplement ByteArray and ByteArrayLittleEndian with Unsafe #14636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back Glavo! A progress list of the required criteria for merging this PR into |
|
I removed |
|
The reimplementation allows parts that invoke package depend on to utilize faster byte array access, namely bytecode generation and Classfile API; which IMO is more important than the reduction on startup time. |
|
Created an issue at https://bugs.openjdk.org/browse/JDK-8310843. |
Webrevs
|
liach
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend documenting that this class is intended to be usable at early startup so we don't accidentally introduce features like lambda into the code.
|
I deleted some incorrect comments. The original author of these two classes misunderstood the behavior of I deleted those comments because conversions from The conversion methods in the |
|
@Glavo This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been no new commits pushed to the As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@JimLaskey, @RogerRiggs) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
RogerRiggs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
/integrate |
|
/reviewers 2 |
|
@AlanBateman |
|
I don't think this change should be integrated without further discussion on what problems you are running into. Code that runs is initPhase1 needs to have as few dependences as possible and can use Unsafe directly when needed. |
|
This patch merely allows |
| * but it's not feasible in practices, because {@code ByteArray} and {@code ByteArrayLittleEndian} | ||
| * can be used in fundamental classes, {@code VarHandle} exercise many other | ||
| * code at VM startup, this could lead a recursive calls when fundamental | ||
| * classes is used in {@code VarHandle}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is confusing, esp. "not feasible in practices". If this code is changed then the comment can be very simple to say that it uses Unsafe to allow it be used in early startup and for in the implementation of classes such as VarHandle.
| static final Unsafe UNSAFE = Unsafe.getUnsafe(); | ||
|
|
||
| @ForceInline | ||
| static long arrayOffset(byte[] array, int typeBytes, int offset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, this is the really interesting thing that this class does - e.g. it introduces a way to translate a (logical) offset into a byte array into a physical offset that can be used for unsafe. After you have an helper method like this, it seems like the client can just do what it wants by using Unsafe directly (which would remove the need for having this class) ? Was some experiment of that kind done (e.g. replacing usage of ByteArray with Unsafe + helpers) - or does it lead to code that is too cumbersome on the client?
Also, were ByteBuffers considered as an alternative? (I'm not suggesting MemorySegment as those depend on VarHandle again, but a heap ByteBuffer is just a thin wrapper around an array which uses Unsafe). ByteBuffer will have a bound check, but so does your code (which call checkIndex). I believe that, at least in hot code, wrapping a ByteBuffer around a byte array should be routinely scalarized, as there's no control flow inside these little methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, a byte buffer is big endian, so some extra code would be required. But maybe that's another helper function:
@ForceInline
ByteBuffer asBuffer(byte[] array) { return ByteBuffer.wrap(array).order(ByteOrder.nativeOrder()); }
And then replace:
ByteArray.getChar(array, 42)
With
asBuffer(array).getChar(42);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also... in a lot of cases where ByteArray is used (DataXYZStream, ObjectXYZStream) the array being used is a field in the class. So the byte buffer creation can definitively be amortized (or the code changed to work on buffers instead of arrays).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Unsafe-based writing will be used by Integer.toString and Long.toString as well; in those cases, will creating a ByteBuffer wrapper be overkill?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Unsafe-based writing will be used by
Integer.toStringandLong.toStringas well; in those cases, will creating a ByteBuffer wrapper be overkill?
Integer/Long are very core classes so I assume they can use Unsafe if needed, they probably want as few dependences as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I ran
LoopOverNonConstantHeapon the 3700x platform, and the performance of ByteBuffer was also poor:
I finally see it.
Benchmark (polluteProfile) Mode Cnt Score Error Units
LoopOverNonConstantHeap.BB_get false avgt 30 1.801 ± 0.020 ns/op
LoopOverNonConstantHeap.unsafe_get false avgt 30 0.567 ± 0.007
It seems that, between updating JMH and rebuilding the JDK from scratch, something did the trick.
While I knew that random access on a BB is slower than Unsafe (as there's an extra check), whereas looped access is as fast (as C2 is good at hoisting the checks outside the loop, as shown in the benchmark). Note also that we are in the nanosecond realm, so each instruction here counts.
Is there any benchmark for DataInput/Output stream that can be used? I mean, it would be interesting to understand how these numbers translate when running the stuff that is built on top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any benchmark for DataInput/Output stream that can be used? I mean, it would be interesting to understand how these numbers translate when running the stuff that is built on top.
I've tried to run the benchmark in test/micro/java/io/DataInputStream.java. This is the baseline:
Benchmark Mode Cnt Score Error Units
DataInputStreamTest.readChar avgt 20 7.583 ± 0.026 us/op
DataInputStreamTest.readInt avgt 20 3.804 ± 0.045 us/op
And this is with a patch similar to the one I shared above, to use ByteBuffer internally:
Benchmark Mode Cnt Score Error Units
DataInputStreamTest.readChar avgt 20 7.594 ± 0.106 us/op
DataInputStreamTest.readInt avgt 20 3.795 ± 0.030 us/op
There does not seem to be any extra overhead. That said, access occurs in a counted loop, and in these cases we know buffer/segment access is optimized quite well.
I believe the question here is: do we have benchmark which are representative of the kind of gain that would be introduced by micro-optimizing ByteArray? It can be quite tricky to estimate real benefits from synthetic benchmark on the ByteArray class, especially when fetching a single element outside of a loop - as those are not representative of how the clients will use this. I note that the original benchmark made by Per used a loop with two iterations to assess the cost of the ByteArray operations:
http://minborgsjavapot.blogspot.com/2023/01/java-21-performance-improvements.html
If I change the benchmark to do 2 iterations, I see this:
Benchmark Mode Cnt Score Error Units
ByteArray.readByte thrpt 5 704199.172 ± 34101.508 ops/ms
ByteArray.readByteFromBuffer thrpt 5 474321.828 ± 6588.471 ops/ms
ByteArray.readInt thrpt 5 662411.181 ± 4470.951 ops/ms
ByteArray.readIntFromBuffer thrpt 5 496900.429 ± 3705.737 ops/ms
ByteArray.readLong thrpt 5 665138.063 ± 5944.814 ops/ms
ByteArray.readLongFromBuffer thrpt 5 517781.548 ± 27106.331 ops/ms
The more the iterations, the less the cost (and you don't need many iterations to break even). This probably explains why the DataInputStream benchmark doesn't change - there's 1024 iterations in there.
I guess all this is to say that excessively focussing on microbenchmark of a simple class such as ByteArray in conditions that are likely unrealistic (e.g. single access) is IMHO the wrong way to look at things, as ByteArray is mostly used by classes that most definitively will read more than one value at a time (including classfile API).
So, also IMHO, we should try to measure the use cases we care about of the higher-level API we care about (I/O streams, classfile) and then see if adding Unsafe/VarHandle/ByteBuffer access in here is going to lead to any benefit at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some feedback about this discussion:
- I agree that the DataInput/OutputStreams should maybe use
ByteBufferdirectly as they use buffering already. So the patch above looks fine. In my project Apache Lucene (which has many performance critical methods like this), we have already implemented ByteBuffer based access like this for all IO-stream-based classes (we call thenDataInput/DataOutput). I don't know why you have seen differences in using aByteBufferas final field in the class. That's common and used in most frameworks out there (NETTY,...) and is bullet proof (unless theres a bug in optimizer which sometimes happened in the past). - We noticed that wrapping a byte array on each access by ByteBuffer causes a lot of overhead and GC activity if used in hot loops. In addition we have seen cases where it is not optimized anymore (not sure why). @mcimadamore: You remember the similar discussions about the
MemorySegmentslices and copying them around between heap/foreign? Maybe inside the JDK you can do better by using@ForceInline. Our code can't do this, so we try to avoid creating instances of classes in such low-level code. - The original VarHandle approach is now used in Lucene's code at all places (basically the idea to use VarHandles for this class was suggested by me a while back). We often have byte arrays and can't wrap them as ByteBuffer on each call (because its not always inlined). For code outside of the JDK this looks like the best approach to have fast access to short/int/long values at specific (not necessarily aligned) positions. We have seen LZ4 compression getting much faster after changing the code from manually constructing logs/floats from bytes like in the reference code. With ByteBuffer it was often getting slower (depending on how it was called, I think because we can't do
@ForceInlinein code outside the JDK.
Generally: A class like this is very nice and also very much needed in code outside the JDK. A lot of code like encoding/decoding network bytes or compression algorithms often has the pattern that they want to read primitive types from byte arrays from. The overhead with wrapping looks bad in code and also causes long startup times and sometimes also OOM (if used multithreaded from different threads hammering the byte array accessors). Also you don not want to write a LZ4 decompressor using ByteBuffer as its only source of data... :-(
So have you thought of making this low-level classes public so we outside users no longer need to deal with VarHandles?
Maybe java.util.ByteArrays with solely static methods. The internal implementation of such a useful basic utility class could definitly be using Unsafe internally, so I would leave out the discussion here. If you use Unsafe there are no surprises! Personally I have no problem with the current implementation in this PR! I would just put little/big endian impl in the same class and move it to java.util (this is just my comment about this, coming from a library which does this low level stuff all the time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So have you thought of making this low-level classes public so we outside users no longer need to deal with VarHandles?
I believe this is beyond the scope of this PR.
As for what do we do in the JDK, I can see few options:
- We keep things as they are in current mainline.
- We keep changes in this PR.
- We rewrite most uses of ByteArray in java.io to use BB and remove ByteArray
- We remove ByteArray and provide some static helper function to generate an unsafe offset from an array
I agree with @uschindler that wrapping stuff in ByteBuffer "on the fly" might be problematic for code that is not inlined, so I don't think we should do that.
I have to admit that I'm a little unclear as to what the goal of this PR is. Initially, it started as an "improve startup" effort, which then morphed into a "let's make ByteArray" more usable, even for other clients (like classfile API), or Long::toString. I'm unsure about the latter use cases, because (a) Long/Integer are core classes and should probably use Unsafe directly, where needed and (b) for classfile API, using ByteBuffer seems a good candidate on paper (of course there is the unknown of how well the byte buffer access will optimize in the classfile API code - but if there's more than one access on the same buffer, we should be more than ok).
I'd like to add some more words of caution against the synthetic benchmarks that we tried above. These benchmarks are quite peculiar, for at least two reasons:
- we only ever access one element
- the accessed offset is always zero
No general API can equal Unsafe under this set of conditions. When playing with the benchmark I realize that every little thing mattered (we're really measuring the number of instructions emitted by C2) - for instance, the fact that when access occurs with a byte buffer, the underlying array and limit have to be fetched from their fields has a cost. Also, the fact that ByteBuffer has a hierarchy has an even bigger cost (as C2 has to make sure you are really invoking HeapByteBuffer). The mutable endianness state in byte buffer also adds up to the noise. The above is what ends up in a big fat "2x slower" label.
That said, all these "factors" are only relevant because we're looking at a single buffer operation. In fact, all such costs can be easily be amortized as soon as there more than one access. Or as soon as you start accessing offsets that are not known statically (unlike in the benchmark).
So, there's a question of what's the code idiom that leads to the absolute fastest code (and I agree that Unsafe + static wrappers seems the best here). And then there's the question of "but, what do we need to get the performance number/startup behavior we want". I feel the important question is the second, but we keep arguing about the former.
And, to assess that second question, we need to understand better what the goals are (which, so far, seems a bit fuzzy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So have you thought of making this low-level classes public so we outside users no longer need to deal with VarHandles?
I believe this is beyond the scope of this PR.
Sure, I brought this up here but yes, it is not really the scope of this PR. It is just another idea that this class could be of more wide use although outside of this PR and also outside of Lucene. Actually it would be nice to have it public, but I know this involves creating a JEP and so on. If there's interest I could start on proposing something like this on mailing list, later creating a JEP or whatever else is needed.
P.S.: Actually for a 3rd party user the whole thing is not much complicated. You only need a class to allocate the VarHandles and then use them from code, you don't even need the wrapper methods (although the ymake it nicer to read and you don't need the cast of return value). As there is no security involved, one can have those VarHandles as public static fields in some utility class: https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/util/BitUtil.html; Usage of them is quite simple then: https://github.com/apache/lucene/blob/59c56a0aed9a43d24c676376b5d50c5c6518e3bc/lucene/core/src/java/org/apache/lucene/store/ByteArrayDataInput.java#L96 (there are many of those throughout Lucene's code)
So I agree with your ideas, we have to decide what is best for this PR. I tend to think that those 2 options are good:
- Use ByteBuffer in classfile API
- commit the PR as proposed here (looks fine to me).
Is |
Good point. Not only that, |
RogerRiggs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Til the comments settle down; don't integrate or sponsor.
|
These classes are used where numbers are being formatted into existing byte arrays and may be assembled with other strings. |
|
@Glavo This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
|
/open |
|
@Glavo This pull request is already open |
|
In my opinion, reducing startup time can be achieved in other ways, both short-term (sharing VHs) and long-term (pre-generate VHs at build time using condensers). I think having a supported API carries some advantages rather than using Unsafe directly. |
|
@Glavo This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
|
@Glavo This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the |
ByteArrayandByteArrayLittleEndianare very useful tool classes that can be used in many places to performance tuning.Currently they are implemented by
VarHandle, so using them may have some impact on startup time.This PR reimplements them using
Unsafe, which reduces the impact on startup time.Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14636/head:pull/14636$ git checkout pull/14636Update a local copy of the PR:
$ git checkout pull/14636$ git pull https://git.openjdk.org/jdk.git pull/14636/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 14636View PR using the GUI difftool:
$ git pr show -t 14636Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14636.diff
Webrev
Link to Webrev Comment