Skip to content

Conversation

@Glavo
Copy link
Contributor

@Glavo Glavo commented Jun 24, 2023

ByteArray and ByteArrayLittleEndian are very useful tool classes that can be used in many places to performance tuning.

Currently they are implemented by VarHandle, so using them may have some impact on startup time.

This PR reimplements them using Unsafe, which reduces the impact on startup time.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

  • JDK-8310843: Reimplement ByteArray and ByteArrayLittleEndian with Unsafe (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14636/head:pull/14636
$ git checkout pull/14636

Update a local copy of the PR:
$ git checkout pull/14636
$ git pull https://git.openjdk.org/jdk.git pull/14636/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 14636

View PR using the GUI difftool:
$ git pr show -t 14636

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14636.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 24, 2023

👋 Welcome back Glavo! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jun 24, 2023

@Glavo The following label will be automatically applied to this pull request:

  • core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@Glavo
Copy link
Contributor Author

Glavo commented Jun 24, 2023

I removed getFloatRaw/getDoubleRaw because they are exactly the same as getFloat/getDouble, so they have no meaning.

@liach
Copy link
Member

liach commented Jun 24, 2023

The reimplementation allows parts that invoke package depend on to utilize faster byte array access, namely bytecode generation and Classfile API; which IMO is more important than the reduction on startup time.

@liach
Copy link
Member

liach commented Jun 24, 2023

Created an issue at https://bugs.openjdk.org/browse/JDK-8310843.
I look forward to this improvement, for it can potentially be used by the Classfile API to speed up Classfile parsing and writing.

@Glavo Glavo changed the title Reimplement ByteArray and ByteArrayLittleEndian with Unsafe 8310843: Reimplement ByteArray and ByteArrayLittleEndian with Unsafe Jun 24, 2023
@openjdk openjdk bot added the rfr Pull request is ready for review label Jun 24, 2023
@mlbridge
Copy link

mlbridge bot commented Jun 24, 2023

Copy link
Member

@liach liach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend documenting that this class is intended to be usable at early startup so we don't accidentally introduce features like lambda into the code.

@openjdk openjdk bot removed the rfr Pull request is ready for review label Jun 24, 2023
@openjdk openjdk bot added the rfr Pull request is ready for review label Jun 24, 2023
@openjdk openjdk bot removed the rfr Pull request is ready for review label Jun 24, 2023
@openjdk openjdk bot added the rfr Pull request is ready for review label Jun 24, 2023
@liach liach mentioned this pull request Jun 24, 2023
3 tasks
@Glavo Glavo mentioned this pull request Jun 25, 2023
3 tasks
@Glavo
Copy link
Contributor Author

Glavo commented Jun 26, 2023

I deleted some incorrect comments.

The original author of these two classes misunderstood the behavior of intBitsToFloat and longBitsToDouble. These two methods never collapse NaN values to a "canonical" NaN value, so the comments is incorrect.

I deleted those comments because conversions from int/long to float/double don't need to differentiate that behavior, their difference only exists when converting from float/double to int/long.

The conversion methods in the Float/Double class also exhibits this: there is only one method for converting from int/long to float/double , but there are two methods for converting from float/double to int/long.

@openjdk
Copy link

openjdk bot commented Jul 19, 2023

@Glavo This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8310843: Reimplement ByteArray and ByteArrayLittleEndian with Unsafe

Reviewed-by: liach, jlaskey

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@JimLaskey, @RogerRiggs) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 19, 2023
Copy link
Contributor

@RogerRiggs RogerRiggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Glavo
Copy link
Contributor Author

Glavo commented Jul 19, 2023

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Jul 19, 2023
@openjdk
Copy link

openjdk bot commented Jul 19, 2023

@Glavo
Your change (at version cb56e73) is now ready to be sponsored by a Committer.

@AlanBateman
Copy link
Contributor

/reviewers 2

@openjdk
Copy link

openjdk bot commented Jul 20, 2023

@AlanBateman
The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).

@AlanBateman
Copy link
Contributor

I don't think this change should be integrated without further discussion on what problems you are running into. Code that runs is initPhase1 needs to have as few dependences as possible and can use Unsafe directly when needed.

@liach
Copy link
Member

liach commented Jul 20, 2023

This patch merely allows ByteArray and ByteArrayLittleEndian to be used before VarHandle is ready. An example is the Classfile API (which will be used to generate LambdaForms for VarHandle), as shown in #14637.

* but it's not feasible in practices, because {@code ByteArray} and {@code ByteArrayLittleEndian}
* can be used in fundamental classes, {@code VarHandle} exercise many other
* code at VM startup, this could lead a recursive calls when fundamental
* classes is used in {@code VarHandle}.
Copy link
Contributor

@AlanBateman AlanBateman Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is confusing, esp. "not feasible in practices". If this code is changed then the comment can be very simple to say that it uses Unsafe to allow it be used in early startup and for in the implementation of classes such as VarHandle.

static final Unsafe UNSAFE = Unsafe.getUnsafe();

@ForceInline
static long arrayOffset(byte[] array, int typeBytes, int offset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, this is the really interesting thing that this class does - e.g. it introduces a way to translate a (logical) offset into a byte array into a physical offset that can be used for unsafe. After you have an helper method like this, it seems like the client can just do what it wants by using Unsafe directly (which would remove the need for having this class) ? Was some experiment of that kind done (e.g. replacing usage of ByteArray with Unsafe + helpers) - or does it lead to code that is too cumbersome on the client?

Also, were ByteBuffers considered as an alternative? (I'm not suggesting MemorySegment as those depend on VarHandle again, but a heap ByteBuffer is just a thin wrapper around an array which uses Unsafe). ByteBuffer will have a bound check, but so does your code (which call checkIndex). I believe that, at least in hot code, wrapping a ByteBuffer around a byte array should be routinely scalarized, as there's no control flow inside these little methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, a byte buffer is big endian, so some extra code would be required. But maybe that's another helper function:

@ForceInline
ByteBuffer asBuffer(byte[] array) { return ByteBuffer.wrap(array).order(ByteOrder.nativeOrder()); }

And then replace:

ByteArray.getChar(array, 42)

With

asBuffer(array).getChar(42);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also... in a lot of cases where ByteArray is used (DataXYZStream, ObjectXYZStream) the array being used is a field in the class. So the byte buffer creation can definitively be amortized (or the code changed to work on buffers instead of arrays).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Unsafe-based writing will be used by Integer.toString and Long.toString as well; in those cases, will creating a ByteBuffer wrapper be overkill?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Unsafe-based writing will be used by Integer.toString and Long.toString as well; in those cases, will creating a ByteBuffer wrapper be overkill?

Integer/Long are very core classes so I assume they can use Unsafe if needed, they probably want as few dependences as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I ran LoopOverNonConstantHeap on the 3700x platform, and the performance of ByteBuffer was also poor:

I finally see it.

Benchmark                           (polluteProfile)  Mode  Cnt  Score   Error  Units
LoopOverNonConstantHeap.BB_get                 false  avgt   30  1.801 ± 0.020  ns/op
LoopOverNonConstantHeap.unsafe_get             false  avgt   30  0.567 ± 0.007  

It seems that, between updating JMH and rebuilding the JDK from scratch, something did the trick.

While I knew that random access on a BB is slower than Unsafe (as there's an extra check), whereas looped access is as fast (as C2 is good at hoisting the checks outside the loop, as shown in the benchmark). Note also that we are in the nanosecond realm, so each instruction here counts.

Is there any benchmark for DataInput/Output stream that can be used? I mean, it would be interesting to understand how these numbers translate when running the stuff that is built on top.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any benchmark for DataInput/Output stream that can be used? I mean, it would be interesting to understand how these numbers translate when running the stuff that is built on top.

I've tried to run the benchmark in test/micro/java/io/DataInputStream.java. This is the baseline:

Benchmark                     Mode  Cnt  Score   Error  Units
DataInputStreamTest.readChar  avgt   20  7.583 ± 0.026  us/op
DataInputStreamTest.readInt   avgt   20  3.804 ± 0.045  us/op

And this is with a patch similar to the one I shared above, to use ByteBuffer internally:

Benchmark                     Mode  Cnt  Score   Error  Units
DataInputStreamTest.readChar  avgt   20  7.594 ± 0.106  us/op
DataInputStreamTest.readInt   avgt   20  3.795 ± 0.030  us/op

There does not seem to be any extra overhead. That said, access occurs in a counted loop, and in these cases we know buffer/segment access is optimized quite well.

I believe the question here is: do we have benchmark which are representative of the kind of gain that would be introduced by micro-optimizing ByteArray? It can be quite tricky to estimate real benefits from synthetic benchmark on the ByteArray class, especially when fetching a single element outside of a loop - as those are not representative of how the clients will use this. I note that the original benchmark made by Per used a loop with two iterations to assess the cost of the ByteArray operations:

http://minborgsjavapot.blogspot.com/2023/01/java-21-performance-improvements.html

If I change the benchmark to do 2 iterations, I see this:

Benchmark                      Mode  Cnt       Score       Error   Units
ByteArray.readByte            thrpt    5  704199.172 ± 34101.508  ops/ms
ByteArray.readByteFromBuffer  thrpt    5  474321.828 ±  6588.471  ops/ms
ByteArray.readInt             thrpt    5  662411.181 ±  4470.951  ops/ms
ByteArray.readIntFromBuffer   thrpt    5  496900.429 ±  3705.737  ops/ms
ByteArray.readLong            thrpt    5  665138.063 ±  5944.814  ops/ms
ByteArray.readLongFromBuffer  thrpt    5  517781.548 ± 27106.331  ops/ms

The more the iterations, the less the cost (and you don't need many iterations to break even). This probably explains why the DataInputStream benchmark doesn't change - there's 1024 iterations in there.

I guess all this is to say that excessively focussing on microbenchmark of a simple class such as ByteArray in conditions that are likely unrealistic (e.g. single access) is IMHO the wrong way to look at things, as ByteArray is mostly used by classes that most definitively will read more than one value at a time (including classfile API).

So, also IMHO, we should try to measure the use cases we care about of the higher-level API we care about (I/O streams, classfile) and then see if adding Unsafe/VarHandle/ByteBuffer access in here is going to lead to any benefit at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some feedback about this discussion:

  • I agree that the DataInput/OutputStreams should maybe use ByteBuffer directly as they use buffering already. So the patch above looks fine. In my project Apache Lucene (which has many performance critical methods like this), we have already implemented ByteBuffer based access like this for all IO-stream-based classes (we call then DataInput/DataOutput). I don't know why you have seen differences in using a ByteBuffer as final field in the class. That's common and used in most frameworks out there (NETTY,...) and is bullet proof (unless theres a bug in optimizer which sometimes happened in the past).
  • We noticed that wrapping a byte array on each access by ByteBuffer causes a lot of overhead and GC activity if used in hot loops. In addition we have seen cases where it is not optimized anymore (not sure why). @mcimadamore: You remember the similar discussions about the MemorySegment slices and copying them around between heap/foreign? Maybe inside the JDK you can do better by using @ForceInline. Our code can't do this, so we try to avoid creating instances of classes in such low-level code.
  • The original VarHandle approach is now used in Lucene's code at all places (basically the idea to use VarHandles for this class was suggested by me a while back). We often have byte arrays and can't wrap them as ByteBuffer on each call (because its not always inlined). For code outside of the JDK this looks like the best approach to have fast access to short/int/long values at specific (not necessarily aligned) positions. We have seen LZ4 compression getting much faster after changing the code from manually constructing logs/floats from bytes like in the reference code. With ByteBuffer it was often getting slower (depending on how it was called, I think because we can't do @ForceInline in code outside the JDK.

Generally: A class like this is very nice and also very much needed in code outside the JDK. A lot of code like encoding/decoding network bytes or compression algorithms often has the pattern that they want to read primitive types from byte arrays from. The overhead with wrapping looks bad in code and also causes long startup times and sometimes also OOM (if used multithreaded from different threads hammering the byte array accessors). Also you don not want to write a LZ4 decompressor using ByteBuffer as its only source of data... :-(

So have you thought of making this low-level classes public so we outside users no longer need to deal with VarHandles?

Maybe java.util.ByteArrays with solely static methods. The internal implementation of such a useful basic utility class could definitly be using Unsafe internally, so I would leave out the discussion here. If you use Unsafe there are no surprises! Personally I have no problem with the current implementation in this PR! I would just put little/big endian impl in the same class and move it to java.util (this is just my comment about this, coming from a library which does this low level stuff all the time).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So have you thought of making this low-level classes public so we outside users no longer need to deal with VarHandles?

I believe this is beyond the scope of this PR.

As for what do we do in the JDK, I can see few options:

  1. We keep things as they are in current mainline.
  2. We keep changes in this PR.
  3. We rewrite most uses of ByteArray in java.io to use BB and remove ByteArray
  4. We remove ByteArray and provide some static helper function to generate an unsafe offset from an array

I agree with @uschindler that wrapping stuff in ByteBuffer "on the fly" might be problematic for code that is not inlined, so I don't think we should do that.

I have to admit that I'm a little unclear as to what the goal of this PR is. Initially, it started as an "improve startup" effort, which then morphed into a "let's make ByteArray" more usable, even for other clients (like classfile API), or Long::toString. I'm unsure about the latter use cases, because (a) Long/Integer are core classes and should probably use Unsafe directly, where needed and (b) for classfile API, using ByteBuffer seems a good candidate on paper (of course there is the unknown of how well the byte buffer access will optimize in the classfile API code - but if there's more than one access on the same buffer, we should be more than ok).

I'd like to add some more words of caution against the synthetic benchmarks that we tried above. These benchmarks are quite peculiar, for at least two reasons:

  • we only ever access one element
  • the accessed offset is always zero

No general API can equal Unsafe under this set of conditions. When playing with the benchmark I realize that every little thing mattered (we're really measuring the number of instructions emitted by C2) - for instance, the fact that when access occurs with a byte buffer, the underlying array and limit have to be fetched from their fields has a cost. Also, the fact that ByteBuffer has a hierarchy has an even bigger cost (as C2 has to make sure you are really invoking HeapByteBuffer). The mutable endianness state in byte buffer also adds up to the noise. The above is what ends up in a big fat "2x slower" label.

That said, all these "factors" are only relevant because we're looking at a single buffer operation. In fact, all such costs can be easily be amortized as soon as there more than one access. Or as soon as you start accessing offsets that are not known statically (unlike in the benchmark).

So, there's a question of what's the code idiom that leads to the absolute fastest code (and I agree that Unsafe + static wrappers seems the best here). And then there's the question of "but, what do we need to get the performance number/startup behavior we want". I feel the important question is the second, but we keep arguing about the former.

And, to assess that second question, we need to understand better what the goals are (which, so far, seems a bit fuzzy).

Copy link
Member

@uschindler uschindler Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So have you thought of making this low-level classes public so we outside users no longer need to deal with VarHandles?

I believe this is beyond the scope of this PR.

Sure, I brought this up here but yes, it is not really the scope of this PR. It is just another idea that this class could be of more wide use although outside of this PR and also outside of Lucene. Actually it would be nice to have it public, but I know this involves creating a JEP and so on. If there's interest I could start on proposing something like this on mailing list, later creating a JEP or whatever else is needed.

P.S.: Actually for a 3rd party user the whole thing is not much complicated. You only need a class to allocate the VarHandles and then use them from code, you don't even need the wrapper methods (although the ymake it nicer to read and you don't need the cast of return value). As there is no security involved, one can have those VarHandles as public static fields in some utility class: https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/util/BitUtil.html; Usage of them is quite simple then: https://github.com/apache/lucene/blob/59c56a0aed9a43d24c676376b5d50c5c6518e3bc/lucene/core/src/java/org/apache/lucene/store/ByteArrayDataInput.java#L96 (there are many of those throughout Lucene's code)

So I agree with your ideas, we have to decide what is best for this PR. I tend to think that those 2 options are good:

  • Use ByteBuffer in classfile API
  • commit the PR as proposed here (looks fine to me).

@mcimadamore
Copy link
Contributor

The reimplementation allows parts that invoke package depend on to utilize faster byte array access, namely bytecode generation and Classfile API; which IMO is more important than the reduction on startup time.

Is BufWriterImpl what you are looking to refactor using this class? Any reason why that implementation doesn't internally rely on a ByteBuffer, rather than a plain array?

@liach
Copy link
Member

liach commented Jul 20, 2023

Is BufWriterImpl what you are looking to refactor using this class? Any reason why that implementation doesn't internally rely on a ByteBuffer, rather than a plain array?

Good point. Not only that, ClassReaderImpl should be migrated to read U2/S2 and U4/S4 via ByteBuffer as well.

Copy link
Contributor

@RogerRiggs RogerRiggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Til the comments settle down; don't integrate or sponsor.

@RogerRiggs
Copy link
Contributor

These classes are used where numbers are being formatted into existing byte arrays and may be assembled with other strings.
For example, StringConcatHelper and AbstractStringBuilder, both optimized for assembling LATIN1 and UTF16 strings.
Switching to use a different implementation would result in more allocation and/or copying the strings.

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 18, 2023

@Glavo This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@Glavo
Copy link
Contributor Author

Glavo commented Aug 18, 2023

/open

@openjdk
Copy link

openjdk bot commented Aug 18, 2023

@Glavo This pull request is already open

@minborg
Copy link
Contributor

minborg commented Aug 22, 2023

In my opinion, reducing startup time can be achieved in other ways, both short-term (sharing VHs) and long-term (pre-generate VHs at build time using condensers). I think having a supported API carries some advantages rather than using Unsafe directly.

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 20, 2023

@Glavo This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@openjdk openjdk bot removed the sponsor Pull request is ready to be sponsored label Sep 21, 2023
@bridgekeeper
Copy link

bridgekeeper bot commented Oct 20, 2023

@Glavo This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core-libs [email protected] ready Pull request is ready to be integrated rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

8 participants