Skip to content

Conversation

@cl4es
Copy link
Member

@cl4es cl4es commented Sep 6, 2023

This PR seeks to improve formatting of hex digits using java.util.HexFormat somewhat.

This is achieved getting rid of a couple of lookup tables, caching the result of HexFormat.of().withUpperCase(), and removing tiny allocation that happens in the formatHex(A, byte) method. Improvements range from 20-40% on throughput, and some operations allocate less:

Name                               Cnt   Base   Error   Test   Error   Unit   Diff%
HexFormatBench.appenderLower        15  1,330 ± 0,021  1,065 ± 0,067  us/op   19,9% (p = 0,000*)
  :gc.alloc.rate                    15 11,481 ± 0,185  0,007 ± 0,000 MB/sec  -99,9% (p = 0,000*)
  :gc.alloc.rate.norm               15 16,009 ± 0,000  0,007 ± 0,000   B/op -100,0% (p = 0,000*)
  :gc.count                         15  3,000          0,000         counts
  :gc.time                           3  2,000                            ms
HexFormatBench.appenderLowerCached  15  1,317 ± 0,013  1,065 ± 0,054  us/op   19,1% (p = 0,000*)
  :gc.alloc.rate                    15 11,590 ± 0,111  0,007 ± 0,000 MB/sec  -99,9% (p = 0,000*)
  :gc.alloc.rate.norm               15 16,009 ± 0,000  0,007 ± 0,000   B/op -100,0% (p = 0,000*)
  :gc.count                         15  3,000          0,000         counts
  :gc.time                           3  2,000                            ms
HexFormatBench.appenderUpper        15  1,330 ± 0,022  1,065 ± 0,036  us/op   19,9% (p = 0,000*)
  :gc.alloc.rate                    15 34,416 ± 0,559  0,007 ± 0,000 MB/sec -100,0% (p = 0,000*)
  :gc.alloc.rate.norm               15 48,009 ± 0,000  0,007 ± 0,000   B/op -100,0% (p = 0,000*)
  :gc.count                         15  0,000          0,000         counts
HexFormatBench.appenderUpperCached  15  1,353 ± 0,009  1,033 ± 0,014  us/op   23,6% (p = 0,000*)
  :gc.alloc.rate                    15 11,284 ± 0,074  0,007 ± 0,000 MB/sec  -99,9% (p = 0,000*)
  :gc.alloc.rate.norm               15 16,009 ± 0,000  0,007 ± 0,000   B/op -100,0% (p = 0,000*)
  :gc.count                         15  3,000          0,000         counts
  :gc.time                           3  2,000                            ms
HexFormatBench.toHexLower           15  0,198 ± 0,001  0,119 ± 0,008  us/op   40,1% (p = 0,000*)
  :gc.alloc.rate                    15  0,007 ± 0,000  0,007 ± 0,000 MB/sec   -0,0% (p = 0,816 )
  :gc.alloc.rate.norm               15  0,001 ± 0,000  0,001 ± 0,000   B/op  -40,1% (p = 0,000*)
  :gc.count                         15  0,000          0,000         counts
HexFormatBench.toHexLowerCached     15  0,201 ± 0,002  0,114 ± 0,001  us/op   43,0% (p = 0,000*)
  :gc.alloc.rate                    15  0,007 ± 0,000  0,007 ± 0,000 MB/sec   -0,2% (p = 0,116 )
  :gc.alloc.rate.norm               15  0,001 ± 0,000  0,001 ± 0,000   B/op  -43,1% (p = 0,000*)
  :gc.count                         15  0,000          0,000         counts
HexFormatBench.toHexUpper           15  0,146 ± 0,002  0,114 ± 0,001  us/op   21,6% (p = 0,000*)
  :gc.alloc.rate                    15  0,007 ± 0,000  0,007 ± 0,000 MB/sec    0,0% (p = 0,668 )
  :gc.alloc.rate.norm               15  0,001 ± 0,000  0,001 ± 0,000   B/op  -21,5% (p = 0,000*)
  :gc.count                         15  0,000          0,000         counts
HexFormatBench.toHexUpperCached     15  0,199 ± 0,002  0,116 ± 0,003  us/op   41,7% (p = 0,000*)
  :gc.alloc.rate                    15  0,007 ± 0,000  0,007 ± 0,000 MB/sec    0,0% (p = 0,684 )
  :gc.alloc.rate.norm               15  0,001 ± 0,000  0,001 ± 0,000   B/op  -41,7% (p = 0,000*)
  :gc.count                         15  0,000          0,000         counts
  * = significant

Invariant parameters used by above microbenchmarks:
size:   512

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8315789: Minor HexFormat performance improvements (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/15591/head:pull/15591
$ git checkout pull/15591

Update a local copy of the PR:
$ git checkout pull/15591
$ git pull https://git.openjdk.org/jdk.git pull/15591/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 15591

View PR using the GUI difftool:
$ git pr show -t 15591

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/15591.diff

Webrev

Link to Webrev Comment

@cl4es cl4es mentioned this pull request Sep 6, 2023
3 tasks
@bridgekeeper
Copy link

bridgekeeper bot commented Sep 6, 2023

👋 Welcome back redestad! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 6, 2023
@openjdk
Copy link

openjdk bot commented Sep 6, 2023

@cl4es The following label will be automatically applied to this pull request:

  • core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@mlbridge
Copy link

mlbridge bot commented Sep 6, 2023

Webrevs

Comment on lines +638 to +644
if (value < 10) {
return (char)('0' + value);
}
if (digitCase == Case.LOWERCASE) {
return (char)('a' - 10 + value);
}
return (char)('A' - 10 + value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this would adversely impact performance, but what about factoring out these lines in a private method? They are repeated in toHighHexDigit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprisingly this does carry some cost in the microbenchmarks on my M1. Might be noise, but it seems rather consistent so I'll need to investigate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This:

diff --git a/src/java.base/share/classes/java/util/HexFormat.java b/src/java.base/share/classes/java/util/HexFormat.java
index 107c362cbc2..177548c03f7 100644
--- a/src/java.base/share/classes/java/util/HexFormat.java
+++ b/src/java.base/share/classes/java/util/HexFormat.java
@@ -634,14 +634,7 @@ private static String escapeNL(String string) {
      * @return the hexadecimal character for the low 4 bits {@code 0-3} of the value
      */
     public char toLowHexDigit(int value) {
-        value = value & 0xf;
-        if (value < 10) {
-            return (char)('0' + value);
-        }
-        if (digitCase == Case.LOWERCASE) {
-            return (char)('a' - 10 + value);
-        }
-        return (char)('A' - 10 + value);
+        return toHexDigit(value & 0xf);
     }

     /**
@@ -655,7 +648,10 @@ public char toLowHexDigit(int value) {
      * @return the hexadecimal character for the bits {@code 4-7} of the value
      */
     public char toHighHexDigit(int value) {
-        value = (value >> 4) & 0xf;
+        return toHexDigit((value >> 4) & 0xf);
+    }
+
+    private char toHexDigit(int value) {
         if (value < 10) {
             return (char)('0' + value);
         }

.. clearly increase cost across all micros:


Name                               Cnt  Base   Error   Test   Error  Unit   Diff%
HexFormatBench.appenderLower        15 1,046 ± 0,041  1,301 ± 0,017 us/op  -24,4% (p = 0,000*)
HexFormatBench.appenderLowerCached  15 1,056 ± 0,055  1,175 ± 0,115 us/op  -11,3% (p = 0,001*)
HexFormatBench.appenderUpper        15 1,059 ± 0,055  1,303 ± 0,012 us/op  -23,0% (p = 0,000*)
HexFormatBench.appenderUpperCached  15 1,099 ± 0,014  1,451 ± 0,267 us/op  -32,0% (p = 0,000*)
HexFormatBench.toHexLower           15 0,322 ± 0,002  0,338 ± 0,005 us/op   -4,8% (p = 0,000*)
HexFormatBench.toHexLowerCached     15 0,324 ± 0,003  0,411 ± 0,005 us/op  -27,0% (p = 0,000*)
HexFormatBench.toHexUpper           15 0,324 ± 0,003  0,340 ± 0,003 us/op   -4,9% (p = 0,000*)
HexFormatBench.toHexUpperCached     15 0,322 ± 0,001  0,411 ± 0,004 us/op  -27,6% (p = 0,000*)

Comment on lines +178 to +183
private final Case digitCase;

private enum Case {
LOWERCASE,
UPPERCASE
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of an enum cache the "A" or "a" for direct use.
private final char caseBase;
Initialize it in the constructor.

Comment on lines +641 to +644
if (digitCase == Case.LOWERCASE) {
return (char)('a' - 10 + value);
}
return (char)('A' - 10 + value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would caching the upper/lower case base avoid a branch?

Suggested change
if (digitCase == Case.LOWERCASE) {
return (char)('a' - 10 + value);
}
return (char)('A' - 10 + value);
return (char)(caseBase - 10 + value);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this but it looks like it is marginally slower - plausibly the code I have means the JIT eliminates the untaken branch and constant folds this neatly. I'll do some digging..

@wenshao
Copy link
Contributor

wenshao commented Sep 7, 2023

You can refer to the PR #14745 I submitted. Using a table with a length of 256 in HexDigits, you can get two digits in one lookup table, and use ByteArrayLittleEndian.setShort to write two digits at a time.

like this:

   public String toHexDigits(byte value) {
        byte[] rep = new byte[2];
        ByteArrayLittleEndian.setShort(rep, 0, HexDigits.DIGITS[value & 0xff]); // DIGITS shuld changed to little-endian
        try {
            return jla.newStringNoRepl(rep, StandardCharsets.ISO_8859_1);
        } catch (CharacterCodingException cce) {
            throw new AssertionError(cce);
        }
    }

@cl4es
Copy link
Member Author

cl4es commented Sep 7, 2023

I took ByteArrayLittleEndian for a spin on a (new) micro for String toHexDigits(byte), and that gives us ~4% (MacBook M1):

Name                       Cnt      Base    Error       Test    Error   Unit   Diff%
HexFormatBench.toHexDigits  15     1,943 ±  0,014      1,862 ±  0,009  us/op    4,1% (p = 0,000*)
  * = significant

I'm not sure 4% is a large enough win on a micro to motivate the lookup-table based approach. I'd rather see us investigate if we could consolidate these overlapping utility classes (HexFormat and HexDigits) on a lookup-table free solution.

@wenshao
Copy link
Contributor

wenshao commented Sep 8, 2023

Should we test the performance comparison of toHexDigits(long)?

I've done some work before, and I didn't find a way to perform better than a lookup table without using a lookup table.

@rgiulietti
Copy link
Contributor

I'm not sure that micro-benchmarks are very indicative on whether a lookup table performs better than short and straightforward code.
The reason is that, once in the CPU caches, a lookup table in micro-benchmarks stays there, whereas in more realistic situations, where access is more spread out in time, it is often evicted to make room for other, more lively data.
A micro-benchmark using a lookup table shows good results because of a high rate of cache hits, whereas in other real-world workloads a lookup table might result in many cache misses on access.

@cl4es
Copy link
Member Author

cl4es commented Sep 8, 2023

As @rgiulietti says lookup-table algorithms may outperform in microbenchmarks but lose out in real world scenarios, so we need to stay clear unless there's major benefit.

And as it turns out, the relative benefit seem to come mainly from the use of ByteArrayLittleEndian, not from using the lookup table in HexDigits.DIGITS:

     public String toHexDigits(byte value) {
         byte[] rep = new byte[2];
-        rep[0] = (byte)toHighHexDigit(value);
-        rep[1] = (byte)toLowHexDigit(value);
+        ByteArrayLittleEndian.setChar(rep, 0, (char)(toHighHexDigit(value) << 8 | toLowHexDigit(value)));
         try {
             return jla.newStringNoRepl(rep, StandardCharsets.ISO_8859_1);
         } catch (CharacterCodingException cce) {

This gets the same speed-up (4%) as calling HexDigits.DIGITS:

Name                       Cnt      Base    Error       Test    Error   Unit   Diff%
HexFormatBench.toHexDigits  15     1,943 ±  0,014      1,860 ±  0,014  us/op    4,2% (p = 0,000*)
  * = significant

@cl4es
Copy link
Member Author

cl4es commented Sep 8, 2023

I also tried variants of this for toHexDigits(short|int|long) but it always ends up worse. Both when chunking 2 digits at a time:

        ByteArrayLittleEndian.setChar(rep, 2, (char)(toHighHexDigit(value) << 8 | toLowHexDigit(value)));
        value >>= 8;
        ByteArrayLittleEndian.setChar(rep, 0, (char)(toHighHexDigit(value) << 8 | toLowHexDigit(value)));

and when doing it all in one go (this code is toHexDigits(short), which writes four nibbles):

        ByteArrayLittleEndian.setInt(rep, 0, toHighHexDigit(value >> 8) << 24 | toLowHexDigit(value >> 8) << 16 | toHighHexDigit(value) << 8 | toLowHexDigit(value)));

Only a win on toHexDigits(byte):

Name                            Cnt  Base   Error   Test   Error  Unit   Diff%
HexFormatBench.toHexDigitsByte   15 1,992 ± 0,008  1,908 ± 0,053 us/op    4,2% (p = 0,000*)
HexFormatBench.toHexDigitsInt    15 2,476 ± 0,018  2,896 ± 0,023 us/op  -16,9% (p = 0,000*)
HexFormatBench.toHexDigitsLong   15 3,992 ± 0,052  4,229 ± 0,036 us/op   -5,9% (p = 0,000*)
HexFormatBench.toHexDigitsShort  15 2,183 ± 0,018  2,800 ± 0,162 us/op  -28,3% (p = 0,000*)
  * = significant

This is indicative that any win here comes from tickling the JIT the right way, rather than some intrinsic property of ByteArrayLittleEndian. I'll leave the code unchanged but add these microbenchmarks if anyone wants to attempt other improvements.

@wenshao
Copy link
Contributor

wenshao commented Sep 8, 2023

I'm not sure that micro-benchmarks are very indicative on whether a lookup table performs better than short and straightforward code.
The reason is that, once in the CPU caches, a lookup table in micro-benchmarks stays there, whereas in more realistic situations, where access is more spread out in time, it is often evicted to make room for other, more lively data.
A micro-benchmark using a lookup table shows good results because of a high rate of cache hits, whereas in other real-world workloads a lookup table might result in many cache misses on access.

In the HexFormat scenario, if the length of the input byte[] is larger, the performance of using the lookup table will be better.

If the length of byte[] is greater than 1 in most scenarios, using lookup table will have better performance.

@cl4es
Copy link
Member Author

cl4es commented Sep 13, 2023

I ran some experiments with a lookup-table approach (based on HexDigits and I can get some of these to be marginally faster when combining a lookup-table approach with the ByteArray hack, but there's no win when using one or the other in isolation. So I think much of the win is actually not from using a lookup-table, but from tickling the JIT to inline more and optimize a bit more aggressively. So I think this might be a case of micros telling us sweet little lies, and we should favor the intuition that lookup tables should be avoided unless absolutely necessary.

I prefer the simplicity of this PR as it stands and think we should backtrack on some of the lookup tables we've recently added in jdk.internal.util.Hex|Decimal|OctalDigits.

Copy link
Contributor

@RogerRiggs RogerRiggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This non-lookup version looks fine; it won't invalidate caches when used and the code is easy to understand. Thanks for the cleanup and re-checking the performance benefits.

@openjdk
Copy link

openjdk bot commented Sep 13, 2023

@cl4es This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8315789: Minor HexFormat performance improvements

Reviewed-by: rriggs

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 219 new commits pushed to the master branch:

  • ce93d27: 6228794: java.text.ChoiceFormat pattern behavior is not well documented.
  • 3b0a6d2: 8314226: Series of colon-style fallthrough switch cases with guards compiled incorrectly
  • ff240a9: 8316087: Test SignedLoggerFinderTest.java is still failing
  • a731a24: 8315934: RISC-V: Disable conservative fences per vendor
  • b3dad24: 8316021: Serial: Remove unused Generation::post_compact
  • f9ab115: 8316050: Use hexadecimal encoding in MemorySegment::toString
  • f804f86: 8314612: TestUnorderedReduction.java fails with -XX:MaxVectorSize=32 and -XX:+AlignVector
  • f8df754: 8311207: Cleanup for Optimization for UUID.toString
  • fecd2fd: 8315898: Open source swing JMenu tests
  • bb6b3f2: 8315761: Open source few swing JList and JMenuBar tests
  • ... and 209 more: https://git.openjdk.org/jdk/compare/12de9b0225363377e9a76729b11698221d4f29f2...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Sep 13, 2023
@cl4es
Copy link
Member Author

cl4es commented Sep 13, 2023

/integrate

@openjdk
Copy link

openjdk bot commented Sep 13, 2023

Going to push as commit 92ad4a2.
Since your change was applied there have been 219 commits pushed to the master branch:

  • ce93d27: 6228794: java.text.ChoiceFormat pattern behavior is not well documented.
  • 3b0a6d2: 8314226: Series of colon-style fallthrough switch cases with guards compiled incorrectly
  • ff240a9: 8316087: Test SignedLoggerFinderTest.java is still failing
  • a731a24: 8315934: RISC-V: Disable conservative fences per vendor
  • b3dad24: 8316021: Serial: Remove unused Generation::post_compact
  • f9ab115: 8316050: Use hexadecimal encoding in MemorySegment::toString
  • f804f86: 8314612: TestUnorderedReduction.java fails with -XX:MaxVectorSize=32 and -XX:+AlignVector
  • f8df754: 8311207: Cleanup for Optimization for UUID.toString
  • fecd2fd: 8315898: Open source swing JMenu tests
  • bb6b3f2: 8315761: Open source few swing JList and JMenuBar tests
  • ... and 209 more: https://git.openjdk.org/jdk/compare/12de9b0225363377e9a76729b11698221d4f29f2...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Sep 13, 2023
@openjdk openjdk bot closed this Sep 13, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Sep 13, 2023
@openjdk
Copy link

openjdk bot commented Sep 13, 2023

@cl4es Pushed as commit 92ad4a2.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@cl4es cl4es deleted the hex branch September 15, 2023 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core-libs [email protected] integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

4 participants