-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8282664: Unroll by hand StringUTF16, StringLatin1, and Arrays polynomial hash loops #7700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Welcome back luhenry! A progress list of the required criteria for merging this PR into |
…loops Despite the hash value being cached for Strings, computing the hash still represents a significant CPU usage for applications handling lots of text. Even though it would be generally better to do it through an enhancement to the autovectorizer, the complexity of doing it by hand is trivial and the gain is sizable (2x speedup) even without the Vector API. The algorithm has been proposed by Richard Startin and Paul Sandoz [1]. At Datadog, we handle a great amount of text (through logs management for example), and hashing String represents a large part of our CPU usage. It's very unlikely that we are the only one as String.hashCode is such a core feature of the JVM-based languages with its use in HashMap for example. Having even only a 2x speedup would allow us to save thousands of CPU cores per month and improve correspondingly the energy/carbon impact. [1] https://static.rainfocus.com/oracle/oow18/sess/1525822677955001tLqU/PF/codeone18-vector-API-DEV5081_1540354883936001Q3Sv.pdf
b62e677
to
f7dda1d
Compare
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This straight forward enough. But I wonder if this should be a hotspot intrinsic and be able to take advantage vector machine instructions.
I'd also expect there is an existing test that checks the value of String.hashCode.
My only comment is that I'd like to see some benchmark verifying also the UTF-16 code path. It should see a similar speed-up, but adding plenty of calls might mess up compilation and inlining. |
@RogerRiggs I would be happy to do such work. As it would be a bigger change, are you suggesting that it could be done or that it should be done as an intrinsic? @cl4es adding that right now. |
@cl4es I've added the UTF-16 benchmarks. I'm running them on my machine and should have the results in ~5 hours. I'll update the PR description once I've these numbers. |
May you print out the generated code? My wild guess is that the updated version is still scalar, the improvement comes from dependency breakdown. I suggest hoisting the accumulation out of the main loop to achieve maximal scalar throughput. A small experiment on my machine shows improvement over your approach.
Thanks a lot. |
@cl4es I've updated the description with the results. StringUTF16 isn't adversely impacted by additional method calls. @merykitty The way to use an accumulator is to do it like https://richardstartin.github.io/posts/vectorised-polynomial-hash-codes. I'll implement the |
I think we should explore an intrinsic that also covers
then used as:
Allowing for the intrinsic to return a "remainder" may simplify the implementations (avoid having to do the tail). |
Great to see this taken up. As it’s implemented here, it’s still scalar, but the unroll prevents a strength reduction of the multiplication in the loop from result = 31 * result + element; to: result = (result << 5) - result + element which creates a data dependency and slows the loop down. This was first reported by Peter Levart here: http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-September/028898.html |
An awkward effect of this implementation is that it perturbs results on small Strings a bit. Adding a branch in the trivial case, but also regressing on certain lengths (try size=7). The added complexity seem to be no issue for JITs in these microbenchmarks, but I worry that the increased code size might play tricks with inlining heuristics in real applications. After chatting a bit with @richardstartin regarding the observation that preventing a strength reduction on the constant 31 value being part of the improvement I devised an experiment which simply makes the 31 non-constant as to disable the strength reduction: private static int base = 31;
@Benchmark
public int scalarLatin1_NoStrengthReduction() {
int h = 0;
int i = 0, len = latin1.length;
for (; i < len; i++) {
h = base * h + (latin1[i] & 0xff);
}
return h;
} Interestingly results of that get planted in the middle of the baseline on large inputs, while avoiding most of the irregularities on small inputs compared to manually unrolled versions:
I wonder if this might be a safer play while we investigate intrinsification and other possible enhancements? |
Can we change the optimizer so that the strength reduction happens only after all transformations have settled? Carelessly changing a multiplication to a shift as today may hurt a lot of potential optimisations. |
@cl4es Yes, we would need to carefully measure the impact for small array sizes (similar to what we had to do when the array mismatch intrinsic was implemented and applied to array equals). My sense is to focus on the intrinsic and also look for potential opportunities like @merykitty points out, as that is where the larger impact is, although it is more work! |
Yes, it's troubling that making a constant non-foldable can lead the JIT down a path that ultimately pessimizes the end result (as observed here). If we could train the JIT to avoid this pitfall and get to the improvement observed in my experiment here without any changes to |
Right, I'm not too thrilled about the prospect of moving ahead with the de-constantification as an alternative patch here. It's such a crutch, but it's also simple and has no obvious downsides as of right now. I think it was a useful experiment to see where some of the gain observed in the unroll might be coming from. The degradation on many smaller Feels like none of the alternatives seen here so far is really it. |
@richardstartin - does that strength reduction actually happen? The bit-shift transformation valid only if the original |
Yes. @State(Scope.Benchmark)
public class StringHashCode {
@Param({"sdjhfklashdfklashdflkashdflkasdhf", "締国件街徹条覧野武鮮覧横営績難比兵州催色"})
String string;
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
@Benchmark
public int stringHashCode() {
return new String(string).hashCode();
}
} ....[Hottest Region 1]..............................................................................
c2, level 4, StringHashCode::stringHashCode, version 507 (384 bytes)
0x00007f2df0142da4: shl $0x3,%r10
0x00007f2df0142da8: movabs $0x800000000,%r12
0x00007f2df0142db2: add %r12,%r10
0x00007f2df0142db5: xor %r12,%r12
0x00007f2df0142db8: cmp %r10,%rax
0x00007f2df0142dbb: jne 0x00007f2de8696080 ; {runtime_call ic_miss_stub}
0x00007f2df0142dc1: data16 xchg %ax,%ax
0x00007f2df0142dc4: nopl 0x0(%rax,%rax,1)
0x00007f2df0142dcc: data16 data16 xchg %ax,%ax
[Verified Entry Point]
0.12% 0x00007f2df0142dd0: mov %eax,-0x14000(%rsp)
0.84% 0x00007f2df0142dd7: push %rbp
0.22% 0x00007f2df0142dd8: sub $0x30,%rsp ;*synchronization entry
; - StringHashCode::stringHashCode@-1 (line 14)
0x00007f2df0142ddc: mov 0xc(%rsi),%r8d ;*getfield string {reexecute=0 rethrow=0 return_oop=0}
; - StringHashCode::stringHashCode@5 (line 14)
0.73% 0x00007f2df0142de0: mov 0x10(%r12,%r8,8),%eax ; implicit exception: dispatches to 0x00007f2df0142fc4
0.10% 0x00007f2df0142de5: test %eax,%eax
╭ 0x00007f2df0142de7: je 0x00007f2df0142df9 ;*synchronization entry
│ ; - StringHashCode::stringHashCode@-1 (line 14)
0.16% │ 0x00007f2df0142de9: add $0x30,%rsp
│ 0x00007f2df0142ded: pop %rbp
│ 0x00007f2df0142dee: mov 0x108(%r15),%r10
0.88% │ 0x00007f2df0142df5: test %eax,(%r10) ; {poll_return}
0.18% │ 0x00007f2df0142df8: retq
↘ 0x00007f2df0142df9: mov 0xc(%r12,%r8,8),%ecx ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.String::<init>@6 (line 236)
; - StringHashCode::stringHashCode@8 (line 14)
0x00007f2df0142dfe: mov 0xc(%r12,%rcx,8),%r10d ;*arraylength {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.String::hashCode@13 (line 1503)
; - StringHashCode::stringHashCode@11 (line 14)
; implicit exception: dispatches to 0x00007f2df0142fd0
0.83% 0x00007f2df0142e03: test %r10d,%r10d
0x00007f2df0142e06: jbe 0x00007f2df0142f86 ;*ifle {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.String::hashCode@14 (line 1503)
; - StringHashCode::stringHashCode@11 (line 14)
0.14% 0x00007f2df0142e0c: movsbl 0x14(%r12,%r8,8),%r8d ;*getfield coder {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.String::<init>@14 (line 237)
; - StringHashCode::stringHashCode@8 (line 14)
0.02% 0x00007f2df0142e12: test %r8d,%r8d
0x00007f2df0142e15: jne 0x00007f2df0142fac ;*ifne {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.String::isLatin1@10 (line 3266)
; - java.lang.String::hashCode@19 (line 1504)
; - StringHashCode::stringHashCode@11 (line 14)
0x00007f2df0142e1b: mov %r10d,%edi
1.14% 0x00007f2df0142e1e: dec %edi
0.10% 0x00007f2df0142e20: cmp %r10d,%edi
0x00007f2df0142e23: jae 0x00007f2df0142f8d
0x00007f2df0142e29: movzbl 0x10(%r12,%rcx,8),%r9d ;*iand {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.StringLatin1::hashCode@31 (line 196)
; - java.lang.String::hashCode@29 (line 1504)
; - StringHashCode::stringHashCode@11 (line 14)
0x00007f2df0142e2f: lea (%r12,%rcx,8),%rbx ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.String::<init>@6 (line 236)
; - StringHashCode::stringHashCode@8 (line 14)
0.77% 0x00007f2df0142e33: mov %r10d,%edx
0.22% 0x00007f2df0142e36: add $0xfffffff9,%edx
0x00007f2df0142e39: mov $0x80000000,%r11d
0x00007f2df0142e3f: cmp %edx,%edi
0.84% 0x00007f2df0142e41: cmovl %r11d,%edx
0.10% 0x00007f2df0142e45: mov $0x1,%ebp
0x00007f2df0142e4a: cmp $0x1,%edx
╭ 0x00007f2df0142e4d: jle 0x00007f2df0142f55
│ 0x00007f2df0142e53: mov %r9d,%r11d
1.08% │ 0x00007f2df0142e56: shl $0x5,%r11d
0.08% │ 0x00007f2df0142e5a: sub %r9d,%r11d ;*putfield value {reexecute=0 rethrow=0 return_oop=0}
│ ; - java.lang.String::<init>@9 (line 236)
│ ; - StringHashCode::stringHashCode@8 (line 14)
0.02% │╭ 0x00007f2df0142e5d: jmp 0x00007f2df0142e6d
││ ↗ 0x00007f2df0142e5f: vmovd %xmm0,%ecx
││ │ 0x00007f2df0142e63: vmovd %xmm2,%r10d
││ │ 0x00007f2df0142e68: vmovd %xmm1,%r8d
│↘ │ 0x00007f2df0142e6d: mov %edx,%esi
0.92% │ │ 0x00007f2df0142e6f: sub %ebp,%esi
0.16% │ │ 0x00007f2df0142e71: mov $0x1f40,%r9d
0.02% │ │ 0x00007f2df0142e77: cmp %r9d,%esi
│ │ 0x00007f2df0142e7a: mov $0x1f40,%edi
0.94% │ │ 0x00007f2df0142e7f: cmovg %edi,%esi
0.12% │ │ 0x00007f2df0142e82: add %ebp,%esi
│ │ 0x00007f2df0142e84: vmovd %ecx,%xmm0
│ │ 0x00007f2df0142e88: vmovd %r10d,%xmm2
0.83% │ │ 0x00007f2df0142e8d: vmovd %r8d,%xmm1
0.10% │ │ 0x00007f2df0142e92: data16 nopw 0x0(%rax,%rax,1)
│ │ 0x00007f2df0142e9c: data16 data16 xchg %ax,%ax ;*imul {reexecute=0 rethrow=0 return_oop=0}
│ │ ; - java.lang.StringLatin1::hashCode@25 (line 196)
│ │ ; - java.lang.String::hashCode@29 (line 1504)
│ │ ; - StringHashCode::stringHashCode@11 (line 14)
0.16% │ ↗│ 0x00007f2df0142ea0: movslq %ebp,%r13 ;*baload {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - java.lang.StringLatin1::hashCode@19 (line 195)
│ ││ ; - java.lang.String::hashCode@29 (line 1504)
│ ││ ; - StringHashCode::stringHashCode@11 (line 14)
1.08% │ ││ 0x00007f2df0142ea3: movzbl 0x10(%rbx,%r13,1),%r9d
2.08% │ ││ 0x00007f2df0142ea9: movzbl 0x17(%rbx,%r13,1),%r10d
1.39% │ ││ 0x00007f2df0142eaf: movzbl 0x11(%rbx,%r13,1),%ecx
0.20% │ ││ 0x00007f2df0142eb5: add %r9d,%r11d
1.04% │ ││ 0x00007f2df0142eb8: movzbl 0x15(%rbx,%r13,1),%r8d
1.59% │ ││ 0x00007f2df0142ebe: mov %r11d,%edi
1.26% │ ││ 0x00007f2df0142ec1: shl $0x5,%edi
0.12% │ ││ 0x00007f2df0142ec4: sub %r11d,%edi
1.81% │ ││ 0x00007f2df0142ec7: add %ecx,%edi
2.77% │ ││ 0x00007f2df0142ec9: movzbl 0x14(%rbx,%r13,1),%r11d
0.84% │ ││ 0x00007f2df0142ecf: mov %edi,%ecx
0.16% │ ││ 0x00007f2df0142ed1: shl $0x5,%ecx
1.77% │ ││ 0x00007f2df0142ed4: sub %edi,%ecx
2.28% │ ││ 0x00007f2df0142ed6: movzbl 0x13(%rbx,%r13,1),%r9d
0.67% │ ││ 0x00007f2df0142edc: movzbl 0x12(%rbx,%r13,1),%edi
0.02% │ ││ 0x00007f2df0142ee2: add %edi,%ecx
2.51% │ ││ 0x00007f2df0142ee4: movzbl 0x16(%rbx,%r13,1),%edi
1.00% │ ││ 0x00007f2df0142eea: mov %ecx,%r14d
0.79% │ ││ 0x00007f2df0142eed: shl $0x5,%r14d
1.61% │ ││ 0x00007f2df0142ef1: sub %ecx,%r14d
6.01% │ ││ 0x00007f2df0142ef4: add %r9d,%r14d
1.73% │ ││ 0x00007f2df0142ef7: mov %r14d,%r9d
0.29% │ ││ 0x00007f2df0142efa: shl $0x5,%r9d
0.24% │ ││ 0x00007f2df0142efe: sub %r14d,%r9d
6.09% │ ││ 0x00007f2df0142f01: add %r11d,%r9d
2.28% │ ││ 0x00007f2df0142f04: mov %r9d,%r11d
0.29% │ ││ 0x00007f2df0142f07: shl $0x5,%r11d
0.28% │ ││ 0x00007f2df0142f0b: sub %r9d,%r11d
5.30% │ ││ 0x00007f2df0142f0e: add %r8d,%r11d
2.50% │ ││ 0x00007f2df0142f11: mov %r11d,%ecx
0.24% │ ││ 0x00007f2df0142f14: shl $0x5,%ecx
0.37% │ ││ 0x00007f2df0142f17: sub %r11d,%ecx
6.50% │ ││ 0x00007f2df0142f1a: add %edi,%ecx
2.71% │ ││ 0x00007f2df0142f1c: mov %ecx,%r9d
0.26% │ ││ 0x00007f2df0142f1f: shl $0x5,%r9d
0.18% │ ││ 0x00007f2df0142f23: sub %ecx,%r9d
5.93% │ ││ 0x00007f2df0142f26: add %r10d,%r9d ;*iadd {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - java.lang.StringLatin1::hashCode@32 (line 196)
│ ││ ; - java.lang.String::hashCode@29 (line 1504)
│ ││ ; - StringHashCode::stringHashCode@11 (line 14)
2.85% │ ││ 0x00007f2df0142f29: mov %r9d,%r11d
0.10% │ ││ 0x00007f2df0142f2c: shl $0x5,%r11d
0.20% │ ││ 0x00007f2df0142f30: sub %r9d,%r11d ;*imul {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - java.lang.StringLatin1::hashCode@25 (line 196)
│ ││ ; - java.lang.String::hashCode@29 (line 1504)
│ ││ ; - StringHashCode::stringHashCode@11 (line 14)
2.57% │ ││ 0x00007f2df0142f33: add $0x8,%ebp ;*iinc {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - java.lang.StringLatin1::hashCode@34 (line 195)
│ ││ ; - java.lang.String::hashCode@29 (line 1504)
│ ││ ; - StringHashCode::stringHashCode@11 (line 14)
1.36% │ ││ 0x00007f2df0142f36: cmp %esi,%ebp
│ ╰│ 0x00007f2df0142f38: jl 0x00007f2df0142ea0 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
│ │ ; - java.lang.StringLatin1::hashCode@13 (line 195)
│ │ ; - java.lang.String::hashCode@29 (line 1504)
│ │ ; - StringHashCode::stringHashCode@11 (line 14)
│ │ 0x00007f2df0142f3e: mov 0x108(%r15),%r10 ; ImmutableOopMap{rbx=Oop xmm0=NarrowOop }
│ │ ;*goto {reexecute=1 rethrow=0 return_oop=0}
│ │ ; - java.lang.StringLatin1::hashCode@37 (line 195)
│ │ ; - java.lang.String::hashCode@29 (line 1504)
│ │ ; - StringHashCode::stringHashCode@11 (line 14)
│ │ 0x00007f2df0142f45: test %eax,(%r10) ;*goto {reexecute=0 rethrow=0 return_oop=0}
│ │ ; - java.lang.StringLatin1::hashCode@37 (line 195)
│ │ ; - java.lang.String::hashCode@29 (line 1504)
│ │ ; - StringHashCode::stringHashCode@11 (line 14)
│ │ ; {poll}
1.00% │ │ 0x00007f2df0142f48: cmp %edx,%ebp
│ ╰ 0x00007f2df0142f4a: jl 0x00007f2df0142e5f
0.16% │ 0x00007f2df0142f50: vmovd %xmm2,%r10d
↘ 0x00007f2df0142f55: cmp %r10d,%ebp
0x00007f2df0142f58: jge 0x00007f2df0142f7e
0x00007f2df0142f5a: xchg %ax,%ax ;*aload_2 {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.StringLatin1::hashCode@16 (line 195)
; - java.lang.String::hashCode@29 (line 1504)
; - StringHashCode::stringHashCode@11 (line 14)
0x00007f2df0142f5c: movzbl 0x10(%rbx,%rbp,1),%r8d
0x00007f2df0142f62: mov %r9d,%eax
0x00007f2df0142f65: shl $0x5,%eax
0x00007f2df0142f68: sub %r9d,%eax
c2, level 4, StringHashCode::stringHashCode, version 505 (435 bytes)
0x00007fd05f2c0ba4: shl $0x3,%r10
0x00007fd05f2c0ba8: movabs $0x800000000,%r12
0x00007fd05f2c0bb2: add %r12,%r10
0x00007fd05f2c0bb5: xor %r12,%r12
0x00007fd05f2c0bb8: cmp %r10,%rax
0x00007fd05f2c0bbb: jne 0x00007fd057814080 ; {runtime_call ic_miss_stub}
0x00007fd05f2c0bc1: data16 xchg %ax,%ax
0x00007fd05f2c0bc4: nopl 0x0(%rax,%rax,1)
0x00007fd05f2c0bcc: data16 data16 xchg %ax,%ax
[Verified Entry Point]
1.14% 0x00007fd05f2c0bd0: mov %eax,-0x14000(%rsp)
0.50% 0x00007fd05f2c0bd7: push %rbp
0.22% 0x00007fd05f2c0bd8: sub $0x30,%rsp ;*synchronization entry
; - StringHashCode::stringHashCode@-1 (line 14)
1.58% 0x00007fd05f2c0bdc: mov 0xc(%rsi),%r11d ;*getfield string {reexecute=0 rethrow=0 return_oop=0}
; - StringHashCode::stringHashCode@5 (line 14)
0x00007fd05f2c0be0: mov 0x10(%r12,%r11,8),%ecx ;*synchronization entry
; - StringHashCode::stringHashCode@-1 (line 14)
; implicit exception: dispatches to 0x00007fd05f2c0efc
0.34% 0x00007fd05f2c0be5: test %ecx,%ecx
╭ 0x00007fd05f2c0be7: jne 0x00007fd05f2c0d84 ;*ifne {reexecute=0 rethrow=0 return_oop=0}
│ ; - java.lang.String::hashCode@6 (line 1503)
│ ; - StringHashCode::stringHashCode@11 (line 14)
1.04% │ 0x00007fd05f2c0bed: mov 0xc(%r12,%r11,8),%edx ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
│ ; - java.lang.String::<init>@6 (line 236)
│ ; - StringHashCode::stringHashCode@8 (line 14)
0.50% │ 0x00007fd05f2c0bf2: mov 0xc(%r12,%rdx,8),%r14d ;*arraylength {reexecute=0 rethrow=0 return_oop=0}
│ ; - java.lang.String::hashCode@13 (line 1503)
│ ; - StringHashCode::stringHashCode@11 (line 14)
│ ; implicit exception: dispatches to 0x00007fd05f2c0f08
│ 0x00007fd05f2c0bf7: xor %eax,%eax
0.36% │ 0x00007fd05f2c0bf9: test %r14d,%r14d
│╭ 0x00007fd05f2c0bfc: jbe 0x00007fd05f2c0d74 ;*ifle {reexecute=0 rethrow=0 return_oop=0}
││ ; - java.lang.String::hashCode@14 (line 1503)
││ ; - StringHashCode::stringHashCode@11 (line 14)
1.08% ││ 0x00007fd05f2c0c02: movsbl 0x14(%r12,%r11,8),%ebp ;*getfield coder {reexecute=0 rethrow=0 return_oop=0}
││ ; - java.lang.String::<init>@14 (line 237)
││ ; - StringHashCode::stringHashCode@8 (line 14)
0.50% ││ 0x00007fd05f2c0c08: lea (%r12,%rdx,8),%rdi ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
││ ; - java.lang.String::<init>@6 (line 236)
││ ; - StringHashCode::stringHashCode@8 (line 14)
││ 0x00007fd05f2c0c0c: mov $0x1,%r10d
0.18% ││ 0x00007fd05f2c0c12: mov $0x1f40,%esi
1.20% ││ 0x00007fd05f2c0c17: mov $0x80000000,%r11d ;*putfield value {reexecute=0 rethrow=0 return_oop=0}
││ ; - java.lang.String::<init>@9 (line 236)
││ ; - StringHashCode::stringHashCode@8 (line 14)
0.50% ││ 0x00007fd05f2c0c1d: test %ebp,%ebp
││╭ 0x00007fd05f2c0c1f: je 0x00007fd05f2c0d88 ;*ifeq {reexecute=0 rethrow=0 return_oop=0}
│││ ; - java.lang.String::hashCode@22 (line 1504)
│││ ; - StringHashCode::stringHashCode@11 (line 14)
│││ 0x00007fd05f2c0c25: sar %r14d ;*ishr {reexecute=0 rethrow=0 return_oop=0}
│││ ; - java.lang.StringUTF16::hashCode@5 (line 348)
│││ ; - java.lang.String::hashCode@39 (line 1505)
│││ ; - StringHashCode::stringHashCode@11 (line 14)
0.20% │││ 0x00007fd05f2c0c28: test %r14d,%r14d
│││╭ 0x00007fd05f2c0c2b: jle 0x00007fd05f2c0d74 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
││││ ; - java.lang.StringUTF16::hashCode@11 (line 349)
││││ ; - java.lang.String::hashCode@39 (line 1505)
││││ ; - StringHashCode::stringHashCode@11 (line 14)
1.14% ││││ 0x00007fd05f2c0c31: movzwl 0x10(%r12,%rdx,8),%r9d ;*invokestatic getChar {reexecute=0 rethrow=0 return_oop=0}
││││ ; - java.lang.StringUTF16::hashCode@20 (line 350)
││││ ; - java.lang.String::hashCode@39 (line 1505)
││││ ; - StringHashCode::stringHashCode@11 (line 14)
0.40% ││││ 0x00007fd05f2c0c37: mov %r14d,%r13d
││││ 0x00007fd05f2c0c3a: dec %r13d
0.16% ││││ 0x00007fd05f2c0c3d: mov %r9d,%r8d
0.86% ││││ 0x00007fd05f2c0c40: shl $0x5,%r8d
0.46% ││││ 0x00007fd05f2c0c44: mov %r14d,%ebx
││││ 0x00007fd05f2c0c47: add $0xfffffff9,%ebx
0.16% ││││ 0x00007fd05f2c0c4a: cmp %ebx,%r13d
0.98% ││││ 0x00007fd05f2c0c4d: cmovl %r11d,%ebx
0.46% ││││ 0x00007fd05f2c0c51: cmp $0x1,%ebx
││││ 0x00007fd05f2c0c54: jle 0x00007fd05f2c0edb
││││ 0x00007fd05f2c0c5a: sub %r9d,%r8d ;*imul {reexecute=0 rethrow=0 return_oop=0}
││││ ; - java.lang.StringUTF16::hashCode@17 (line 350)
││││ ; - java.lang.String::hashCode@39 (line 1505)
││││ ; - StringHashCode::stringHashCode@11 (line 14)
0.28% ││││╭ 0x00007fd05f2c0c5d: jmp 0x00007fd05f2c0c8e ;*bipush {reexecute=0 rethrow=0 return_oop=0}
│││││ ; - java.lang.StringUTF16::hashCode@14 (line 350)
│││││ ; - java.lang.String::hashCode@39 (line 1505)
│││││ ; - StringHashCode::stringHashCode@11 (line 14)
1.22% │││││ ↗ ↗ 0x00007fd05f2c0c5f: movzwl 0x10(%rdi,%r10,2),%r11d
1.54% │││││ │ │ 0x00007fd05f2c0c65: sub %r9d,%eax
1.58% │││││ │ │ 0x00007fd05f2c0c68: add %r11d,%eax ;*iadd {reexecute=0 rethrow=0 return_oop=0}
│││││ │ │ ; - java.lang.StringUTF16::hashCode@23 (line 350)
│││││ │ │ ; - java.lang.String::hashCode@39 (line 1505)
│││││ │ │ ; - StringHashCode::stringHashCode@11 (line 14)
1.93% │││││ │ │ 0x00007fd05f2c0c6b: inc %r10d ;*iinc {reexecute=0 rethrow=0 return_oop=0}
│││││ │ │ ; - java.lang.StringUTF16::hashCode@25 (line 349)
│││││ │ │ ; - java.lang.String::hashCode@39 (line 1505)
│││││ │ │ ; - StringHashCode::stringHashCode@11 (line 14)
0.78% │││││ │ │ 0x00007fd05f2c0c6e: cmp %r14d,%r10d
│││││╭│ │ 0x00007fd05f2c0c71: jge 0x00007fd05f2c0d74 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
│││││││ │ ; - java.lang.StringUTF16::hashCode@11 (line 349)
│││││││ │ ; - java.lang.String::hashCode@39 (line 1505)
│││││││ │ ; - StringHashCode::stringHashCode@11 (line 14)
0.80% │││││││ │ 0x00007fd05f2c0c77: mov %eax,%r8d
0.72% │││││││ │ 0x00007fd05f2c0c7a: shl $0x5,%r8d
1.56% │││││││ │ 0x00007fd05f2c0c7e: mov %eax,%r9d
0.68% │││││││ │ 0x00007fd05f2c0c81: mov %r8d,%eax
0.62% ││││││╰ │ 0x00007fd05f2c0c84: jmp 0x00007fd05f2c0c5f
││││││ ↗│ 0x00007fd05f2c0c86: vmovd %xmm1,%ecx
││││││ ││ 0x00007fd05f2c0c8a: vmovd %xmm2,%edx
1.12% ││││↘│ ││ 0x00007fd05f2c0c8e: mov %ebx,%r13d
0.46% ││││ │ ││ 0x00007fd05f2c0c91: sub %r10d,%r13d
││││ │ ││ 0x00007fd05f2c0c94: cmp %esi,%r13d
0.18% ││││ │ ││ 0x00007fd05f2c0c97: cmovg %esi,%r13d
1.14% ││││ │ ││ 0x00007fd05f2c0c9b: add %r10d,%r13d
0.46% ││││ │ ││ 0x00007fd05f2c0c9e: vmovd %ecx,%xmm1
││││ │ ││ 0x00007fd05f2c0ca2: vmovd %edx,%xmm2
0.30% ││││ │ ││ 0x00007fd05f2c0ca6: data16 nopw 0x0(%rax,%rax,1) ;*imul {reexecute=0 rethrow=0 return_oop=0}
││││ │ ││ ; - java.lang.StringUTF16::hashCode@17 (line 350)
││││ │ ││ ; - java.lang.String::hashCode@39 (line 1505)
││││ │ ││ ; - StringHashCode::stringHashCode@11 (line 14)
1.22% ││││ │ ↗││ 0x00007fd05f2c0cb0: movzwl 0x1e(%rdi,%r10,2),%eax
1.91% ││││ │ │││ 0x00007fd05f2c0cb6: movzwl 0x1c(%rdi,%r10,2),%ecx
0.42% ││││ │ │││ 0x00007fd05f2c0cbc: movzwl 0x10(%rdi,%r10,2),%r9d
0.16% ││││ │ │││ 0x00007fd05f2c0cc2: movzwl 0x12(%rdi,%r10,2),%r11d
1.16% ││││ │ │││ 0x00007fd05f2c0cc8: add %r9d,%r8d
1.72% ││││ │ │││ 0x00007fd05f2c0ccb: movzwl 0x14(%rdi,%r10,2),%r9d
0.50% ││││ │ │││ 0x00007fd05f2c0cd1: mov %r8d,%edx
0.26% ││││ │ │││ 0x00007fd05f2c0cd4: shl $0x5,%edx
1.54% ││││ │ │││ 0x00007fd05f2c0cd7: sub %r8d,%edx
1.68% ││││ │ │││ 0x00007fd05f2c0cda: add %r11d,%edx
0.44% ││││ │ │││ 0x00007fd05f2c0cdd: movzwl 0x16(%rdi,%r10,2),%r8d
0.26% ││││ │ │││ 0x00007fd05f2c0ce3: mov %edx,%r11d
1.10% ││││ │ │││ 0x00007fd05f2c0ce6: shl $0x5,%r11d
1.38% ││││ │ │││ 0x00007fd05f2c0cea: sub %edx,%r11d
0.46% ││││ │ │││ 0x00007fd05f2c0ced: add %r9d,%r11d
0.38% ││││ │ │││ 0x00007fd05f2c0cf0: movzwl 0x18(%rdi,%r10,2),%edx
1.10% ││││ │ │││ 0x00007fd05f2c0cf6: mov %r11d,%r9d
1.44% ││││ │ │││ 0x00007fd05f2c0cf9: shl $0x5,%r9d
0.54% ││││ │ │││ 0x00007fd05f2c0cfd: sub %r11d,%r9d
0.38% ││││ │ │││ 0x00007fd05f2c0d00: add %r8d,%r9d
1.64% ││││ │ │││ 0x00007fd05f2c0d03: movzwl 0x1a(%rdi,%r10,2),%r8d
1.40% ││││ │ │││ 0x00007fd05f2c0d09: mov %r9d,%r11d
0.44% ││││ │ │││ 0x00007fd05f2c0d0c: shl $0x5,%r11d
0.56% ││││ │ │││ 0x00007fd05f2c0d10: sub %r9d,%r11d
1.58% ││││ │ │││ 0x00007fd05f2c0d13: add %edx,%r11d
1.97% ││││ │ │││ 0x00007fd05f2c0d16: mov %r11d,%edx
0.22% ││││ │ │││ 0x00007fd05f2c0d19: shl $0x5,%edx
1.02% ││││ │ │││ 0x00007fd05f2c0d1c: sub %r11d,%edx
3.41% ││││ │ │││ 0x00007fd05f2c0d1f: add %r8d,%edx
2.03% ││││ │ │││ 0x00007fd05f2c0d22: mov %edx,%r11d
0.12% ││││ │ │││ 0x00007fd05f2c0d25: shl $0x5,%r11d
1.24% ││││ │ │││ 0x00007fd05f2c0d29: sub %edx,%r11d
2.97% ││││ │ │││ 0x00007fd05f2c0d2c: add %ecx,%r11d
1.83% ││││ │ │││ 0x00007fd05f2c0d2f: mov %r11d,%r9d
0.06% ││││ │ │││ 0x00007fd05f2c0d32: shl $0x5,%r9d
1.16% ││││ │ │││ 0x00007fd05f2c0d36: sub %r11d,%r9d
3.89% ││││ │ │││ 0x00007fd05f2c0d39: add %eax,%r9d ;*iadd {reexecute=0 rethrow=0 return_oop=0}
││││ │ │││ ; - java.lang.StringUTF16::hashCode@23 (line 350)
││││ │ │││ ; - java.lang.String::hashCode@39 (line 1505)
││││ │ │││ ; - StringHashCode::stringHashCode@11 (line 14)
1.44% ││││ │ │││ 0x00007fd05f2c0d3c: mov %r9d,%eax
││││ │ │││ 0x00007fd05f2c0d3f: shl $0x5,%eax
1.16% ││││ │ │││ 0x00007fd05f2c0d42: mov %eax,%r8d
1.83% ││││ │ │││ 0x00007fd05f2c0d45: sub %r9d,%r8d ;*imul {reexecute=0 rethrow=0 return_oop=0}
││││ │ │││ ; - java.lang.StringUTF16::hashCode@17 (line 350)
││││ │ │││ ; - java.lang.String::hashCode@39 (line 1505)
││││ │ │││ ; - StringHashCode::stringHashCode@11 (line 14)
1.76% ││││ │ │││ 0x00007fd05f2c0d48: add $0x8,%r10d ;*iinc {reexecute=0 rethrow=0 return_oop=0}
││││ │ │││ ; - java.lang.StringUTF16::hashCode@25 (line 349)
││││ │ │││ ; - java.lang.String::hashCode@39 (line 1505)
││││ │ │││ ; - StringHashCode::stringHashCode@11 (line 14)
││││ │ │││ 0x00007fd05f2c0d4c: cmp %r13d,%r10d
││││ │ ╰││ 0x00007fd05f2c0d4f: jl 0x00007fd05f2c0cb0 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
││││ │ ││ ; - java.lang.StringUTF16::hashCode@11 (line 349)
││││ │ ││ ; - java.lang.String::hashCode@39 (line 1505)
││││ │ ││ ; - StringHashCode::stringHashCode@11 (line 14)
││││ │ ││ 0x00007fd05f2c0d55: mov 0x108(%r15),%r11 ; ImmutableOopMap{rdi=Oop xmm2=NarrowOop }
││││ │ ││ ;*goto {reexecute=1 rethrow=0 return_oop=0}
││││ │ ││ ; - java.lang.StringUTF16::hashCode@28 (line 349)
││││ │ ││ ; - java.lang.String::hashCode@39 (line 1505)
││││ │ ││ ; - StringHashCode::stringHashCode@11 (line 14)
0.68% ││││ │ ││ 0x00007fd05f2c0d5c: test %eax,(%r11) ;*goto {reexecute=0 rethrow=0 return_oop=0}
││││ │ ││ ; - java.lang.StringUTF16::hashCode@28 (line 349)
││││ │ ││ ; - java.lang.String::hashCode@39 (line 1505)
││││ │ ││ ; - StringHashCode::stringHashCode@11 (line 14)
││││ │ ││ ; {poll}
0.84% ││││ │ ││ 0x00007fd05f2c0d5f: cmp %ebx,%r10d
││││ │ ╰│ 0x00007fd05f2c0d62: jl 0x00007fd05f2c0c86
││││ │ │ 0x00007fd05f2c0d68: cmp %r14d,%r10d
││││ │ ╰ 0x00007fd05f2c0d6b: jl 0x00007fd05f2c0c5f
││││ │ 0x00007fd05f2c0d71: mov %r9d,%eax ;*synchronization entry
││││ │ ; - StringHashCode::stringHashCode@-1 (line 14)
0.38% │↘│↘ ↘ ↗ 0x00007fd05f2c0d74: add $0x30,%rsp
0.88% │ │ │ 0x00007fd05f2c0d78: pop %rbp
0.76% │ │ │ 0x00007fd05f2c0d79: mov 0x108(%r15),%r10
│ │ │ 0x00007fd05f2c0d80: test %eax,(%r10) ; {poll_return}
0.28% │ │ │ 0x00007fd05f2c0d83: retq
↘ │ │ 0x00007fd05f2c0d84: mov %ecx,%eax
│ ╰ 0x00007fd05f2c0d86: jmp 0x00007fd05f2c0d74
↘ 0x00007fd05f2c0d88: mov %r14d,%ebx
0x00007fd05f2c0d8b: dec %ebx
0x00007fd05f2c0d8d: cmp %r14d,%ebx
0x00007fd05f2c0d90: jae 0x00007fd05f2c0ee3
0x00007fd05f2c0d96: movzbl 0x10(%r12,%rdx,8),%r9d ;*iand {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.StringLatin1::hashCode@31 (line 196)
; - java.lang.String::hashCode@29 (line 1504)
; - StringHashCode::stringHashCode@11 (line 14)
|
Independent of performance improvements, the proposed changes may be tolerable from a code maintenance point of view, but I think VM intrinsics would be a better fit here. If the Java-level changes are kept, I think a short comment to explain the intent of the loop would be appropriate; e.g. something like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The (length & ~(8 - 1))
computation should probably be moved outside of the loop.
I'm currently working on the vectorized intrinsic. It's taking more time due to end of quarter activities but I'm getting around to it :) |
Next is to generalize it for Arrays.hashCode and StringUTF16.hashCode and make it cheap on shorter strings
Some early results:
The The results are very encouraging as it is 7x faster for large strings. Next steps are to:
|
EOD day update: I'm trying to generalise the approach to Arrays.hashCode for some of the types (int, short, char, byte, float). However, I'm running into the following assertion and I haven't figured it out just yet.
Any pointers would be greatly appreciated, I'll keep digging in the meantime. I've also explored using a jump table for the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't really see anything that I think is the direct cause of the error you're seeing, but there are a couple of places where Op_AryHashCode
appears to be missing.
@cl4es that was indeed the issue leading to the crash. Thanks! |
strcmp(_matrule->_rChild->_opType,"StrInflatedCopy" )==0 || | ||
strcmp(_matrule->_rChild->_opType,"StrCompressedCopy" )==0 || | ||
strcmp(_matrule->_rChild->_opType,"StrIndexOf")==0 || | ||
strcmp(_matrule->_rChild->_opType,"StrIndexOfChar")==0 || | ||
strcmp(_matrule->_rChild->_opType,"CountPositives")==0 || | ||
strcmp(_matrule->_rChild->_opType,"EncodeISOArray")==0)) { | ||
// String.(compareTo/equals/indexOf) and Arrays.equals | ||
// String.(compareTo/equals/indexOf/hashCode) and Arrays.equals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// String.(compareTo/equals/indexOf/hashCode) and Arrays.equals | |
// String.(compareTo/equals/indexOf/hashCode) and Arrays.(equals/hashCode) |
I will say, for the record, although It looks like Richard Startin scooped me by half a year (for which, kudos!), that an explicitly vectorized algorithm was independently derived as a seed challenge for the Panama Vector API. I coded the explicitly vectorized Horner's rule loops seen in http://cr.openjdk.java.net/~jrose/vectors/vloop0.cpp in mid-2015, when we first thought there was an opportunity to do something with vector units and Java. (Thank you Intel for believing in this crazy idea!) I'm glad to see the current work moving forward. I agree that an intrinsic form, rather than a magically hand-crafted Java loop, is the right way to give the JVM its call to action. I wish we could generalize this to other instances of vectorized polynomial evaluators, rather than simply the wretchedly hardwired radix-31 one that so much of Java relies on. Maybe we will eventually... |
…ed-stringlatin1-hashcode
Looks like you are making great progress. Have you thought about ways the intrinsic implementation might be simplified if some code is retained in Java and passed as constant arguments? e.g. table of constants, scalar loop, bounds checks etc, such that the intrinsic primarily focuses on the vectorized code. To some extent that's related to John's point on generalization, and through simplification there may be some generalization. For example if there was a general intrinsic that returned a long value (e.g. first 32 bits are the offset in the array to continue processing, the second 32 bits are the current hashcode value) then we could call that from the Java implementations that then proceed with the scalar loop up to the array length. The Java implementation intrinsic would return Separately it would be nice to consider computing the hash code from the contents of a memory segment, similar to how we added The |
@PaulSandoz yes, keeping the "short" string part in pure Java and switching to an intrinsified/vectorized version for "long" strings is on the next avenue of exploration. I would also put the intrinsic as a runtime stub to avoid unnecessarily increase the size of the calling method unnecessarily. The overhead of the call would be amortised because it would only be called for longer strings anyway. I haven't given much thoughts to how we could split up the different elements of the algorithm to generalise the approach just yet. I'll give it a try, see how far I can get with it, and keep you updated on my findings. |
@luhenry ok, we took a similar approach to the mismatch intrinsic, carefully analyzing the threshold by which the intrinsic would be called. My suggestion would be to follow that approach further and head towards an internal intrinsic perhaps with this signature:
Then on a further iteration try and pass the polynomial constant and table of powers (stable array) as arguments. |
@luhenry This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
Still working on it, other work priorities have popped up. I'm taking the approach of outlining the longer string approach in a dedicated runtime stub. This makes the code easier and it doesn't have a performance impact given the stub is only called on longer strings (the cost of the call is thus amortised by the faster execution). |
@luhenry This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
@luhenry This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the |
@luhenry notified me that he won't be able to continue working on this for now. I've started looking at this and am scoping out what's needed to finishing the work. To be continued in a new PR (soon!) |
Despite the hash value being cached for Strings, computing the hash still represents a significant CPU usage for applications handling lots of text.
Even though it would be generally better to do it through an enhancement to the autovectorizer, the complexity of doing it by hand is trivial and the gain is sizable (2x speedup) even without the Vector API. The algorithm has been proposed by Richard Startin and Paul Sandoz [1].
Speedup are as follows on a
Intel(R) Xeon(R) E-2276G CPU @ 3.80GHz
At Datadog, we handle a great amount of text (through logs management for example), and hashing String represents a large part of our CPU usage. It's very unlikely that we are the only one as String.hashCode is such a core feature of the JVM-based languages with its use in HashMap for example. Having even only a 2x speedup would allow us to save thousands of CPU cores per month and improve correspondingly the energy/carbon impact.
[1] https://static.rainfocus.com/oracle/oow18/sess/1525822677955001tLqU/PF/codeone18-vector-API-DEV5081_1540354883936001Q3Sv.pdf
Progress
Integration blocker
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/7700/head:pull/7700
$ git checkout pull/7700
Update a local copy of the PR:
$ git checkout pull/7700
$ git pull https://git.openjdk.org/jdk pull/7700/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 7700
View PR using the GUI difftool:
$ git pr show -t 7700
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/7700.diff