-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-14620][SQL] Use/benchmark a better hash in VectorizedHashMap #12379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #55781 has finished for PR 12379 at commit
|
|
Test build #55877 has finished for PR 12379 at commit
|
68af2e4 to
657edf5
Compare
|
cc @nongli |
|
Test build #55896 has finished for PR 12379 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this plan look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WholeStageCodegen
: +- TungstenAggregate(key=[k#3L,k#3L], functions=[(sum(id#0L),mode=Final,isDistinct=false)], output=[k#3L,k#3L,sum(id)#183L])
: +- INPUT
+- Exchange hashpartitioning(k#3L, k#3L, 1), None
+- WholeStageCodegen
: +- TungstenAggregate(key=[k#3L,k#3L], functions=[(sum(id#0L),mode=Partial,isDistinct=false)], output=[k#3L,k#3L,sum#185L])
: +- Project [id#0L,FLOOR((rand(-9053518532274118725) * 10000.0)) AS k#3L]
: +- Range 0, 1, 1, 20971520, [id#0L]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel you're spending a lot of time in rand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Doesn't registerTempTable materialize the table before registering it? If not, is there a way to do that?
|
LGTM |
|
Test build #55962 has finished for PR 12379 at commit
|
## What changes were proposed in this pull request? This PR uses a better hashing algorithm while probing the AggregateHashMap: ```java long h = 0 h = (h ^ (0x9e3779b9)) + key_1 + (h << 6) + (h >>> 2); h = (h ^ (0x9e3779b9)) + key_2 + (h << 6) + (h >>> 2); h = (h ^ (0x9e3779b9)) + key_3 + (h << 6) + (h >>> 2); ... h = (h ^ (0x9e3779b9)) + key_n + (h << 6) + (h >>> 2); return h ``` Depends on: apache#12345 ## How was this patch tested? Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 on Mac OS X 10.11.4 Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Aggregate w keys: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- codegen = F 2417 / 2457 8.7 115.2 1.0X codegen = T hashmap = F 1554 / 1581 13.5 74.1 1.6X codegen = T hashmap = T 877 / 929 23.9 41.8 2.8X Author: Sameer Agarwal <[email protected]> Closes apache#12379 from sameeragarwal/hash.
What changes were proposed in this pull request?
This PR uses a better hashing algorithm while probing the AggregateHashMap:
Depends on: #12345
How was this patch tested?