-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12480][SQL] add Hash expression that can calculate hash value for a group of expressions #10435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #48203 has finished for PR 10435 at commit
|
|
Looks like we can also mention https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFHash.java. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we add more contents at here to explain the algorithm and mentions that we are following Hive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, it is also good to mention the compatibility benefit of following Hive at here.
|
Regarding testing this, I am wondering if we can add it to function registry. So, all queries that use |
|
It is also important to use this hash function in Exchange. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which path handles string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last one case other => other.hashCode()
|
Test build #48241 has finished for PR 10435 at commit
|
|
Test build #48272 has finished for PR 10435 at commit
|
|
Test build #48274 has finished for PR 10435 at commit
|
|
Test build #48281 has finished for PR 10435 at commit
|
|
Test build #48291 has finished for PR 10435 at commit
|
|
Test build #48325 has finished for PR 10435 at commit
|
|
Test build #48337 has finished for PR 10435 at commit
|
1b56480 to
303b69b
Compare
|
Test build #48346 has finished for PR 10435 at commit
|
|
retest this please. |
|
Test build #48349 has finished for PR 10435 at commit
|
|
Test build #48350 has finished for PR 10435 at commit
|
|
Test build #48352 has finished for PR 10435 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we just turned this into Mumur3Hash instead?
This would just do UnsafeProjection.create()
project(input).hashCode()
Murmur3 will give us much nicer hashing properties. The current hash function can be bad in reasonable cases.
For example, if the long column is a timestamp in milis from a source that samples every second. Most of the low digits will be similar (e.g. values are 1000, 2002, 2999, etc. Very few that end in 500). The hash function does a very bad job of breaking this up and this will generate some very skewed partitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point!
after decided to not follow hive, I agree Mumur3Hash is a better choice.
|
Closing, will open another PR to use |
|
On the contrary I think we should consider having the hash code expression for two reasons:
|
|
It makes sense to still have a Hash expression (called more specifically, Mumur3Hash) that does what this patch originally intended. I think this will be a useful primitive. The underlying implementation can just use UnsafeRow.hashCode for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I didn't use hash for the name, as it will break a lot of hive compatibility tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many does it break?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
35 tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you give me a list? i think we should consider just blacklisting them ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48241/consoleFull
most of them is testing something else but coincidently include hash expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can have a flag to control this -- when in hive compatibility test, fall back to Hive's, and otherwise our own?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me, let me try it out.
|
Test build #48540 has finished for PR 10435 at commit
|
|
Test build #48542 has finished for PR 10435 at commit
|
|
Test build #48644 has finished for PR 10435 at commit
|
|
Test build #48656 has finished for PR 10435 at commit
|
|
I think we should open another PR to use this hash expression in |
|
LGTM |
|
Test build #48699 has finished for PR 10435 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should just be a single column vararg, rather than one followed by vararg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the hash function should take at least one parameter, does @scala.annotation.varargs support this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use the following form:
(firstarg:Int)(more:Int*)
|
I've merged this. You can address the API comment in the next pull request. Thanks. |
address comments in apache#10435 This makes the API easier to use if user programmatically generate the call to hash, and they will get analysis exception if the arguments of hash is empty. Author: Wenchen Fan <[email protected]> Closes apache#10588 from cloud-fan/hash.
just write the arguments into unsafe row and use murmur3 to calculate hash code