-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23381][CORE] Murmur3 hash generates a different value from other implementations #20568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add JIRA number with a short description as a comment (e.g. SPARK-23381 ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to compare with the result of murmur3 hash value by scala library?
8856e39 to
905ae19
Compare
|
@kiszk Thank you for your review! I fixed it. |
|
@mrkm4ntr The change itself looks pretty reasonable. However I am very hesitant to merge this because this will probably break bucketing (it uses murmur3 to create the buckets); for example a bucketed table written by Spark 2.2 cannot be safely read by Spark after this change. Can you explain what problem you are trying to fix here? |
|
@hvanhovell The main motivation is making the online prediction of trained parameters using FeatureHasher in MLLib. If the generated hash value is different from the implementations in another language, indices of coefficients do not match and can not predict correctly. |
|
How about add a new config to control whether to use the new Murmur3 hash function and have that default turned off? We also have to document the change explicitly. WDYT @gatorsmile @hvanhovell @cloud-fan ? |
|
@mrkm4ntr I see your point. Adding a method to Murmur3 would work. The problem is that we are now going to release a cc @sameeragarwal @srowen for more visibility. |
|
@hvanhovell I sent an e-mail to the topic |
905ae19 to
336bce0
Compare
|
@hvanhovell I added a method and changed it so that we call it only from FeatureHasher. |
|
Jenkins, test this please |
|
Test build #87472 has finished for PR 20568 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Use this method for new components after Spark 2.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed it.
|
Retest this please |
…not a multiple of 4
336bce0 to
c20cd97
Compare
|
I cannot reproduce this failure of the test in my environment. |
|
@mrkm4ntr Do not worry about these failures. Since we know there are some unstable tests, our community is trying to fix them. For a while, we have to kick test. |
|
Retest this please |
|
Jenkins, retest this please. |
|
Test build #87501 has finished for PR 20568 at commit
|
|
Jenkins, retest this please. |
|
retest this please. |
| * See SPARK-23381. | ||
| */ | ||
| @Since("2.3.0") | ||
| def murmur3Hash(term: Any): Int = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe private[feature]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also address this comment.
|
Jenkins, retest this please. |
|
Test build #87509 has finished for PR 20568 at commit
|
|
Jenkins, retest this please. |
|
@mrkm4ntr this is legitimate failure. Can you fix the python tests? |
|
@hvanhovell just to make sure, given the dependency on |
|
(updated) For ML, I actually don't think this has to be a blocker. It's not great, but it's not a regression. However, we should definitely fix this in the future and soon: For ML, it's really important that MurmurHash3 behave consistently across platforms. To fix this, we'll need to maintain the old implementation of MurmushHash3 to maintain the behavior of ML Pipelines exported from previous versions of Spark. |
|
Submitted the PR #20630 to take this over. |
|
I think we can close this now. |
|
@mrkm4ntr Thank you for your contribution! The PR has been merged using your Github account. Could you close this? |
|
@gatorsmile Thanks! I will close it. |
What changes were proposed in this pull request?
Murmur3 hash generates a different value from the original and other implementations (like Scala standard library and Guava or so) when the length of a bytes array is not multiple of 4.
How was this patch tested?
Added a unit test.
Please review http://spark.apache.org/contributing.html before opening a pull request.