Skip to content

Conversation

@mrkm4ntr
Copy link
Contributor

What changes were proposed in this pull request?

Murmur3 hash generates a different value from the original and other implementations (like Scala standard library and Guava or so) when the length of a bytes array is not multiple of 4.

How was this patch tested?

Added a unit test.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add JIRA number with a short description as a comment (e.g. SPARK-23381 ...)

Copy link
Member

@kiszk kiszk Feb 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to compare with the result of murmur3 hash value by scala library?

@mrkm4ntr
Copy link
Contributor Author

@kiszk Thank you for your review! I fixed it.

@hvanhovell
Copy link
Contributor

hvanhovell commented Feb 12, 2018

@mrkm4ntr The change itself looks pretty reasonable. However I am very hesitant to merge this because this will probably break bucketing (it uses murmur3 to create the buckets); for example a bucketed table written by Spark 2.2 cannot be safely read by Spark after this change.

Can you explain what problem you are trying to fix here?

@mrkm4ntr
Copy link
Contributor Author

@hvanhovell The main motivation is making the online prediction of trained parameters using FeatureHasher in MLLib. If the generated hash value is different from the implementations in another language, indices of coefficients do not match and can not predict correctly.
But I agree backward compatibility is more important. Since FeatureHasher will be added from Spark 2.3.0, how about adding a new method of this content to Murmur 3 and using it only from FeatureHasher?

@jiangxb1987
Copy link
Contributor

How about add a new config to control whether to use the new Murmur3 hash function and have that default turned off? We also have to document the change explicitly. WDYT @gatorsmile @hvanhovell @cloud-fan ?

@hvanhovell
Copy link
Contributor

@mrkm4ntr I see your point. Adding a method to Murmur3 would work.

The problem is that we are now going to release a FeatureHasher in Spark 2.3 that uses the current Murmur3 implementation. If we change this to use the correct Murmur3 implementation after the release of Spark 2.3 we will break all models using feature hashing created using Spark 2.3. This might be a blocker. Can you send an e-mail to the dev list?

cc @sameeragarwal @srowen for more visibility.

@mrkm4ntr
Copy link
Contributor Author

@hvanhovell I sent an e-mail to the topic [VOTE] Spark 2.3.0 (RC3).

@mrkm4ntr
Copy link
Contributor Author

@hvanhovell I added a method and changed it so that we call it only from FeatureHasher.

@felixcheung
Copy link
Member

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Feb 15, 2018

Test build #87472 has finished for PR 20568 at commit 336bce0.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Use this method for new components after Spark 2.3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed it.

@kiszk
Copy link
Member

kiszk commented Feb 15, 2018

Retest this please

@mrkm4ntr
Copy link
Contributor Author

I cannot reproduce this failure of the test in my environment.
It seems to me that this is not related to this change...

@kiszk
Copy link
Member

kiszk commented Feb 16, 2018

@mrkm4ntr Do not worry about these failures. Since we know there are some unstable tests, our community is trying to fix them. For a while, we have to kick test.

@kiszk
Copy link
Member

kiszk commented Feb 16, 2018

Retest this please

@ueshin
Copy link
Member

ueshin commented Feb 16, 2018

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Feb 16, 2018

Test build #87501 has finished for PR 20568 at commit c20cd97.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Feb 16, 2018

Jenkins, retest this please.

@viirya
Copy link
Member

viirya commented Feb 16, 2018

retest this please.

* See SPARK-23381.
*/
@Since("2.3.0")
def murmur3Hash(term: Any): Int = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe private[feature]?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also address this comment.

@felixcheung
Copy link
Member

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Feb 16, 2018

Test build #87509 has finished for PR 20568 at commit c20cd97.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Feb 16, 2018

Jenkins, retest this please.

@hvanhovell
Copy link
Contributor

@mrkm4ntr this is legitimate failure. Can you fix the python tests?

@sameeragarwal
Copy link
Member

@hvanhovell just to make sure, given the dependency on FeatureHasher, should this block RC4?

@jkbradley
Copy link
Member

jkbradley commented Feb 16, 2018

(updated)

For ML, I actually don't think this has to be a blocker. It's not great, but it's not a regression.

However, we should definitely fix this in the future and soon: For ML, it's really important that MurmurHash3 behave consistently across platforms.

To fix this, we'll need to maintain the old implementation of MurmushHash3 to maintain the behavior of ML Pipelines exported from previous versions of Spark.

@gatorsmile
Copy link
Member

To speedup the work here, I will take this over. All the contributions should be given to @mrkm4ntr

Thanks for your work! @mrkm4ntr

@gatorsmile
Copy link
Member

Submitted the PR #20630 to take this over.

@viirya
Copy link
Member

viirya commented Feb 17, 2018

I think we can close this now.

@gatorsmile
Copy link
Member

@mrkm4ntr Thank you for your contribution! The PR has been merged using your Github account. Could you close this?

@mrkm4ntr
Copy link
Contributor Author

@gatorsmile Thanks! I will close it.

@mrkm4ntr mrkm4ntr closed this Feb 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.