[SPARK-23381][CORE] Murmur3 hash generates a different value from other implementations #20568

mrkm4ntr · 2018-02-10T14:33:41Z

What changes were proposed in this pull request?

Murmur3 hash generates a different value from the original and other implementations (like Scala standard library and Guava or so) when the length of a bytes array is not multiple of 4.

How was this patch tested?

Added a unit test.

Please review http://spark.apache.org/contributing.html before opening a pull request.

kiszk · 2018-02-11T04:09:50Z

common/unsafe/src/test/java/org/apache/spark/unsafe/hash/Murmur3_x86_32Suite.java

It would be good to add JIRA number with a short description as a comment (e.g. SPARK-23381 ...)

kiszk · 2018-02-11T04:10:27Z

common/unsafe/src/test/java/org/apache/spark/unsafe/hash/Murmur3_x86_32Suite.java

Is it better to compare with the result of murmur3 hash value by scala library?

mrkm4ntr · 2018-02-11T09:51:53Z

@kiszk Thank you for your review! I fixed it.

hvanhovell · 2018-02-12T17:29:33Z

@mrkm4ntr The change itself looks pretty reasonable. However I am very hesitant to merge this because this will probably break bucketing (it uses murmur3 to create the buckets); for example a bucketed table written by Spark 2.2 cannot be safely read by Spark after this change.

Can you explain what problem you are trying to fix here?

mrkm4ntr · 2018-02-13T01:25:11Z

@hvanhovell The main motivation is making the online prediction of trained parameters using FeatureHasher in MLLib. If the generated hash value is different from the implementations in another language, indices of coefficients do not match and can not predict correctly.
But I agree backward compatibility is more important. Since FeatureHasher will be added from Spark 2.3.0, how about adding a new method of this content to Murmur 3 and using it only from FeatureHasher?

jiangxb1987 · 2018-02-14T08:30:54Z

How about add a new config to control whether to use the new Murmur3 hash function and have that default turned off? We also have to document the change explicitly. WDYT @gatorsmile @hvanhovell @cloud-fan ?

hvanhovell · 2018-02-14T12:54:50Z

@mrkm4ntr I see your point. Adding a method to Murmur3 would work.

The problem is that we are now going to release a FeatureHasher in Spark 2.3 that uses the current Murmur3 implementation. If we change this to use the correct Murmur3 implementation after the release of Spark 2.3 we will break all models using feature hashing created using Spark 2.3. This might be a blocker. Can you send an e-mail to the dev list?

cc @sameeragarwal @srowen for more visibility.

mrkm4ntr · 2018-02-14T14:08:41Z

@hvanhovell I sent an e-mail to the topic [VOTE] Spark 2.3.0 (RC3).

mrkm4ntr · 2018-02-15T05:48:45Z

@hvanhovell I added a method and changed it so that we call it only from FeatureHasher.

felixcheung · 2018-02-15T06:33:09Z

Jenkins, test this please

SparkQA · 2018-02-15T08:05:02Z

Test build #87472 has finished for PR 20568 at commit 336bce0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-15T09:15:00Z

common/sketch/src/main/java/org/apache/spark/util/sketch/Murmur3_x86_32.java

nit: Use this method for new components after Spark 2.3

Thanks, fixed it.

kiszk · 2018-02-15T14:33:11Z

Retest this please

…not a multiple of 4

mrkm4ntr · 2018-02-16T01:47:22Z

I cannot reproduce this failure of the test in my environment.
It seems to me that this is not related to this change...

kiszk · 2018-02-16T02:03:26Z

@mrkm4ntr Do not worry about these failures. Since we know there are some unstable tests, our community is trying to fix them. For a while, we have to kick test.

kiszk · 2018-02-16T02:03:38Z

Retest this please

ueshin · 2018-02-16T03:26:39Z

Jenkins, retest this please.

SparkQA · 2018-02-16T06:31:45Z

Test build #87501 has finished for PR 20568 at commit c20cd97.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-02-16T06:38:12Z

Jenkins, retest this please.

viirya · 2018-02-16T08:38:53Z

retest this please.

viirya · 2018-02-16T08:42:08Z

mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala

+   * See SPARK-23381.
+   */
+  @Since("2.3.0")
+  def murmur3Hash(term: Any): Int = {


Maybe private[feature]?

I would also address this comment.

felixcheung · 2018-02-16T09:41:30Z

Jenkins, retest this please.

SparkQA · 2018-02-16T12:58:20Z

Test build #87509 has finished for PR 20568 at commit c20cd97.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-02-16T13:37:58Z

Jenkins, retest this please.

hvanhovell · 2018-02-16T17:48:10Z

@mrkm4ntr this is legitimate failure. Can you fix the python tests?

sameeragarwal · 2018-02-16T19:04:59Z

@hvanhovell just to make sure, given the dependency on FeatureHasher, should this block RC4?

jkbradley · 2018-02-16T20:24:21Z

(updated)

For ML, I actually don't think this has to be a blocker. It's not great, but it's not a regression.

However, we should definitely fix this in the future and soon: For ML, it's really important that MurmurHash3 behave consistently across platforms.

To fix this, we'll need to maintain the old implementation of MurmushHash3 to maintain the behavior of ML Pipelines exported from previous versions of Spark.

gatorsmile · 2018-02-16T20:53:50Z

To speedup the work here, I will take this over. All the contributions should be given to @mrkm4ntr

Thanks for your work! @mrkm4ntr

gatorsmile · 2018-02-16T21:25:08Z

Submitted the PR #20630 to take this over.

viirya · 2018-02-17T02:53:02Z

I think we can close this now.

gatorsmile · 2018-02-17T04:36:31Z

@mrkm4ntr Thank you for your contribution! The PR has been merged using your Github account. Could you close this?

mrkm4ntr · 2018-02-17T05:12:13Z

@gatorsmile Thanks! I will close it.

kiszk reviewed Feb 11, 2018

View reviewed changes

mrkm4ntr force-pushed the spark-23381 branch from 8856e39 to 905ae19 Compare February 11, 2018 09:47

mrkm4ntr force-pushed the spark-23381 branch from 905ae19 to 336bce0 Compare February 15, 2018 05:41

cloud-fan reviewed Feb 15, 2018

View reviewed changes

[SPARK-23381][CORE] Fix Murmur3 for byte arrays whose byte length is …

c20cd97

…not a multiple of 4

mrkm4ntr force-pushed the spark-23381 branch from 336bce0 to c20cd97 Compare February 16, 2018 01:39

viirya reviewed Feb 16, 2018

View reviewed changes

mrkm4ntr closed this Feb 17, 2018

[SPARK-23381][CORE] Murmur3 hash generates a different value from other implementations #20568

[SPARK-23381][CORE] Murmur3 hash generates a different value from other implementations #20568

Uh oh!

Conversation

mrkm4ntr commented Feb 10, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

kiszk Feb 11, 2018

Choose a reason for hiding this comment

Uh oh!

kiszk Feb 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrkm4ntr commented Feb 11, 2018

Uh oh!

hvanhovell commented Feb 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrkm4ntr commented Feb 13, 2018

Uh oh!

jiangxb1987 commented Feb 14, 2018

Uh oh!

hvanhovell commented Feb 14, 2018

Uh oh!

mrkm4ntr commented Feb 14, 2018

Uh oh!

mrkm4ntr commented Feb 15, 2018

Uh oh!

felixcheung commented Feb 15, 2018

Uh oh!

SparkQA commented Feb 15, 2018

Uh oh!

cloud-fan Feb 15, 2018

Choose a reason for hiding this comment

Uh oh!

mrkm4ntr Feb 16, 2018

Choose a reason for hiding this comment

Uh oh!

kiszk commented Feb 15, 2018

Uh oh!

mrkm4ntr commented Feb 16, 2018

Uh oh!

kiszk commented Feb 16, 2018

Uh oh!

kiszk commented Feb 16, 2018

Uh oh!

ueshin commented Feb 16, 2018

Uh oh!

SparkQA commented Feb 16, 2018

Uh oh!

kiszk commented Feb 16, 2018

Uh oh!

viirya commented Feb 16, 2018

Uh oh!

viirya Feb 16, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile Feb 16, 2018

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Feb 16, 2018

Uh oh!

SparkQA commented Feb 16, 2018

Uh oh!

kiszk commented Feb 16, 2018

Uh oh!

hvanhovell commented Feb 16, 2018

Uh oh!

sameeragarwal commented Feb 16, 2018

Uh oh!

jkbradley commented Feb 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Feb 16, 2018

Uh oh!

gatorsmile commented Feb 16, 2018

Uh oh!

viirya commented Feb 17, 2018

kiszk Feb 11, 2018 •

edited

Loading

hvanhovell commented Feb 12, 2018 •

edited

Loading

jkbradley commented Feb 16, 2018 •

edited

Loading