[SPARK-12480][SQL] add Hash expression that can calculate hash value for a group of expressions #10435

cloud-fan · 2015-12-22T15:26:59Z

just write the arguments into unsafe row and use murmur3 to calculate hash code

cloud-fan · 2015-12-22T15:29:16Z

SparkQA · 2015-12-22T15:39:31Z

Test build #48203 has finished for PR 10435 at commit 53b0ec5.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Hash(children: Seq[Expression]) extends Expression\n

yhuai · 2015-12-22T18:02:08Z

Looks like we can also mention https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFHash.java.

yhuai · 2015-12-22T18:03:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala

How about we add more contents at here to explain the algorithm and mentions that we are following Hive.

oh, it is also good to mention the compatibility benefit of following Hive at here.

yhuai · 2015-12-22T18:05:24Z

Regarding testing this, I am wondering if we can add it to function registry. So, all queries that use hash will use this implementation and we can see if there is any failed test.

yhuai · 2015-12-22T18:08:31Z

It is also important to use this hash function in Exchange.

nongli · 2015-12-22T21:40:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala

Which path handles string?

The last one case other => other.hashCode()

SparkQA · 2015-12-23T13:57:48Z

Test build #48241 has finished for PR 10435 at commit 79a5738.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class JavaWordBlacklist\n * class JavaDroppedWordsCounter\n * case class Hash(children: Seq[Expression]) extends Expression\n

SparkQA · 2015-12-24T03:03:56Z

Test build #48272 has finished for PR 10435 at commit f12cbb6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-24T03:13:47Z

Test build #48274 has finished for PR 10435 at commit 207ca84.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Hash(children: Seq[Expression]) extends Expression\n

SparkQA · 2015-12-24T04:54:08Z

Test build #48281 has finished for PR 10435 at commit e4d7b82.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Hash(children: Seq[Expression]) extends Expression\n

SparkQA · 2015-12-24T07:54:32Z

Test build #48291 has finished for PR 10435 at commit fcb5af9.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Hash(children: Seq[Expression]) extends Expression\n

SparkQA · 2015-12-25T17:05:42Z

Test build #48325 has finished for PR 10435 at commit 04a7301.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Hash(children: Seq[Expression]) extends Expression\n

SparkQA · 2015-12-26T07:21:51Z

Test build #48337 has finished for PR 10435 at commit f408f1f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Hash(children: Seq[Expression]) extends Expression\n

SparkQA · 2015-12-26T16:11:45Z

Test build #48346 has finished for PR 10435 at commit 303b69b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Hash(children: Seq[Expression]) extends Expression\n

cloud-fan · 2015-12-27T09:00:23Z

retest this please.

SparkQA · 2015-12-27T09:30:16Z

Test build #48349 has finished for PR 10435 at commit c130097.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Hash(children: Seq[Expression]) extends Expression\n

SparkQA · 2015-12-27T09:33:52Z

Test build #48350 has finished for PR 10435 at commit c130097.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Hash(children: Seq[Expression]) extends Expression\n

SparkQA · 2015-12-27T11:48:06Z

Test build #48352 has finished for PR 10435 at commit a629e75.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Hash(children: Seq[Expression]) extends Expression\n

nongli · 2015-12-30T06:09:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala

What if we just turned this into Mumur3Hash instead?

This would just do UnsafeProjection.create()
project(input).hashCode()

Murmur3 will give us much nicer hashing properties. The current hash function can be bad in reasonable cases.

For example, if the long column is a timestamp in milis from a source that samples every second. Most of the low digits will be similar (e.g. values are 1000, 2002, 2999, etc. Very few that end in 500). The hash function does a very bad job of breaking this up and this will generate some very skewed partitions.

good point!
after decided to not follow hive, I agree Mumur3Hash is a better choice.

cloud-fan · 2015-12-30T15:15:33Z

Closing, will open another PR to use UnsafeRow.hashCode for shuffle and fix tests.

rxin · 2015-12-30T18:15:14Z

On the contrary I think we should consider having the hash code expression for two reasons:

We still need a hash SQL function (currently delegated to Hive)
We get code gen using an expression
It is easier to control (being able to pass a seed or use it for bloom filters)

nongli · 2015-12-30T20:54:22Z

It makes sense to still have a Hash expression (called more specifically, Mumur3Hash) that does what this patch originally intended. I think this will be a useful primitive. The underlying implementation can just use UnsafeRow.hashCode for now.

cloud-fan · 2015-12-31T04:11:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

Here I didn't use hash for the name, as it will break a lot of hive compatibility tests.

How many does it break?

can you give me a list? i think we should consider just blacklisting them ...

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48241/consoleFull

most of them is testing something else but coincidently include hash expression.

maybe we can have a flag to control this -- when in hive compatibility test, fall back to Hive's, and otherwise our own?

sounds good to me, let me try it out.

SparkQA · 2015-12-31T05:54:36Z

Test build #48540 has finished for PR 10435 at commit 61783e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Murmur3Hash(children: Seq[Expression], seed: Int) extends Expression

SparkQA · 2015-12-31T06:24:19Z

Test build #48542 has finished for PR 10435 at commit b95e64e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-04T08:55:57Z

Test build #48644 has finished for PR 10435 at commit aa57583.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-04T13:16:26Z

Test build #48656 has finished for PR 10435 at commit 2c1e963.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-01-04T15:03:59Z

I think we should open another PR to use this hash expression in Exchange, as it will break a lof of tests and make it harder to review.

nongli · 2016-01-04T20:45:00Z

LGTM

SparkQA · 2016-01-05T02:41:59Z

Test build #48699 has finished for PR 10435 at commit 9a978c4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-05T02:48:49Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

this should just be a single column vararg, rather than one followed by vararg?

the hash function should take at least one parameter, does @scala.annotation.varargs support this?

You can use the following form:
(firstarg:Int)(more:Int*)

rxin · 2016-01-05T02:49:59Z

I've merged this. You can address the API comment in the next pull request. Thanks.

address comments in apache#10435 This makes the API easier to use if user programmatically generate the call to hash, and they will get analysis exception if the arguments of hash is empty. Author: Wenchen Fan <[email protected]> Closes apache#10588 from cloud-fan/hash.

yhuai reviewed Dec 22, 2015
View reviewed changes

nongli reviewed Dec 22, 2015
View reviewed changes

cloud-fan force-pushed the hash-expr branch from f12cbb6 to 207ca84 Compare December 24, 2015 01:37

cloud-fan force-pushed the hash-expr branch from 207ca84 to e4d7b82 Compare December 24, 2015 03:22

cloud-fan force-pushed the hash-expr branch from e4d7b82 to fcb5af9 Compare December 24, 2015 06:29

cloud-fan force-pushed the hash-expr branch from fcb5af9 to 04a7301 Compare December 25, 2015 15:32

cloud-fan force-pushed the hash-expr branch from 04a7301 to f408f1f Compare December 26, 2015 05:53

cloud-fan force-pushed the hash-expr branch 2 times, most recently from 1b56480 to 303b69b Compare December 26, 2015 14:39

cloud-fan force-pushed the hash-expr branch from c130097 to a629e75 Compare December 27, 2015 10:14

cloud-fan force-pushed the hash-expr branch from a629e75 to 655800c Compare December 27, 2015 23:36

nongli reviewed Dec 30, 2015
View reviewed changes

cloud-fan closed this Dec 30, 2015

cloud-fan reopened this Dec 31, 2015

add hash expression

61783e7

cloud-fan force-pushed the hash-expr branch from 8703b1a to 61783e7 Compare December 31, 2015 04:08

cloud-fan reviewed Dec 31, 2015
View reviewed changes

update

b95e64e

address comments

2c1e963

cloud-fan force-pushed the hash-expr branch from aa57583 to 2c1e963 Compare January 4, 2016 11:29

Merge remote-tracking branch 'origin/master' into mumur3_hash

9a978c4

rxin reviewed Jan 5, 2016
View reviewed changes

asfgit closed this in b1a7712 Jan 5, 2016

cloud-fan deleted the hash-expr branch January 5, 2016 05:43

cloud-fan mentioned this pull request Jan 5, 2016

[SPARK-12480][follow-up] use a single column vararg for hash #10588

Closed

[SPARK-12480][SQL] add Hash expression that can calculate hash value for a group of expressions #10435

[SPARK-12480][SQL] add Hash expression that can calculate hash value for a group of expressions #10435

Uh oh!

Conversation

cloud-fan commented Dec 22, 2015

Uh oh!

cloud-fan commented Dec 22, 2015

Uh oh!

SparkQA commented Dec 22, 2015

Uh oh!

yhuai commented Dec 22, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhuai commented Dec 22, 2015

Uh oh!

yhuai commented Dec 22, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 23, 2015

Uh oh!

SparkQA commented Dec 24, 2015

Uh oh!

SparkQA commented Dec 24, 2015

Uh oh!

SparkQA commented Dec 24, 2015

Uh oh!

SparkQA commented Dec 24, 2015

Uh oh!

SparkQA commented Dec 25, 2015

Uh oh!

SparkQA commented Dec 26, 2015

Uh oh!

SparkQA commented Dec 26, 2015

Uh oh!

cloud-fan commented Dec 27, 2015

Uh oh!

SparkQA commented Dec 27, 2015

Uh oh!

SparkQA commented Dec 27, 2015

Uh oh!

SparkQA commented Dec 27, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 30, 2015

Uh oh!

rxin commented Dec 30, 2015

Uh oh!

nongli commented Dec 30, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 31, 2015

Uh oh!

SparkQA commented Dec 31, 2015

Uh oh!

SparkQA commented Jan 4, 2016

Uh oh!

SparkQA commented Jan 4, 2016

Uh oh!