Skip to content

Conversation

@cloud-fan
Copy link
Contributor

just write the arguments into unsafe row and use murmur3 to calculate hash code

@cloud-fan
Copy link
Contributor Author

cc @yhuai @rxin

@SparkQA
Copy link

SparkQA commented Dec 22, 2015

Test build #48203 has finished for PR 10435 at commit 53b0ec5.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class Hash(children: Seq[Expression]) extends Expression\n

@yhuai
Copy link
Contributor

yhuai commented Dec 22, 2015

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we add more contents at here to explain the algorithm and mentions that we are following Hive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, it is also good to mention the compatibility benefit of following Hive at here.

@yhuai
Copy link
Contributor

yhuai commented Dec 22, 2015

Regarding testing this, I am wondering if we can add it to function registry. So, all queries that use hash will use this implementation and we can see if there is any failed test.

@yhuai
Copy link
Contributor

yhuai commented Dec 22, 2015

It is also important to use this hash function in Exchange.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which path handles string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last one case other => other.hashCode()

@SparkQA
Copy link

SparkQA commented Dec 23, 2015

Test build #48241 has finished for PR 10435 at commit 79a5738.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class JavaWordBlacklist\n * class JavaDroppedWordsCounter\n * case class Hash(children: Seq[Expression]) extends Expression\n

@SparkQA
Copy link

SparkQA commented Dec 24, 2015

Test build #48272 has finished for PR 10435 at commit f12cbb6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 24, 2015

Test build #48274 has finished for PR 10435 at commit 207ca84.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class Hash(children: Seq[Expression]) extends Expression\n

@SparkQA
Copy link

SparkQA commented Dec 24, 2015

Test build #48281 has finished for PR 10435 at commit e4d7b82.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class Hash(children: Seq[Expression]) extends Expression\n

@SparkQA
Copy link

SparkQA commented Dec 24, 2015

Test build #48291 has finished for PR 10435 at commit fcb5af9.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class Hash(children: Seq[Expression]) extends Expression\n

@SparkQA
Copy link

SparkQA commented Dec 25, 2015

Test build #48325 has finished for PR 10435 at commit 04a7301.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class Hash(children: Seq[Expression]) extends Expression\n

@SparkQA
Copy link

SparkQA commented Dec 26, 2015

Test build #48337 has finished for PR 10435 at commit f408f1f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class Hash(children: Seq[Expression]) extends Expression\n

@cloud-fan cloud-fan force-pushed the hash-expr branch 2 times, most recently from 1b56480 to 303b69b Compare December 26, 2015 14:39
@SparkQA
Copy link

SparkQA commented Dec 26, 2015

Test build #48346 has finished for PR 10435 at commit 303b69b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class Hash(children: Seq[Expression]) extends Expression\n

@cloud-fan
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Dec 27, 2015

Test build #48349 has finished for PR 10435 at commit c130097.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class Hash(children: Seq[Expression]) extends Expression\n

@SparkQA
Copy link

SparkQA commented Dec 27, 2015

Test build #48350 has finished for PR 10435 at commit c130097.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class Hash(children: Seq[Expression]) extends Expression\n

@SparkQA
Copy link

SparkQA commented Dec 27, 2015

Test build #48352 has finished for PR 10435 at commit a629e75.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class Hash(children: Seq[Expression]) extends Expression\n

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we just turned this into Mumur3Hash instead?

This would just do UnsafeProjection.create()
project(input).hashCode()

Murmur3 will give us much nicer hashing properties. The current hash function can be bad in reasonable cases.

For example, if the long column is a timestamp in milis from a source that samples every second. Most of the low digits will be similar (e.g. values are 1000, 2002, 2999, etc. Very few that end in 500). The hash function does a very bad job of breaking this up and this will generate some very skewed partitions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point!
after decided to not follow hive, I agree Mumur3Hash is a better choice.

@cloud-fan
Copy link
Contributor Author

Closing, will open another PR to use UnsafeRow.hashCode for shuffle and fix tests.

@cloud-fan cloud-fan closed this Dec 30, 2015
@rxin
Copy link
Contributor

rxin commented Dec 30, 2015

On the contrary I think we should consider having the hash code expression for two reasons:

  1. We still need a hash SQL function (currently delegated to Hive)
  2. We get code gen using an expression
  3. It is easier to control (being able to pass a seed or use it for bloom filters)

@nongli
Copy link
Contributor

nongli commented Dec 30, 2015

It makes sense to still have a Hash expression (called more specifically, Mumur3Hash) that does what this patch originally intended. I think this will be a useful primitive. The underlying implementation can just use UnsafeRow.hashCode for now.

@cloud-fan cloud-fan reopened this Dec 31, 2015
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I didn't use hash for the name, as it will break a lot of hive compatibility tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many does it break?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

35 tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give me a list? i think we should consider just blacklisting them ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48241/consoleFull

most of them is testing something else but coincidently include hash expression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can have a flag to control this -- when in hive compatibility test, fall back to Hive's, and otherwise our own?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me, let me try it out.

@SparkQA
Copy link

SparkQA commented Dec 31, 2015

Test build #48540 has finished for PR 10435 at commit 61783e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Murmur3Hash(children: Seq[Expression], seed: Int) extends Expression

@SparkQA
Copy link

SparkQA commented Dec 31, 2015

Test build #48542 has finished for PR 10435 at commit b95e64e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 4, 2016

Test build #48644 has finished for PR 10435 at commit aa57583.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 4, 2016

Test build #48656 has finished for PR 10435 at commit 2c1e963.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

I think we should open another PR to use this hash expression in Exchange, as it will break a lof of tests and make it harder to review.

@nongli
Copy link
Contributor

nongli commented Jan 4, 2016

LGTM

@SparkQA
Copy link

SparkQA commented Jan 5, 2016

Test build #48699 has finished for PR 10435 at commit 9a978c4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should just be a single column vararg, rather than one followed by vararg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the hash function should take at least one parameter, does @scala.annotation.varargs support this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the following form:
(firstarg:Int)(more:Int*)

@rxin
Copy link
Contributor

rxin commented Jan 5, 2016

I've merged this. You can address the API comment in the next pull request. Thanks.

@asfgit asfgit closed this in b1a7712 Jan 5, 2016
@cloud-fan cloud-fan deleted the hash-expr branch January 5, 2016 05:43
smarthi pushed a commit to smarthi/spark that referenced this pull request Jan 5, 2016
address comments in apache#10435

This makes the API easier to use if user programmatically generate the call to hash, and they will get analysis exception if the arguments of hash is empty.

Author: Wenchen Fan <[email protected]>

Closes apache#10588 from cloud-fan/hash.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants