[SPARK-24076][SQL] Use different seed in HashAggregate to avoid hash conflict #21149

yucai · 2018-04-25T07:13:00Z

What changes were proposed in this pull request?

HashAggregate uses the same hash algorithm and seed as ShuffleExchange, it may lead to bad hash conflict when shuffle.partitions=8192*n.

Considering below example:

SET spark.sql.shuffle.partitions=8192;
INSERT OVERWRITE TABLE target_xxx
SELECT
 item_id,
 auct_end_dt
FROM
 from source_xxx
GROUP BY
 item_id,
 auct_end_dt;

In the shuffle stage, if user sets the shuffle.partition = 8192, all tuples in the same partition will meet the following relationship:

hash(tuple x) = hash(tuple y) + n * 8192

Then in the next HashAggregate stage, all tuples from the same partition need be put into a 16K BytesToBytesMap (unsafeRowAggBuffer).

Here, the HashAggregate uses the same hash algorithm on the same expression as shuffle, and uses the same seed, and 16K = 8192 * 2, so actually, all tuples in the same parititon will only be hashed to 2 different places in the BytesToBytesMap. It is bad hash conflict. With BytesToBytesMap growing, the conflict will always exist.

Before change:

After change:

How was this patch tested?

Unit tests and production cases.

maropu · 2018-04-25T07:51:48Z

oh... good catch. I think you'd be better to put the detailed info. (written in the jira) in the description above?

maropu · 2018-04-25T07:53:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala

How about writing comments for the reason why we set the seed value here?

…conflict

yucai · 2018-04-25T09:46:55Z

@maropu Thanks for comments, I have updated, could you help take a look at?

maropu · 2018-04-25T10:15:18Z

cc: @hvanhovell

viirya · 2018-04-25T10:24:30Z

Can you also show the screenshot after this change?

SparkQA · 2018-04-25T10:59:10Z

Test build #89830 has finished for PR 21149 at commit 5e88468.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-25T13:05:49Z

Test build #89839 has finished for PR 21149 at commit 0818618.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-05-08T09:22:57Z

ping @hvanhovell @gatorsmile

hvanhovell · 2018-05-08T09:33:24Z

LGTM - merging to master. Thanks!

yucai · 2018-05-08T10:53:25Z

@maropu @hvanhovell thanks very much!

cloud-fan · 2019-02-15T14:42:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala

+    // SPARK-24076: HashAggregate uses the same hash algorithm on the same expressions
+    // as ShuffleExchange, it may lead to bad hash conflict when shuffle.partitions=8192*n,
+    // pick a different seed to avoid this conflict
+    val hashExpr = Murmur3Hash(groupingExpressions, 48)


can we just use UnsafeRow.hashCode here?

@cloud-fan you mean unsafeRowKeys.hashCode(), right?
I think it is a good idea, unsafe row has [null bit set] etc., the result should be different, we don't need weird 48 also. Do you want me to create a followup PR?

yes please, thanks!

@cloud-fan would this perform slower since now we are moving to interpreted version for hashcode generation? If not then why didn't we use unsafeRowKeys.hashCode() in the first place?

it should be faster, as unsafeRowKeys.hashCode() is just one function call. I don't know why we didn't do it in the first place, the code is pretty old.

…n HashAggregate ## What changes were proposed in this pull request? This is a followup PR for #21149. New way uses unsafeRow.hashCode() as hash value in HashAggregate. The unsafe row has [null bit set] etc., so the hash should be different from shuffle hash, and then we don't need a special seed. ## How was this patch tested? UTs. Closes #23821 from yucai/unsafe_hash. Authored-by: yucai <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

maropu reviewed Apr 25, 2018

View reviewed changes

[SPARK-24076][SQL] Use different seed in HashAggregate to avoid hash …

0818618

…conflict

yucai force-pushed the SPARK-24076 branch from 5e88468 to 0818618 Compare April 25, 2018 09:44

asfgit closed this in e17567c May 8, 2018

cloud-fan reviewed Feb 15, 2019

View reviewed changes

yucai mentioned this pull request Feb 18, 2019

[SPARK-26909][FOLLOWUP][SQL] use unsafeRow.hashCode() as hash value in HashAggregate #23821

Closed

[SPARK-24076][SQL] Use different seed in HashAggregate to avoid hash conflict #21149

[SPARK-24076][SQL] Use different seed in HashAggregate to avoid hash conflict #21149

Uh oh!

Conversation

yucai commented Apr 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

maropu commented Apr 25, 2018

Uh oh!

maropu Apr 25, 2018

Choose a reason for hiding this comment

Uh oh!

yucai commented Apr 25, 2018

Uh oh!

maropu commented Apr 25, 2018

Uh oh!

viirya commented Apr 25, 2018

Uh oh!

SparkQA commented Apr 25, 2018

Uh oh!

SparkQA commented Apr 25, 2018

Uh oh!

maropu commented May 8, 2018

Uh oh!

hvanhovell commented May 8, 2018

Uh oh!

yucai commented May 8, 2018

Uh oh!

cloud-fan Feb 15, 2019

Choose a reason for hiding this comment

Uh oh!

yucai Feb 16, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 16, 2019

Choose a reason for hiding this comment

Uh oh!

nikitagkonda Mar 6, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Mar 6, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

yucai commented Apr 25, 2018 •

edited

Loading