[SQL] Set outputPartitioning of BroadcastHashJoin correctly. #1735

yhuai · 2014-08-02T05:47:07Z

I think we will not generate the plan triggering this bug at this moment. But, let me explain it...

Right now, we are using left.outputPartitioning as the outputPartitioning of a BroadcastHashJoin. We may have a wrong physical plan for cases like...

SELECT l.key, count(*)
FROM (SELECT key, count(*) as cnt
      FROM src
      GROUP BY key) l // This is buildPlan
JOIN r // This is the streamedPlan
ON (l.cnt = r.value)
GROUP BY l.key

Let's say we have a BroadcastHashJoin on l and r. For this case, we will pick l's outputPartitioning for the outputPartitioningof the BroadcastHashJoin on l and r. Also, because the last GROUP BY is using l.key as the key, we will not introduce an Exchange for this aggregation. However, r's outputPartitioning may not match the required distribution of the last GROUP BY and we fail to group data correctly.

JIRA is being reindexed. I will create a JIRA ticket once it is back online.

SparkQA · 2014-08-02T05:49:30Z

QA tests have started for PR 1735. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17754/consoleFull

SparkQA · 2014-08-02T07:19:03Z

QA results for PR 1735:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17754/consoleFull

marmbrus · 2014-08-02T20:17:31Z

This is pretty minor and we can't even come up with a case where we plan incorrectly. I'm just going to merge this into master and 1.1 without a JIRA.

marmbrus · 2014-08-02T20:17:50Z

Thanks for fixing it!

I think we will not generate the plan triggering this bug at this moment. But, let me explain it... Right now, we are using `left.outputPartitioning` as the `outputPartitioning` of a `BroadcastHashJoin`. We may have a wrong physical plan for cases like... ```sql SELECT l.key, count(*) FROM (SELECT key, count(*) as cnt FROM src GROUP BY key) l // This is buildPlan JOIN r // This is the streamedPlan ON (l.cnt = r.value) GROUP BY l.key ``` Let's say we have a `BroadcastHashJoin` on `l` and `r`. For this case, we will pick `l`'s `outputPartitioning` for the `outputPartitioning`of the `BroadcastHashJoin` on `l` and `r`. Also, because the last `GROUP BY` is using `l.key` as the key, we will not introduce an `Exchange` for this aggregation. However, `r`'s outputPartitioning may not match the required distribution of the last `GROUP BY` and we fail to group data correctly. JIRA is being reindexed. I will create a JIRA ticket once it is back online. Author: Yin Huai <[email protected]> Closes #1735 from yhuai/BroadcastHashJoin and squashes the following commits: 96d9cb3 [Yin Huai] Set outputPartitioning correctly. (cherry picked from commit 67bd8e3) Signed-off-by: Michael Armbrust <[email protected]>

I think we will not generate the plan triggering this bug at this moment. But, let me explain it... Right now, we are using `left.outputPartitioning` as the `outputPartitioning` of a `BroadcastHashJoin`. We may have a wrong physical plan for cases like... ```sql SELECT l.key, count(*) FROM (SELECT key, count(*) as cnt FROM src GROUP BY key) l // This is buildPlan JOIN r // This is the streamedPlan ON (l.cnt = r.value) GROUP BY l.key ``` Let's say we have a `BroadcastHashJoin` on `l` and `r`. For this case, we will pick `l`'s `outputPartitioning` for the `outputPartitioning`of the `BroadcastHashJoin` on `l` and `r`. Also, because the last `GROUP BY` is using `l.key` as the key, we will not introduce an `Exchange` for this aggregation. However, `r`'s outputPartitioning may not match the required distribution of the last `GROUP BY` and we fail to group data correctly. JIRA is being reindexed. I will create a JIRA ticket once it is back online. Author: Yin Huai <[email protected]> Closes apache#1735 from yhuai/BroadcastHashJoin and squashes the following commits: 96d9cb3 [Yin Huai] Set outputPartitioning correctly.

Set outputPartitioning correctly.

96d9cb3

asfgit closed this in 67bd8e3 Aug 2, 2014

yhuai deleted the BroadcastHashJoin branch October 6, 2014 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SQL] Set outputPartitioning of BroadcastHashJoin correctly. #1735

[SQL] Set outputPartitioning of BroadcastHashJoin correctly. #1735

Uh oh!

yhuai commented Aug 2, 2014

Uh oh!

SparkQA commented Aug 2, 2014

Uh oh!

SparkQA commented Aug 2, 2014

Uh oh!

marmbrus commented Aug 2, 2014

Uh oh!

marmbrus commented Aug 2, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SQL] Set outputPartitioning of BroadcastHashJoin correctly. #1735

[SQL] Set outputPartitioning of BroadcastHashJoin correctly. #1735

Uh oh!

Conversation

yhuai commented Aug 2, 2014

Uh oh!

SparkQA commented Aug 2, 2014

Uh oh!

SparkQA commented Aug 2, 2014

Uh oh!

marmbrus commented Aug 2, 2014

Uh oh!

marmbrus commented Aug 2, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants