Skip to content

Conversation

@yhuai
Copy link
Contributor

@yhuai yhuai commented Aug 2, 2014

I think we will not generate the plan triggering this bug at this moment. But, let me explain it...

Right now, we are using left.outputPartitioning as the outputPartitioning of a BroadcastHashJoin. We may have a wrong physical plan for cases like...

SELECT l.key, count(*)
FROM (SELECT key, count(*) as cnt
      FROM src
      GROUP BY key) l // This is buildPlan
JOIN r // This is the streamedPlan
ON (l.cnt = r.value)
GROUP BY l.key

Let's say we have a BroadcastHashJoin on l and r. For this case, we will pick l's outputPartitioning for the outputPartitioningof the BroadcastHashJoin on l and r. Also, because the last GROUP BY is using l.key as the key, we will not introduce an Exchange for this aggregation. However, r's outputPartitioning may not match the required distribution of the last GROUP BY and we fail to group data correctly.

JIRA is being reindexed. I will create a JIRA ticket once it is back online.

@SparkQA
Copy link

SparkQA commented Aug 2, 2014

QA tests have started for PR 1735. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17754/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 2, 2014

QA results for PR 1735:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17754/consoleFull

@marmbrus
Copy link
Contributor

marmbrus commented Aug 2, 2014

This is pretty minor and we can't even come up with a case where we plan incorrectly. I'm just going to merge this into master and 1.1 without a JIRA.

@marmbrus
Copy link
Contributor

marmbrus commented Aug 2, 2014

Thanks for fixing it!

asfgit pushed a commit that referenced this pull request Aug 2, 2014
I think we will not generate the plan triggering this bug at this moment. But, let me explain it...

Right now, we are using `left.outputPartitioning` as the `outputPartitioning` of a `BroadcastHashJoin`. We may have a wrong physical plan for cases like...
```sql
SELECT l.key, count(*)
FROM (SELECT key, count(*) as cnt
      FROM src
      GROUP BY key) l // This is buildPlan
JOIN r // This is the streamedPlan
ON (l.cnt = r.value)
GROUP BY l.key
```
Let's say we have a `BroadcastHashJoin` on `l` and `r`. For this case, we will pick `l`'s `outputPartitioning` for the `outputPartitioning`of the `BroadcastHashJoin` on `l` and `r`. Also, because the last `GROUP BY` is using `l.key` as the key, we will not introduce an `Exchange` for this aggregation. However, `r`'s outputPartitioning may not match the required distribution of the last `GROUP BY` and we fail to group data correctly.

JIRA is being reindexed. I will create a JIRA ticket once it is back online.

Author: Yin Huai <[email protected]>

Closes #1735 from yhuai/BroadcastHashJoin and squashes the following commits:

96d9cb3 [Yin Huai] Set outputPartitioning correctly.

(cherry picked from commit 67bd8e3)
Signed-off-by: Michael Armbrust <[email protected]>
@asfgit asfgit closed this in 67bd8e3 Aug 2, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
I think we will not generate the plan triggering this bug at this moment. But, let me explain it...

Right now, we are using `left.outputPartitioning` as the `outputPartitioning` of a `BroadcastHashJoin`. We may have a wrong physical plan for cases like...
```sql
SELECT l.key, count(*)
FROM (SELECT key, count(*) as cnt
      FROM src
      GROUP BY key) l // This is buildPlan
JOIN r // This is the streamedPlan
ON (l.cnt = r.value)
GROUP BY l.key
```
Let's say we have a `BroadcastHashJoin` on `l` and `r`. For this case, we will pick `l`'s `outputPartitioning` for the `outputPartitioning`of the `BroadcastHashJoin` on `l` and `r`. Also, because the last `GROUP BY` is using `l.key` as the key, we will not introduce an `Exchange` for this aggregation. However, `r`'s outputPartitioning may not match the required distribution of the last `GROUP BY` and we fail to group data correctly.

JIRA is being reindexed. I will create a JIRA ticket once it is back online.

Author: Yin Huai <[email protected]>

Closes apache#1735 from yhuai/BroadcastHashJoin and squashes the following commits:

96d9cb3 [Yin Huai] Set outputPartitioning correctly.
@yhuai yhuai deleted the BroadcastHashJoin branch October 6, 2014 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants