Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is a followup of #23388 .

#23388 has an issue: it doesn't handle subquery expressions and assumes they will be turned into joins. However, this is not true for non-correlated subquery expressions.

This PR fixes this issue. It now doesn't skip Subquery, and subquery expressions will be handled by OptimizeSubqueries, which runs the optimizer with the subquery.

Note that, correlated subquery expressions will be handled twice: once in OptimizeSubqueries, once later when it becomes join. This is OK as NormalizeFloatingNumbers is idempotent now.

Why are the changes needed?

fix a bug

Does this PR introduce any user-facing change?

yes, see the newly added test.

How was this patch tested?

new test

@cloud-fan
Copy link
Contributor Author

cc @hvanhovell @maropu @viirya

def apply(plan: LogicalPlan): LogicalPlan = plan match {
// A subquery will be rewritten into join later, and will go through this rule
// eventually. Here we skip subquery, as we only need to run this rule once.
case _: Subquery => plan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding tests for the subquery case in NormalizeFloatingPointNumbersSuite, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we can't.

This fix relies on the rule OptimizeSubqueries, which is an inner object of the class Optimizer as it needs to rerun the entire optimizer for subquery. So we can't use OptimizeSubqueries in NormalizeFloatingPointNumbersSuite.

@SparkQA
Copy link

SparkQA commented Jun 10, 2020

Test build #123771 has finished for PR 28785 at commit 4dc6413.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Note that, this rule must be executed at the end of optimizer, because the optimizer may create
* new joins(the subquery rewrite) and new join conditions(the join reorder).
*/
object NormalizeFloatingNumbers extends Rule[LogicalPlan] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it also mean This batch must be executed after the RewriteSubquery batch, which creates joins. is not definitely true now?

Copy link
Contributor Author

@cloud-fan cloud-fan Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's still true, the correlated subquery becomes join, and may have new join keys.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Makes sense.

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master/3.0!

@cloud-fan cloud-fan closed this in 6fb9c80 Jun 11, 2020
cloud-fan added a commit that referenced this pull request Jun 11, 2020
### What changes were proposed in this pull request?

This is a followup of #23388 .

#23388 has an issue: it doesn't handle subquery expressions and assumes they will be turned into joins. However, this is not true for non-correlated subquery expressions.

This PR fixes this issue. It now doesn't skip `Subquery`, and subquery expressions will be handled by `OptimizeSubqueries`, which runs the optimizer with the subquery.

Note that, correlated subquery expressions will be handled twice: once in `OptimizeSubqueries`, once later when it becomes join. This is OK as `NormalizeFloatingNumbers` is idempotent now.

### Why are the changes needed?

fix a bug

### Does this PR introduce _any_ user-facing change?

yes, see the newly added test.

### How was this patch tested?

new test

Closes #28785 from cloud-fan/normalize.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 6fb9c80)
Signed-off-by: Wenchen Fan <[email protected]>
holdenk pushed a commit to holdenk/spark that referenced this pull request Jun 25, 2020
### What changes were proposed in this pull request?

This is a followup of apache#23388 .

apache#23388 has an issue: it doesn't handle subquery expressions and assumes they will be turned into joins. However, this is not true for non-correlated subquery expressions.

This PR fixes this issue. It now doesn't skip `Subquery`, and subquery expressions will be handled by `OptimizeSubqueries`, which runs the optimizer with the subquery.

Note that, correlated subquery expressions will be handled twice: once in `OptimizeSubqueries`, once later when it becomes join. This is OK as `NormalizeFloatingNumbers` is idempotent now.

### Why are the changes needed?

fix a bug

### Does this PR introduce _any_ user-facing change?

yes, see the newly added test.

### How was this patch tested?

new test

Closes apache#28785 from cloud-fan/normalize.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 6fb9c80)
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants