Skip to content

Conversation

@rednaxelafx
Copy link
Contributor

@rednaxelafx rednaxelafx commented Jan 26, 2019

What changes were proposed in this pull request?

Add verification of plan integrity with regards to special expressions being hosted only in supported operators. Specifically:

  • AggregateExpression: should only be hosted in Aggregate, or indirectly in Window
  • WindowExpression: should only be hosted in Window
  • Generator: should only be hosted in Generate

This will help us catch errors in future optimizer rules that incorrectly hoist special expression out of their supported operator.

TODO: This PR actually caught a bug in the analyzer in the test case SPARK-23957 Remove redundant sort from subquery plan(scalar subquery) in SubquerySuite, where a max() aggregate function is hosted in a Sort operator in the analyzed plan, which is invalid. That test case is disabled in this PR.
SPARK-26741 has been opened to track the fix in the analyzer.

How was this patch tested?

Added new test case in OptimizerStructuralIntegrityCheckerSuite

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use find?

@cloud-fan
Copy link
Contributor

I like this check! It can save us a lot of time when debugging!

@cloud-fan
Copy link
Contributor

LGTM except a code style comment.

@SparkQA
Copy link

SparkQA commented Jan 26, 2019

Test build #101699 has finished for PR 23658 at commit cf3f288.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

…wExpression and Generator should only be hosted in corresponding operators
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have another check now. We should update the doc here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Updated

@rednaxelafx
Copy link
Contributor Author

Thanks for your reviews, @cloud-fan and @viirya !

I've updated the PR addressing your comments.

This PR has actually caught a genuine bug in the analyzer in one of the test cases:
For this query in SubquerySuite,

SELECT *
FROM   t1
WHERE  c1 = (SELECT   max(t2.c1)
             FROM     t2
             GROUP BY t2.c1
             HAVING   count(*) >= 1
             ORDER BY max(t2.c1))

The analyzer resolves it into:

== Analyzed Logical Plan ==
c1: int, c2: int
Project [c1#51, c2#52]
+- Filter (c1#51 = scalar-subquery#109 [])
   :  +- Project [max(c1)#137]
   :     +- Sort [max(c1#28) ASC NULLS FIRST], true                // NOTE HERE!!
   :        +- Project [max(c1)#137, c1#28]
   :           +- Filter (count(1)#139L >= cast(1 as bigint))
   :              +- Aggregate [c1#28], [max(c1#28) AS max(c1)#137, count(1) AS count(1)#139L, c1#28]
   :                 +- SubqueryAlias `t2`
   :                    +- Project [_1#23 AS c1#28, _2#24 AS c2#29]
   :                       +- LocalRelation [_1#23, _2#24]
   +- SubqueryAlias `t1`
      +- Project [_1#46 AS c1#51, _2#47 AS c2#52]
         +- LocalRelation [_1#46, _2#47]

... where Sort [max(c1#28) ASC NULLS FIRST], true is an example of a Sort operator hosting an aggregate expression max, incorrectly.

It's somewhat tedious to fix because we need to tweak the order a bit. Working on it.

@SparkQA
Copy link

SparkQA commented Jan 26, 2019

Test build #101712 has finished for PR 23658 at commit c7ccc42.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val newAttr = UnresolvedAttribute("unresolvedAttr")
Project(projectList ++ Seq(newAttr), child)
case agg @ Aggregate(Nil, aggregateExpressions, child) =>
Project(aggregateExpressions, child)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment here to explain Project is unable to handle Aggregate expressions.

@gatorsmile
Copy link
Member

Please open a JIRA and make the bug fix as a TO-DO item. Thanks!

@SparkQA
Copy link

SparkQA commented Jan 26, 2019

Test build #101718 has finished for PR 23658 at commit 4c34e98.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 27, 2019

Test build #101721 has finished for PR 23658 at commit c1d38dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master.

@asfgit asfgit closed this in 860336d Jan 27, 2019
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Add verification of plan integrity with regards to special expressions being hosted only in supported operators. Specifically:

- `AggregateExpression`: should only be hosted in `Aggregate`, or indirectly in `Window`
- `WindowExpression`: should only be hosted in `Window`
- `Generator`: should only be hosted in `Generate`

This will help us catch errors in future optimizer rules that incorrectly hoist special expression out of their supported operator.

TODO: This PR actually caught a bug in the analyzer in the test case `SPARK-23957 Remove redundant sort from subquery plan(scalar subquery)` in `SubquerySuite`, where a `max()` aggregate function is hosted in a `Sort` operator in the analyzed plan, which is invalid. That test case is disabled in this PR.
SPARK-26741 has been opened to track the fix in the analyzer.

## How was this patch tested?

Added new test case in `OptimizerStructuralIntegrityCheckerSuite`

Closes apache#23658 from rednaxelafx/plan-integrity.

Authored-by: Kris Mok <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants