-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-14495][SQL][1.6] fix resolution failure of having clause with distinct aggregate function #12974
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@rxin @cloud-fan Thanks! |
|
@xwu0226 Because this is for Branch 1.6 only, please update the PR title to |
|
OK to test |
|
So #11579 accidentally fixed this bug in 2.0? |
|
ok to test |
|
@cloud-fan Yes. |
|
Test build #58061 has finished for PR 12974 at commit
|
| .filter(_.isDistinct) | ||
| .groupBy(_.aggregateFunction.children.toSet) | ||
|
|
||
| val shouldRewrite = if (conf.specializeSingleDistinctAggPlanning) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flag is for the purpose of benchmarking the performance of single distinct aggregation by DistinctAggregationRewriter. The default value is false, which means DistinctAggregationRewriter will not be used for a single distinct case. I see 2.0 has removed this flag, so i guess the decision has been made. If it is still needed for 1.6, I can add it back, which will involves more change in Optimizer to take the CatalystConf. Please let me know. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even it's not that useful, we should not remove it in minor release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davies Thanks for your input! Let me modify Optimizer to add the conf parameter.
|
@davies @cloud-fan I modified the change to keep the SQLConf property. In order to pass |
|
Test build #58184 has finished for PR 12974 at commit
|
|
Test build #58183 has finished for PR 12974 at commit
|
|
retest this please |
1 similar comment
|
retest this please |
| } | ||
|
|
||
| test("single distinct column set") { | ||
| Seq(true, false).foreach { specializeSingleDistinctAgg => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This config is not removed, looks like we don't need to change this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. you are right. I will add it back.
| } | ||
| } | ||
|
|
||
| test("SPARK-14495: two distinct aggregation with having clause of one distinct aggregation") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these 4 tests look a little verbose to me. The key to trigger this bug is to put distinct aggregate function in having, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Thanks for pointing it out! The original thinking was that because I removed the configuration property, single distinct aggregate case will not get into rewrite in Optimizer. Therefore, I wanted to cover cases for both single distinct and multi-distinct in having clause. I think I can just keep the first one. Any suggestions? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, please also update the test case name, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Thanks! I will update and push once the current test build is done.
|
Test build #58327 has finished for PR 12974 at commit
|
|
Test build #58341 has finished for PR 12974 at commit
|
…distinct aggregate function #### Symptom: In the latest **branch 1.6**, when a `DISTINCT` aggregation function is used in the `HAVING` clause, Analyzer throws `AnalysisException` with a message like following: ``` resolved attribute(s) gid#558,id#559 missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], [date#554,id#561,gid#560,if ((gid = 1)) id else null#562]; ``` #### Root cause: The problem is that the distinct aggregate in having condition are resolved by the rule `DistinctAggregationRewriter` twice, which messes up the resulted `EXPAND` operator. In a `ResolveAggregateFunctions` rule, when resolving ```Filter(havingCondition, _: Aggregate)```, the `havingCondition` is resolved as an `Aggregate` in a nested loop of analyzer rule execution (by invoking `RuleExecutor.execute`). At this nested level of analysis, the rule `DistinctAggregationRewriter` rewrites this distinct aggregate clause to an expanded two-layer aggregation, where the `aggregateExpresssions` of the final `Aggregate` contains the resolved `gid` and the aggregate expression attributes (In the above case, they are `gid#558, id#559`). After completion of the nested analyzer rule execution, the resulted `aggregateExpressions` in the `havingCondition` is pushed down into the underlying `Aggregate` operator. The `DistinctAggregationRewriter` rule is executed again. The `projections` field of `EXPAND` operator is populated with the `aggregateExpressions` of the `havingCondition` mentioned above. However, the attributes (In the above case, they are `gid#558, id#559`) in the projection list of `EXPAND` operator can not be found in the underlying relation. #### Solution: This PR retrofits part of [#11579](#11579) that moves the `DistinctAggregationRewriter` to the beginning of Optimizer, so that it guarantees that the rewrite only happens after all the aggregate functions are resolved first. Thus, it avoids resolution failure. #### How is the PR change tested New [test cases ](https://github.com/xwu0226/spark/blob/f73428f94746d6d074baf6702589545bdbd11cad/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala#L927-L988) are added to drive `DistinctAggregationRewriter` rewrites for multi-distinct aggregations , involving having clause. A following up PR will be submitted to add these test cases to master(2.0) branch. Author: xin Wu <[email protected]> Closes #12974 from xwu0226/SPARK-14495_review.
|
merging to 1.6, thanks! |
|
Can you close it? Looks like PR not merged to master won't be closed automatically. |
…distinct aggregate function #### Symptom: In the latest **branch 1.6**, when a `DISTINCT` aggregation function is used in the `HAVING` clause, Analyzer throws `AnalysisException` with a message like following: ``` resolved attribute(s) gid#558,id#559 missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], [date#554,id#561,gid#560,if ((gid = 1)) id else null#562]; ``` #### Root cause: The problem is that the distinct aggregate in having condition are resolved by the rule `DistinctAggregationRewriter` twice, which messes up the resulted `EXPAND` operator. In a `ResolveAggregateFunctions` rule, when resolving ```Filter(havingCondition, _: Aggregate)```, the `havingCondition` is resolved as an `Aggregate` in a nested loop of analyzer rule execution (by invoking `RuleExecutor.execute`). At this nested level of analysis, the rule `DistinctAggregationRewriter` rewrites this distinct aggregate clause to an expanded two-layer aggregation, where the `aggregateExpresssions` of the final `Aggregate` contains the resolved `gid` and the aggregate expression attributes (In the above case, they are `gid#558, id#559`). After completion of the nested analyzer rule execution, the resulted `aggregateExpressions` in the `havingCondition` is pushed down into the underlying `Aggregate` operator. The `DistinctAggregationRewriter` rule is executed again. The `projections` field of `EXPAND` operator is populated with the `aggregateExpressions` of the `havingCondition` mentioned above. However, the attributes (In the above case, they are `gid#558, id#559`) in the projection list of `EXPAND` operator can not be found in the underlying relation. #### Solution: This PR retrofits part of [apache#11579](apache#11579) that moves the `DistinctAggregationRewriter` to the beginning of Optimizer, so that it guarantees that the rewrite only happens after all the aggregate functions are resolved first. Thus, it avoids resolution failure. #### How is the PR change tested New [test cases ](https://github.com/xwu0226/spark/blob/f73428f94746d6d074baf6702589545bdbd11cad/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala#L927-L988) are added to drive `DistinctAggregationRewriter` rewrites for multi-distinct aggregations , involving having clause. A following up PR will be submitted to add these test cases to master(2.0) branch. Author: xin Wu <[email protected]> Closes apache#12974 from xwu0226/SPARK-14495_review. (cherry picked from commit d165486)
|
@cloud-fan Thank! I will close it. |
## What changes were proposed in this pull request? Add new test cases for including distinct aggregate in having clause in 2.0 branch. This is a followup PR for [#12974](#12974), which is for 1.6 branch. Author: xin Wu <[email protected]> Closes #12984 from xwu0226/SPARK-15206.
## What changes were proposed in this pull request? Add new test cases for including distinct aggregate in having clause in 2.0 branch. This is a followup PR for [#12974](#12974), which is for 1.6 branch. Author: xin Wu <[email protected]> Closes #12984 from xwu0226/SPARK-15206. (cherry picked from commit df9adb5) Signed-off-by: Wenchen Fan <[email protected]>
Symptom:
In the latest branch 1.6, when a
DISTINCTaggregation function is used in theHAVINGclause, Analyzer throwsAnalysisExceptionwith a message like following:Root cause:
The problem is that the distinct aggregate in having condition are resolved by the rule
DistinctAggregationRewritertwice, which messes up the resultedEXPANDoperator.In a
ResolveAggregateFunctionsrule, when resolvingFilter(havingCondition, _: Aggregate), thehavingConditionis resolved as anAggregatein a nested loop of analyzer rule execution (by invokingRuleExecutor.execute). At this nested level of analysis, the ruleDistinctAggregationRewriterrewrites this distinct aggregate clause to an expanded two-layer aggregation, where theaggregateExpresssionsof the finalAggregatecontains the resolvedgidand the aggregate expression attributes (In the above case, they aregid#558, id#559).After completion of the nested analyzer rule execution, the resulted
aggregateExpressionsin thehavingConditionis pushed down into the underlyingAggregateoperator. TheDistinctAggregationRewriterrule is executed again. Theprojectionsfield ofEXPANDoperator is populated with theaggregateExpressionsof thehavingConditionmentioned above. However, the attributes (In the above case, they aregid#558, id#559) in the projection list ofEXPANDoperator can not be found in the underlying relation.Solution:
This PR retrofits part of #11579 that moves the
DistinctAggregationRewriterto the beginning of Optimizer, so that it guarantees that the rewrite only happens after all the aggregate functions are resolved first. Thus, it avoids resolution failure.How is the PR change tested
New test cases are added to drive
DistinctAggregationRewriterrewrites for multi-distinct aggregations , involving having clause.A following up PR will be submitted to add these test cases to master(2.0) branch.