[SPARK-14495][SQL][1.6] fix resolution failure of having clause with distinct aggregate function #12974

xwu0226 · 2016-05-07T05:26:19Z

Symptom:

In the latest branch 1.6, when a DISTINCT aggregation function is used in the HAVING clause, Analyzer throws AnalysisException with a message like following:

resolved attribute(s) gid#558,id#559 missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], [date#554,id#561,gid#560,if ((gid = 1)) id else null#562];

Root cause:

The problem is that the distinct aggregate in having condition are resolved by the rule DistinctAggregationRewriter twice, which messes up the resulted EXPAND operator.

In a ResolveAggregateFunctions rule, when resolving Filter(havingCondition, _: Aggregate), the havingCondition is resolved as an Aggregate in a nested loop of analyzer rule execution (by invoking RuleExecutor.execute). At this nested level of analysis, the rule DistinctAggregationRewriter rewrites this distinct aggregate clause to an expanded two-layer aggregation, where the aggregateExpresssions of the final Aggregate contains the resolved gid and the aggregate expression attributes (In the above case, they are gid#558, id#559).

After completion of the nested analyzer rule execution, the resulted aggregateExpressions in the havingCondition is pushed down into the underlying Aggregate operator. The DistinctAggregationRewriter rule is executed again. The projections field of EXPAND operator is populated with the aggregateExpressions of the havingCondition mentioned above. However, the attributes (In the above case, they are gid#558, id#559) in the projection list of EXPAND operator can not be found in the underlying relation.

Solution:

This PR retrofits part of #11579 that moves the DistinctAggregationRewriter to the beginning of Optimizer, so that it guarantees that the rewrite only happens after all the aggregate functions are resolved first. Thus, it avoids resolution failure.

How is the PR change tested

New test cases are added to drive DistinctAggregationRewriter rewrites for multi-distinct aggregations , involving having clause.

A following up PR will be submitted to add these test cases to master(2.0) branch.

…ctAggPlanning

gatorsmile · 2016-05-07T06:45:44Z

@rxin @cloud-fan Thanks!

gatorsmile · 2016-05-07T06:49:15Z

@xwu0226 Because this is for Branch 1.6 only, please update the PR title to

[SPARK-14495][SQL][1.6] Fix Resolution Failure of Having Clause with Distinct Aggregate Functions

cloud-fan · 2016-05-07T07:06:55Z

OK to test

cloud-fan · 2016-05-07T07:36:06Z

So #11579 accidentally fixed this bug in 2.0?

cloud-fan · 2016-05-07T07:36:16Z

ok to test

xwu0226 · 2016-05-07T08:06:32Z

@cloud-fan Yes.

SparkQA · 2016-05-07T09:07:13Z

Test build #58061 has finished for PR 12974 at commit fb550a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-05-09T05:09:10Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DistinctAggregationRewriter.scala

      .filter(_.isDistinct)
      .groupBy(_.aggregateFunction.children.toSet)

-    val shouldRewrite = if (conf.specializeSingleDistinctAggPlanning) {


is this flag still useful in 1.6? cc @davies @yhuai

This flag is for the purpose of benchmarking the performance of single distinct aggregation by DistinctAggregationRewriter. The default value is false, which means DistinctAggregationRewriter will not be used for a single distinct case. I see 2.0 has removed this flag, so i guess the decision has been made. If it is still needed for 1.6, I can add it back, which will involves more change in Optimizer to take the CatalystConf. Please let me know. Thanks!

Even it's not that useful, we should not remove it in minor release.

@davies Thanks for your input! Let me modify Optimizer to add the conf parameter.

…leDistinctAggPlanning

xwu0226 · 2016-05-10T00:16:11Z

@davies @cloud-fan I modified the change to keep the SQLConf property. In order to pass conf: CatalystConf to Optimizer, I did similar thing as how 2.0 branch handles Optimizer. Please help take a look. Thanks!

SparkQA · 2016-05-10T00:18:49Z

Test build #58184 has finished for PR 12974 at commit e3deb13.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-10T01:35:47Z

Test build #58183 has finished for PR 12974 at commit 326eb4b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-05-10T02:17:16Z

retest this please

xwu0226 · 2016-05-10T15:17:00Z

retest this please

cloud-fan · 2016-05-11T03:02:06Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala

  }

  test("single distinct column set") {
-    Seq(true, false).foreach { specializeSingleDistinctAgg =>


This config is not removed, looks like we don't need to change this test?

yes. you are right. I will add it back.

cloud-fan · 2016-05-11T06:14:34Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala

    }
  }
+
+  test("SPARK-14495: two distinct aggregation with having clause of one distinct aggregation") {


these 4 tests look a little verbose to me. The key to trigger this bug is to put distinct aggregate function in having, right?

Yes. Thanks for pointing it out! The original thinking was that because I removed the configuration property, single distinct aggregate case will not get into rewrite in Optimizer. Therefore, I wanted to cover cases for both single distinct and multi-distinct in having clause. I think I can just keep the first one. Any suggestions? Thanks!

SGTM, please also update the test case name, thanks!

Ok. Thanks! I will update and push once the current test build is done.

SparkQA · 2016-05-11T06:51:21Z

Test build #58327 has finished for PR 12974 at commit e0eeb7d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-11T08:24:20Z

Test build #58341 has finished for PR 12974 at commit 3782cda.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…distinct aggregate function #### Symptom: In the latest **branch 1.6**, when a `DISTINCT` aggregation function is used in the `HAVING` clause, Analyzer throws `AnalysisException` with a message like following: ``` resolved attribute(s) gid#558,id#559 missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], [date#554,id#561,gid#560,if ((gid = 1)) id else null#562]; ``` #### Root cause: The problem is that the distinct aggregate in having condition are resolved by the rule `DistinctAggregationRewriter` twice, which messes up the resulted `EXPAND` operator. In a `ResolveAggregateFunctions` rule, when resolving ```Filter(havingCondition, _: Aggregate)```, the `havingCondition` is resolved as an `Aggregate` in a nested loop of analyzer rule execution (by invoking `RuleExecutor.execute`). At this nested level of analysis, the rule `DistinctAggregationRewriter` rewrites this distinct aggregate clause to an expanded two-layer aggregation, where the `aggregateExpresssions` of the final `Aggregate` contains the resolved `gid` and the aggregate expression attributes (In the above case, they are `gid#558, id#559`). After completion of the nested analyzer rule execution, the resulted `aggregateExpressions` in the `havingCondition` is pushed down into the underlying `Aggregate` operator. The `DistinctAggregationRewriter` rule is executed again. The `projections` field of `EXPAND` operator is populated with the `aggregateExpressions` of the `havingCondition` mentioned above. However, the attributes (In the above case, they are `gid#558, id#559`) in the projection list of `EXPAND` operator can not be found in the underlying relation. #### Solution: This PR retrofits part of [#11579](#11579) that moves the `DistinctAggregationRewriter` to the beginning of Optimizer, so that it guarantees that the rewrite only happens after all the aggregate functions are resolved first. Thus, it avoids resolution failure. #### How is the PR change tested New [test cases ](https://github.com/xwu0226/spark/blob/f73428f94746d6d074baf6702589545bdbd11cad/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala#L927-L988) are added to drive `DistinctAggregationRewriter` rewrites for multi-distinct aggregations , involving having clause. A following up PR will be submitted to add these test cases to master(2.0) branch. Author: xin Wu <[email protected]> Closes #12974 from xwu0226/SPARK-14495_review.

cloud-fan · 2016-05-11T08:32:26Z

merging to 1.6, thanks!

cloud-fan · 2016-05-11T08:33:24Z

Can you close it? Looks like PR not merged to master won't be closed automatically.

…distinct aggregate function #### Symptom: In the latest **branch 1.6**, when a `DISTINCT` aggregation function is used in the `HAVING` clause, Analyzer throws `AnalysisException` with a message like following: ``` resolved attribute(s) gid#558,id#559 missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], [date#554,id#561,gid#560,if ((gid = 1)) id else null#562]; ``` #### Root cause: The problem is that the distinct aggregate in having condition are resolved by the rule `DistinctAggregationRewriter` twice, which messes up the resulted `EXPAND` operator. In a `ResolveAggregateFunctions` rule, when resolving ```Filter(havingCondition, _: Aggregate)```, the `havingCondition` is resolved as an `Aggregate` in a nested loop of analyzer rule execution (by invoking `RuleExecutor.execute`). At this nested level of analysis, the rule `DistinctAggregationRewriter` rewrites this distinct aggregate clause to an expanded two-layer aggregation, where the `aggregateExpresssions` of the final `Aggregate` contains the resolved `gid` and the aggregate expression attributes (In the above case, they are `gid#558, id#559`). After completion of the nested analyzer rule execution, the resulted `aggregateExpressions` in the `havingCondition` is pushed down into the underlying `Aggregate` operator. The `DistinctAggregationRewriter` rule is executed again. The `projections` field of `EXPAND` operator is populated with the `aggregateExpressions` of the `havingCondition` mentioned above. However, the attributes (In the above case, they are `gid#558, id#559`) in the projection list of `EXPAND` operator can not be found in the underlying relation. #### Solution: This PR retrofits part of [apache#11579](apache#11579) that moves the `DistinctAggregationRewriter` to the beginning of Optimizer, so that it guarantees that the rewrite only happens after all the aggregate functions are resolved first. Thus, it avoids resolution failure. #### How is the PR change tested New [test cases ](https://github.com/xwu0226/spark/blob/f73428f94746d6d074baf6702589545bdbd11cad/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala#L927-L988) are added to drive `DistinctAggregationRewriter` rewrites for multi-distinct aggregations , involving having clause. A following up PR will be submitted to add these test cases to master(2.0) branch. Author: xin Wu <[email protected]> Closes apache#12974 from xwu0226/SPARK-14495_review. (cherry picked from commit d165486)

xwu0226 · 2016-05-11T15:21:40Z

@cloud-fan Thank! I will close it.

## What changes were proposed in this pull request? Add new test cases for including distinct aggregate in having clause in 2.0 branch. This is a followup PR for [#12974](#12974), which is for 1.6 branch. Author: xin Wu <[email protected]> Closes #12984 from xwu0226/SPARK-15206.

## What changes were proposed in this pull request? Add new test cases for including distinct aggregate in having clause in 2.0 branch. This is a followup PR for [#12974](#12974), which is for 1.6 branch. Author: xin Wu <[email protected]> Closes #12984 from xwu0226/SPARK-15206. (cherry picked from commit df9adb5) Signed-off-by: Wenchen Fan <[email protected]>

xwu0226 added 3 commits May 5, 2016 11:13

move DistinctAggregateRewrite rule to optimizer

c51448d

modify testcases and remove property spark.sql.specializeSingleDistin…

f73428f

…ctAggPlanning

fix import order

fb550a1

xwu0226 changed the title ~~[SPARK-14495][SQL] fix resolution failure of having clause with distinct aggregate function~~ [SPARK-14495][SQL][1.6] fix resolution failure of having clause with distinct aggregate function May 7, 2016

xwu0226 mentioned this pull request May 8, 2016

[SPARK-15206][SQL] add testcases for distinct aggregate in having clause #12984

Closed

cloud-fan reviewed May 9, 2016
View reviewed changes

xwu0226 added 2 commits May 9, 2016 17:01

SPARK-14495: update upon review. keep SQLConf property specializeSing…

326eb4b

…leDistinctAggPlanning

SPARK-14495: reorder import

e3deb13

cloud-fan reviewed May 11, 2016
View reviewed changes

SPARK-14495: update testcase based on review

e0eeb7d

cloud-fan reviewed May 11, 2016
View reviewed changes

SPARK-14495: remove unnecessary testcases

3782cda

xwu0226 closed this May 11, 2016

[SPARK-14495][SQL][1.6] fix resolution failure of having clause with distinct aggregate function #12974

[SPARK-14495][SQL][1.6] fix resolution failure of having clause with distinct aggregate function #12974

Uh oh!

Conversation

xwu0226 commented May 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Symptom:

Root cause:

Solution:

How is the PR change tested

Uh oh!

gatorsmile commented May 7, 2016

Uh oh!

gatorsmile commented May 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented May 7, 2016

Uh oh!

cloud-fan commented May 7, 2016

Uh oh!

cloud-fan commented May 7, 2016

Uh oh!

xwu0226 commented May 7, 2016

Uh oh!

SparkQA commented May 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xwu0226 commented May 10, 2016

Uh oh!

SparkQA commented May 10, 2016

Uh oh!

SparkQA commented May 10, 2016

Uh oh!

gatorsmile commented May 10, 2016

Uh oh!

xwu0226 commented May 10, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 11, 2016

Uh oh!

SparkQA commented May 11, 2016

Uh oh!

cloud-fan commented May 11, 2016

Uh oh!

cloud-fan commented May 11, 2016

Uh oh!

xwu0226 commented May 11, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xwu0226 commented May 7, 2016 •

edited

Loading

gatorsmile commented May 7, 2016 •

edited

Loading