Skip to content

Conversation

@xwu0226
Copy link
Contributor

@xwu0226 xwu0226 commented May 7, 2016

Symptom:

In the latest branch 1.6, when a DISTINCT aggregation function is used in the HAVING clause, Analyzer throws AnalysisException with a message like following:

resolved attribute(s) gid#558,id#559 missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], [date#554,id#561,gid#560,if ((gid = 1)) id else null#562];

Root cause:

The problem is that the distinct aggregate in having condition are resolved by the rule DistinctAggregationRewriter twice, which messes up the resulted EXPAND operator.

In a ResolveAggregateFunctions rule, when resolving Filter(havingCondition, _: Aggregate), the havingCondition is resolved as an Aggregate in a nested loop of analyzer rule execution (by invoking RuleExecutor.execute). At this nested level of analysis, the rule DistinctAggregationRewriter rewrites this distinct aggregate clause to an expanded two-layer aggregation, where the aggregateExpresssions of the final Aggregate contains the resolved gid and the aggregate expression attributes (In the above case, they are gid#558, id#559).

After completion of the nested analyzer rule execution, the resulted aggregateExpressions in the havingCondition is pushed down into the underlying Aggregate operator. The DistinctAggregationRewriter rule is executed again. The projections field of EXPAND operator is populated with the aggregateExpressions of the havingCondition mentioned above. However, the attributes (In the above case, they are gid#558, id#559) in the projection list of EXPAND operator can not be found in the underlying relation.

Solution:

This PR retrofits part of #11579 that moves the DistinctAggregationRewriter to the beginning of Optimizer, so that it guarantees that the rewrite only happens after all the aggregate functions are resolved first. Thus, it avoids resolution failure.

How is the PR change tested

New test cases are added to drive DistinctAggregationRewriter rewrites for multi-distinct aggregations , involving having clause.

A following up PR will be submitted to add these test cases to master(2.0) branch.

@gatorsmile
Copy link
Member

@rxin @cloud-fan Thanks!

@gatorsmile
Copy link
Member

gatorsmile commented May 7, 2016

@xwu0226 Because this is for Branch 1.6 only, please update the PR title to

[SPARK-14495][SQL][1.6] Fix Resolution Failure of Having Clause with Distinct Aggregate Functions

@cloud-fan
Copy link
Contributor

OK to test

@xwu0226 xwu0226 changed the title [SPARK-14495][SQL] fix resolution failure of having clause with distinct aggregate function [SPARK-14495][SQL][1.6] fix resolution failure of having clause with distinct aggregate function May 7, 2016
@cloud-fan
Copy link
Contributor

So #11579 accidentally fixed this bug in 2.0?

@cloud-fan
Copy link
Contributor

ok to test

@xwu0226
Copy link
Contributor Author

xwu0226 commented May 7, 2016

@cloud-fan Yes.

@SparkQA
Copy link

SparkQA commented May 7, 2016

Test build #58061 has finished for PR 12974 at commit fb550a1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.filter(_.isDistinct)
.groupBy(_.aggregateFunction.children.toSet)

val shouldRewrite = if (conf.specializeSingleDistinctAggPlanning) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this flag still useful in 1.6? cc @davies @yhuai

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag is for the purpose of benchmarking the performance of single distinct aggregation by DistinctAggregationRewriter. The default value is false, which means DistinctAggregationRewriter will not be used for a single distinct case. I see 2.0 has removed this flag, so i guess the decision has been made. If it is still needed for 1.6, I can add it back, which will involves more change in Optimizer to take the CatalystConf. Please let me know. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even it's not that useful, we should not remove it in minor release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies Thanks for your input! Let me modify Optimizer to add the conf parameter.

@xwu0226
Copy link
Contributor Author

xwu0226 commented May 10, 2016

@davies @cloud-fan I modified the change to keep the SQLConf property. In order to pass conf: CatalystConf to Optimizer, I did similar thing as how 2.0 branch handles Optimizer. Please help take a look. Thanks!

@SparkQA
Copy link

SparkQA commented May 10, 2016

Test build #58184 has finished for PR 12974 at commit e3deb13.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 10, 2016

Test build #58183 has finished for PR 12974 at commit 326eb4b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

1 similar comment
@xwu0226
Copy link
Contributor Author

xwu0226 commented May 10, 2016

retest this please

}

test("single distinct column set") {
Seq(true, false).foreach { specializeSingleDistinctAgg =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config is not removed, looks like we don't need to change this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. you are right. I will add it back.

}
}

test("SPARK-14495: two distinct aggregation with having clause of one distinct aggregation") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these 4 tests look a little verbose to me. The key to trigger this bug is to put distinct aggregate function in having, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Thanks for pointing it out! The original thinking was that because I removed the configuration property, single distinct aggregate case will not get into rewrite in Optimizer. Therefore, I wanted to cover cases for both single distinct and multi-distinct in having clause. I think I can just keep the first one. Any suggestions? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, please also update the test case name, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Thanks! I will update and push once the current test build is done.

@SparkQA
Copy link

SparkQA commented May 11, 2016

Test build #58327 has finished for PR 12974 at commit e0eeb7d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 11, 2016

Test build #58341 has finished for PR 12974 at commit 3782cda.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request May 11, 2016
…distinct aggregate function

#### Symptom:
In the latest **branch 1.6**, when a `DISTINCT` aggregation function is used in the `HAVING` clause, Analyzer throws `AnalysisException` with a message like following:
```
resolved attribute(s) gid#558,id#559 missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], [date#554,id#561,gid#560,if ((gid = 1)) id else null#562];
```
#### Root cause:
The problem is that the distinct aggregate in having condition are resolved by the rule `DistinctAggregationRewriter` twice, which messes up the resulted `EXPAND` operator.

In a `ResolveAggregateFunctions` rule, when resolving ```Filter(havingCondition, _: Aggregate)```, the `havingCondition` is resolved as an `Aggregate` in a nested loop of analyzer rule execution (by invoking `RuleExecutor.execute`). At this nested level of analysis, the rule `DistinctAggregationRewriter` rewrites this distinct aggregate clause to an expanded two-layer aggregation, where the `aggregateExpresssions` of the final `Aggregate` contains the resolved `gid` and the aggregate expression attributes (In the above case, they are  `gid#558, id#559`).

After completion of the nested analyzer rule execution, the resulted `aggregateExpressions` in the `havingCondition` is pushed down into the underlying `Aggregate` operator. The `DistinctAggregationRewriter` rule is executed again. The `projections` field of `EXPAND` operator is populated with the `aggregateExpressions` of the `havingCondition` mentioned above. However, the attributes (In the above case, they are `gid#558, id#559`) in the projection list of `EXPAND` operator can not be found in the underlying relation.

#### Solution:
This PR retrofits part of [#11579](#11579) that moves the `DistinctAggregationRewriter` to the beginning of Optimizer, so that it guarantees that the rewrite only happens after all the aggregate functions are resolved first. Thus, it avoids resolution failure.

#### How is the PR change tested
New [test cases ](https://github.com/xwu0226/spark/blob/f73428f94746d6d074baf6702589545bdbd11cad/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala#L927-L988) are added to drive `DistinctAggregationRewriter` rewrites for multi-distinct aggregations , involving having clause.

A following up PR will be submitted to add these test cases to master(2.0) branch.

Author: xin Wu <[email protected]>

Closes #12974 from xwu0226/SPARK-14495_review.
@cloud-fan
Copy link
Contributor

merging to 1.6, thanks!

@cloud-fan
Copy link
Contributor

Can you close it? Looks like PR not merged to master won't be closed automatically.

zzcclp pushed a commit to zzcclp/spark that referenced this pull request May 11, 2016
…distinct aggregate function

#### Symptom:
In the latest **branch 1.6**, when a `DISTINCT` aggregation function is used in the `HAVING` clause, Analyzer throws `AnalysisException` with a message like following:
```
resolved attribute(s) gid#558,id#559 missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], [date#554,id#561,gid#560,if ((gid = 1)) id else null#562];
```
#### Root cause:
The problem is that the distinct aggregate in having condition are resolved by the rule `DistinctAggregationRewriter` twice, which messes up the resulted `EXPAND` operator.

In a `ResolveAggregateFunctions` rule, when resolving ```Filter(havingCondition, _: Aggregate)```, the `havingCondition` is resolved as an `Aggregate` in a nested loop of analyzer rule execution (by invoking `RuleExecutor.execute`). At this nested level of analysis, the rule `DistinctAggregationRewriter` rewrites this distinct aggregate clause to an expanded two-layer aggregation, where the `aggregateExpresssions` of the final `Aggregate` contains the resolved `gid` and the aggregate expression attributes (In the above case, they are  `gid#558, id#559`).

After completion of the nested analyzer rule execution, the resulted `aggregateExpressions` in the `havingCondition` is pushed down into the underlying `Aggregate` operator. The `DistinctAggregationRewriter` rule is executed again. The `projections` field of `EXPAND` operator is populated with the `aggregateExpressions` of the `havingCondition` mentioned above. However, the attributes (In the above case, they are `gid#558, id#559`) in the projection list of `EXPAND` operator can not be found in the underlying relation.

#### Solution:
This PR retrofits part of [apache#11579](apache#11579) that moves the `DistinctAggregationRewriter` to the beginning of Optimizer, so that it guarantees that the rewrite only happens after all the aggregate functions are resolved first. Thus, it avoids resolution failure.

#### How is the PR change tested
New [test cases ](https://github.com/xwu0226/spark/blob/f73428f94746d6d074baf6702589545bdbd11cad/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala#L927-L988) are added to drive `DistinctAggregationRewriter` rewrites for multi-distinct aggregations , involving having clause.

A following up PR will be submitted to add these test cases to master(2.0) branch.

Author: xin Wu <[email protected]>

Closes apache#12974 from xwu0226/SPARK-14495_review.

(cherry picked from commit d165486)
@xwu0226
Copy link
Contributor Author

xwu0226 commented May 11, 2016

@cloud-fan Thank! I will close it.

@xwu0226 xwu0226 closed this May 11, 2016
asfgit pushed a commit that referenced this pull request May 22, 2016
## What changes were proposed in this pull request?
Add new test cases for including distinct aggregate in having clause in 2.0 branch.
This is a followup PR for [#12974](#12974), which is for 1.6 branch.

Author: xin Wu <[email protected]>

Closes #12984 from xwu0226/SPARK-15206.
asfgit pushed a commit that referenced this pull request May 22, 2016
## What changes were proposed in this pull request?
Add new test cases for including distinct aggregate in having clause in 2.0 branch.
This is a followup PR for [#12974](#12974), which is for 1.6 branch.

Author: xin Wu <[email protected]>

Closes #12984 from xwu0226/SPARK-15206.

(cherry picked from commit df9adb5)
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants