[SPARK-11863][SQL][WIP] Unable to resolve order by if it contains mixture of aliases and real columns. #9844

dilipbiswal · 2015-11-19T21:36:08Z

Compute the evaluatedOrderings by replacing the Alias names referenced by Sort
expression with the attributes in agregate expressions after checking the semantic equality.

Example : select c1 as a , c2 as b from tab group by c1, c2 order by a, c2

…xture of aliases and real columns.

dilipbiswal · 2015-11-19T21:36:37Z

@cloud-fan Hi Wenchen, can you please look at this change and let me know your comments.

cloud-fan · 2015-11-20T01:36:42Z

ok to test

SparkQA · 2015-11-20T03:51:49Z

Test build #46383 has finished for PR 9844 at commit 954f919.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-20T10:32:27Z

Test build #46405 has finished for PR 9844 at commit ef4274a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-11-23T02:18:23Z

This is really a good catch, thanks @dilipbiswal

The problems is that, normal operator should be resolved based on its child, but Sort operator can be resolved based on its grandchild. So we have 3 rules that can resolve Sort: ResolveReferences, ResolveSortReferences(if grandchild is Project) and ResolveAggregateFunctions(if grandchild is Aggregate).

For your example, select c1 as a , c2 as b from tab group by c1, c2 order by a, c2, we need to resolve a and c2 for Sort. Firstly a will be resolved in ResolveReferences based on its child, and when we reach ResolveAggregateFunctions, we will try to resolve both a and c2 based on its grandchild, but failed because a is not a legal aggregate expression.

I think we can just fix the problem directly, i.e. only pick up unresolved SortOrders and try to resolve it based on grandchild in ResolveAggregateFunctions. @dilipbiswal what do you think?

dilipbiswal · 2015-11-23T02:41:07Z

@cloud-fan Thank you for the explanation as always. Trying to see if i understood your suggestion properly. Were you suggesting to add another case under ResolveSortReferences to deal with sort operator with Aggregation ? Or you were thinking to call resolveAndFindMissing from within ResolveAggregateFunctions passing the grandchild ? Since most of the stuff were happening in that function , i thought its easier to just refactor the evaluatedorderings logic. However you understand this stuff a lot better. So i would go by your suggestion. Please let me know.

cloud-fan · 2015-11-23T03:37:43Z

In ResolveAggregateFunctions, we pick up all SortOrders and put them in aggregate list to resolve them. However, logically we should only pick unresolved SortOrders as some of them may already be resolved at ResolveReferences.

In your case, the a in order by a, c2 can and only can be resolved in ResolveReferences, so we should skip the a in ResolveAggregateFunctions.

Does it make sense to you?

dilipbiswal · 2015-11-23T07:31:02Z

Thanks a lot @cloud-fan. Actually i do remember trying to do something similar. So i had tried to filter on resolved and was trying to only pick un-resolved attributes. But after i had done the execute i had difficulty to stich together the two resolved attributes to restore the original sortorder. We need to keep the orginal order of the attribute, right ? So here is what i was trying ..

filter the sortorder attributes to only keep the unresolved ones.
copy them to aggregate expression and call execute
After this i wanted to form a sort order having both the already filtered attributes and the newly resolved ones. This is where i was having difficulty.
We then call checkAnalysis .. and rest of the logic follows.

I am sure you will have some magic here :-) Can you share your thoughts.. Thanks a lot.

cloud-fan · 2015-11-23T07:41:48Z

how about this:

val unresolvedSortOrders = sortOrders.filterNot(_.resolved)
val resolvedSortOrders =  ... // the original logic that copy to aggregate expression and call execute
val sortOrdersMap = unresolvedSortOrders.map(TreeNodeRef(_)).zip(resolvedSortOrders).toMap
val finalSortOrders = sortOrders.map(s => sortOrdersMap.getOrElse(TreeNodeRef(s), s))

TreeNodeRef is a helper class that can help us search the SortOrder by object reference.

dilipbiswal · 2015-11-23T07:47:57Z

@cloud-fan Wow.. thank you very much. I will try it. Thanks again,

dilipbiswal · 2015-11-24T02:44:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@cloud-fan Thanks for your help. I have made changes based on your comments. There is one change i am a little concerned about and need your input.
After computing the finalSortOrders, i am passing it to the pushdown computation logic where it tries to compare the semantic equality between sort attributes and aggregate expression attributes. For the already resolved attribute, the comparison is failing and it considers the attribute as pushdownable and things just go wrong :-). So i am comparing the expression id of the resolved alias against the aggregation expression's attribute reference. It does not seem clean though :-)

We can put the finalSortOrders at the final place, after this evaluatedOrderings, it should be:

val sortOrdersMap = unresolvedSortOrders.map(new TreeNodeRef(_)).zip(evaluatedOrderings).toMap val finalSortOrders = sortOrders.map(s => sortOrdersMap.getOrElse(TreeNodeRef(s), s)) if (sortOrder == finalSortOrders) { ......

@cloud-fan THANKS !! Somehow i kept thinking that we need the full list of sort attributes for pushdown determination. After your comment , it makes sense as any resolved sort order attribute(s) must be resolved from the aggregate expressions and hence its ok to not considered. Right ?

One other question , after our a change a lot of cases are now going through this codepath and here we add an extra projection above the sort. Should we add this only when we have added at least one pushdown attribute ? Or should we fix the tescases instead ?

I think our optimizer is smart enough to remove unnecessary Project, so we don't need to worry about it here :)
you can change your test case if the plan doesn't match.

SparkQA · 2015-11-24T03:16:09Z

Test build #46580 has finished for PR 9844 at commit d319524.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-11-24T06:17:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

keep the indent please.

dilipbiswal · 2015-11-24T07:58:11Z

@cloud-fan can you please help trigger a retest ? Thanks.

cloud-fan · 2015-11-24T07:58:25Z

retest this please.

cloud-fan · 2015-11-24T08:02:27Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala

code style: we should put the . at the beginning of a line, not at the end. And also remove the space between groupBy('a, 'c) and ('a.as("a1"),...

@cloud-fan Will do. Wenchen, there are a few test failures. I am still looking at it. So i think our idea to NOT consider the already resolved attribute for pushdown decision is causing the issue.

Here are the tests

SELECT count() FROM orderByData GROUP BY a ORDER BY count()
In this case we want the sort attribute representing the count(_) to be replace by the
group by alias.

SELECT a FROM orderByData GROUP BY a ORDER BY a, count(_), sum(b)
In this case we want the count(*) to be pushed down to aggregate

In both these case, we are skipping pushdown processing because its a resolved attribute.
Given this wenchen, may i request you to look at the original fix. After learning more about
different conditions, it seems like that may be a safer fix. Let me know what you think.

SparkQA · 2015-11-24T08:11:47Z

Test build #46588 has finished for PR 9844 at commit cbf14ff.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2015-11-25T06:02:41Z

@cloud-fan Thanks a lot.

…of aliases and real columns this is based on #9844, with some bug fix and clean up. The problems is that, normal operator should be resolved based on its child, but `Sort` operator can also be resolved based on its grandchild. So we have 3 rules that can resolve `Sort`: `ResolveReferences`, `ResolveSortReferences`(if grandchild is `Project`) and `ResolveAggregateFunctions`(if grandchild is `Aggregate`). For example, `select c1 as a , c2 as b from tab group by c1, c2 order by a, c2`, we need to resolve `a` and `c2` for `Sort`. Firstly `a` will be resolved in `ResolveReferences` based on its child, and when we reach `ResolveAggregateFunctions`, we will try to resolve both `a` and `c2` based on its grandchild, but failed because `a` is not a legal aggregate expression. whoever merge this PR, please give the credit to dilipbiswal Author: Dilip Biswal <[email protected]> Author: Wenchen Fan <[email protected]> Closes #9961 from cloud-fan/sort. (cherry picked from commit bc16a67) Signed-off-by: Michael Armbrust <[email protected]>

…of aliases and real columns this is based on #9844, with some bug fix and clean up. The problems is that, normal operator should be resolved based on its child, but `Sort` operator can also be resolved based on its grandchild. So we have 3 rules that can resolve `Sort`: `ResolveReferences`, `ResolveSortReferences`(if grandchild is `Project`) and `ResolveAggregateFunctions`(if grandchild is `Aggregate`). For example, `select c1 as a , c2 as b from tab group by c1, c2 order by a, c2`, we need to resolve `a` and `c2` for `Sort`. Firstly `a` will be resolved in `ResolveReferences` based on its child, and when we reach `ResolveAggregateFunctions`, we will try to resolve both `a` and `c2` based on its grandchild, but failed because `a` is not a legal aggregate expression. whoever merge this PR, please give the credit to dilipbiswal Author: Dilip Biswal <[email protected]> Author: Wenchen Fan <[email protected]> Closes #9961 from cloud-fan/sort.

markhamstra · 2015-12-09T19:28:25Z

Should this be closed now that #9961 is merged?

srowen · 2015-12-09T19:35:43Z

Yes, using the magic words: do you mind closing this PR?

…of aliases and real columns this is based on apache#9844, with some bug fix and clean up. The problems is that, normal operator should be resolved based on its child, but `Sort` operator can also be resolved based on its grandchild. So we have 3 rules that can resolve `Sort`: `ResolveReferences`, `ResolveSortReferences`(if grandchild is `Project`) and `ResolveAggregateFunctions`(if grandchild is `Aggregate`). For example, `select c1 as a , c2 as b from tab group by c1, c2 order by a, c2`, we need to resolve `a` and `c2` for `Sort`. Firstly `a` will be resolved in `ResolveReferences` based on its child, and when we reach `ResolveAggregateFunctions`, we will try to resolve both `a` and `c2` based on its grandchild, but failed because `a` is not a legal aggregate expression. whoever merge this PR, please give the credit to dilipbiswal Author: Dilip Biswal <[email protected]> Author: Wenchen Fan <[email protected]> Closes apache#9961 from cloud-fan/sort.

dilipbiswal · 2015-12-09T19:44:53Z

@srowen closed. Sorry to have missed it.

[SPARK-11863] Unable to resolve order by attributes if it contains mi…

954f919

…xture of aliases and real columns.

Fix test failure

ef4274a

Implement code review comments

d319524

dilipbiswal reviewed Nov 24, 2015
View reviewed changes

cloud-fan reviewed Nov 24, 2015
View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated

Copy link

Contributor

cloud-fan Nov 24, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the indent please.

dilipbiswal added 2 commits November 23, 2015 23:28

fix test failure

8c96098

minor style

cbf14ff

cloud-fan reviewed Nov 24, 2015
View reviewed changes

cloud-fan mentioned this pull request Nov 25, 2015

[SPARK-11863][SQL] Unable to resolve order by if it contains mixture of aliases and real columns #9961

Closed

dilipbiswal closed this Dec 9, 2015

[SPARK-11863][SQL][WIP] Unable to resolve order by if it contains mixture of aliases and real columns. #9844

[SPARK-11863][SQL][WIP] Unable to resolve order by if it contains mixture of aliases and real columns. #9844

Uh oh!

Conversation

dilipbiswal commented Nov 19, 2015

Uh oh!

dilipbiswal commented Nov 19, 2015

Uh oh!

cloud-fan commented Nov 20, 2015

Uh oh!

SparkQA commented Nov 20, 2015

Uh oh!

SparkQA commented Nov 20, 2015

Uh oh!

cloud-fan commented Nov 23, 2015

Uh oh!

dilipbiswal commented Nov 23, 2015

Uh oh!

cloud-fan commented Nov 23, 2015

Uh oh!

dilipbiswal commented Nov 23, 2015

Uh oh!

cloud-fan commented Nov 23, 2015

Uh oh!

dilipbiswal commented Nov 23, 2015

Uh oh!

dilipbiswal Nov 24, 2015

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 24, 2015

Choose a reason for hiding this comment

Uh oh!

dilipbiswal Nov 24, 2015

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 24, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 24, 2015

Uh oh!

cloud-fan Nov 24, 2015

Choose a reason for hiding this comment

Uh oh!

dilipbiswal commented Nov 24, 2015

Uh oh!

cloud-fan commented Nov 24, 2015

Uh oh!

cloud-fan Nov 24, 2015

Choose a reason for hiding this comment

Uh oh!

dilipbiswal Nov 24, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 24, 2015

Uh oh!

dilipbiswal commented Nov 25, 2015

Uh oh!

markhamstra commented Dec 9, 2015

Uh oh!

srowen commented Dec 9, 2015

Uh oh!

dilipbiswal commented Dec 9, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants