-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-11863][SQL][WIP] Unable to resolve order by if it contains mixture of aliases and real columns. #9844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…xture of aliases and real columns.
|
@cloud-fan Hi Wenchen, can you please look at this change and let me know your comments. |
|
ok to test |
|
Test build #46383 has finished for PR 9844 at commit
|
|
Test build #46405 has finished for PR 9844 at commit
|
|
This is really a good catch, thanks @dilipbiswal The problems is that, normal operator should be resolved based on its child, but For your example, I think we can just fix the problem directly, i.e. only pick up unresolved |
|
@cloud-fan Thank you for the explanation as always. Trying to see if i understood your suggestion properly. Were you suggesting to add another case under ResolveSortReferences to deal with sort operator with Aggregation ? Or you were thinking to call resolveAndFindMissing from within ResolveAggregateFunctions passing the grandchild ? Since most of the stuff were happening in that function , i thought its easier to just refactor the evaluatedorderings logic. However you understand this stuff a lot better. So i would go by your suggestion. Please let me know. |
|
In In your case, the Does it make sense to you? |
|
Thanks a lot @cloud-fan. Actually i do remember trying to do something similar. So i had tried to filter on resolved and was trying to only pick un-resolved attributes. But after i had done the execute i had difficulty to stich together the two resolved attributes to restore the original sortorder. We need to keep the orginal order of the attribute, right ? So here is what i was trying ..
I am sure you will have some magic here :-) Can you share your thoughts.. Thanks a lot. |
|
how about this:
|
|
@cloud-fan Wow.. thank you very much. I will try it. Thanks again, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Thanks for your help. I have made changes based on your comments. There is one change i am a little concerned about and need your input.
After computing the finalSortOrders, i am passing it to the pushdown computation logic where it tries to compare the semantic equality between sort attributes and aggregate expression attributes. For the already resolved attribute, the comparison is failing and it considers the attribute as pushdownable and things just go wrong :-). So i am comparing the expression id of the resolved alias against the aggregation expression's attribute reference. It does not seem clean though :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can put the finalSortOrders at the final place, after this evaluatedOrderings, it should be:
val sortOrdersMap = unresolvedSortOrders.map(new TreeNodeRef(_)).zip(evaluatedOrderings).toMap
val finalSortOrders = sortOrders.map(s => sortOrdersMap.getOrElse(TreeNodeRef(s), s))
if (sortOrder == finalSortOrders) {
......
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan THANKS !! Somehow i kept thinking that we need the full list of sort attributes for pushdown determination. After your comment , it makes sense as any resolved sort order attribute(s) must be resolved from the aggregate expressions and hence its ok to not considered. Right ?
One other question , after our a change a lot of cases are now going through this codepath and here we add an extra projection above the sort. Should we add this only when we have added at least one pushdown attribute ? Or should we fix the tescases instead ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think our optimizer is smart enough to remove unnecessary Project, so we don't need to worry about it here :)
you can change your test case if the plan doesn't match.
|
Test build #46580 has finished for PR 9844 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep the indent please.
|
@cloud-fan can you please help trigger a retest ? Thanks. |
|
retest this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code style: we should put the . at the beginning of a line, not at the end. And also remove the space between groupBy('a, 'c) and ('a.as("a1"),...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Will do. Wenchen, there are a few test failures. I am still looking at it. So i think our idea to NOT consider the already resolved attribute for pushdown decision is causing the issue.
Here are the tests
- SELECT count() FROM orderByData GROUP BY a ORDER BY count()
In this case we want the sort attribute representing the count(_) to be replace by the
group by alias. - SELECT a FROM orderByData GROUP BY a ORDER BY a, count(_), sum(b)
In this case we want the count(*) to be pushed down to aggregate
In both these case, we are skipping pushdown processing because its a resolved attribute.
Given this wenchen, may i request you to look at the original fix. After learning more about
different conditions, it seems like that may be a safer fix. Let me know what you think.
|
Test build #46588 has finished for PR 9844 at commit
|
|
@cloud-fan Thanks a lot. |
…of aliases and real columns this is based on #9844, with some bug fix and clean up. The problems is that, normal operator should be resolved based on its child, but `Sort` operator can also be resolved based on its grandchild. So we have 3 rules that can resolve `Sort`: `ResolveReferences`, `ResolveSortReferences`(if grandchild is `Project`) and `ResolveAggregateFunctions`(if grandchild is `Aggregate`). For example, `select c1 as a , c2 as b from tab group by c1, c2 order by a, c2`, we need to resolve `a` and `c2` for `Sort`. Firstly `a` will be resolved in `ResolveReferences` based on its child, and when we reach `ResolveAggregateFunctions`, we will try to resolve both `a` and `c2` based on its grandchild, but failed because `a` is not a legal aggregate expression. whoever merge this PR, please give the credit to dilipbiswal Author: Dilip Biswal <[email protected]> Author: Wenchen Fan <[email protected]> Closes #9961 from cloud-fan/sort. (cherry picked from commit bc16a67) Signed-off-by: Michael Armbrust <[email protected]>
…of aliases and real columns this is based on #9844, with some bug fix and clean up. The problems is that, normal operator should be resolved based on its child, but `Sort` operator can also be resolved based on its grandchild. So we have 3 rules that can resolve `Sort`: `ResolveReferences`, `ResolveSortReferences`(if grandchild is `Project`) and `ResolveAggregateFunctions`(if grandchild is `Aggregate`). For example, `select c1 as a , c2 as b from tab group by c1, c2 order by a, c2`, we need to resolve `a` and `c2` for `Sort`. Firstly `a` will be resolved in `ResolveReferences` based on its child, and when we reach `ResolveAggregateFunctions`, we will try to resolve both `a` and `c2` based on its grandchild, but failed because `a` is not a legal aggregate expression. whoever merge this PR, please give the credit to dilipbiswal Author: Dilip Biswal <[email protected]> Author: Wenchen Fan <[email protected]> Closes #9961 from cloud-fan/sort.
|
Should this be closed now that #9961 is merged? |
|
Yes, using the magic words: do you mind closing this PR? |
…of aliases and real columns this is based on apache#9844, with some bug fix and clean up. The problems is that, normal operator should be resolved based on its child, but `Sort` operator can also be resolved based on its grandchild. So we have 3 rules that can resolve `Sort`: `ResolveReferences`, `ResolveSortReferences`(if grandchild is `Project`) and `ResolveAggregateFunctions`(if grandchild is `Aggregate`). For example, `select c1 as a , c2 as b from tab group by c1, c2 order by a, c2`, we need to resolve `a` and `c2` for `Sort`. Firstly `a` will be resolved in `ResolveReferences` based on its child, and when we reach `ResolveAggregateFunctions`, we will try to resolve both `a` and `c2` based on its grandchild, but failed because `a` is not a legal aggregate expression. whoever merge this PR, please give the credit to dilipbiswal Author: Dilip Biswal <[email protected]> Author: Wenchen Fan <[email protected]> Closes apache#9961 from cloud-fan/sort.
|
@srowen closed. Sorry to have missed it. |
Compute the evaluatedOrderings by replacing the Alias names referenced by Sort
expression with the attributes in agregate expressions after checking the semantic equality.
Example : select c1 as a , c2 as b from tab group by c1, c2 order by a, c2