Skip to content

Conversation

@dilipbiswal
Copy link
Contributor

Compute the evaluatedOrderings by replacing the Alias names referenced by Sort
expression with the attributes in agregate expressions after checking the semantic equality.

Example : select c1 as a , c2 as b from tab group by c1, c2 order by a, c2

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Hi Wenchen, can you please look at this change and let me know your comments.

@cloud-fan
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Nov 20, 2015

Test build #46383 has finished for PR 9844 at commit 954f919.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 20, 2015

Test build #46405 has finished for PR 9844 at commit ef4274a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

This is really a good catch, thanks @dilipbiswal

The problems is that, normal operator should be resolved based on its child, but Sort operator can be resolved based on its grandchild. So we have 3 rules that can resolve Sort: ResolveReferences, ResolveSortReferences(if grandchild is Project) and ResolveAggregateFunctions(if grandchild is Aggregate).

For your example, select c1 as a , c2 as b from tab group by c1, c2 order by a, c2, we need to resolve a and c2 for Sort. Firstly a will be resolved in ResolveReferences based on its child, and when we reach ResolveAggregateFunctions, we will try to resolve both a and c2 based on its grandchild, but failed because a is not a legal aggregate expression.

I think we can just fix the problem directly, i.e. only pick up unresolved SortOrders and try to resolve it based on grandchild in ResolveAggregateFunctions. @dilipbiswal what do you think?

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Thank you for the explanation as always. Trying to see if i understood your suggestion properly. Were you suggesting to add another case under ResolveSortReferences to deal with sort operator with Aggregation ? Or you were thinking to call resolveAndFindMissing from within ResolveAggregateFunctions passing the grandchild ? Since most of the stuff were happening in that function , i thought its easier to just refactor the evaluatedorderings logic. However you understand this stuff a lot better. So i would go by your suggestion. Please let me know.

@cloud-fan
Copy link
Contributor

In ResolveAggregateFunctions, we pick up all SortOrders and put them in aggregate list to resolve them. However, logically we should only pick unresolved SortOrders as some of them may already be resolved at ResolveReferences.

In your case, the a in order by a, c2 can and only can be resolved in ResolveReferences, so we should skip the a in ResolveAggregateFunctions.

Does it make sense to you?

@dilipbiswal
Copy link
Contributor Author

Thanks a lot @cloud-fan. Actually i do remember trying to do something similar. So i had tried to filter on resolved and was trying to only pick un-resolved attributes. But after i had done the execute i had difficulty to stich together the two resolved attributes to restore the original sortorder. We need to keep the orginal order of the attribute, right ? So here is what i was trying ..

  1. filter the sortorder attributes to only keep the unresolved ones.
  2. copy them to aggregate expression and call execute
  3. After this i wanted to form a sort order having both the already filtered attributes and the newly resolved ones. This is where i was having difficulty.
  4. We then call checkAnalysis .. and rest of the logic follows.

I am sure you will have some magic here :-) Can you share your thoughts.. Thanks a lot.

@cloud-fan
Copy link
Contributor

how about this:

val unresolvedSortOrders = sortOrders.filterNot(_.resolved)
val resolvedSortOrders =  ... // the original logic that copy to aggregate expression and call execute
val sortOrdersMap = unresolvedSortOrders.map(TreeNodeRef(_)).zip(resolvedSortOrders).toMap
val finalSortOrders = sortOrders.map(s => sortOrdersMap.getOrElse(TreeNodeRef(s), s))

TreeNodeRef is a helper class that can help us search the SortOrder by object reference.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Wow.. thank you very much. I will try it. Thanks again,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Thanks for your help. I have made changes based on your comments. There is one change i am a little concerned about and need your input.
After computing the finalSortOrders, i am passing it to the pushdown computation logic where it tries to compare the semantic equality between sort attributes and aggregate expression attributes. For the already resolved attribute, the comparison is failing and it considers the attribute as pushdownable and things just go wrong :-). So i am comparing the expression id of the resolved alias against the aggregation expression's attribute reference. It does not seem clean though :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put the finalSortOrders at the final place, after this evaluatedOrderings, it should be:

val sortOrdersMap = unresolvedSortOrders.map(new TreeNodeRef(_)).zip(evaluatedOrderings).toMap
val finalSortOrders = sortOrders.map(s => sortOrdersMap.getOrElse(TreeNodeRef(s), s))

if (sortOrder == finalSortOrders) {
......

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan THANKS !! Somehow i kept thinking that we need the full list of sort attributes for pushdown determination. After your comment , it makes sense as any resolved sort order attribute(s) must be resolved from the aggregate expressions and hence its ok to not considered. Right ?

One other question , after our a change a lot of cases are now going through this codepath and here we add an extra projection above the sort. Should we add this only when we have added at least one pushdown attribute ? Or should we fix the tescases instead ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our optimizer is smart enough to remove unnecessary Project, so we don't need to worry about it here :)
you can change your test case if the plan doesn't match.

@SparkQA
Copy link

SparkQA commented Nov 24, 2015

Test build #46580 has finished for PR 9844 at commit d319524.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the indent please.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan can you please help trigger a retest ? Thanks.

@cloud-fan
Copy link
Contributor

retest this please.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code style: we should put the . at the beginning of a line, not at the end. And also remove the space between groupBy('a, 'c) and ('a.as("a1"),...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Will do. Wenchen, there are a few test failures. I am still looking at it. So i think our idea to NOT consider the already resolved attribute for pushdown decision is causing the issue.

Here are the tests

  1. SELECT count() FROM orderByData GROUP BY a ORDER BY count()
    In this case we want the sort attribute representing the count(_) to be replace by the
    group by alias.
  2. SELECT a FROM orderByData GROUP BY a ORDER BY a, count(_), sum(b)
    In this case we want the count(*) to be pushed down to aggregate

In both these case, we are skipping pushdown processing because its a resolved attribute.
Given this wenchen, may i request you to look at the original fix. After learning more about
different conditions, it seems like that may be a safer fix. Let me know what you think.

@SparkQA
Copy link

SparkQA commented Nov 24, 2015

Test build #46588 has finished for PR 9844 at commit cbf14ff.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Thanks a lot.

asfgit pushed a commit that referenced this pull request Nov 26, 2015
…of aliases and real columns

this is based on #9844, with some bug fix and clean up.

The problems is that, normal operator should be resolved based on its child, but `Sort` operator can also be resolved based on its grandchild. So we have 3 rules that can resolve `Sort`: `ResolveReferences`, `ResolveSortReferences`(if grandchild is `Project`) and `ResolveAggregateFunctions`(if grandchild is `Aggregate`).
For example, `select c1 as a , c2 as b from tab group by c1, c2 order by a, c2`, we need to resolve `a` and `c2` for `Sort`. Firstly `a` will be resolved in `ResolveReferences` based on its child, and when we reach `ResolveAggregateFunctions`, we will try to resolve both `a` and `c2` based on its grandchild, but failed because `a` is not a legal aggregate expression.

whoever merge this PR, please give the credit to dilipbiswal

Author: Dilip Biswal <[email protected]>
Author: Wenchen Fan <[email protected]>

Closes #9961 from cloud-fan/sort.

(cherry picked from commit bc16a67)
Signed-off-by: Michael Armbrust <[email protected]>
asfgit pushed a commit that referenced this pull request Nov 26, 2015
…of aliases and real columns

this is based on #9844, with some bug fix and clean up.

The problems is that, normal operator should be resolved based on its child, but `Sort` operator can also be resolved based on its grandchild. So we have 3 rules that can resolve `Sort`: `ResolveReferences`, `ResolveSortReferences`(if grandchild is `Project`) and `ResolveAggregateFunctions`(if grandchild is `Aggregate`).
For example, `select c1 as a , c2 as b from tab group by c1, c2 order by a, c2`, we need to resolve `a` and `c2` for `Sort`. Firstly `a` will be resolved in `ResolveReferences` based on its child, and when we reach `ResolveAggregateFunctions`, we will try to resolve both `a` and `c2` based on its grandchild, but failed because `a` is not a legal aggregate expression.

whoever merge this PR, please give the credit to dilipbiswal

Author: Dilip Biswal <[email protected]>
Author: Wenchen Fan <[email protected]>

Closes #9961 from cloud-fan/sort.
@markhamstra
Copy link
Contributor

Should this be closed now that #9961 is merged?

@srowen
Copy link
Member

srowen commented Dec 9, 2015

Yes, using the magic words: do you mind closing this PR?

markhamstra pushed a commit to markhamstra/spark that referenced this pull request Dec 9, 2015
…of aliases and real columns

this is based on apache#9844, with some bug fix and clean up.

The problems is that, normal operator should be resolved based on its child, but `Sort` operator can also be resolved based on its grandchild. So we have 3 rules that can resolve `Sort`: `ResolveReferences`, `ResolveSortReferences`(if grandchild is `Project`) and `ResolveAggregateFunctions`(if grandchild is `Aggregate`).
For example, `select c1 as a , c2 as b from tab group by c1, c2 order by a, c2`, we need to resolve `a` and `c2` for `Sort`. Firstly `a` will be resolved in `ResolveReferences` based on its child, and when we reach `ResolveAggregateFunctions`, we will try to resolve both `a` and `c2` based on its grandchild, but failed because `a` is not a legal aggregate expression.

whoever merge this PR, please give the credit to dilipbiswal

Author: Dilip Biswal <[email protected]>
Author: Wenchen Fan <[email protected]>

Closes apache#9961 from cloud-fan/sort.
@dilipbiswal dilipbiswal closed this Dec 9, 2015
@dilipbiswal
Copy link
Contributor Author

@srowen closed. Sorry to have missed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants