-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12789]Support order by index and group by index #10731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'ts for intercept[UnresolvedException[SortOrder]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
Test build #49276 has finished for PR 10731 at commit
|
|
Test build #49283 has finished for PR 10731 at commit
|
|
Test build #49292 has finished for PR 10731 at commit
|
|
Test build #49311 has finished for PR 10731 at commit
|
|
@zhichao-li It will be good if you can take a look and see if other databases (other than hive) support this. I am not sure if it is really useful. |
|
oh, nvm. It is pretty common in other databases. |
|
retest this please. |
|
Test build #49359 has finished for PR 10731 at commit
|
|
Test build #49383 has finished for PR 10731 at commit
|
|
a quick question. If I do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep the && !s.resolved in the condition and have another case to handle the case of all literals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would update that shortly. I was thinking it would be more efficient by combining those in one past.
|
order by 2 should be the second column, I think |
|
yes, It's a 1-based indexing for the projection list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yhuai not sure if it's the style you prefer. mind giving suggestions?
|
Test build #49575 has finished for PR 10731 at commit
|
|
Test build #49579 has finished for PR 10731 at commit
|
|
Test build #49584 has finished for PR 10731 at commit
|
|
This is similar to: #10052 That PR also implements this idea for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, actually, I meant if we can just check if there is any integer literal. If so, we create a new Sort. Otherwise, we keep the old one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are nested attributes, integer literal could be used to reference nested fields, so we the integer literal should be on top level of order list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it might be more reasonable from the semantic point of view to override the resolved method and move the logic to resolveSortOrders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for passing the style check
|
@hvanhovell didn't aware of #10052, would be happy if @dereksabryfb can pick up that. |
|
Test build #49663 has finished for PR 10731 at commit
|
|
Test build #51419 has finished for PR 10731 at commit
|
| object IntegerIndex { | ||
| def unapply(a: Any): Option[Int] = a match { | ||
| case Literal(a: Int, IntegerType) => Some(a) | ||
| case UnaryMinus(IntegerLiteral(v)) => Some(-v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it standard to support -(-1)? I see postgres support it, but somewhat strange to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is used for catching the illegal case:
sql("SELECT * FROM testData2 ORDER BY -1 DESC, b ASC").collect()I plan to keep it untouched in the PR. Thanks!
|
Also I'd say "by position", not "by index", since index usually refers to something else in databases. |
|
|
||
| // Replace the index with the related attribute for ORDER BY | ||
| // which is a 1-base position of the projection list. | ||
| case s @ Sort(orders, global, child) if child.resolved && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this rule is getting pretty long -- i wonder if there are ways to break it down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will move it to the rule ResolveSortReferences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unable to find a good place for group by ordinal resolution, after placing order by ordinal resolution in ResolveSortReferences. Two options are in my mind:
- Assuming we can merge [SPARK-13320] [SQL] Support Star in CreateStruct/CreateArray and Error Handling when DataFrame/DataSet Functions using Star #11208, which splits
ResolveReferencesto two rules:ResolveStarandResolveReferences. Then,ResolveReferencesis not very long, maybe we can keep resolution of ordinal here. - Create a separate rule
ResolveOrdinalfor both cases.
In the next PR, I will first pick the second option, if nobody is against it. : ) Thanks!
|
two other comments:
|
|
@gatorsmile do you think you can take over this and create two prs based on this? |
|
Sure, will do it. Thanks! |
|
Just did a quick search.
However, the mainstream RDBMS still support it.
None of these top 3 enterprise RDBMS are allowing negative positions. I think we should not support the negative integer in Order by. Thanks! |
|
@gatorsmile Thank you for the investigation. Yea let's not use negative integer. Regarding group by clause, do other systems support by specifying column positions? |
|
One question, if I have what are columns used in ORDER BY? I guess For GROUP BY, I feel it is not always obvious what are columns referred by the positions (I mean not as obvious as ORDER BY). What do you think? |
|
What you said about order by is right. The most tricky part is Regarding Group By, I do not know which behavior is right. Different from Order By, Group By is below Project/SELECT. Thus, personally, I do not know what is the expected behavior when resolving position number in group by. Just like, the alias defined in Project/Select cannot be used in Group By. |
|
Thanks. Then, let's add the support to ORDER BY. |
|
It is pretty obvious isn't it even for group by? It is just the project list, not the underlying table. |
|
GROUP BY position is supported by a few major analytical databases: Terradata & Netezza I am not sure if you should even allow the combination of a |
|
select * with group by is definitely not valid (with or without position) select * with order by should work, since * here is just an expansion. |
|
Just confirmed what @hvanhovell said, Netezza and Terradata support "group by position".
Also confirmed what @rxin said, in Group By, the position is based on the output columns (select expression). Thus, I think the integer in |
|
"Group By Ordinal" will throw an exception if the corresponding position of the select list is an |
#### What changes were proposed in this pull request? This PR is to support order by position in SQL, e.g. ```SQL select c1, c2, c3 from tbl order by 1 desc, 3 ``` should be equivalent to ```SQL select c1, c2, c3 from tbl order by c1 desc, c3 asc ``` This is controlled by config option `spark.sql.orderByOrdinal`. - When true, the ordinal numbers are treated as the position in the select list. - When false, the ordinal number in order/sort By clause are ignored. - Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them - This also works with select *. **Question**: Do we still need sort by columns that contain zero reference? In this case, it will have no impact on the sorting results. IMO, we should not allow users do it. rxin cloud-fan marmbrus yhuai hvanhovell -- Update: In these cases, they are ignored in this case. **Note**: This PR is taken from apache#10731. When merging this PR, please give the credit to zhichao-li Also cc all the people who are involved in the previous discussion: adrian-wang chenghao-intel tejasapatil #### How was this patch tested? Added a few test cases for both positive and negative test cases. Author: gatorsmile <[email protected]> Closes apache#11815 from gatorsmile/orderByPosition.
#### What changes were proposed in this pull request? This PR is to support order by position in SQL, e.g. ```SQL select c1, c2, c3 from tbl order by 1 desc, 3 ``` should be equivalent to ```SQL select c1, c2, c3 from tbl order by c1 desc, c3 asc ``` This is controlled by config option `spark.sql.orderByOrdinal`. - When true, the ordinal numbers are treated as the position in the select list. - When false, the ordinal number in order/sort By clause are ignored. - Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them - This also works with select *. **Question**: Do we still need sort by columns that contain zero reference? In this case, it will have no impact on the sorting results. IMO, we should not allow users do it. rxin cloud-fan marmbrus yhuai hvanhovell -- Update: In these cases, they are ignored in this case. **Note**: This PR is taken from apache#10731. When merging this PR, please give the credit to zhichao-li Also cc all the people who are involved in the previous discussion: adrian-wang chenghao-intel tejasapatil #### How was this patch tested? Added a few test cases for both positive and negative test cases. Author: gatorsmile <[email protected]> Closes apache#11815 from gatorsmile/orderByPosition.
#### What changes were proposed in this pull request? This PR is to support group by position in SQL. For example, when users input the following query ```SQL select c1 as a, c2, c3, sum(*) from tbl group by 1, 3, c4 ``` The ordinals are recognized as the positions in the select list. Thus, `Analyzer` converts it to ```SQL select c1, c2, c3, sum(*) from tbl group by c1, c3, c4 ``` This is controlled by the config option `spark.sql.groupByOrdinal`. - When true, the ordinal numbers in group by clauses are treated as the position in the select list. - When false, the ordinal numbers are ignored. - Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them. - When the positions specified in the group by clauses correspond to the aggregate functions in select list, output an exception message. - star is not allowed to use in the select list when users specify ordinals in group by Note: This PR is taken from #10731. When merging this PR, please give the credit to zhichao-li Also cc all the people who are involved in the previous discussion: rxin cloud-fan marmbrus yhuai hvanhovell adrian-wang chenghao-intel tejasapatil #### How was this patch tested? Added a few test cases for both positive and negative test cases. Author: gatorsmile <[email protected]> Author: xiaoli <[email protected]> Author: Xiao Li <[email protected]> Closes #11846 from gatorsmile/groupByOrdinal.
Order byand the position of colums forgroup by.order by 0orgroup by 4, it would throw exception in this case since the index has been out of range.