[SPARK-12789]Support order by index and group by index #10731

zhichao-li · 2016-01-13T00:48:56Z

Num in Order by is treated as constant expression at the moment. I guess it would be good to enable user to specify column by index which has been supported in Hive 0.11.0 and later.
The index is 1-base which means the position of the projection list for Order by and the position of colums for group by.
For example:
- table test (a, b, c)
- SELECT b, c FROM test ORDER BY 1 same as SELECT b, c FROM test ORDER BY b
- SELECT SUM(a) FROM test GROUP BY 2 same as SELECT SUM(a) FROM test GROUP BY b
- If we order by 0 or group by 4, it would throw exception in this case since the index has been out of range.

zhichao-li · 2016-01-13T00:49:41Z

cc @chenghao-intel @adrian-wang

adrian-wang · 2016-01-13T01:03:06Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

I'ts for intercept[UnresolvedException[SortOrder]]

SparkQA · 2016-01-13T01:04:46Z

Test build #49276 has finished for PR 10731 at commit 5a2270b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-13T02:16:30Z

Test build #49283 has finished for PR 10731 at commit d5cb4e2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-13T06:54:16Z

Test build #49292 has finished for PR 10731 at commit b72547b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-13T10:34:59Z

Test build #49311 has finished for PR 10731 at commit 1cb6752.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-01-13T18:06:29Z

@zhichao-li It will be good if you can take a look and see if other databases (other than hive) support this. I am not sure if it is really useful.

yhuai · 2016-01-13T20:43:47Z

oh, nvm. It is pretty common in other databases.

zhichao-li · 2016-01-14T00:48:15Z

retest this please.

SparkQA · 2016-01-14T02:43:42Z

Test build #49359 has finished for PR 10731 at commit 1cb6752.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-14T07:15:03Z

Test build #49383 has finished for PR 10731 at commit fe99e00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-01-18T04:37:54Z

a quick question. If I do ORDER BY a, 2, b, c, 2 means the second column?

yhuai · 2016-01-18T04:39:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Should we keep the && !s.resolved in the condition and have another case to handle the case of all literals?

Would update that shortly. I was thinking it would be more efficient by combining those in one past.

adrian-wang · 2016-01-18T04:40:26Z

order by 2 should be the second column, I think

zhichao-li · 2016-01-18T04:47:39Z

yes, It's a 1-based indexing for the projection list.

zhichao-li · 2016-01-18T07:06:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@yhuai not sure if it's the style you prefer. mind giving suggestions?

SparkQA · 2016-01-18T07:20:36Z

Test build #49575 has finished for PR 10731 at commit d9f548c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-18T07:47:54Z

Test build #49579 has finished for PR 10731 at commit 2746e0f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-18T09:53:48Z

Test build #49584 has finished for PR 10731 at commit acd00be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-01-18T10:17:00Z

This is similar to: #10052

That PR also implements this idea for GROUP BY clauses.

yhuai · 2016-01-18T17:42:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

oh, actually, I meant if we can just check if there is any integer literal. If so, we create a new Sort. Otherwise, we keep the old one.

If there are nested attributes, integer literal could be used to reference nested fields, so we the integer literal should be on top level of order list.

Seems like it might be more reasonable from the semantic point of view to override the resolved method and move the logic to resolveSortOrders.

zhichao-li · 2016-01-19T04:50:57Z

sql/core/src/test/scala/org/apache/spark/sql/execution/joins/InnerJoinSuite.scala

This is for passing the style check

zhichao-li · 2016-01-19T04:55:45Z

@hvanhovell didn't aware of #10052, would be happy if @dereksabryfb can pick up that.

SparkQA · 2016-01-19T06:10:41Z

Test build #49663 has finished for PR 10731 at commit 0daa766.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-17T09:31:50Z

Test build #51419 has finished for PR 10731 at commit 66c54b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-03-17T04:24:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

+object IntegerIndex {
+  def unapply(a: Any): Option[Int] = a match {
+    case Literal(a: Int, IntegerType) => Some(a)
+    case UnaryMinus(IntegerLiteral(v)) => Some(-v)


is it standard to support -(-1)? I see postgres support it, but somewhat strange to me.

This line is used for catching the illegal case:

sql("SELECT * FROM testData2 ORDER BY -1 DESC, b ASC").collect()

I plan to keep it untouched in the PR. Thanks!

rxin · 2016-03-17T04:37:53Z

Also I'd say "by position", not "by index", since index usually refers to something else in databases.

rxin · 2016-03-17T04:39:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala


+      // Replace the index with the related attribute for ORDER BY
+      // which is a 1-base position of the projection list.
+      case s @ Sort(orders, global, child) if child.resolved &&


this rule is getting pretty long -- i wonder if there are ways to break it down

I will move it to the rule ResolveSortReferences

I am unable to find a good place for group by ordinal resolution, after placing order by ordinal resolution in ResolveSortReferences. Two options are in my mind:

Assuming we can merge [SPARK-13320] [SQL] Support Star in CreateStruct/CreateArray and Error Handling when DataFrame/DataSet Functions using Star #11208, which splits ResolveReferences to two rules: ResolveStar and ResolveReferences. Then, ResolveReferences is not very long, maybe we can keep resolution of ordinal here.

Create a separate rule ResolveOrdinal for both cases.

In the next PR, I will first pick the second option, if nobody is against it. : ) Thanks!

rxin · 2016-03-17T04:40:46Z

two other comments:

it'd be better to separate order by position and group by position into two prs.
we should have config options to allow turning this off.

rxin · 2016-03-17T04:41:31Z

@gatorsmile do you think you can take over this and create two prs based on this?

gatorsmile · 2016-03-17T05:26:42Z

Sure, will do it. Thanks!

gatorsmile · 2016-03-17T06:31:48Z

Just did a quick search.

SQL92 allowed the use of ordinal positions for sort_expressions, but this functionality has been deprecated and should not be used in SQL99 and SQL2003 queries.

However, the mainstream RDBMS still support it.

None of these top 3 enterprise RDBMS are allowing negative positions. I think we should not support the negative integer in Order by.

Thanks!

yhuai · 2016-03-17T17:12:03Z

@gatorsmile Thank you for the investigation. Yea let's not use negative integer.

Regarding group by clause, do other systems support by specifying column positions?

gatorsmile · 2016-03-17T18:02:53Z

@yhuai "group by position" is not supported by Oracle, DB2 and SQL Server. I am unable to find it in any SQL standard.

Should we continue to support it? Also CC @rxin

yhuai · 2016-03-17T18:11:38Z

One question, if I have

table1: a: int, b: int, c: int

SELECT b, c FROM table1 ORDER BY 1, 2

what are columns used in ORDER BY? I guess b, c, right?

For GROUP BY, I feel it is not always obvious what are columns referred by the positions (I mean not as obvious as ORDER BY). What do you think?

gatorsmile · 2016-03-17T18:25:49Z

What you said about order by is right. The most tricky part is *. When we are doing select (*) in DB2, the position number is based on the table definition in catalog table.

Regarding Group By, I do not know which behavior is right. Different from Order By, Group By is below Project/SELECT. Thus, personally, I do not know what is the expected behavior when resolving position number in group by. Just like, the alias defined in Project/Select cannot be used in Group By.

yhuai · 2016-03-17T18:48:49Z

Thanks. Then, let's add the support to ORDER BY.

rxin · 2016-03-17T18:51:29Z

It is pretty obvious isn't it even for group by? It is just the project list, not the underlying table.

hvanhovell · 2016-03-17T18:53:49Z

GROUP BY position is supported by a few major analytical databases: Terradata & Netezza

I am not sure if you should even allow the combination of a SELECT * with a positional ORDER BY/GROUP BY clause

rxin · 2016-03-17T19:00:36Z

select * with group by is definitely not valid (with or without position)

select * with order by should work, since * here is just an expansion.

gatorsmile · 2016-03-17T21:06:36Z

Just confirmed what @hvanhovell said, Netezza and Terradata support "group by position".

Also confirmed what @rxin said, in Group By, the position is based on the output columns (select expression).

Thus, I think the integer in groupingExpressions should be resolved based on aggregateExpressions of Aggregate. Please let me know if my understanding is wrong. Thanks!

gatorsmile · 2016-03-19T22:51:38Z

"Group By Ordinal" will throw an exception if the corresponding position of the select list is an AggregateFunction. This is not allowed. I believe this PR misses this point. Please correct me if my understanding is wrong. Thanks!

#### What changes were proposed in this pull request? This PR is to support order by position in SQL, e.g. ```SQL select c1, c2, c3 from tbl order by 1 desc, 3 ``` should be equivalent to ```SQL select c1, c2, c3 from tbl order by c1 desc, c3 asc ``` This is controlled by config option `spark.sql.orderByOrdinal`. - When true, the ordinal numbers are treated as the position in the select list. - When false, the ordinal number in order/sort By clause are ignored. - Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them - This also works with select *. **Question**: Do we still need sort by columns that contain zero reference? In this case, it will have no impact on the sorting results. IMO, we should not allow users do it. rxin cloud-fan marmbrus yhuai hvanhovell -- Update: In these cases, they are ignored in this case. **Note**: This PR is taken from apache#10731. When merging this PR, please give the credit to zhichao-li Also cc all the people who are involved in the previous discussion: adrian-wang chenghao-intel tejasapatil #### How was this patch tested? Added a few test cases for both positive and negative test cases. Author: gatorsmile <[email protected]> Closes apache#11815 from gatorsmile/orderByPosition.

#### What changes were proposed in this pull request? This PR is to support group by position in SQL. For example, when users input the following query ```SQL select c1 as a, c2, c3, sum(*) from tbl group by 1, 3, c4 ``` The ordinals are recognized as the positions in the select list. Thus, `Analyzer` converts it to ```SQL select c1, c2, c3, sum(*) from tbl group by c1, c3, c4 ``` This is controlled by the config option `spark.sql.groupByOrdinal`. - When true, the ordinal numbers in group by clauses are treated as the position in the select list. - When false, the ordinal numbers are ignored. - Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them. - When the positions specified in the group by clauses correspond to the aggregate functions in select list, output an exception message. - star is not allowed to use in the select list when users specify ordinals in group by Note: This PR is taken from #10731. When merging this PR, please give the credit to zhichao-li Also cc all the people who are involved in the previous discussion: rxin cloud-fan marmbrus yhuai hvanhovell adrian-wang chenghao-intel tejasapatil #### How was this patch tested? Added a few test cases for both positive and negative test cases. Author: gatorsmile <[email protected]> Author: xiaoli <[email protected]> Author: Xiao Li <[email protected]> Closes #11846 from gatorsmile/groupByOrdinal.

adrian-wang reviewed Jan 13, 2016
View reviewed changes

zhichao-li force-pushed the orderby branch from 1cb6752 to fe99e00 Compare January 14, 2016 05:46

yhuai reviewed Jan 18, 2016
View reviewed changes

zhichao-li force-pushed the orderby branch from fe99e00 to d9f548c Compare January 18, 2016 07:04

zhichao-li reviewed Jan 18, 2016
View reviewed changes

yhuai reviewed Jan 18, 2016
View reviewed changes

zhichao-li force-pushed the orderby branch from acd00be to 0daa766 Compare January 19, 2016 04:49

zhichao-li reviewed Jan 19, 2016
View reviewed changes

zhichao-li force-pushed the orderby branch from 0daa766 to e61429f Compare January 21, 2016 02:34

address comments

66c54b1

zhichao-li force-pushed the orderby branch from 1a36a2a to 66c54b1 Compare February 17, 2016 07:42

rxin reviewed Mar 17, 2016
View reviewed changes

gatorsmile mentioned this pull request Mar 18, 2016

[SPARK-12789] [SQL] Support Order By Ordinal in SQL #11815

Closed

gatorsmile mentioned this pull request Mar 20, 2016

[SPARK-13957] [SQL] Support Group By Ordinal in SQL #11846

Closed

zhichao-li closed this Mar 22, 2016

zhichao-li deleted the orderby branch March 22, 2016 01:16

cloud-fan mentioned this pull request Aug 11, 2016

[SPARK-17016][SQL] Improve group-by/order-by ordinal error reporting #14594

Closed

[SPARK-12789]Support order by index and group by index #10731

[SPARK-12789]Support order by index and group by index #10731

Uh oh!

Conversation

zhichao-li commented Jan 13, 2016

Uh oh!

zhichao-li commented Jan 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 13, 2016

Uh oh!

SparkQA commented Jan 13, 2016

Uh oh!

SparkQA commented Jan 13, 2016

Uh oh!

SparkQA commented Jan 13, 2016

Uh oh!

yhuai commented Jan 13, 2016

Uh oh!

yhuai commented Jan 13, 2016

Uh oh!

zhichao-li commented Jan 14, 2016

Uh oh!

SparkQA commented Jan 14, 2016

Uh oh!

SparkQA commented Jan 14, 2016

Uh oh!

yhuai commented Jan 18, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrian-wang commented Jan 18, 2016

Uh oh!

zhichao-li commented Jan 18, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 18, 2016

Uh oh!

SparkQA commented Jan 18, 2016

Uh oh!

SparkQA commented Jan 18, 2016

Uh oh!

hvanhovell commented Jan 18, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhichao-li commented Jan 19, 2016

Uh oh!

SparkQA commented Jan 19, 2016

Uh oh!

SparkQA commented Feb 17, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Mar 17, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Mar 17, 2016

Uh oh!