Skip to content

Conversation

@zhichao-li
Copy link
Contributor

  1. Num in Order by is treated as constant expression at the moment. I guess it would be good to enable user to specify column by index which has been supported in Hive 0.11.0 and later.
  2. The index is 1-base which means the position of the projection list for Order by and the position of colums for group by.
  3. For example:
    • table test (a, b, c)
    • SELECT b, c FROM test ORDER BY 1 same as SELECT b, c FROM test ORDER BY b
    • SELECT SUM(a) FROM test GROUP BY 2 same as SELECT SUM(a) FROM test GROUP BY b
    • If we order by 0 or group by 4, it would throw exception in this case since the index has been out of range.

@zhichao-li
Copy link
Contributor Author

cc @chenghao-intel @adrian-wang

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'ts for intercept[UnresolvedException[SortOrder]]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@SparkQA
Copy link

SparkQA commented Jan 13, 2016

Test build #49276 has finished for PR 10731 at commit 5a2270b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2016

Test build #49283 has finished for PR 10731 at commit d5cb4e2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2016

Test build #49292 has finished for PR 10731 at commit b72547b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2016

Test build #49311 has finished for PR 10731 at commit 1cb6752.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Jan 13, 2016

@zhichao-li It will be good if you can take a look and see if other databases (other than hive) support this. I am not sure if it is really useful.

@yhuai
Copy link
Contributor

yhuai commented Jan 13, 2016

oh, nvm. It is pretty common in other databases.

@zhichao-li
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Jan 14, 2016

Test build #49359 has finished for PR 10731 at commit 1cb6752.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 14, 2016

Test build #49383 has finished for PR 10731 at commit fe99e00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Jan 18, 2016

a quick question. If I do ORDER BY a, 2, b, c, 2 means the second column?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep the && !s.resolved in the condition and have another case to handle the case of all literals?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would update that shortly. I was thinking it would be more efficient by combining those in one past.

@adrian-wang
Copy link
Contributor

order by 2 should be the second column, I think

@zhichao-li
Copy link
Contributor Author

yes, It's a 1-based indexing for the projection list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhuai not sure if it's the style you prefer. mind giving suggestions?

@SparkQA
Copy link

SparkQA commented Jan 18, 2016

Test build #49575 has finished for PR 10731 at commit d9f548c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 18, 2016

Test build #49579 has finished for PR 10731 at commit 2746e0f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 18, 2016

Test build #49584 has finished for PR 10731 at commit acd00be.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

This is similar to: #10052

That PR also implements this idea for GROUP BY clauses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, actually, I meant if we can just check if there is any integer literal. If so, we create a new Sort. Otherwise, we keep the old one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are nested attributes, integer literal could be used to reference nested fields, so we the integer literal should be on top level of order list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it might be more reasonable from the semantic point of view to override the resolved method and move the logic to resolveSortOrders.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for passing the style check

@zhichao-li
Copy link
Contributor Author

@hvanhovell didn't aware of #10052, would be happy if @dereksabryfb can pick up that.

@SparkQA
Copy link

SparkQA commented Jan 19, 2016

Test build #49663 has finished for PR 10731 at commit 0daa766.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 17, 2016

Test build #51419 has finished for PR 10731 at commit 66c54b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

object IntegerIndex {
def unapply(a: Any): Option[Int] = a match {
case Literal(a: Int, IntegerType) => Some(a)
case UnaryMinus(IntegerLiteral(v)) => Some(-v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it standard to support -(-1)? I see postgres support it, but somewhat strange to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is used for catching the illegal case:

      sql("SELECT * FROM testData2 ORDER BY -1 DESC, b ASC").collect()

I plan to keep it untouched in the PR. Thanks!

@rxin
Copy link
Contributor

rxin commented Mar 17, 2016

Also I'd say "by position", not "by index", since index usually refers to something else in databases.


// Replace the index with the related attribute for ORDER BY
// which is a 1-base position of the projection list.
case s @ Sort(orders, global, child) if child.resolved &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this rule is getting pretty long -- i wonder if there are ways to break it down

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move it to the rule ResolveSortReferences

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unable to find a good place for group by ordinal resolution, after placing order by ordinal resolution in ResolveSortReferences. Two options are in my mind:

In the next PR, I will first pick the second option, if nobody is against it. : ) Thanks!

@rxin
Copy link
Contributor

rxin commented Mar 17, 2016

two other comments:

  1. it'd be better to separate order by position and group by position into two prs.
  2. we should have config options to allow turning this off.

@rxin
Copy link
Contributor

rxin commented Mar 17, 2016

@gatorsmile do you think you can take over this and create two prs based on this?

@gatorsmile
Copy link
Member

Sure, will do it. Thanks!

@gatorsmile
Copy link
Member

Just did a quick search.

SQL92 allowed the use of ordinal positions for sort_expressions, but this functionality has been deprecated and should not be used in SQL99 and SQL2003 queries.

However, the mainstream RDBMS still support it.

None of these top 3 enterprise RDBMS are allowing negative positions. I think we should not support the negative integer in Order by.

Thanks!

@yhuai
Copy link
Contributor

yhuai commented Mar 17, 2016

@gatorsmile Thank you for the investigation. Yea let's not use negative integer.

Regarding group by clause, do other systems support by specifying column positions?

@gatorsmile
Copy link
Member

@yhuai "group by position" is not supported by Oracle, DB2 and SQL Server. I am unable to find it in any SQL standard.

Should we continue to support it? Also CC @rxin

@yhuai
Copy link
Contributor

yhuai commented Mar 17, 2016

One question, if I have

table1: a: int, b: int, c: int

SELECT b, c FROM table1 ORDER BY 1, 2

what are columns used in ORDER BY? I guess b, c, right?

For GROUP BY, I feel it is not always obvious what are columns referred by the positions (I mean not as obvious as ORDER BY). What do you think?

@gatorsmile
Copy link
Member

What you said about order by is right. The most tricky part is *. When we are doing select (*) in DB2, the position number is based on the table definition in catalog table.

Regarding Group By, I do not know which behavior is right. Different from Order By, Group By is below Project/SELECT. Thus, personally, I do not know what is the expected behavior when resolving position number in group by. Just like, the alias defined in Project/Select cannot be used in Group By.

@yhuai
Copy link
Contributor

yhuai commented Mar 17, 2016

Thanks. Then, let's add the support to ORDER BY.

@rxin
Copy link
Contributor

rxin commented Mar 17, 2016

It is pretty obvious isn't it even for group by? It is just the project list, not the underlying table.

@hvanhovell
Copy link
Contributor

GROUP BY position is supported by a few major analytical databases: Terradata & Netezza

I am not sure if you should even allow the combination of a SELECT * with a positional ORDER BY/GROUP BY clause

@rxin
Copy link
Contributor

rxin commented Mar 17, 2016

select * with group by is definitely not valid (with or without position)

select * with order by should work, since * here is just an expansion.

@gatorsmile
Copy link
Member

Just confirmed what @hvanhovell said, Netezza and Terradata support "group by position".

Also confirmed what @rxin said, in Group By, the position is based on the output columns (select expression).

Thus, I think the integer in groupingExpressions should be resolved based on aggregateExpressions of Aggregate. Please let me know if my understanding is wrong. Thanks!

@gatorsmile
Copy link
Member

"Group By Ordinal" will throw an exception if the corresponding position of the select list is an AggregateFunction. This is not allowed. I believe this PR misses this point. Please correct me if my understanding is wrong. Thanks!

ghost pushed a commit to dbtsai/spark that referenced this pull request Mar 21, 2016
#### What changes were proposed in this pull request?
This PR is to support order by position in SQL, e.g.
```SQL
select c1, c2, c3 from tbl order by 1 desc, 3
```
should be equivalent to
```SQL
select c1, c2, c3 from tbl order by c1 desc, c3 asc
```

This is controlled by config option `spark.sql.orderByOrdinal`.
- When true, the ordinal numbers are treated as the position in the select list.
- When false, the ordinal number in order/sort By clause are ignored.

- Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them
- This also works with select *.

**Question**: Do we still need sort by columns that contain zero reference? In this case, it will have no impact on the sorting results. IMO, we should not allow users do it. rxin cloud-fan marmbrus yhuai hvanhovell
-- Update: In these cases, they are ignored in this case.

**Note**: This PR is taken from apache#10731. When merging this PR, please give the credit to zhichao-li

Also cc all the people who are involved in the previous discussion: adrian-wang chenghao-intel tejasapatil

#### How was this patch tested?
Added a few test cases for both positive and negative test cases.

Author: gatorsmile <[email protected]>

Closes apache#11815 from gatorsmile/orderByPosition.
@zhichao-li zhichao-li closed this Mar 22, 2016
@zhichao-li zhichao-li deleted the orderby branch March 22, 2016 01:16
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
#### What changes were proposed in this pull request?
This PR is to support order by position in SQL, e.g.
```SQL
select c1, c2, c3 from tbl order by 1 desc, 3
```
should be equivalent to
```SQL
select c1, c2, c3 from tbl order by c1 desc, c3 asc
```

This is controlled by config option `spark.sql.orderByOrdinal`.
- When true, the ordinal numbers are treated as the position in the select list.
- When false, the ordinal number in order/sort By clause are ignored.

- Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them
- This also works with select *.

**Question**: Do we still need sort by columns that contain zero reference? In this case, it will have no impact on the sorting results. IMO, we should not allow users do it. rxin cloud-fan marmbrus yhuai hvanhovell
-- Update: In these cases, they are ignored in this case.

**Note**: This PR is taken from apache#10731. When merging this PR, please give the credit to zhichao-li

Also cc all the people who are involved in the previous discussion: adrian-wang chenghao-intel tejasapatil

#### How was this patch tested?
Added a few test cases for both positive and negative test cases.

Author: gatorsmile <[email protected]>

Closes apache#11815 from gatorsmile/orderByPosition.
asfgit pushed a commit that referenced this pull request Mar 25, 2016
#### What changes were proposed in this pull request?
This PR is to support group by position in SQL. For example, when users input the following query
```SQL
select c1 as a, c2, c3, sum(*) from tbl group by 1, 3, c4
```
The ordinals are recognized as the positions in the select list. Thus, `Analyzer` converts it to
```SQL
select c1, c2, c3, sum(*) from tbl group by c1, c3, c4
```

This is controlled by the config option `spark.sql.groupByOrdinal`.
- When true, the ordinal numbers in group by clauses are treated as the position in the select list.
- When false, the ordinal numbers are ignored.
- Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them.
- When the positions specified in the group by clauses correspond to the aggregate functions in select list, output an exception message.
- star is not allowed to use in the select list when users specify ordinals in group by

Note: This PR is taken from #10731. When merging this PR, please give the credit to zhichao-li

Also cc all the people who are involved in the previous discussion:  rxin cloud-fan marmbrus yhuai hvanhovell adrian-wang chenghao-intel tejasapatil

#### How was this patch tested?

Added a few test cases for both positive and negative test cases.

Author: gatorsmile <[email protected]>
Author: xiaoli <[email protected]>
Author: Xiao Li <[email protected]>

Closes #11846 from gatorsmile/groupByOrdinal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants