[SPARK-13957] [SQL] Support Group By Ordinal in SQL #11846

gatorsmile · 2016-03-20T04:19:21Z

What changes were proposed in this pull request?

This PR is to support group by position in SQL. For example, when users input the following query

select c1 as a, c2, c3, sum(*) from tbl group by 1, 3, c4

The ordinals are recognized as the positions in the select list. Thus, Analyzer converts it to

select c1, c2, c3, sum(*) from tbl group by c1, c3, c4

This is controlled by the config option spark.sql.groupByOrdinal.

When true, the ordinal numbers in group by clauses are treated as the position in the select list.
When false, the ordinal numbers are ignored.
Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them.
When the positions specified in the group by clauses correspond to the aggregate functions in select list, output an exception message.
star is not allowed to use in the select list when users specify ordinals in group by

Note: This PR is taken from #10731. When merging this PR, please give the credit to @zhichao-li

Also cc all the people who are involved in the previous discussion: @rxin @cloud-fan @marmbrus @yhuai @hvanhovell @adrian-wang @chenghao-intel @tejasapatil

How was this patch tested?

Added a few test cases for both positive and negative test cases.

SparkQA · 2016-03-22T00:32:29Z

Test build #53723 has finished for PR 11846 at commit b61345b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

cloud-fan · 2016-03-22T02:48:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+        val newGroups = groups.map {
+          case IntegerIndex(index) if index > 0 && index <= aggs.size =>
+            aggs(index - 1) match {
+              case Alias(c, _) if c.isInstanceOf[AggregateExpression] =>


how about sum(a) + 1? I think we need to use TreeNode.find to check if there are any agg functions inside it.

We already have a method called cotainsAggregate somewhere, we should call it here.

uh, yeah! let me fix it and add a test case. Thanks!

SparkQA · 2016-03-22T03:57:31Z

Test build #53737 has finished for PR 11846 at commit 18bab66.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-22T06:58:58Z

Test build #53746 has finished for PR 11846 at commit b19b73c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-22T07:24:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+        } else {
+          val expanded = a.aggregateExpressions.flatMap {
+            case s: Star => s.expand(a.child, resolver)
+            case u @ UnresolvedAlias(_: Star, _) => expandStarExpression(u.child, a.child) :: Nil


when will we hit this branch?

select * from tab group by col1, col2

But why doesn't Project have this case?

I think this is intentionally added by CatalystQl. I can double check if this is the root cause.

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/CatalystQl.scala#L224-L225

After reading the code, Project still has a problem in star expansion:

val structDf = testData2.select("a", "b").as("record") structDf.select(hash($"record.*"))

Sorry, the previous PR does not cover all the cases. Let me submit a separate PR to handle all the star expansion.

Of course, if we want to limit the support of star expansion in group by, we can do it for sure.

SparkQA · 2016-03-23T04:34:03Z

Test build #53876 has finished for PR 11846 at commit 74a16be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

SparkQA · 2016-03-24T21:37:36Z

Test build #54059 has finished for PR 11846 at commit a06c4ce.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-03-24T21:53:30Z

retest this please.

SparkQA · 2016-03-24T23:55:54Z

Test build #54098 has finished for PR 11846 at commit a06c4ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-25T01:02:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

      // which is a 1-base position of the projection list.
      case s @ Sort(orders, global, child)
-          if conf.orderByOrdinal && orders.exists(o => IntegerIndex.unapply(o.child).nonEmpty) =>
+          if conf.orderByOrdinal && child.resolved &&


We can add a case plan if !plan.childrenResolved => plan at the beginning.

Sure, let me do it. Thanks!

Will use p instead of plan since plan causes a warning by IntelliJ compiler for possible shadowing.

cloud-fan · 2016-03-25T01:05:52Z

LGTM except one minor comment, thanks for working on it!

gatorsmile · 2016-03-25T02:04:07Z

Thank you for your detailed review! :-)

gatorsmile · 2016-03-25T02:41:58Z

retest this please

SparkQA · 2016-03-25T04:27:59Z

Test build #54138 has finished for PR 11846 at commit 6d08009.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-25T04:58:14Z

Thanks, merging to master!

gatorsmile and others added 30 commits November 13, 2015 14:50

Merge remote-tracking branch 'upstream/master'

01e4cdf

Merge remote-tracking branch 'upstream/master'

6835704

Merge remote-tracking branch 'upstream/master'

9180687

SPARK-11633

b38a21e

Merge remote-tracking branch 'upstream/master' into joinMakeCopy

d2b84af

Merge remote-tracking branch 'upstream/master'

fda8025

Merge branch 'master' of https://github.com/gatorsmile/spark

ac0dccd

Merge remote-tracking branch 'upstream/master'

6e0018b

converge

0546772

converge

b37a64f

Merge remote-tracking branch 'upstream/master'

c2a872c

Merge remote-tracking branch 'upstream/master'

ab6dbd7

Merge remote-tracking branch 'upstream/master'

4276356

Merge remote-tracking branch 'upstream/master'

2dab708

Merge remote-tracking branch 'upstream/master'

0458770

Merge remote-tracking branch 'upstream/master'

1debdfa

Merge remote-tracking branch 'upstream/master'

763706d

Merge remote-tracking branch 'upstream/master'

4de6ec1

Merge remote-tracking branch 'upstream/master'

9422a4f

Merge remote-tracking branch 'upstream/master'

52bdf48

Merge remote-tracking branch 'upstream/master'

1e95df3

Merge remote-tracking branch 'upstream/master'

fab24cf

Merge remote-tracking branch 'upstream/master'

8b2e33b

Merge remote-tracking branch 'upstream/master'

2ee1876

Merge remote-tracking branch 'upstream/master'

b9f0090

Merge remote-tracking branch 'upstream/master'

ade6f7e

Merge remote-tracking branch 'upstream/master'

9fd63d2

Merge remote-tracking branch 'upstream/master'

5199d49

Merge remote-tracking branch 'upstream/master'

404214c

Merge remote-tracking branch 'upstream/master'

c001dd9

gatorsmile added 2 commits March 21, 2016 19:06

Merge remote-tracking branch 'upstream/master'

c08f561

Merge branch 'groupByOrdinalNew' into groupByOrdinalNewNew

18bab66

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

cloud-fan reviewed Mar 22, 2016
View reviewed changes

gatorsmile added 2 commits March 21, 2016 21:08

temp fix.

dacf2d8

fixed an issue in star expansion for group by

b19b73c

cloud-fan reviewed Mar 22, 2016
View reviewed changes

gatorsmile added 2 commits March 22, 2016 15:36

Merge remote-tracking branch 'upstream/master'

474df88

address comments.

74a16be

gatorsmile added 2 commits March 24, 2016 09:43

Merge remote-tracking branch 'upstream/master'

3d9828d

Merge branch 'groupByOrdinalNewNew' into groupByOrdinalNewNewNew

a06c4ce

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

cloud-fan reviewed Mar 25, 2016
View reviewed changes

address comments.

6d08009

asfgit closed this in 05f652d Mar 25, 2016

hvanhovell mentioned this pull request May 6, 2016

[SPARK-12063][SQL] Use number in group by clause to refer to columns #10052

Closed

[SPARK-13957] [SQL] Support Group By Ordinal in SQL #11846

[SPARK-13957] [SQL] Support Group By Ordinal in SQL #11846

Uh oh!

Conversation

gatorsmile commented Mar 20, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 22, 2016

Uh oh!

SparkQA commented Mar 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 23, 2016

Uh oh!

SparkQA commented Mar 24, 2016

Uh oh!

gatorsmile commented Mar 24, 2016

Uh oh!

SparkQA commented Mar 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Mar 25, 2016

Uh oh!

gatorsmile commented Mar 25, 2016

Uh oh!

gatorsmile commented Mar 25, 2016

Uh oh!

SparkQA commented Mar 25, 2016

Uh oh!

cloud-fan commented Mar 25, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants