[SPARK-12063][SQL] Use number in group by clause to refer to columns #10052

dereksabryfb · 2015-12-01T00:57:35Z

If there is a number n in a group by clause, the nth column in the select clause is used to modify the group by clause to refer to this column instead of the number.

eg.
select a,b from c group by 1,2
becomes
select a,b from c group by a,b

marmbrus · 2015-12-01T06:48:02Z

Thanks for working on this. This seems resonable to support, but I have two suggestions:

I would probably implement this as a rule in the Analyzer so that it is not specific to the Hive parser.
Please add a unit test, probably in SQLQuerySuite

rxin · 2015-12-01T23:40:20Z

@dereksabryfb you should also add the email you used in your git commit to your github profile so it shows up on github.

dereksabryfb · 2015-12-03T19:55:17Z

Thanks for your feedback! The change is now implemented as a rule in the Analyzer and a unit test has been added. Let me know if there are any other changes I need to make.

marmbrus · 2015-12-07T23:34:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

I'm not sure we want to match on the pretty string as I think this would also trigger for things like "1", instead I'd consider matching on a Literal of NumericType.

marmbrus · 2015-12-07T23:37:43Z

Are there other cases that we should handle here? Can you do the same thing in ORDER BY normally?

dereksabryfb · 2015-12-08T00:59:25Z

Thanks for the feedback! I'm making the changes you suggested. With respect to the 'ORDER BY' clause, this looks to be semi-handled by the ResolveSortReferences in the Analyzer; because in standard SQL literals are allowed in the order by clause, this rule just sorts by the literal value '1'; In HiveQL, '1' is interpreted as a column in the same way it is in the group by clause.

I can add the case for a Sort() with an IntegerType. I assume there is no one currently relying on how 'sort by 1' currently functions.

marmbrus · 2015-12-08T18:27:31Z

We'll have to note the change in the release notes, but since its a no-op to sort by a constant I think we can safely change behavior here.

dereksabryfb · 2015-12-10T17:45:21Z

Added a case for sort

marmbrus · 2015-12-10T19:01:54Z

ok to test

SparkQA · 2015-12-10T19:14:51Z

Test build #47534 has finished for PR 10052 at commit 8a5a4f6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

dereksabryfb · 2015-12-10T19:33:58Z

Apologies, I haven't been able to run ./dev/run-tests, getting the following exception: http://pastebin.com/L0p0sjtJ

so I wasn't able to pick up the style issues, and I'm not sure if there's more that the build doesn't flag.

marmbrus · 2015-12-10T19:37:33Z

I'd try the following locally build/sbt scalastyle test:scalastyle catalyst/test sql/test.

Each of those commands can be run separately too and you can use ~ to rerun whenever something changes to iterate more quickly build/sbt ~scalastyle

SparkQA · 2015-12-10T20:00:29Z

Test build #47538 has finished for PR 10052 at commit bd453d5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dereksabryfb · 2015-12-11T00:05:14Z

Looks like it fails on the query "SELECT a, count(2) FROM testData2 GROUP BY a, 2" because 2 now refers to the column count(2) which is not valid syntax for a group by clause in hive. How should I go about resolving this? The purpose of the test is to see that literals in the group by clause don't modify the results, but the purpose of this patch is to do the opposite of that. I could modify the offending case, but I feel the whole test may be irrelevant with this patch.

marmbrus · 2015-12-11T02:10:13Z

We should probably throw an AnalysisException for this if they use a column ordinal that refers to an aggregate expression. The fact that we make it all the way to the execution is pretty confusing to a user.

Regarding the test, we can probably remove it.

…nged

dereksabryfb · 2015-12-11T03:52:13Z

I removed the offending test.

I found that there was no AnalysisException thrown even if there was an explicit aggregate in the group by clause (e.g. select a from b group by count(a)), and it would fail in the same way, so I added the check to CheckAnalysis ; if you think this is out of the scope of this pull request (since it isn't strictly to do with a number reference), I can create a new task and attach just that commit to it.

Thanks again for your feedback.

SparkQA · 2015-12-11T04:33:07Z

Test build #47566 has finished for PR 10052 at commit 09b3f77.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-11T04:37:30Z

Test build #47565 has finished for PR 10052 at commit e4edc31.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2015-12-18T16:57:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

case Literal(index: Int) => is easier. It also eliminates the need for group.toString.toInt

hvanhovell · 2016-01-18T09:37:00Z

@dereksabryfb are you still working on this?

srowen · 2016-05-06T17:32:32Z

Ping @dereksabryfb -- are you working on this? or else close it

hvanhovell · 2016-05-06T17:42:53Z

I think this one can be closed, it has been implemented in #11846

JoshRosen · 2016-09-07T22:57:13Z

Yep, it looks like this PR has been subsumed by #11846. @dereksabryfb, could you please close this pull request? Thanks!

Use number to refer to columns in group by clause

74c42ae

dereksabryfb force-pushed the group_by_number branch from 86d7e93 to 74c42ae Compare December 1, 2015 01:04

Group by Column Number for Spark SQL

a179e6f

minor style change

b4cfcbf

marmbrus reviewed Dec 7, 2015
View reviewed changes

Add Sort() case

8a5a4f6

scala style

bd453d5

dereksabryfb added 2 commits December 10, 2015 19:22

Remove literal in group by clause test, as this functioanlity has cha…

e4edc31

…nged

aggregate in group by clause causes failure in execution

09b3f77

hvanhovell reviewed Dec 18, 2015
View reviewed changes

hvanhovell mentioned this pull request Jan 18, 2016

[SPARK-12789]Support order by index and group by index #10731

Closed

HyukjinKwon mentioned this pull request Sep 12, 2016

[BUILD] Closing some stale PRs and ones suggested to be closed by committer(s) #15057

Closed

asfgit closed this in 46f5c20 Sep 13, 2016

[SPARK-12063][SQL] Use number in group by clause to refer to columns #10052

[SPARK-12063][SQL] Use number in group by clause to refer to columns #10052

Uh oh!

Conversation

dereksabryfb commented Dec 1, 2015

Uh oh!

marmbrus commented Dec 1, 2015

Uh oh!

rxin commented Dec 1, 2015

Uh oh!

dereksabryfb commented Dec 3, 2015

Uh oh!

marmbrus Dec 7, 2015

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Dec 7, 2015

Uh oh!

dereksabryfb commented Dec 8, 2015

Uh oh!

marmbrus commented Dec 8, 2015

Uh oh!

dereksabryfb commented Dec 10, 2015

Uh oh!

marmbrus commented Dec 10, 2015

Uh oh!

SparkQA commented Dec 10, 2015

Uh oh!

dereksabryfb commented Dec 10, 2015

Uh oh!

marmbrus commented Dec 10, 2015

Uh oh!

SparkQA commented Dec 10, 2015

Uh oh!

dereksabryfb commented Dec 11, 2015

Uh oh!

marmbrus commented Dec 11, 2015

Uh oh!

dereksabryfb commented Dec 11, 2015

Uh oh!

SparkQA commented Dec 11, 2015

Uh oh!

SparkQA commented Dec 11, 2015

Uh oh!

hvanhovell Dec 18, 2015

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Jan 18, 2016

Uh oh!

srowen commented May 6, 2016

Uh oh!

hvanhovell commented May 6, 2016

Uh oh!

JoshRosen commented Sep 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants