[SPARK-35028][SQL] ANSI mode: disallow group by aliases #32129

gengliangwang · 2021-04-12T12:19:44Z

What changes were proposed in this pull request?

Disallow group by aliases under ANSI mode.

Why are the changes needed?

As per the ANSI SQL standard secion 7.12 :

Each grouping column reference shall unambiguously reference a column of the table resulting from the from clause. A column referenced in a group by clause is a grouping column.

By forbidding it, we can avoid ambiguous SQL queries like:

SELECT col + 1 as col FROM t GROUP BY col

Does this PR introduce any user-facing change?

Yes, group by aliases is not allowed under ANSI mode.

How was this patch tested?

Unit tests

gengliangwang · 2021-04-12T12:20:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Moving ANSI_ENABLED to the front so that other configurations can refer to it without compiling errors.

gengliangwang · 2021-04-12T12:27:11Z

There has been some discussion under the PR that supports group by alias: #17191
@cloud-fan also mention the issue behind that behavior in #27775

Group by aliases is convenient. But it can be ambiguous and incompatible with SQL standard.

SparkQA · 2021-04-12T13:12:31Z

Test build #137212 has finished for PR 32129 at commit 8c84858.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-12T13:18:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41792/

SparkQA · 2021-04-12T13:18:46Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41792/

cloud-fan · 2021-04-12T15:01:14Z

retest this please

SparkQA · 2021-04-12T16:44:21Z

Test build #137220 has finished for PR 32129 at commit 8c84858.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-12T16:45:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41800/

SparkQA · 2021-04-12T16:45:47Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41800/

SparkQA · 2021-04-12T19:00:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41806/

SparkQA · 2021-04-12T19:00:54Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41806/

SparkQA · 2021-04-12T22:24:31Z

Test build #137226 has finished for PR 32129 at commit 62cee4f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2021-04-12T23:00:32Z

docs/sql-ref-ansi-compliance.md

 The behavior of some SQL operators can be different under ANSI mode (`spark.sql.ansi.enabled=true`).
  - `array_col[index]`: This operator throws `ArrayIndexOutOfBoundsException` if using invalid indices.
  - `map_col[key]`: This operator throws `NoSuchElementException` if key does not exist in map.
+  - `GROUP BY`: aliases in a select list can not be used in GROUP BY clauses. Each column referenced in a GROUP BY clause shall unambiguously reference a column of the table resulting from the FROM clause.


nit: in a GROUP BY clause -> by a GROUP BY clause?

Both should work. The second sentence is from the ANSI SQL standard.

HyukjinKwon · 2021-04-13T02:26:44Z

Nice!

gengliangwang · 2021-04-13T02:42:26Z

Merging to master

### What changes were proposed in this pull request? Revert [[SPARK-35028][SQL] ANSI mode: disallow group by aliases ](#32129) ### Why are the changes needed? It turns out that many users are using the group by alias feature. Spark has its precedence rule when alias names conflict with column names in Group by clause: always use the table column. This should be reasonable and acceptable. Also, external DBMS such as PostgreSQL and MySQL allow grouping by alias, too. As we are going to announce ANSI mode GA in Spark 3.2, I suggest allowing the group by alias in ANSI mode. ### Does this PR introduce _any_ user-facing change? No, the feature is not released yet. ### How was this patch tested? Unit tests Closes #33758 from gengliangwang/revertGroupByAlias. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

### What changes were proposed in this pull request? Revert [[SPARK-35028][SQL] ANSI mode: disallow group by aliases ](#32129) ### Why are the changes needed? It turns out that many users are using the group by alias feature. Spark has its precedence rule when alias names conflict with column names in Group by clause: always use the table column. This should be reasonable and acceptable. Also, external DBMS such as PostgreSQL and MySQL allow grouping by alias, too. As we are going to announce ANSI mode GA in Spark 3.2, I suggest allowing the group by alias in ANSI mode. ### Does this PR introduce _any_ user-facing change? No, the feature is not released yet. ### How was this patch tested? Unit tests Closes #33758 from gengliangwang/revertGroupByAlias. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Gengliang Wang <[email protected]> (cherry picked from commit 8bfb4f1) Signed-off-by: Gengliang Wang <[email protected]>

### What changes were proposed in this pull request? Disallow group by aliases under ANSI mode. ### Why are the changes needed? As per the ANSI SQL standard secion 7.12 <group by clause>: >Each `grouping column reference` shall unambiguously reference a column of the table resulting from the `from clause`. A column referenced in a `group by clause` is a grouping column. By forbidding it, we can avoid ambiguous SQL queries like: ``` SELECT col + 1 as col FROM t GROUP BY col ``` ### Does this PR introduce _any_ user-facing change? Yes, group by aliases is not allowed under ANSI mode. ### How was this patch tested? Unit tests Closes apache#32129 from gengliangwang/disallowGroupByAlias. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

### What changes were proposed in this pull request? Revert [[SPARK-35028][SQL] ANSI mode: disallow group by aliases ](apache#32129) ### Why are the changes needed? It turns out that many users are using the group by alias feature. Spark has its precedence rule when alias names conflict with column names in Group by clause: always use the table column. This should be reasonable and acceptable. Also, external DBMS such as PostgreSQL and MySQL allow grouping by alias, too. As we are going to announce ANSI mode GA in Spark 3.2, I suggest allowing the group by alias in ANSI mode. ### Does this PR introduce _any_ user-facing change? No, the feature is not released yet. ### How was this patch tested? Unit tests Closes apache#33758 from gengliangwang/revertGroupByAlias. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Gengliang Wang <[email protected]> (cherry picked from commit 8bfb4f1) Signed-off-by: Gengliang Wang <[email protected]>

gengliangwang commented Apr 12, 2021

View reviewed changes

gengliangwang requested review from cloud-fan and maropu April 12, 2021 12:20

github-actions bot added DOCS SQL labels Apr 12, 2021

cloud-fan approved these changes Apr 12, 2021

View reviewed changes

gengliangwang added 2 commits April 13, 2021 01:26

disallow group by alias

7c2dbe5

update sql.out

62cee4f

gengliangwang force-pushed the disallowGroupByAlias branch from 8c84858 to 62cee4f Compare April 12, 2021 17:31

maropu reviewed Apr 12, 2021

View reviewed changes

maropu approved these changes Apr 12, 2021

View reviewed changes

HyukjinKwon approved these changes Apr 13, 2021

View reviewed changes

gengliangwang closed this in 79e55b4 Apr 13, 2021

gengliangwang mentioned this pull request Aug 17, 2021

Revert "[SPARK-35028][SQL] ANSI mode: disallow group by aliases" #33758

Closed

[SPARK-35028][SQL] ANSI mode: disallow group by aliases #32129

[SPARK-35028][SQL] ANSI mode: disallow group by aliases #32129

Uh oh!

Conversation

gengliangwang commented Apr 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gengliangwang Apr 12, 2021

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Apr 12, 2021

Uh oh!

SparkQA commented Apr 12, 2021

Uh oh!

SparkQA commented Apr 12, 2021

Uh oh!

SparkQA commented Apr 12, 2021

Uh oh!

cloud-fan commented Apr 12, 2021

Uh oh!

SparkQA commented Apr 12, 2021

Uh oh!

SparkQA commented Apr 12, 2021

Uh oh!

SparkQA commented Apr 12, 2021

Uh oh!

SparkQA commented Apr 12, 2021

Uh oh!

SparkQA commented Apr 12, 2021

Uh oh!

SparkQA commented Apr 12, 2021

Uh oh!

maropu Apr 12, 2021

Choose a reason for hiding this comment

Uh oh!

gengliangwang Apr 13, 2021

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Apr 13, 2021

Uh oh!

gengliangwang commented Apr 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gengliangwang commented Apr 12, 2021 •

edited

Loading