[SPARK-31022][SQL] group by alias should fail if there are name conflicts #27775

cloud-fan · 2020-03-03T16:46:02Z

What changes were proposed in this pull request?

Make group by alias fail if there are name conflicts like SELECT col + 1 as col FROM t GROUP BY col.

Why are the changes needed?

It's super confusing that SELECT col + 1 as new_col FROM t GROUP BY new_col and SELECT col + 1 as col FROM t GROUP BY col works differently.

Does this PR introduce any user-facing change?

yes, group by alias now fails if there are name conflicts.

How was this patch tested?

new tests

cloud-fan · 2020-03-03T16:46:24Z

cc @maropu @gatorsmile

SparkQA · 2020-03-03T18:23:12Z

Test build #119241 has finished for PR 27775 at commit 1317477.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-03-04T02:12:09Z

docs/sql-migration-guide.md


    - The decimal string representation can be different between Hive 1.2 and Hive 2.3 when using `TRANSFORM` operator in SQL for script transformation, which depends on hive's behavior. In Hive 1.2, the string representation omits trailing zeroes. But in Hive 2.3, it is always padded to 18 digits with trailing zeroes if necessary.

+  - Since Spark 3.0, group by alias fails if there are name conflicts like `SELECT col + 1 as col FROM t GROUP BY col`. In Spark version 2.4 and earlier, it works and the column will be resolved using child output. To restore the previous behaviour, set `spark.sql.legacy.allowAmbiguousGroupByAlias` to `true`.


nit: like -> such as

HyukjinKwon · 2020-03-04T02:12:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala


+  val LEGACY_ALLOW_AMBIGUOUS_GROUP_BY_ALIAS =
+    buildConf("spark.sql.legacy.allowAmbiguousGroupByAlias")
+      .doc(s"When ${GROUP_BY_ALIASES.key} is enabled and this conf is true, Spark will resolve " +


nit: conf -> configuration

HyukjinKwon · 2020-03-04T02:12:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    buildConf("spark.sql.legacy.allowAmbiguousGroupByAlias")
+      .doc(s"When ${GROUP_BY_ALIASES.key} is enabled and this conf is true, Spark will resolve " +
+        "the GROUP BY column using child's output, even though there is an ambiguous alias in " +
+        "the SELECT clause. Id false, Spark fails the query.")


nit Id -> if

dongjoon-hyun

Hi, @cloud-fan . Sorry, but I'm -1 because this is a regression from 2.x. Confusion is too subjective to deprecate this behind a new legacy config. Is there any other reason you have?

PostgreSQL and MySQL works like Apache Spark 2.4.x.

cc @dbtsai

maropu · 2020-03-04T23:45:09Z

Yea, it seems SQL server, oracle and presto accept this alias, so I worried that this change makes users a bit confused.

cloud-fan · 2020-03-05T11:36:30Z

It turns out the TPCDS queries also have name conflicts. I think users can only accept it and be careful when writing GROUP BY columns.

group by alias should fail if there are name conflicts

1317477

HyukjinKwon reviewed Mar 4, 2020

View reviewed changes

dongjoon-hyun requested changes Mar 4, 2020

View reviewed changes

dongjoon-hyun added the SQL label Mar 4, 2020

cloud-fan closed this Mar 5, 2020

gengliangwang mentioned this pull request Apr 12, 2021

[SPARK-35028][SQL] ANSI mode: disallow group by aliases #32129

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-31022][SQL] group by alias should fail if there are name conflicts #27775

[SPARK-31022][SQL] group by alias should fail if there are name conflicts #27775

Uh oh!

cloud-fan commented Mar 3, 2020

Uh oh!

cloud-fan commented Mar 3, 2020

Uh oh!

SparkQA commented Mar 3, 2020

Uh oh!

HyukjinKwon Mar 4, 2020

Uh oh!

HyukjinKwon Mar 4, 2020

Uh oh!

HyukjinKwon Mar 4, 2020

Uh oh!

dongjoon-hyun left a comment •

edited

Loading

Uh oh!

maropu commented Mar 4, 2020

Uh oh!

cloud-fan commented Mar 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		- The decimal string representation can be different between Hive 1.2 and Hive 2.3 when using `TRANSFORM` operator in SQL for script transformation, which depends on hive's behavior. In Hive 1.2, the string representation omits trailing zeroes. But in Hive 2.3, it is always padded to 18 digits with trailing zeroes if necessary.

		- Since Spark 3.0, group by alias fails if there are name conflicts like `SELECT col + 1 as col FROM t GROUP BY col`. In Spark version 2.4 and earlier, it works and the column will be resolved using child output. To restore the previous behaviour, set `spark.sql.legacy.allowAmbiguousGroupByAlias` to `true`.

[SPARK-31022][SQL] group by alias should fail if there are name conflicts #27775

[SPARK-31022][SQL] group by alias should fail if there are name conflicts #27775

Uh oh!

Conversation

cloud-fan commented Mar 3, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Mar 3, 2020

Uh oh!

SparkQA commented Mar 3, 2020

Uh oh!

HyukjinKwon Mar 4, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Mar 4, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Mar 4, 2020

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Mar 4, 2020

Uh oh!

cloud-fan commented Mar 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dongjoon-hyun left a comment •

edited

Loading