Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

Make group by alias fail if there are name conflicts like SELECT col + 1 as col FROM t GROUP BY col.

Why are the changes needed?

It's super confusing that SELECT col + 1 as new_col FROM t GROUP BY new_col and SELECT col + 1 as col FROM t GROUP BY col works differently.

Does this PR introduce any user-facing change?

yes, group by alias now fails if there are name conflicts.

How was this patch tested?

new tests

@cloud-fan
Copy link
Contributor Author

cc @maropu @gatorsmile

@SparkQA
Copy link

SparkQA commented Mar 3, 2020

Test build #119241 has finished for PR 27775 at commit 1317477.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


- The decimal string representation can be different between Hive 1.2 and Hive 2.3 when using `TRANSFORM` operator in SQL for script transformation, which depends on hive's behavior. In Hive 1.2, the string representation omits trailing zeroes. But in Hive 2.3, it is always padded to 18 digits with trailing zeroes if necessary.

- Since Spark 3.0, group by alias fails if there are name conflicts like `SELECT col + 1 as col FROM t GROUP BY col`. In Spark version 2.4 and earlier, it works and the column will be resolved using child output. To restore the previous behaviour, set `spark.sql.legacy.allowAmbiguousGroupByAlias` to `true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: like -> such as


val LEGACY_ALLOW_AMBIGUOUS_GROUP_BY_ALIAS =
buildConf("spark.sql.legacy.allowAmbiguousGroupByAlias")
.doc(s"When ${GROUP_BY_ALIASES.key} is enabled and this conf is true, Spark will resolve " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: conf -> configuration

buildConf("spark.sql.legacy.allowAmbiguousGroupByAlias")
.doc(s"When ${GROUP_BY_ALIASES.key} is enabled and this conf is true, Spark will resolve " +
"the GROUP BY column using child's output, even though there is an ambiguous alias in " +
"the SELECT clause. Id false, Spark fails the query.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit Id -> if

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @cloud-fan . Sorry, but I'm -1 because this is a regression from 2.x. Confusion is too subjective to deprecate this behind a new legacy config. Is there any other reason you have?

PostgreSQL and MySQL works like Apache Spark 2.4.x.

cc @dbtsai

@maropu
Copy link
Member

maropu commented Mar 4, 2020

Yea, it seems SQL server, oracle and presto accept this alias, so I worried that this change makes users a bit confused.

@cloud-fan
Copy link
Contributor Author

It turns out the TPCDS queries also have name conflicts. I think users can only accept it and be careful when writing GROUP BY columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants