[SPARK-38185][SQL] Fix data incorrect if aggregate function is empty #35490

ulysses-you · 2022-02-11T07:13:38Z

What changes were proposed in this pull request?

Add aggregateExpressions.nonEmpty check in groupOnly function.

Why are the changes needed?

The group only condition should check if the aggregate expression is empty.

In DataFrame api, it is allowed to make a empty aggregations.

So the following query should return 1 rather than 0 because it's a global aggregate.

val emptyAgg = Map.empty[String, String]
spark.range(2).where("id > 2").agg(emptyAgg).limit(1).count

Does this PR introduce any user-facing change?

yes, bug fix

How was this patch tested?

Add test

ulysses-you · 2022-02-11T07:15:10Z

cc @wangyum @cloud-fan

cloud-fan

good catch!

cloud-fan · 2022-02-11T07:32:35Z

cc @sigmod as well

cloud-fan · 2022-02-11T07:34:30Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

  private[sql] def groupOnly: Boolean = {
-    aggregateExpressions.map {
+    // aggregateExpressions can be empty through Dateset.agg
+    aggregateExpressions.nonEmpty && aggregateExpressions.map {


is it more explicit to check groupingExpressions.nonEmpty? Logically, an aggregate operator without output columns is still group only if it has group columns.

cloud-fan · 2022-02-11T13:14:01Z

thanks, merging to master/3.2!

Add `aggregateExpressions.nonEmpty` check in `groupOnly` function. The group only condition should check if the aggregate expression is empty. In DataFrame api, it is allowed to make a empty aggregations. So the following query should return 1 rather than 0 because it's a global aggregate. ```scala val emptyAgg = Map.empty[String, String] spark.range(2).where("id > 2").agg(emptyAgg).limit(1).count ``` yes, bug fix Add test Closes #35490 from ulysses-you/SPARK-38185. Authored-by: ulysses-you <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 25a4c5f) Signed-off-by: Wenchen Fan <[email protected]>

dongjoon-hyun

+1, LGTM. Thank you, @ulysses-you and all.

dongjoon-hyun · 2022-02-11T17:26:37Z

The code looks good but branch-3.2 seems not to have this issue.
The UT passes without the patch in branch-3.2. Do you know what difference causes the issue at master branch, @ulysses-you ?
BTW, I'm good to have this in branch-3.2 in any case because new code looks robust.

ulysses-you · 2022-02-12T01:26:49Z

thank you @dongjoon-hyun , The rule that change aggregate if it's group only with limit 1 to project is landed at branch-3.3 see SPARK-36183, and it's the key to trigger this bug. But the group only condition is added since branch-3.2, see SPARK-34808.

So it would be good to also fix this in branch-3.2.

Add `aggregateExpressions.nonEmpty` check in `groupOnly` function. The group only condition should check if the aggregate expression is empty. In DataFrame api, it is allowed to make a empty aggregations. So the following query should return 1 rather than 0 because it's a global aggregate. ```scala val emptyAgg = Map.empty[String, String] spark.range(2).where("id > 2").agg(emptyAgg).limit(1).count ``` yes, bug fix Add test Closes apache#35490 from ulysses-you/SPARK-38185. Authored-by: ulysses-you <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 25a4c5f) Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 3711a81)

Fix data incorrect if aggregate function is empty

5900cce

github-actions bot added the SQL label Feb 11, 2022

cloud-fan approved these changes Feb 11, 2022

View reviewed changes

cloud-fan reviewed Feb 11, 2022

View reviewed changes

check grouping

b9e44c0

sigmod approved these changes Feb 11, 2022

View reviewed changes

HyukjinKwon approved these changes Feb 11, 2022

View reviewed changes

cloud-fan closed this in 25a4c5f Feb 11, 2022

dongjoon-hyun reviewed Feb 11, 2022

View reviewed changes

ulysses-you deleted the SPARK-38185 branch February 12, 2022 01:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-38185][SQL] Fix data incorrect if aggregate function is empty #35490

[SPARK-38185][SQL] Fix data incorrect if aggregate function is empty #35490

Uh oh!

ulysses-you commented Feb 11, 2022

Uh oh!

ulysses-you commented Feb 11, 2022

Uh oh!

cloud-fan left a comment

Uh oh!

cloud-fan commented Feb 11, 2022

Uh oh!

cloud-fan Feb 11, 2022

Uh oh!

ulysses-you Feb 11, 2022

Uh oh!

cloud-fan commented Feb 11, 2022

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Feb 11, 2022

Uh oh!

ulysses-you commented Feb 12, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-38185][SQL] Fix data incorrect if aggregate function is empty #35490

[SPARK-38185][SQL] Fix data incorrect if aggregate function is empty #35490

Uh oh!

Conversation

ulysses-you commented Feb 11, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

ulysses-you commented Feb 11, 2022

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 11, 2022

Uh oh!

cloud-fan Feb 11, 2022

Choose a reason for hiding this comment

Uh oh!

ulysses-you Feb 11, 2022

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 11, 2022

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Feb 11, 2022

Uh oh!

ulysses-you commented Feb 12, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants