[SPARK-49000][SQL] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates #47525

nikolamand-db · 2024-07-29T21:50:37Z

What changes were proposed in this pull request?

Fix RewriteDistinctAggregates rule to deal properly with aggregation on DISTINCT literals. Physical plan for select count(distinct 1) from t:

-- count(distinct 1)
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(distinct 1)], output=[count(DISTINCT 1)#2L])
   +- HashAggregate(keys=[], functions=[partial_count(distinct 1)], output=[count#6L])
      +- HashAggregate(keys=[], functions=[], output=[])
         +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=20]
            +- HashAggregate(keys=[], functions=[], output=[])
               +- FileScan parquet spark_catalog.default.t[] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/nikola.mandic/oss-spark/spark-warehouse/org.apache.spark.s..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>

Problem is happening when HashAggregate(keys=[], functions=[], output=[]) node yields one row to partial_count node, which then captures one row. This four-node structure is constructed by AggUtils.planAggregateWithOneDistinct.

To fix the problem, we're adding Expand node which will force non-empty grouping expressions in HashAggregateExec nodes. This will in turn enable streaming zero rows to parent partial_count node, yielding correct final result.

Why are the changes needed?

Aggregation with DISTINCT literal gives wrong results. For example, when running on empty table t:
select count(distinct 1) from t returns 1, while the correct result should be 0.
For reference:
select count(1) from t returns 0, which is the correct and expected result.

Does this PR introduce any user-facing change?

Yes, this fixes a critical bug in Spark.

How was this patch tested?

New e2e SQL tests for aggregates with DISTINCT literals.

Was this patch authored or co-authored using generative AI tooling?

No.

…ding RewriteDistinctAggregates

cloud-fan · 2024-07-30T06:52:41Z

can we move the tests to SQL golden files?

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

uros-db · 2024-07-30T21:17:14Z

@cloud-fan we've updated the tests quite a bit to try and limit the impact of e2e sql testing, but we believe it's best to keep it like this instead of using golden files - we're using loops and test cases to verify expected results with a table that has various number of rows, while also verifying whether Expand was injected

uros-db

all checks look good, @cloud-fan please review

cloud-fan · 2024-07-31T04:27:46Z

Please fill the PR description

uros-db · 2024-07-31T05:04:11Z

I think @nikolamand-db will have to do that because it's his PR, I don't have access to edit PR description

### What changes were proposed in this pull request?
Fix `RewriteDistinctAggregates` rule to deal properly with aggregation on DISTINCT literals.


### Why are the changes needed?
Aggregation with DISTINCT literal gives wrong results. For example:
`select count(distinct 1) from t` returns 1, while the correct result should be 0.
For reference:
`select count(1) from t` returns 0, which is the correct and expected result.


### Does this PR introduce _any_ user-facing change?
Yes, this fixes a critical bug in Spark.


### How was this patch tested?
New e2e SQL tests for aggregates with DISTINCT literals.


### Was this patch authored or co-authored using generative AI tooling?
No.

uros-db · 2024-07-31T05:05:20Z

also, I think it's worth noting (at least in this comment) that the optimizer rule RewriteDistinctAggregates is in nonExcludableRules - this is important because the changes are related to a correctness issue

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

nikolamand-db · 2024-07-31T14:07:15Z

@cloud-fan resolved all threads, should we merge now? Thanks.

cloud-fan · 2024-07-31T14:37:15Z

thanks, merging to master/3.5!

…mpty table by expanding RewriteDistinctAggregates Fix `RewriteDistinctAggregates` rule to deal properly with aggregation on DISTINCT literals. Physical plan for `select count(distinct 1) from t`: ``` -- count(distinct 1) == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[], functions=[count(distinct 1)], output=[count(DISTINCT 1)#2L]) +- HashAggregate(keys=[], functions=[partial_count(distinct 1)], output=[count#6L]) +- HashAggregate(keys=[], functions=[], output=[]) +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=20] +- HashAggregate(keys=[], functions=[], output=[]) +- FileScan parquet spark_catalog.default.t[] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/nikola.mandic/oss-spark/spark-warehouse/org.apache.spark.s..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> ``` Problem is happening when `HashAggregate(keys=[], functions=[], output=[])` node yields one row to `partial_count` node, which then captures one row. This four-node structure is constructed by `AggUtils.planAggregateWithOneDistinct`. To fix the problem, we're adding `Expand` node which will force non-empty grouping expressions in `HashAggregateExec` nodes. This will in turn enable streaming zero rows to parent `partial_count` node, yielding correct final result. Aggregation with DISTINCT literal gives wrong results. For example, when running on empty table `t`: `select count(distinct 1) from t` returns 1, while the correct result should be 0. For reference: `select count(1) from t` returns 0, which is the correct and expected result. Yes, this fixes a critical bug in Spark. New e2e SQL tests for aggregates with DISTINCT literals. No. Closes #47525 from nikolamand-db/SPARK-49000-spark-expand-approach. Lead-authored-by: Uros Bojanic <[email protected]> Co-authored-by: Nikola Mandic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit dfa2133) Signed-off-by: Wenchen Fan <[email protected]>

dongjoon-hyun

Thank you, @nikolamand-db and @cloud-fan .

According to the JIRA, this is filed against on 3.0.0 with the following JIRA report. If then, can we have this on branch-3.4? Could you confirm the affected version number once more?

It appears that this bug affects all (or most) released versions of Spark.
...
Aggregation with DISTINCT literal gives wrong results. For example, when running on empty table t:
select count(distinct 1) from t returns 1, while the correct result should be 0.

When I use spark-sql, Apache Spark 3.5.1 and 3.4.2 seems to work correctly like the following. Is there a handy way to check this PR's case?

spark-sql (default)> select count(distinct 1) from (select * from range(1) where 1 = 0);
0
Time taken: 0.055 seconds, Fetched 1 row(s)

dongjoon-hyun · 2024-07-31T15:02:03Z

cc @yaooqinn and @viirya , too

uros-db · 2024-07-31T15:08:12Z

@dongjoon-hyun please try this:

  test("minimal test") {
    withTable("tbl") {
      sql("create table tbl (col string) using parquet")
      checkAnswer(sql("select count(distinct 1) from tbl"), Row(1)) // wrong
    }
  }

dongjoon-hyun · 2024-07-31T15:24:53Z

Thank you, @uros-db . I confirmed with Spark 3.4.3.

spark-sql (default)> SELECT version();
3.4.3 1eb558c3a6fbdd59e5a305bc3ab12ce748f6511f
Time taken: 0.054 seconds, Fetched 1 row(s)

spark-sql (default)> select count(*) from tbl;
0
Time taken: 0.109 seconds, Fetched 1 row(s)

spark-sql (default)> select count(distinct 1) from tbl;
1
Time taken: 0.336 seconds, Fetched 1 row(s)

viirya · 2024-07-31T15:41:22Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

 */
 object RewriteDistinctAggregates extends Rule[LogicalPlan] {
+  private def mustRewrite(
+      aggregateExpressions: Seq[AggregateExpression],


s/aggregateExpressions/distinctAggs/

viirya · 2024-07-31T15:42:54Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

+  private def mustRewrite(
+      aggregateExpressions: Seq[AggregateExpression],
+      groupingExpressions: Seq[Expression]): Boolean = {
+    // If there are any AggregateExpressions with filter, we need to rewrite the query.


s/any/any distinct/

viirya · 2024-07-31T15:43:47Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

    // clause for this rule because aggregation strategy can handle a single distinct aggregate
    // group without filter clause.
    // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).


It is better to update the comment.

viirya · 2024-07-31T15:45:59Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

+      groupingExpressions: Seq[Expression]): Boolean = {
+    // If there are any AggregateExpressions with filter, we need to rewrite the query.
+    // Also, if there are no grouping expressions and all aggregate expressions are foldable,
+    // we need to rewrite the query, e.g. SELECT COUNT(DISTINCT 1).


Compared to the comment in mayNeedtoRewrite which explains why rewriting is necessary. This comment doesn't do any explanation but just claims it needs to rewrite the query. This comment simply describes what the code does and it is obvious.

To better improve the code readability, it would be better to explain why the rewriting is needed for the case.

yaooqinn · 2024-08-01T05:46:57Z

Removed this from branch-3.5 as it causes the GA failures

https://github.com/apache/spark/actions/runs/10182404165/job/28186327497

LuciferYang · 2024-08-01T05:52:23Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

+        s"""SELECT COUNT(DISTINCT 1, "col") FROM $t"""
+      ),
+      AggregateTestCaseDefault(
+        s"""SELECT COUNT(DISTINCT collation("abc")) FROM $t"""


This test case cannot be merged into branch-3.5, as collation is a new function added in Spark 4.0.
cc @nikolamand-db @cloud-fan
also cc @yaooqinn

yes, collation doesn't exist in older version - so this test will need to be excluded

I can take care of that in a follow-up

cloud-fan · 2024-08-01T06:00:51Z

@nikolamand-db please send a followup PR to address post-hoc review comments and then create backport PRs for 3.5 and 3.4

uros-db · 2024-08-01T07:13:06Z

@cloud-fan I'll be creating the follow-up: #47565

let's first merge this into master, and we can backport later

yaooqinn · 2024-08-01T07:39:54Z

Close this first as a new PR is appropriate for branch-3.5 and/or branch-3.4

uros-db · 2024-08-01T07:42:28Z

Backport out for:

3.5 [SPARK-49000][SQL][3.5] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates #47566
3.4 [SPARK-49000][SQL][3.4] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates #47567

Fix select count(distinct 1) from t where t is empty table by expan…

ee6d85d

…ding RewriteDistinctAggregates

github-actions bot added the SQL label Jul 29, 2024

cloud-fan reviewed Jul 30, 2024

View reviewed changes

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala Outdated Show resolved Hide resolved

Fixes

51e2edd

uros-db approved these changes Jul 30, 2024

View reviewed changes

nikolamand-db changed the title ~~[SPARK-49000][SQL][WIP] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates~~ [SPARK-49000][SQL] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates Jul 31, 2024

nikolamand-db mentioned this pull request Jul 31, 2024

[SPARK-49000][SQL][WIP] Fix "select count(distinct 1) from t" where t is empty table #47499

Closed

nikolamand-db requested a review from cloud-fan July 31, 2024 08:28

cloud-fan reviewed Jul 31, 2024

View reviewed changes

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 31, 2024

View reviewed changes

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Jul 31, 2024

View reviewed changes

Fixes

1b1f8f9

cloud-fan reviewed Jul 31, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala Show resolved Hide resolved

cloud-fan reviewed Jul 31, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala Show resolved Hide resolved

cloud-fan closed this in dfa2133 Jul 31, 2024

dongjoon-hyun reviewed Jul 31, 2024

View reviewed changes

viirya reviewed Jul 31, 2024

View reviewed changes

yaooqinn reopened this Aug 1, 2024

LuciferYang reviewed Aug 1, 2024

View reviewed changes

yaooqinn closed this Aug 1, 2024

dongjoon-hyun mentioned this pull request Aug 1, 2024

[SPARK-49000][SQL][3.5] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates #47566

Closed

[SPARK-49000][SQL] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates #47525

[SPARK-49000][SQL] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates #47525

Uh oh!

Conversation

nikolamand-db commented Jul 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan commented Jul 30, 2024

Uh oh!

Uh oh!

uros-db commented Jul 30, 2024

Uh oh!

uros-db left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 31, 2024

Uh oh!

uros-db commented Jul 31, 2024

Uh oh!

uros-db commented Jul 31, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nikolamand-db commented Jul 31, 2024

Uh oh!

cloud-fan commented Jul 31, 2024

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jul 31, 2024

Uh oh!

uros-db commented Jul 31, 2024

Uh oh!

dongjoon-hyun commented Jul 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya Jul 31, 2024

Choose a reason for hiding this comment

Uh oh!

viirya Jul 31, 2024

Choose a reason for hiding this comment

Uh oh!

viirya Jul 31, 2024

Choose a reason for hiding this comment

Uh oh!

viirya Jul 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Aug 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LuciferYang Aug 1, 2024

Choose a reason for hiding this comment

Uh oh!

uros-db Aug 1, 2024

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Aug 1, 2024

Uh oh!

uros-db commented Aug 1, 2024

Uh oh!

yaooqinn commented Aug 1, 2024

Uh oh!

uros-db commented Aug 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

nikolamand-db commented Jul 29, 2024 •

edited

Loading

dongjoon-hyun commented Jul 31, 2024 •

edited

Loading

viirya Jul 31, 2024 •

edited

Loading

yaooqinn commented Aug 1, 2024 •

edited

Loading