[SPARK-42346][SQL] Rewrite distinct aggregates after subquery merge #39887

peter-toth · 2023-02-05T10:18:27Z

What changes were proposed in this pull request?

Unfortunately #32298 introduced a regression from Spark 3.2 to 3.3 as after that change a merged subquery can contain multiple distict type aggregates. Those aggregates need to be rewritten by the RewriteDistinctAggregates rule to get the correct results. This PR fixed that.

Why are the changes needed?

The following query:

SELECT
  (SELECT count(distinct c1) FROM t1),
  (SELECT count(distinct c2) FROM t1)

currently fails with:

java.lang.IllegalStateException: You hit a query analyzer bug. Please report your query to Spark user mailing list.
	at org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:538)

but works again after this PR.

Does this PR introduce any user-facing change?

Yes, the above query works again.

How was this patch tested?

Added new UT.

peter-toth · 2023-02-05T10:25:35Z

cc @cloud-fan, @viirya, @wangyum

RobinL · 2023-02-05T10:44:36Z

Thanks so much @peter-toth! Amazing to have a fix so quickly

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

cloud-fan · 2023-02-06T12:54:52Z

late LGTM

### What changes were proposed in this pull request? Unfortunately #32298 introduced a regression from Spark 3.2 to 3.3 as after that change a merged subquery can contain multiple distict type aggregates. Those aggregates need to be rewritten by the `RewriteDistinctAggregates` rule to get the correct results. This PR fixed that. ### Why are the changes needed? The following query: ``` SELECT (SELECT count(distinct c1) FROM t1), (SELECT count(distinct c2) FROM t1) ``` currently fails with: ``` java.lang.IllegalStateException: You hit a query analyzer bug. Please report your query to Spark user mailing list. at org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:538) ``` but works again after this PR. ### Does this PR introduce _any_ user-facing change? Yes, the above query works again. ### How was this patch tested? Added new UT. Closes #39887 from peter-toth/SPARK-42346-rewrite-distinct-aggregates-after-subquery-merge. Authored-by: Peter Toth <[email protected]> Signed-off-by: Yuming Wang <[email protected]> (cherry picked from commit 5940b98) Signed-off-by: Yuming Wang <[email protected]>

wangyum · 2023-02-06T13:27:12Z

Merged to master, branch-3.4 and branch-3.3.

peter-toth · 2023-02-06T13:41:54Z

Thanks for the review!

### What changes were proposed in this pull request? Unfortunately apache#32298 introduced a regression from Spark 3.2 to 3.3 as after that change a merged subquery can contain multiple distict type aggregates. Those aggregates need to be rewritten by the `RewriteDistinctAggregates` rule to get the correct results. This PR fixed that. ### Why are the changes needed? The following query: ``` SELECT (SELECT count(distinct c1) FROM t1), (SELECT count(distinct c2) FROM t1) ``` currently fails with: ``` java.lang.IllegalStateException: You hit a query analyzer bug. Please report your query to Spark user mailing list. at org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:538) ``` but works again after this PR. ### Does this PR introduce _any_ user-facing change? Yes, the above query works again. ### How was this patch tested? Added new UT. Closes apache#39887 from peter-toth/SPARK-42346-rewrite-distinct-aggregates-after-subquery-merge. Authored-by: Peter Toth <[email protected]> Signed-off-by: Yuming Wang <[email protected]> (cherry picked from commit 5940b98) Signed-off-by: Yuming Wang <[email protected]>

run RewriteDistinctAggregates after MergeScalarSubqueries

e0b5eb6

github-actions bot added the SQL label Feb 5, 2023

fix test

acd4fda

RobinL mentioned this pull request Feb 5, 2023

Count distinct bug in Spark 3.3.x causing ''You hit a query analyzer bug." moj-analytical-services/splink#1021

Closed

2 tasks

wangyum reviewed Feb 5, 2023

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala Show resolved Hide resolved

wangyum approved these changes Feb 5, 2023

View reviewed changes

viirya approved these changes Feb 5, 2023

View reviewed changes

wangyum closed this in 5940b98 Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-42346][SQL] Rewrite distinct aggregates after subquery merge #39887

[SPARK-42346][SQL] Rewrite distinct aggregates after subquery merge #39887

Uh oh!

peter-toth commented Feb 5, 2023

Uh oh!

peter-toth commented Feb 5, 2023

Uh oh!

RobinL commented Feb 5, 2023

Uh oh!

Uh oh!

cloud-fan commented Feb 6, 2023

Uh oh!

wangyum commented Feb 6, 2023

Uh oh!

peter-toth commented Feb 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-42346][SQL] Rewrite distinct aggregates after subquery merge #39887

[SPARK-42346][SQL] Rewrite distinct aggregates after subquery merge #39887

Uh oh!

Conversation

peter-toth commented Feb 5, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

peter-toth commented Feb 5, 2023

Uh oh!

RobinL commented Feb 5, 2023

Uh oh!

Uh oh!

cloud-fan commented Feb 6, 2023

Uh oh!

wangyum commented Feb 6, 2023

Uh oh!

peter-toth commented Feb 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants