[SPARK-47241][SQL] Fix rule order issues for ExtractGenerator #45350

cloud-fan · 2024-03-01T06:56:15Z

What changes were proposed in this pull request?

The rule ExtractGenerator does not define any trigger condition when rewriting generator functions in Project, which makes the behavior quite unstable and heavily depends on the execution order of analyzer rules.

Two bugs I've found so far:

By design, we want to forbid users from using more than one generator function in SELECT. However, we can't really enforce it if two generator functions are not resolved at the same time: the rule thinks there is only one generate function (the other is still unresolved), then rewrite it. The other one gets resolved later and gets rewritten later.
When a generator function is put after SELECT *, it's possible that * is not expanded yet when we enter ExtractGenerator. The rule rewrites the generator function: insert a Generate operator below, and add a new column to the projectList for the generator function output. Then we expand * to the child plan output which is Generate, we end up with two identical columns for the generate function output.

This PR fixes it by adding a trigger condition when rewriting generator functions in Project: the projectList should be resolved or a generator function. This is the same trigger condition we used for Aggregate. To avoid breaking changes, this PR also allows multiple generator functions in Project, which works totally fine.

Why are the changes needed?

bug fix

Does this PR introduce any user-facing change?

Yes, now multiple generator functions are allowed in Project. And there won't be duplicated columns for generator function output.

How was this patch tested?

new test

Was this patch authored or co-authored using generative AI tooling?

No

cloud-fan · 2024-03-01T06:56:55Z

cc @viirya @gengliangwang

cloud-fan · 2024-03-06T15:24:07Z

cc @MaxGekk @yaooqinn @dongjoon-hyun

viirya · 2024-03-06T17:30:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

      case Aggregate(_, aggList, _) if aggList.count(hasGenerator) > 1 =>
        val generators = aggList.filter(hasGenerator).map(trimAlias)
-        throw QueryCompilationErrors.moreThanOneGeneratorError(generators, "aggregate")
+        throw QueryCompilationErrors.moreThanOneGeneratorError(generators)


This is still for aggregate, but I saw you remove the clause field in the error message?

I find aggregate clause confusing, as what end users write is a SELECT query with GROUP BY or aggregate functions.

Another reason is we can't always figure out if it's aggregate or not. If there is no GROUP BY, the plan is still Project and we may fail before analyzer rewrite it to Aggregate, then we report SELECT clause anyway.

viirya · 2024-03-06T17:35:37Z

sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala

+  test("SPARK-47241: generator function after SELECT *") {
+    val df = sql(
+      s"""
+         |SELECT *, explode(array('a', 'b')) as c1
+         |FROM
+         |(
+         |  SELECT id FROM range(1) GROUP BY 1
+         |)
+         |""".stripMargin)
+    checkAnswer(df, Seq(Row(0, "a"), Row(0, "b")))
+  }


When I read the PR description, I began wondering what it is "after" SELECT *, because I think the Generate will be under Project:

== Physical Plan == *(1) Project [id#1L, c1#2, c1#2] +- *(1) Generate explode([a,b]), [id#1L], false, [c1#2] +- *(1) Range (0, 1, step=1, splits=16)

It maybe "before"?

If you refer the order in project list, maybe generator function after wildcard in SELECT?

good suggestion!

viirya

Only two minor comments.

sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala

…Suite.scala

dongjoon-hyun

+1, LGTM.

cloud-fan · 2024-03-07T09:01:32Z

thanks for the review, merging to master/3.5!

### What changes were proposed in this pull request? The rule `ExtractGenerator` does not define any trigger condition when rewriting generator functions in `Project`, which makes the behavior quite unstable and heavily depends on the execution order of analyzer rules. Two bugs I've found so far: 1. By design, we want to forbid users from using more than one generator function in SELECT. However, we can't really enforce it if two generator functions are not resolved at the same time: the rule thinks there is only one generate function (the other is still unresolved), then rewrite it. The other one gets resolved later and gets rewritten later. 2. When a generator function is put after `SELECT *`, it's possible that `*` is not expanded yet when we enter `ExtractGenerator`. The rule rewrites the generator function: insert a `Generate` operator below, and add a new column to the projectList for the generator function output. Then we expand `*` to the child plan output which is `Generate`, we end up with two identical columns for the generate function output. This PR fixes it by adding a trigger condition when rewriting generator functions in `Project`: the projectList should be resolved or a generator function. This is the same trigger condition we used for `Aggregate`. To avoid breaking changes, this PR also allows multiple generator functions in `Project`, which works totally fine. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now multiple generator functions are allowed in `Project`. And there won't be duplicated columns for generator function output. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes #45350 from cloud-fan/generate. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 51f4cfa) Signed-off-by: Wenchen Fan <[email protected]>

rajatrj20 · 2024-05-02T07:05:16Z

@cloud-fan This change broke an existing behaviour. When a aliased generator field A is referenced in some another field B in project list, it will create a situation where the B will wait for A to resolve as it needs to know the field identifier of A and A will wait for B to resolve since we have check in place which needs all fields other than generator field to be resolved.

This is how you can reproduce this:

CREATE TABLE nestedTable1 (array_int_col array<int>, array_str_col array<string>, map_str_col map<string,string>, map_int_col map<string,int>, struct_col struct<col_3:string,col_1:int>, col_3 string, col_1 int) USING iceberg

INSERT INTO nestedTable1 (array_int_col, array_str_col, map_str_col, map_int_col, struct_col, col_3, col_1) VALUES (array(1, 2, 3, 4), array('a', 'b', 'c'), map('k1', 'v1', 'k2', 'v2'), map('k', 1), struct('a', 1), 'a', 1)

SELECT col_1, EXPLODE(MAP_KEYS(map_str_col)) AS key, map_str_col[key] AS value FROM nestedTable1;

This results in following final analyzed plan:

'Project [col_1#192, 'EXPLODE(map_keys(map_str_col#188)) AS key#329, map_str_col#188[lateralAliasReference(key)] AS value#330]                                  
 +- RelationV2[array_int_col#186, array_str_col#187, map_str_col#188, map_int_col#189, struct_col#190, col_3#191, col_1#192]  spark_catalog.default.nestedtable1

Without this change the analyzed plan looks like:

'Project [col_1#192, key#331, map_str_col#188['key] AS value#330]                                                                                                    
 +- Generate explode(map_keys(map_str_col#188)), false, [key#331]                                                                                                     
    +- RelationV2[array_int_col#186, array_str_col#187, map_str_col#188, map_int_col#189, struct_col#190, col_3#191, col_1#192]  spark_catalog.default.nestedtable1

cloud-fan · 2024-05-02T11:11:49Z

SELECT col_1, EXPLODE(MAP_KEYS(map_str_col)) AS key, map_str_col[key] AS value FROM nestedTable1;

I think this can be supported with LCA. cc @anchovYu

frosforever · 2025-01-31T16:57:49Z

Hello! I'm hitting the same issue @rajatrj20 encountered. Is the intention for the previous behavior to no longer be supported? Apologies if this is not the right forum for follow up.

cloud-fan · 2025-03-20T05:48:28Z

#50310 should fix it

…#374) ### What changes were proposed in this pull request? The rule `ExtractGenerator` does not define any trigger condition when rewriting generator functions in `Project`, which makes the behavior quite unstable and heavily depends on the execution order of analyzer rules. Two bugs I've found so far: 1. By design, we want to forbid users from using more than one generator function in SELECT. However, we can't really enforce it if two generator functions are not resolved at the same time: the rule thinks there is only one generate function (the other is still unresolved), then rewrite it. The other one gets resolved later and gets rewritten later. 2. When a generator function is put after `SELECT *`, it's possible that `*` is not expanded yet when we enter `ExtractGenerator`. The rule rewrites the generator function: insert a `Generate` operator below, and add a new column to the projectList for the generator function output. Then we expand `*` to the child plan output which is `Generate`, we end up with two identical columns for the generate function output. This PR fixes it by adding a trigger condition when rewriting generator functions in `Project`: the projectList should be resolved or a generator function. This is the same trigger condition we used for `Aggregate`. To avoid breaking changes, this PR also allows multiple generator functions in `Project`, which works totally fine. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now multiple generator functions are allowed in `Project`. And there won't be duplicated columns for generator function output. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45350 from cloud-fan/generate. Lead-authored-by: Wenchen Fan <[email protected]> (cherry picked from commit 51f4cfa) Signed-off-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]>

github-actions bot added the SQL label Mar 1, 2024

cloud-fan force-pushed the generate branch from ab5115c to cd6263c Compare March 4, 2024 16:45

fix rule order issues for ExtractGenerator

b576e81

cloud-fan force-pushed the generate branch from cd6263c to b576e81 Compare March 6, 2024 07:11

github-actions bot added the DOCS label Mar 6, 2024

viirya reviewed Mar 6, 2024

View reviewed changes

viirya approved these changes Mar 6, 2024

View reviewed changes

viirya reviewed Mar 6, 2024

View reviewed changes

yaooqinn approved these changes Mar 7, 2024

View reviewed changes

cloud-fan commented Mar 7, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala Outdated Show resolved Hide resolved

Update sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunction…

942a3d3

…Suite.scala

dongjoon-hyun approved these changes Mar 7, 2024

View reviewed changes

cloud-fan closed this in 51f4cfa Mar 7, 2024

cloud-fan mentioned this pull request Mar 20, 2025

[SPARK-47241][SQL][FOLLOWUP] Fix issue when laterally referencing a Generator #50310

Closed

[SPARK-47241][SQL] Fix rule order issues for ExtractGenerator #45350

[SPARK-47241][SQL] Fix rule order issues for ExtractGenerator #45350

Uh oh!

Conversation

cloud-fan commented Mar 1, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan commented Mar 1, 2024

Uh oh!

cloud-fan commented Mar 6, 2024

Uh oh!

viirya Mar 6, 2024

Choose a reason for hiding this comment

Uh oh!

cloud-fan Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

viirya Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

cloud-fan Mar 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Mar 6, 2024

Choose a reason for hiding this comment

Uh oh!

viirya Mar 6, 2024

Choose a reason for hiding this comment

Uh oh!

cloud-fan Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Mar 7, 2024

Uh oh!

rajatrj20 commented May 2, 2024

Uh oh!

cloud-fan commented May 2, 2024

Uh oh!

frosforever commented Jan 31, 2025

Uh oh!

cloud-fan commented Mar 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cloud-fan Mar 7, 2024 •

edited

Loading