Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

The rule ExtractGenerator does not define any trigger condition when rewriting generator functions in Project, which makes the behavior quite unstable and heavily depends on the execution order of analyzer rules.

Two bugs I've found so far:

  1. By design, we want to forbid users from using more than one generator function in SELECT. However, we can't really enforce it if two generator functions are not resolved at the same time: the rule thinks there is only one generate function (the other is still unresolved), then rewrite it. The other one gets resolved later and gets rewritten later.
  2. When a generator function is put after SELECT *, it's possible that * is not expanded yet when we enter ExtractGenerator. The rule rewrites the generator function: insert a Generate operator below, and add a new column to the projectList for the generator function output. Then we expand * to the child plan output which is Generate, we end up with two identical columns for the generate function output.

This PR fixes it by adding a trigger condition when rewriting generator functions in Project: the projectList should be resolved or a generator function. This is the same trigger condition we used for Aggregate. To avoid breaking changes, this PR also allows multiple generator functions in Project, which works totally fine.

Why are the changes needed?

bug fix

Does this PR introduce any user-facing change?

Yes, now multiple generator functions are allowed in Project. And there won't be duplicated columns for generator function output.

How was this patch tested?

new test

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Mar 1, 2024
@cloud-fan
Copy link
Contributor Author

cc @viirya @gengliangwang

@cloud-fan
Copy link
Contributor Author

cc @MaxGekk @yaooqinn @dongjoon-hyun

case Aggregate(_, aggList, _) if aggList.count(hasGenerator) > 1 =>
val generators = aggList.filter(hasGenerator).map(trimAlias)
throw QueryCompilationErrors.moreThanOneGeneratorError(generators, "aggregate")
throw QueryCompilationErrors.moreThanOneGeneratorError(generators)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still for aggregate, but I saw you remove the clause field in the error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find aggregate clause confusing, as what end users write is a SELECT query with GROUP BY or aggregate functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, okay.

Copy link
Contributor Author

@cloud-fan cloud-fan Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason is we can't always figure out if it's aggregate or not. If there is no GROUP BY, the plan is still Project and we may fail before analyzer rewrite it to Aggregate, then we report SELECT clause anyway.

Comment on lines 570 to 580
test("SPARK-47241: generator function after SELECT *") {
val df = sql(
s"""
|SELECT *, explode(array('a', 'b')) as c1
|FROM
|(
| SELECT id FROM range(1) GROUP BY 1
|)
|""".stripMargin)
checkAnswer(df, Seq(Row(0, "a"), Row(0, "b")))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I read the PR description, I began wondering what it is "after" SELECT *, because I think the Generate will be under Project:

== Physical Plan ==
*(1) Project [id#1L, c1#2, c1#2]
+- *(1) Generate explode([a,b]), [id#1L], false, [c1#2]
   +- *(1) Range (0, 1, step=1, splits=16)

It maybe "before"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you refer the order in project list, maybe generator function after wildcard in SELECT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion!

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only two minor comments.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master/3.5!

@cloud-fan cloud-fan closed this in 51f4cfa Mar 7, 2024
cloud-fan added a commit that referenced this pull request Mar 7, 2024
### What changes were proposed in this pull request?

The rule `ExtractGenerator` does not define any trigger condition when rewriting generator functions in `Project`, which makes the behavior quite unstable and heavily depends on the execution order of analyzer rules.

Two bugs I've found so far:
1. By design, we want to forbid users from using more than one generator function in SELECT. However, we can't really enforce it if two generator functions are not resolved at the same time: the rule thinks there is only one generate function (the other is still unresolved), then rewrite it. The other one gets resolved later and gets rewritten later.
2. When a generator function is put after `SELECT *`, it's possible that `*` is not expanded yet when we enter `ExtractGenerator`. The rule rewrites the generator function: insert a `Generate` operator below, and add a new column to the projectList for the generator function output. Then we expand `*` to the child plan output which is `Generate`, we end up with two identical columns for the generate function output.

This PR fixes it by adding a trigger condition when rewriting generator functions in `Project`: the projectList should be resolved or a generator function. This is the same trigger condition we used for `Aggregate`. To avoid breaking changes, this PR also allows multiple generator functions in `Project`, which works totally fine.
### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

Yes, now multiple generator functions are allowed in `Project`. And there won't be duplicated columns for generator function output.

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #45350 from cloud-fan/generate.

Lead-authored-by: Wenchen Fan <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 51f4cfa)
Signed-off-by: Wenchen Fan <[email protected]>
@rajatrj20
Copy link

@cloud-fan This change broke an existing behaviour. When a aliased generator field A is referenced in some another field B in project list, it will create a situation where the B will wait for A to resolve as it needs to know the field identifier of A and A will wait for B to resolve since we have check in place which needs all fields other than generator field to be resolved.

This is how you can reproduce this:

CREATE TABLE nestedTable1 (array_int_col array<int>, array_str_col array<string>, map_str_col map<string,string>, map_int_col map<string,int>, struct_col struct<col_3:string,col_1:int>, col_3 string, col_1 int) USING iceberg

INSERT INTO nestedTable1 (array_int_col, array_str_col, map_str_col, map_int_col, struct_col, col_3, col_1) VALUES (array(1, 2, 3, 4), array('a', 'b', 'c'), map('k1', 'v1', 'k2', 'v2'), map('k', 1), struct('a', 1), 'a', 1)

SELECT col_1, EXPLODE(MAP_KEYS(map_str_col)) AS key, map_str_col[key] AS value FROM nestedTable1;

This results in following final analyzed plan:

'Project [col_1#192, 'EXPLODE(map_keys(map_str_col#188)) AS key#329, map_str_col#188[lateralAliasReference(key)] AS value#330]                                  
 +- RelationV2[array_int_col#186, array_str_col#187, map_str_col#188, map_int_col#189, struct_col#190, col_3#191, col_1#192]  spark_catalog.default.nestedtable1

Without this change the analyzed plan looks like:

'Project [col_1#192, key#331, map_str_col#188['key] AS value#330]                                                                                                    
 +- Generate explode(map_keys(map_str_col#188)), false, [key#331]                                                                                                     
    +- RelationV2[array_int_col#186, array_str_col#187, map_str_col#188, map_int_col#189, struct_col#190, col_3#191, col_1#192]  spark_catalog.default.nestedtable1

@cloud-fan
Copy link
Contributor Author

SELECT col_1, EXPLODE(MAP_KEYS(map_str_col)) AS key, map_str_col[key] AS value FROM nestedTable1;

I think this can be supported with LCA. cc @anchovYu

@frosforever
Copy link

Hello! I'm hitting the same issue @rajatrj20 encountered. Is the intention for the previous behavior to no longer be supported? Apologies if this is not the right forum for follow up.

@cloud-fan
Copy link
Contributor Author

#50310 should fix it

turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
…#374)

### What changes were proposed in this pull request?

The rule `ExtractGenerator` does not define any trigger condition when rewriting generator functions in `Project`, which makes the behavior quite unstable and heavily depends on the execution order of analyzer rules.

Two bugs I've found so far:
1. By design, we want to forbid users from using more than one generator function in SELECT. However, we can't really enforce it if two generator functions are not resolved at the same time: the rule thinks there is only one generate function (the other is still unresolved), then rewrite it. The other one gets resolved later and gets rewritten later.
2. When a generator function is put after `SELECT *`, it's possible that `*` is not expanded yet when we enter `ExtractGenerator`. The rule rewrites the generator function: insert a `Generate` operator below, and add a new column to the projectList for the generator function output. Then we expand `*` to the child plan output which is `Generate`, we end up with two identical columns for the generate function output.

This PR fixes it by adding a trigger condition when rewriting generator functions in `Project`: the projectList should be resolved or a generator function. This is the same trigger condition we used for `Aggregate`. To avoid breaking changes, this PR also allows multiple generator functions in `Project`, which works totally fine.
### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

Yes, now multiple generator functions are allowed in `Project`. And there won't be duplicated columns for generator function output.

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#45350 from cloud-fan/generate.

Lead-authored-by: Wenchen Fan <[email protected]>


(cherry picked from commit 51f4cfa)

Signed-off-by: Wenchen Fan <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants