-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-47241][SQL] Fix rule order issues for ExtractGenerator #45350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| case Aggregate(_, aggList, _) if aggList.count(hasGenerator) > 1 => | ||
| val generators = aggList.filter(hasGenerator).map(trimAlias) | ||
| throw QueryCompilationErrors.moreThanOneGeneratorError(generators, "aggregate") | ||
| throw QueryCompilationErrors.moreThanOneGeneratorError(generators) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still for aggregate, but I saw you remove the clause field in the error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find aggregate clause confusing, as what end users write is a SELECT query with GROUP BY or aggregate functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason is we can't always figure out if it's aggregate or not. If there is no GROUP BY, the plan is still Project and we may fail before analyzer rewrite it to Aggregate, then we report SELECT clause anyway.
| test("SPARK-47241: generator function after SELECT *") { | ||
| val df = sql( | ||
| s""" | ||
| |SELECT *, explode(array('a', 'b')) as c1 | ||
| |FROM | ||
| |( | ||
| | SELECT id FROM range(1) GROUP BY 1 | ||
| |) | ||
| |""".stripMargin) | ||
| checkAnswer(df, Seq(Row(0, "a"), Row(0, "b"))) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I read the PR description, I began wondering what it is "after" SELECT *, because I think the Generate will be under Project:
== Physical Plan ==
*(1) Project [id#1L, c1#2, c1#2]
+- *(1) Generate explode([a,b]), [id#1L], false, [c1#2]
+- *(1) Range (0, 1, step=1, splits=16)
It maybe "before"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you refer the order in project list, maybe generator function after wildcard in SELECT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good suggestion!
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only two minor comments.
sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala
Outdated
Show resolved
Hide resolved
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
|
thanks for the review, merging to master/3.5! |
### What changes were proposed in this pull request? The rule `ExtractGenerator` does not define any trigger condition when rewriting generator functions in `Project`, which makes the behavior quite unstable and heavily depends on the execution order of analyzer rules. Two bugs I've found so far: 1. By design, we want to forbid users from using more than one generator function in SELECT. However, we can't really enforce it if two generator functions are not resolved at the same time: the rule thinks there is only one generate function (the other is still unresolved), then rewrite it. The other one gets resolved later and gets rewritten later. 2. When a generator function is put after `SELECT *`, it's possible that `*` is not expanded yet when we enter `ExtractGenerator`. The rule rewrites the generator function: insert a `Generate` operator below, and add a new column to the projectList for the generator function output. Then we expand `*` to the child plan output which is `Generate`, we end up with two identical columns for the generate function output. This PR fixes it by adding a trigger condition when rewriting generator functions in `Project`: the projectList should be resolved or a generator function. This is the same trigger condition we used for `Aggregate`. To avoid breaking changes, this PR also allows multiple generator functions in `Project`, which works totally fine. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now multiple generator functions are allowed in `Project`. And there won't be duplicated columns for generator function output. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes #45350 from cloud-fan/generate. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 51f4cfa) Signed-off-by: Wenchen Fan <[email protected]>
|
@cloud-fan This change broke an existing behaviour. When a aliased generator field A is referenced in some another field B in project list, it will create a situation where the B will wait for A to resolve as it needs to know the field identifier of A and A will wait for B to resolve since we have check in place which needs all fields other than generator field to be resolved. This is how you can reproduce this: This results in following final analyzed plan: Without this change the analyzed plan looks like: |
I think this can be supported with LCA. cc @anchovYu |
|
Hello! I'm hitting the same issue @rajatrj20 encountered. Is the intention for the previous behavior to no longer be supported? Apologies if this is not the right forum for follow up. |
|
#50310 should fix it |
…#374) ### What changes were proposed in this pull request? The rule `ExtractGenerator` does not define any trigger condition when rewriting generator functions in `Project`, which makes the behavior quite unstable and heavily depends on the execution order of analyzer rules. Two bugs I've found so far: 1. By design, we want to forbid users from using more than one generator function in SELECT. However, we can't really enforce it if two generator functions are not resolved at the same time: the rule thinks there is only one generate function (the other is still unresolved), then rewrite it. The other one gets resolved later and gets rewritten later. 2. When a generator function is put after `SELECT *`, it's possible that `*` is not expanded yet when we enter `ExtractGenerator`. The rule rewrites the generator function: insert a `Generate` operator below, and add a new column to the projectList for the generator function output. Then we expand `*` to the child plan output which is `Generate`, we end up with two identical columns for the generate function output. This PR fixes it by adding a trigger condition when rewriting generator functions in `Project`: the projectList should be resolved or a generator function. This is the same trigger condition we used for `Aggregate`. To avoid breaking changes, this PR also allows multiple generator functions in `Project`, which works totally fine. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now multiple generator functions are allowed in `Project`. And there won't be duplicated columns for generator function output. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45350 from cloud-fan/generate. Lead-authored-by: Wenchen Fan <[email protected]> (cherry picked from commit 51f4cfa) Signed-off-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
The rule
ExtractGeneratordoes not define any trigger condition when rewriting generator functions inProject, which makes the behavior quite unstable and heavily depends on the execution order of analyzer rules.Two bugs I've found so far:
SELECT *, it's possible that*is not expanded yet when we enterExtractGenerator. The rule rewrites the generator function: insert aGenerateoperator below, and add a new column to the projectList for the generator function output. Then we expand*to the child plan output which isGenerate, we end up with two identical columns for the generate function output.This PR fixes it by adding a trigger condition when rewriting generator functions in
Project: the projectList should be resolved or a generator function. This is the same trigger condition we used forAggregate. To avoid breaking changes, this PR also allows multiple generator functions inProject, which works totally fine.Why are the changes needed?
bug fix
Does this PR introduce any user-facing change?
Yes, now multiple generator functions are allowed in
Project. And there won't be duplicated columns for generator function output.How was this patch tested?
new test
Was this patch authored or co-authored using generative AI tooling?
No