-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-41338][SQL] Resolve outer references and normal columns in the same analyzer batch #38851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is applying these rules once enough? e.g. is it possible that the application of ResolveMissingReferences makes more attributes resolvable by ResolveReferences and ResolveMissingReferences can be applied another time, so forth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as long as an UnresolvedAttribute can't be resolved by these rules, we can resolve it to outer references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also these rules won't generate any new UnresolvedAttribute correct?
|
|
||
| // We must run these 3 rules first, as they also resolve `UnresolvedAttribute` and have | ||
| // higher priority than outer reference resolution. | ||
| val prepared = ResolveAggregateFunctions(ResolveMissingReferences(ResolveReferences(plan))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't ResolveMissingReferences and ResolveReferences already run before ResolveOuterReferences? Will we just run this rule separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key is to run ResolveOuterReferences right after these 3 rules, so that it's safe to resolve UnresolvedAttribute to outer references. Otherwise, other rules may change the plan shape and make these 3 rules applicable again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess one disadvantage of running these three rules inside ResolveOuterReferences is that they are not visible in the plan change log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it's a bit strange when there is no outer reference to resolve but the 3 rules take effect, the plan change logger of this ResolveOuterReferences actually shows the changes from these 3 rules.
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOuterReferences ===
'Project [id2#28L] 'Project [id2#28L]
+- 'SubqueryAlias __auto_generated_subquery_name +- 'SubqueryAlias __auto_generated_subquery_name
! +- 'Project [id#27, ('id + cast(1 as bigint)) AS id2#28] +- 'Project [id#27, (id#27 + cast(1 as bigint)) AS id2#28]
..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is not a perfect solution, but AFAIK this is the only reliable way to guarantee rule execution order. The best solution in my opinion is to centralize all column resolution code in one rule, but that's a much larger change.
|
thanks for review, merging to master! |
|
|
||
| // We must run these 3 rules first, as they also resolve `UnresolvedAttribute` and have | ||
| // higher priority than outer reference resolution. | ||
| val prepared = ResolveAggregateFunctions(ResolveMissingReferences(ResolveReferences(plan))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess one disadvantage of running these three rules inside ResolveOuterReferences is that they are not visible in the plan change log.
| // `UnresolvedAttribute` but we should never resolve it to outer references. It's a bit | ||
| // hacky that `Generate` uses `UnresolvedAttribute` to store the generator column names, | ||
| // we should clean it up later. | ||
| case g: Generate if g.childrenResolved && !g.resolved => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a unit test for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
existing tests failed without this change.
…n Project ### What changes were proposed in this pull request? This PR implements a new feature: Implicit lateral column alias on `Project` case, controlled by `spark.sql.lateralColumnAlias.enableImplicitResolution` temporarily (default false now, but will turn on this conf once the feature is completely merged). #### Lateral column alias View https://issues.apache.org/jira/browse/SPARK-27561 for more details on lateral column alias. There are two main cases to support: LCA in Project, and LCA in Aggregate. ```sql -- LCA in Project. The base_salary references an attribute defined by a previous alias SELECT salary AS base_salary, base_salary + bonus AS total_salary FROM employee -- LCA in Aggregate. The avg_salary references an attribute defined by a previous alias SELECT dept, average(salary) AS avg_salary, avg_salary + average(bonus) FROM employee GROUP BY dept ``` This **implicit** lateral column alias (no explicit keyword, e.g. `lateral.base_salary`) should be supported. #### High level design This PR defines a new Resolution rule, `ResolveLateralColumnAlias` to resolve the implicit lateral column alias, covering the `Project` case. It introduces a new leaf node NamedExpression, `LateralColumnAliasReference`, as a placeholder used to hold a referenced that has been temporarily resolved as the reference to a lateral column alias. The whole process is generally divided into two phases: 1) recognize **resolved** lateral alias, wrap the attributes referencing them with `LateralColumnAliasReference`. 2) when the whole operator is resolved, unwrap `LateralColumnAliasReference`. For Project, it further resolves the attributes and push down the referenced lateral aliases to the new Project. For example: ``` // Before Project [age AS a, 'a + 1] +- Child // After phase 1 Project [age AS a, lateralalias(a) + 1] +- Child // After phase 2 Project [a, a + 1] +- Project [child output, age AS a] +- Child ``` #### Resolution order Given this new rule, the name resolution order will be (higher -> lower): ``` local table column > local metadata attribute > local lateral column alias > all others (outer reference of subquery, parameters of SQL UDF, ..) ``` There is a recent refactor that moves the creation of `OuterReference` in the Resolution batch: #38851. Because lateral column alias has higher resolution priority than outer reference, it will try to resolve an `OuterReference` using lateral column alias, similar as an `UnresolvedAttribute`. If success, it strips `OuterReference` and also wraps it with `LateralColumnAliasReference`. ### Why are the changes needed? The lateral column alias is a popular feature wanted for a long time. It is supported by lots of other database vendors (Redshift, snowflake, etc) and provides a better user experience. ### Does this PR introduce _any_ user-facing change? Yes, as shown in the above example, it will be able to resolve lateral column alias. I will write the migration guide or release note when most PRs of this feature are merged. ### How was this patch tested? Existing tests and newly added tests. Closes #38776 from anchovYu/SPARK-27561-refactor. Authored-by: Xinyi Yu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
… same analyzer batch ### What changes were proposed in this pull request? Today, the way we resolve outer references is very inefficient. It invokes the entire analyzer to resolve the subquery plan, then transforms the plan to resolve `UnresolvedAttribute` to outer references. If the plan is still unresolved, repeat the process until the plan is resolved or the plan doesn't change any more. Ideally, we should only invoke the analyzer once to resolve subquery plans. This PR adds a new rule to resolve outer references, and put it in the main analyzer batch. Then we can safely invoke the analyzer only once. ### Why are the changes needed? Simplify the subquery resolution code and make it more efficient ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes apache#38851 from cloud-fan/outer. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…n Project ### What changes were proposed in this pull request? This PR implements a new feature: Implicit lateral column alias on `Project` case, controlled by `spark.sql.lateralColumnAlias.enableImplicitResolution` temporarily (default false now, but will turn on this conf once the feature is completely merged). #### Lateral column alias View https://issues.apache.org/jira/browse/SPARK-27561 for more details on lateral column alias. There are two main cases to support: LCA in Project, and LCA in Aggregate. ```sql -- LCA in Project. The base_salary references an attribute defined by a previous alias SELECT salary AS base_salary, base_salary + bonus AS total_salary FROM employee -- LCA in Aggregate. The avg_salary references an attribute defined by a previous alias SELECT dept, average(salary) AS avg_salary, avg_salary + average(bonus) FROM employee GROUP BY dept ``` This **implicit** lateral column alias (no explicit keyword, e.g. `lateral.base_salary`) should be supported. #### High level design This PR defines a new Resolution rule, `ResolveLateralColumnAlias` to resolve the implicit lateral column alias, covering the `Project` case. It introduces a new leaf node NamedExpression, `LateralColumnAliasReference`, as a placeholder used to hold a referenced that has been temporarily resolved as the reference to a lateral column alias. The whole process is generally divided into two phases: 1) recognize **resolved** lateral alias, wrap the attributes referencing them with `LateralColumnAliasReference`. 2) when the whole operator is resolved, unwrap `LateralColumnAliasReference`. For Project, it further resolves the attributes and push down the referenced lateral aliases to the new Project. For example: ``` // Before Project [age AS a, 'a + 1] +- Child // After phase 1 Project [age AS a, lateralalias(a) + 1] +- Child // After phase 2 Project [a, a + 1] +- Project [child output, age AS a] +- Child ``` #### Resolution order Given this new rule, the name resolution order will be (higher -> lower): ``` local table column > local metadata attribute > local lateral column alias > all others (outer reference of subquery, parameters of SQL UDF, ..) ``` There is a recent refactor that moves the creation of `OuterReference` in the Resolution batch: apache#38851. Because lateral column alias has higher resolution priority than outer reference, it will try to resolve an `OuterReference` using lateral column alias, similar as an `UnresolvedAttribute`. If success, it strips `OuterReference` and also wraps it with `LateralColumnAliasReference`. ### Why are the changes needed? The lateral column alias is a popular feature wanted for a long time. It is supported by lots of other database vendors (Redshift, snowflake, etc) and provides a better user experience. ### Does this PR introduce _any_ user-facing change? Yes, as shown in the above example, it will be able to resolve lateral column alias. I will write the migration guide or release note when most PRs of this feature are merged. ### How was this patch tested? Existing tests and newly added tests. Closes apache#38776 from anchovYu/SPARK-27561-refactor. Authored-by: Xinyi Yu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
Today, the way we resolve outer references is very inefficient. It invokes the entire analyzer to resolve the subquery plan, then transforms the plan to resolve
UnresolvedAttributeto outer references. If the plan is still unresolved, repeat the process until the plan is resolved or the plan doesn't change any more. Ideally, we should only invoke the analyzer once to resolve subquery plans.This PR adds a new rule to resolve outer references, and put it in the main analyzer batch. Then we can safely invoke the analyzer only once.
Why are the changes needed?
Simplify the subquery resolution code and make it more efficient
Does this PR introduce any user-facing change?
no
How was this patch tested?
existing tests