[SPARK-41338][SQL] Resolve outer references and normal columns in the same analyzer batch #38851

cloud-fan · 2022-11-30T15:47:42Z

What changes were proposed in this pull request?

Today, the way we resolve outer references is very inefficient. It invokes the entire analyzer to resolve the subquery plan, then transforms the plan to resolve UnresolvedAttribute to outer references. If the plan is still unresolved, repeat the process until the plan is resolved or the plan doesn't change any more. Ideally, we should only invoke the analyzer once to resolve subquery plans.

This PR adds a new rule to resolve outer references, and put it in the main analyzer batch. Then we can safely invoke the analyzer only once.

Why are the changes needed?

Simplify the subquery resolution code and make it more efficient

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

cloud-fan · 2022-11-30T15:47:56Z

cc @viirya @allisonwang-db

anchovYu · 2022-11-30T17:43:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Is applying these rules once enough? e.g. is it possible that the application of ResolveMissingReferences makes more attributes resolvable by ResolveReferences and ResolveMissingReferences can be applied another time, so forth?

as long as an UnresolvedAttribute can't be resolved by these rules, we can resolve it to outer references.

Also these rules won't generate any new UnresolvedAttribute correct?

viirya · 2022-12-01T03:36:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+
+      // We must run these 3 rules first, as they also resolve `UnresolvedAttribute` and have
+      // higher priority than outer reference resolution.
+      val prepared = ResolveAggregateFunctions(ResolveMissingReferences(ResolveReferences(plan)))


Aren't ResolveMissingReferences and ResolveReferences already run before ResolveOuterReferences? Will we just run this rule separately?

The key is to run ResolveOuterReferences right after these 3 rules, so that it's safe to resolve UnresolvedAttribute to outer references. Otherwise, other rules may change the plan shape and make these 3 rules applicable again.

I guess one disadvantage of running these three rules inside ResolveOuterReferences is that they are not visible in the plan change log.

Also it's a bit strange when there is no outer reference to resolve but the 3 rules take effect, the plan change logger of this ResolveOuterReferences actually shows the changes from these 3 rules.

=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOuterReferences === 'Project [id2#28L] 'Project [id2#28L] +- 'SubqueryAlias __auto_generated_subquery_name +- 'SubqueryAlias __auto_generated_subquery_name ! +- 'Project [id#27, ('id + cast(1 as bigint)) AS id2#28] +- 'Project [id#27, (id#27 + cast(1 as bigint)) AS id2#28] ..

Yes, this is not a perfect solution, but AFAIK this is the only reliable way to guarantee rule execution order. The best solution in my opinion is to centralize all column resolution code in one rule, but that's a much larger change.

cloud-fan · 2022-12-01T06:20:00Z

thanks for review, merging to master!

allisonwang-db · 2022-12-01T17:36:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+
+      // We must run these 3 rules first, as they also resolve `UnresolvedAttribute` and have
+      // higher priority than outer reference resolution.
+      val prepared = ResolveAggregateFunctions(ResolveMissingReferences(ResolveReferences(plan)))


I guess one disadvantage of running these three rules inside ResolveOuterReferences is that they are not visible in the plan change log.

allisonwang-db · 2022-12-01T17:43:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+        // `UnresolvedAttribute` but we should never resolve it to outer references. It's a bit
+        // hacky that `Generate` uses `UnresolvedAttribute` to store the generator column names,
+        // we should clean it up later.
+        case g: Generate if g.childrenResolved && !g.resolved =>


Do we have a unit test for this case?

existing tests failed without this change.

…n Project ### What changes were proposed in this pull request? This PR implements a new feature: Implicit lateral column alias on `Project` case, controlled by `spark.sql.lateralColumnAlias.enableImplicitResolution` temporarily (default false now, but will turn on this conf once the feature is completely merged). #### Lateral column alias View https://issues.apache.org/jira/browse/SPARK-27561 for more details on lateral column alias. There are two main cases to support: LCA in Project, and LCA in Aggregate. ```sql -- LCA in Project. The base_salary references an attribute defined by a previous alias SELECT salary AS base_salary, base_salary + bonus AS total_salary FROM employee -- LCA in Aggregate. The avg_salary references an attribute defined by a previous alias SELECT dept, average(salary) AS avg_salary, avg_salary + average(bonus) FROM employee GROUP BY dept ``` This **implicit** lateral column alias (no explicit keyword, e.g. `lateral.base_salary`) should be supported. #### High level design This PR defines a new Resolution rule, `ResolveLateralColumnAlias` to resolve the implicit lateral column alias, covering the `Project` case. It introduces a new leaf node NamedExpression, `LateralColumnAliasReference`, as a placeholder used to hold a referenced that has been temporarily resolved as the reference to a lateral column alias. The whole process is generally divided into two phases: 1) recognize **resolved** lateral alias, wrap the attributes referencing them with `LateralColumnAliasReference`. 2) when the whole operator is resolved, unwrap `LateralColumnAliasReference`. For Project, it further resolves the attributes and push down the referenced lateral aliases to the new Project. For example: ``` // Before Project [age AS a, 'a + 1] +- Child // After phase 1 Project [age AS a, lateralalias(a) + 1] +- Child // After phase 2 Project [a, a + 1] +- Project [child output, age AS a] +- Child ``` #### Resolution order Given this new rule, the name resolution order will be (higher -> lower): ``` local table column > local metadata attribute > local lateral column alias > all others (outer reference of subquery, parameters of SQL UDF, ..) ``` There is a recent refactor that moves the creation of `OuterReference` in the Resolution batch: #38851. Because lateral column alias has higher resolution priority than outer reference, it will try to resolve an `OuterReference` using lateral column alias, similar as an `UnresolvedAttribute`. If success, it strips `OuterReference` and also wraps it with `LateralColumnAliasReference`. ### Why are the changes needed? The lateral column alias is a popular feature wanted for a long time. It is supported by lots of other database vendors (Redshift, snowflake, etc) and provides a better user experience. ### Does this PR introduce _any_ user-facing change? Yes, as shown in the above example, it will be able to resolve lateral column alias. I will write the migration guide or release note when most PRs of this feature are merged. ### How was this patch tested? Existing tests and newly added tests. Closes #38776 from anchovYu/SPARK-27561-refactor. Authored-by: Xinyi Yu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

… same analyzer batch ### What changes were proposed in this pull request? Today, the way we resolve outer references is very inefficient. It invokes the entire analyzer to resolve the subquery plan, then transforms the plan to resolve `UnresolvedAttribute` to outer references. If the plan is still unresolved, repeat the process until the plan is resolved or the plan doesn't change any more. Ideally, we should only invoke the analyzer once to resolve subquery plans. This PR adds a new rule to resolve outer references, and put it in the main analyzer batch. Then we can safely invoke the analyzer only once. ### Why are the changes needed? Simplify the subquery resolution code and make it more efficient ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes apache#38851 from cloud-fan/outer. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…n Project ### What changes were proposed in this pull request? This PR implements a new feature: Implicit lateral column alias on `Project` case, controlled by `spark.sql.lateralColumnAlias.enableImplicitResolution` temporarily (default false now, but will turn on this conf once the feature is completely merged). #### Lateral column alias View https://issues.apache.org/jira/browse/SPARK-27561 for more details on lateral column alias. There are two main cases to support: LCA in Project, and LCA in Aggregate. ```sql -- LCA in Project. The base_salary references an attribute defined by a previous alias SELECT salary AS base_salary, base_salary + bonus AS total_salary FROM employee -- LCA in Aggregate. The avg_salary references an attribute defined by a previous alias SELECT dept, average(salary) AS avg_salary, avg_salary + average(bonus) FROM employee GROUP BY dept ``` This **implicit** lateral column alias (no explicit keyword, e.g. `lateral.base_salary`) should be supported. #### High level design This PR defines a new Resolution rule, `ResolveLateralColumnAlias` to resolve the implicit lateral column alias, covering the `Project` case. It introduces a new leaf node NamedExpression, `LateralColumnAliasReference`, as a placeholder used to hold a referenced that has been temporarily resolved as the reference to a lateral column alias. The whole process is generally divided into two phases: 1) recognize **resolved** lateral alias, wrap the attributes referencing them with `LateralColumnAliasReference`. 2) when the whole operator is resolved, unwrap `LateralColumnAliasReference`. For Project, it further resolves the attributes and push down the referenced lateral aliases to the new Project. For example: ``` // Before Project [age AS a, 'a + 1] +- Child // After phase 1 Project [age AS a, lateralalias(a) + 1] +- Child // After phase 2 Project [a, a + 1] +- Project [child output, age AS a] +- Child ``` #### Resolution order Given this new rule, the name resolution order will be (higher -> lower): ``` local table column > local metadata attribute > local lateral column alias > all others (outer reference of subquery, parameters of SQL UDF, ..) ``` There is a recent refactor that moves the creation of `OuterReference` in the Resolution batch: apache#38851. Because lateral column alias has higher resolution priority than outer reference, it will try to resolve an `OuterReference` using lateral column alias, similar as an `UnresolvedAttribute`. If success, it strips `OuterReference` and also wraps it with `LateralColumnAliasReference`. ### Why are the changes needed? The lateral column alias is a popular feature wanted for a long time. It is supported by lots of other database vendors (Redshift, snowflake, etc) and provides a better user experience. ### Does this PR introduce _any_ user-facing change? Yes, as shown in the above example, it will be able to resolve lateral column alias. I will write the migration guide or release note when most PRs of this feature are merged. ### How was this patch tested? Existing tests and newly added tests. Closes apache#38776 from anchovYu/SPARK-27561-refactor. Authored-by: Xinyi Yu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added the SQL label Nov 30, 2022

anchovYu reviewed Nov 30, 2022

View reviewed changes

resolve outer references and normal columns in the same analyzer batch

9c0fc55

cloud-fan force-pushed the outer branch from 35fc282 to 9c0fc55 Compare December 1, 2022 03:04

viirya reviewed Dec 1, 2022

View reviewed changes

viirya approved these changes Dec 1, 2022

View reviewed changes

cloud-fan closed this in 8d60736 Dec 1, 2022

allisonwang-db reviewed Dec 1, 2022

View reviewed changes

anchovYu mentioned this pull request Dec 2, 2022

[SPARK-41630][SQL] Support implicit lateral column alias resolution on Project #38776

Closed

[SPARK-41338][SQL] Resolve outer references and normal columns in the same analyzer batch #38851

[SPARK-41338][SQL] Resolve outer references and normal columns in the same analyzer batch #38851

Uh oh!

Conversation

cloud-fan commented Nov 30, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Nov 30, 2022

Uh oh!

anchovYu Nov 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Dec 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 1, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anchovYu Nov 30, 2022 •

edited

Loading

viirya Dec 1, 2022 •

edited

Loading