Skip to content

Commit 7e9b88b

Browse files
anchovYucloud-fan
authored andcommitted
[SPARK-27561][SQL] Support implicit lateral column alias resolution on Project
### What changes were proposed in this pull request? This PR implements a new feature: Implicit lateral column alias on `Project` case, controlled by `spark.sql.lateralColumnAlias.enableImplicitResolution` temporarily (default false now, but will turn on this conf once the feature is completely merged). #### Lateral column alias View https://issues.apache.org/jira/browse/SPARK-27561 for more details on lateral column alias. There are two main cases to support: LCA in Project, and LCA in Aggregate. ```sql -- LCA in Project. The base_salary references an attribute defined by a previous alias SELECT salary AS base_salary, base_salary + bonus AS total_salary FROM employee -- LCA in Aggregate. The avg_salary references an attribute defined by a previous alias SELECT dept, average(salary) AS avg_salary, avg_salary + average(bonus) FROM employee GROUP BY dept ``` This **implicit** lateral column alias (no explicit keyword, e.g. `lateral.base_salary`) should be supported. #### High level design This PR defines a new Resolution rule, `ResolveLateralColumnAlias` to resolve the implicit lateral column alias, covering the `Project` case. It introduces a new leaf node NamedExpression, `LateralColumnAliasReference`, as a placeholder used to hold a referenced that has been temporarily resolved as the reference to a lateral column alias. The whole process is generally divided into two phases: 1) recognize **resolved** lateral alias, wrap the attributes referencing them with `LateralColumnAliasReference`. 2) when the whole operator is resolved, unwrap `LateralColumnAliasReference`. For Project, it further resolves the attributes and push down the referenced lateral aliases to the new Project. For example: ``` // Before Project [age AS a, 'a + 1] +- Child // After phase 1 Project [age AS a, lateralalias(a) + 1] +- Child // After phase 2 Project [a, a + 1] +- Project [child output, age AS a] +- Child ``` #### Resolution order Given this new rule, the name resolution order will be (higher -> lower): ``` local table column > local metadata attribute > local lateral column alias > all others (outer reference of subquery, parameters of SQL UDF, ..) ``` There is a recent refactor that moves the creation of `OuterReference` in the Resolution batch: #38851. Because lateral column alias has higher resolution priority than outer reference, it will try to resolve an `OuterReference` using lateral column alias, similar as an `UnresolvedAttribute`. If success, it strips `OuterReference` and also wraps it with `LateralColumnAliasReference`. ### Why are the changes needed? The lateral column alias is a popular feature wanted for a long time. It is supported by lots of other database vendors (Redshift, snowflake, etc) and provides a better user experience. ### Does this PR introduce _any_ user-facing change? Yes, as shown in the above example, it will be able to resolve lateral column alias. I will write the migration guide or release note when most PRs of this feature are merged. ### How was this patch tested? Existing tests and newly added tests. Closes #38776 from anchovYu/SPARK-27561-refactor. Authored-by: Xinyi Yu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
1 parent a2ceff2 commit 7e9b88b

File tree

13 files changed

+686
-7
lines changed

13 files changed

+686
-7
lines changed

core/src/main/resources/error/error-classes.json

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,12 @@
55
],
66
"sqlState" : "42000"
77
},
8+
"AMBIGUOUS_LATERAL_COLUMN_ALIAS" : {
9+
"message" : [
10+
"Lateral column alias <name> is ambiguous and has <n> matches."
11+
],
12+
"sqlState" : "42000"
13+
},
814
"AMBIGUOUS_REFERENCE" : {
915
"message" : [
1016
"Reference <name> is ambiguous, could be: <referenceNames>."

sql/catalyst/src/main/scala-2.12/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,8 @@ class AttributeMap[A](val baseMap: Map[ExprId, (Attribute, A)])
4949

5050
override def contains(k: Attribute): Boolean = get(k).isDefined
5151

52-
override def + [B1 >: A](kv: (Attribute, B1)): Map[Attribute, B1] = baseMap.values.toMap + kv
52+
override def + [B1 >: A](kv: (Attribute, B1)): AttributeMap[B1] =
53+
AttributeMap(baseMap.values.toMap + kv)
5354

5455
override def iterator: Iterator[(Attribute, A)] = baseMap.valuesIterator
5556

sql/catalyst/src/main/scala-2.13/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,9 @@ class AttributeMap[A](val baseMap: Map[ExprId, (Attribute, A)])
4949

5050
override def contains(k: Attribute): Boolean = get(k).isDefined
5151

52+
override def + [B1 >: A](kv: (Attribute, B1)): AttributeMap[B1] =
53+
AttributeMap(baseMap.values.toMap + kv)
54+
5255
override def updated[B1 >: A](key: Attribute, value: B1): Map[Attribute, B1] =
5356
baseMap.values.toMap + (key -> value)
5457

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Lines changed: 116 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ import org.apache.spark.sql.catalyst.streaming.StreamingRelationV2
4141
import org.apache.spark.sql.catalyst.trees.{AlwaysProcess, CurrentOrigin}
4242
import org.apache.spark.sql.catalyst.trees.CurrentOrigin.withOrigin
4343
import org.apache.spark.sql.catalyst.trees.TreePattern._
44-
import org.apache.spark.sql.catalyst.util.{toPrettySQL, CharVarcharUtils, StringUtils}
44+
import org.apache.spark.sql.catalyst.util.{toPrettySQL, CaseInsensitiveMap, CharVarcharUtils, StringUtils}
4545
import org.apache.spark.sql.catalyst.util.ResolveDefaultColumns._
4646
import org.apache.spark.sql.connector.catalog.{View => _, _}
4747
import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
@@ -288,6 +288,8 @@ class Analyzer(override val catalogManager: CatalogManager)
288288
AddMetadataColumns ::
289289
DeduplicateRelations ::
290290
ResolveReferences ::
291+
WrapLateralColumnAliasReference ::
292+
ResolveLateralColumnAliasReference ::
291293
ResolveExpressionsWithNamePlaceholders ::
292294
ResolveDeserializer ::
293295
ResolveNewInstance ::
@@ -1672,7 +1674,7 @@ class Analyzer(override val catalogManager: CatalogManager)
16721674
// Only Project and Aggregate can host star expressions.
16731675
case u @ (_: Project | _: Aggregate) =>
16741676
Try(s.expand(u.children.head, resolver)) match {
1675-
case Success(expanded) => expanded.map(wrapOuterReference)
1677+
case Success(expanded) => expanded.map(wrapOuterReference(_))
16761678
case Failure(_) => throw e
16771679
}
16781680
// Do not use the outer plan to resolve the star expression
@@ -1761,6 +1763,117 @@ class Analyzer(override val catalogManager: CatalogManager)
17611763
}
17621764
}
17631765

1766+
/**
1767+
* The first phase to resolve lateral column alias. See comments in
1768+
* [[ResolveLateralColumnAliasReference]] for more detailed explanation.
1769+
*/
1770+
object WrapLateralColumnAliasReference extends Rule[LogicalPlan] {
1771+
import ResolveLateralColumnAliasReference.AliasEntry
1772+
1773+
private def insertIntoAliasMap(
1774+
a: Alias,
1775+
idx: Int,
1776+
aliasMap: CaseInsensitiveMap[Seq[AliasEntry]]): CaseInsensitiveMap[Seq[AliasEntry]] = {
1777+
val prevAliases = aliasMap.getOrElse(a.name, Seq.empty[AliasEntry])
1778+
aliasMap + (a.name -> (prevAliases :+ AliasEntry(a, idx)))
1779+
}
1780+
1781+
/**
1782+
* Use the given lateral alias to resolve the unresolved attribute with the name parts.
1783+
*
1784+
* Construct a dummy plan with the given lateral alias as project list, use the output of the
1785+
* plan to resolve.
1786+
* @return The resolved [[LateralColumnAliasReference]] if succeeds. None if fails to resolve.
1787+
*/
1788+
private def resolveByLateralAlias(
1789+
nameParts: Seq[String], lateralAlias: Alias): Option[LateralColumnAliasReference] = {
1790+
val resolvedAttr = resolveExpressionByPlanOutput(
1791+
expr = UnresolvedAttribute(nameParts),
1792+
plan = LocalRelation(Seq(lateralAlias.toAttribute)),
1793+
throws = false
1794+
).asInstanceOf[NamedExpression]
1795+
if (resolvedAttr.resolved) {
1796+
Some(LateralColumnAliasReference(resolvedAttr, nameParts, lateralAlias.toAttribute))
1797+
} else {
1798+
None
1799+
}
1800+
}
1801+
1802+
/**
1803+
* Recognize all the attributes in the given expression that reference lateral column aliases
1804+
* by looking up the alias map. Resolve these attributes and replace by wrapping with
1805+
* [[LateralColumnAliasReference]].
1806+
*
1807+
* @param currentPlan Because lateral alias has lower resolution priority than table columns,
1808+
* the current plan is needed to first try resolving the attribute by its
1809+
* children
1810+
*/
1811+
private def wrapLCARef(
1812+
e: NamedExpression,
1813+
currentPlan: LogicalPlan,
1814+
aliasMap: CaseInsensitiveMap[Seq[AliasEntry]]): NamedExpression = {
1815+
e.transformWithPruning(_.containsAnyPattern(UNRESOLVED_ATTRIBUTE, OUTER_REFERENCE)) {
1816+
case u: UnresolvedAttribute if aliasMap.contains(u.nameParts.head) &&
1817+
resolveExpressionByPlanChildren(u, currentPlan).isInstanceOf[UnresolvedAttribute] =>
1818+
val aliases = aliasMap.get(u.nameParts.head).get
1819+
aliases.size match {
1820+
case n if n > 1 =>
1821+
throw QueryCompilationErrors.ambiguousLateralColumnAlias(u.name, n)
1822+
case n if n == 1 && aliases.head.alias.resolved =>
1823+
// Only resolved alias can be the lateral column alias
1824+
// The lateral alias can be a struct and have nested field, need to construct
1825+
// a dummy plan to resolve the expression
1826+
resolveByLateralAlias(u.nameParts, aliases.head.alias).getOrElse(u)
1827+
case _ => u
1828+
}
1829+
case o: OuterReference
1830+
if aliasMap.contains(
1831+
o.getTagValue(ResolveLateralColumnAliasReference.NAME_PARTS_FROM_UNRESOLVED_ATTR)
1832+
.map(_.head)
1833+
.getOrElse(o.name)) =>
1834+
// handle OuterReference exactly same as UnresolvedAttribute
1835+
val nameParts = o
1836+
.getTagValue(ResolveLateralColumnAliasReference.NAME_PARTS_FROM_UNRESOLVED_ATTR)
1837+
.getOrElse(Seq(o.name))
1838+
val aliases = aliasMap.get(nameParts.head).get
1839+
aliases.size match {
1840+
case n if n > 1 =>
1841+
throw QueryCompilationErrors.ambiguousLateralColumnAlias(nameParts, n)
1842+
case n if n == 1 && aliases.head.alias.resolved =>
1843+
resolveByLateralAlias(nameParts, aliases.head.alias).getOrElse(o)
1844+
case _ => o
1845+
}
1846+
}.asInstanceOf[NamedExpression]
1847+
}
1848+
1849+
override def apply(plan: LogicalPlan): LogicalPlan = {
1850+
if (!conf.getConf(SQLConf.LATERAL_COLUMN_ALIAS_IMPLICIT_ENABLED)) {
1851+
plan
1852+
} else {
1853+
plan.resolveOperatorsUpWithPruning(
1854+
_.containsAnyPattern(UNRESOLVED_ATTRIBUTE, OUTER_REFERENCE), ruleId) {
1855+
case p @ Project(projectList, _) if p.childrenResolved
1856+
&& !ResolveReferences.containsStar(projectList)
1857+
&& projectList.exists(_.containsAnyPattern(UNRESOLVED_ATTRIBUTE, OUTER_REFERENCE)) =>
1858+
var aliasMap = CaseInsensitiveMap(Map[String, Seq[AliasEntry]]())
1859+
val newProjectList = projectList.zipWithIndex.map {
1860+
case (a: Alias, idx) =>
1861+
val lcaWrapped = wrapLCARef(a, p, aliasMap).asInstanceOf[Alias]
1862+
// Insert the LCA-resolved alias instead of the unresolved one into map. If it is
1863+
// resolved, it can be referenced as LCA by later expressions (chaining).
1864+
// Unresolved Alias is also added to the map to perform ambiguous name check, but
1865+
// only resolved alias can be LCA.
1866+
aliasMap = insertIntoAliasMap(lcaWrapped, idx, aliasMap)
1867+
lcaWrapped
1868+
case (e, _) =>
1869+
wrapLCARef(e, p, aliasMap)
1870+
}
1871+
p.copy(projectList = newProjectList)
1872+
}
1873+
}
1874+
}
1875+
}
1876+
17641877
private def containsDeserializer(exprs: Seq[Expression]): Boolean = {
17651878
exprs.exists(_.exists(_.isInstanceOf[UnresolvedDeserializer]))
17661879
}
@@ -2143,7 +2256,7 @@ class Analyzer(override val catalogManager: CatalogManager)
21432256
case u @ UnresolvedAttribute(nameParts) => withPosition(u) {
21442257
try {
21452258
AnalysisContext.get.outerPlan.get.resolveChildren(nameParts, resolver) match {
2146-
case Some(resolved) => wrapOuterReference(resolved)
2259+
case Some(resolved) => wrapOuterReference(resolved, Some(nameParts))
21472260
case None => u
21482261
}
21492262
} catch {

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ import org.apache.spark.sql.catalyst.optimizer.{BooleanSimplification, Decorrela
2727
import org.apache.spark.sql.catalyst.plans._
2828
import org.apache.spark.sql.catalyst.plans.logical._
2929
import org.apache.spark.sql.catalyst.trees.TreeNodeTag
30-
import org.apache.spark.sql.catalyst.trees.TreePattern.UNRESOLVED_WINDOW_EXPRESSION
30+
import org.apache.spark.sql.catalyst.trees.TreePattern.{LATERAL_COLUMN_ALIAS_REFERENCE, UNRESOLVED_WINDOW_EXPRESSION}
3131
import org.apache.spark.sql.catalyst.util.{CharVarcharUtils, StringUtils, TypeUtils}
3232
import org.apache.spark.sql.connector.catalog.{LookupCatalog, SupportsPartitionManagement}
3333
import org.apache.spark.sql.errors.{QueryCompilationErrors, QueryErrorsBase}
@@ -638,6 +638,16 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB
638638
case UnresolvedWindowExpression(_, windowSpec) =>
639639
throw QueryCompilationErrors.windowSpecificationNotDefinedError(windowSpec.name)
640640
})
641+
// This should not happen, resolved Project or Aggregate should restore or resolve
642+
// all lateral column alias references. Add check for extra safe.
643+
projectList.foreach(_.transformDownWithPruning(
644+
_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE)) {
645+
case lcaRef: LateralColumnAliasReference if p.resolved =>
646+
throw SparkException.internalError("Resolved Project should not contain " +
647+
s"any LateralColumnAliasReference.\nDebugging information: plan: $p",
648+
context = lcaRef.origin.getQueryContext,
649+
summary = lcaRef.origin.context.summary)
650+
})
641651

642652
case j: Join if !j.duplicateResolved =>
643653
val conflictingAttributes = j.left.outputSet.intersect(j.right.outputSet)
@@ -714,6 +724,19 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB
714724
"operator" -> other.nodeName,
715725
"invalidExprSqls" -> invalidExprSqls.mkString(", ")))
716726

727+
// This should not happen, resolved Project or Aggregate should restore or resolve
728+
// all lateral column alias references. Add check for extra safe.
729+
case agg @ Aggregate(_, aggList, _)
730+
if aggList.exists(_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE)) && agg.resolved =>
731+
aggList.foreach(_.transformDownWithPruning(
732+
_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE)) {
733+
case lcaRef: LateralColumnAliasReference =>
734+
throw SparkException.internalError("Resolved Aggregate should not contain " +
735+
s"any LateralColumnAliasReference.\nDebugging information: plan: $agg",
736+
context = lcaRef.origin.getQueryContext,
737+
summary = lcaRef.origin.context.summary)
738+
})
739+
717740
case _ => // Analysis successful!
718741
}
719742
}
Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
package org.apache.spark.sql.catalyst.analysis
19+
20+
import org.apache.spark.sql.catalyst.expressions.{Alias, AttributeMap, LateralColumnAliasReference, NamedExpression}
21+
import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project}
22+
import org.apache.spark.sql.catalyst.rules.Rule
23+
import org.apache.spark.sql.catalyst.trees.TreeNodeTag
24+
import org.apache.spark.sql.catalyst.trees.TreePattern.LATERAL_COLUMN_ALIAS_REFERENCE
25+
import org.apache.spark.sql.internal.SQLConf
26+
27+
/**
28+
* This rule is the second phase to resolve lateral column alias.
29+
*
30+
* Resolve lateral column alias, which references the alias defined previously in the SELECT list.
31+
* Plan-wise, it handles two types of operators: Project and Aggregate.
32+
* - in Project, pushing down the referenced lateral alias into a newly created Project, resolve
33+
* the attributes referencing these aliases
34+
* - in Aggregate TODO.
35+
*
36+
* The whole process is generally divided into two phases:
37+
* 1) recognize resolved lateral alias, wrap the attributes referencing them with
38+
* [[LateralColumnAliasReference]]
39+
* 2) when the whole operator is resolved, unwrap [[LateralColumnAliasReference]].
40+
* For Project, it further resolves the attributes and push down the referenced lateral aliases.
41+
* For Aggregate, TODO
42+
*
43+
* Example for Project:
44+
* Before rewrite:
45+
* Project [age AS a, 'a + 1]
46+
* +- Child
47+
*
48+
* After phase 1:
49+
* Project [age AS a, lateralalias(a) + 1]
50+
* +- Child
51+
*
52+
* After phase 2:
53+
* Project [a, a + 1]
54+
* +- Project [child output, age AS a]
55+
* +- Child
56+
*
57+
* Example for Aggregate TODO
58+
*
59+
*
60+
* The name resolution priority:
61+
* local table column > local lateral column alias > outer reference
62+
*
63+
* Because lateral column alias has higher resolution priority than outer reference, it will try
64+
* to resolve an [[OuterReference]] using lateral column alias in phase 1, similar as an
65+
* [[UnresolvedAttribute]]. If success, it strips [[OuterReference]] and also wraps it with
66+
* [[LateralColumnAliasReference]].
67+
*/
68+
object ResolveLateralColumnAliasReference extends Rule[LogicalPlan] {
69+
case class AliasEntry(alias: Alias, index: Int)
70+
71+
/**
72+
* A tag to store the nameParts from the original unresolved attribute.
73+
* It is set for [[OuterReference]], used in the current rule to convert [[OuterReference]] back
74+
* to [[LateralColumnAliasReference]].
75+
*/
76+
val NAME_PARTS_FROM_UNRESOLVED_ATTR = TreeNodeTag[Seq[String]]("name_parts_from_unresolved_attr")
77+
78+
override def apply(plan: LogicalPlan): LogicalPlan = {
79+
if (!conf.getConf(SQLConf.LATERAL_COLUMN_ALIAS_IMPLICIT_ENABLED)) {
80+
plan
81+
} else {
82+
// phase 2: unwrap
83+
plan.resolveOperatorsUpWithPruning(
84+
_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE), ruleId) {
85+
case p @ Project(projectList, child) if p.resolved
86+
&& projectList.exists(_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE)) =>
87+
var aliasMap = AttributeMap.empty[AliasEntry]
88+
val referencedAliases = collection.mutable.Set.empty[AliasEntry]
89+
def unwrapLCAReference(e: NamedExpression): NamedExpression = {
90+
e.transformWithPruning(_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE)) {
91+
case lcaRef: LateralColumnAliasReference if aliasMap.contains(lcaRef.a) =>
92+
val aliasEntry = aliasMap.get(lcaRef.a).get
93+
// If there is no chaining of lateral column alias reference, push down the alias
94+
// and unwrap the LateralColumnAliasReference to the NamedExpression inside
95+
// If there is chaining, don't resolve and save to future rounds
96+
if (!aliasEntry.alias.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE)) {
97+
referencedAliases += aliasEntry
98+
lcaRef.ne
99+
} else {
100+
lcaRef
101+
}
102+
case lcaRef: LateralColumnAliasReference if !aliasMap.contains(lcaRef.a) =>
103+
// It shouldn't happen, but restore to unresolved attribute to be safe.
104+
UnresolvedAttribute(lcaRef.nameParts)
105+
}.asInstanceOf[NamedExpression]
106+
}
107+
val newProjectList = projectList.zipWithIndex.map {
108+
case (a: Alias, idx) =>
109+
val lcaResolved = unwrapLCAReference(a)
110+
// Insert the original alias instead of rewritten one to detect chained LCA
111+
aliasMap += (a.toAttribute -> AliasEntry(a, idx))
112+
lcaResolved
113+
case (e, _) =>
114+
unwrapLCAReference(e)
115+
}
116+
117+
if (referencedAliases.isEmpty) {
118+
p
119+
} else {
120+
val outerProjectList = collection.mutable.Seq(newProjectList: _*)
121+
val innerProjectList =
122+
collection.mutable.ArrayBuffer(child.output.map(_.asInstanceOf[NamedExpression]): _*)
123+
referencedAliases.foreach { case AliasEntry(alias: Alias, idx) =>
124+
outerProjectList.update(idx, alias.toAttribute)
125+
innerProjectList += alias
126+
}
127+
p.copy(
128+
projectList = outerProjectList.toSeq,
129+
child = Project(innerProjectList.toSeq, child)
130+
)
131+
}
132+
}
133+
}
134+
}
135+
}

0 commit comments

Comments
 (0)