-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20229][SQL] add semanticHash to QueryPlan #17541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -359,9 +359,59 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] extends TreeNode[PlanT | |
| override protected def innerChildren: Seq[QueryPlan[_]] = subqueries | ||
|
|
||
| /** | ||
| * Canonicalized copy of this query plan. | ||
| * Returns a plan where a best effort attempt has been made to transform `this` in a way | ||
| * that preserves the result but removes cosmetic variations (case sensitivity, ordering for | ||
| * commutative operations, expression id, etc.) | ||
| * | ||
| * Plans where `this.canonicalized == other.canonicalized` will always evaluate to the same | ||
| * result. | ||
| * | ||
| * Some nodes should overwrite this to provide proper canonicalize logic. | ||
| */ | ||
| lazy val canonicalized: PlanType = { | ||
| val canonicalizedChildren = children.map(_.canonicalized) | ||
| var id = -1 | ||
| preCanonicalized.mapExpressions { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need to consider non-deterministic expressions?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. see |
||
| case a: Alias => | ||
| id += 1 | ||
| // As the root of the expression, Alias will always take an arbitrary exprId, we need to | ||
| // normalize that for equality testing, by assigning expr id from 0 incrementally. The | ||
| // alias name doesn't matter and should be erased. | ||
| Alias(normalizeExprId(a.child), "")(ExprId(id), a.qualifier, isGenerated = a.isGenerated) | ||
|
|
||
| case ar: AttributeReference if allAttributes.indexOf(ar.exprId) == -1 => | ||
| // Top level `AttributeReference` may also be used for output like `Alias`, we should | ||
| // normalize the epxrId too. | ||
| id += 1 | ||
| ar.withExprId(ExprId(id)) | ||
|
|
||
| case other => normalizeExprId(other) | ||
| }.withNewChildren(canonicalizedChildren) | ||
| } | ||
|
|
||
| /** | ||
| * Do some simple transformation on this plan before canonicalizing. Implementations can override | ||
| * this method to provide customized canonicalize logic without rewriting the whole logic. | ||
| */ | ||
| protected lazy val canonicalized: PlanType = this | ||
| protected def preCanonicalized: PlanType = this | ||
|
|
||
| /** | ||
| * Normalize the exprIds in the given expression, by updating the exprId in `AttributeReference` | ||
| * with its referenced ordinal from input attributes. It's similar to `BindReferences` but we | ||
| * do not use `BindReferences` here as the plan may take the expression as a parameter with type | ||
| * `Attribute`, and replace it with `BoundReference` will cause error. | ||
| */ | ||
| protected def normalizeExprId[T <: Expression](e: T, input: AttributeSeq = allAttributes): T = { | ||
| e.transformUp { | ||
| case ar: AttributeReference => | ||
| val ordinal = input.indexOf(ar.exprId) | ||
| if (ordinal == -1) { | ||
| ar | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No need to normalize exprIds in this case?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no, actually this is unexpected, the attribute should either reference to input attributes, or represent new output at top level. Keep it unchanged so that the equality check will fail later. |
||
| } else { | ||
| ar.withExprId(ExprId(ordinal)) | ||
| } | ||
| }.canonicalized.asInstanceOf[T] | ||
| } | ||
|
|
||
| /** | ||
| * Returns true when the given query plan will return the same results as this query plan. | ||
|
|
@@ -372,49 +422,19 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] extends TreeNode[PlanT | |
| * enhancements like caching. However, it is not acceptable to return true if the results could | ||
| * possibly be different. | ||
| * | ||
| * By default this function performs a modified version of equality that is tolerant of cosmetic | ||
| * differences like attribute naming and or expression id differences. Operators that | ||
| * can do better should override this function. | ||
| * This function performs a modified version of equality that is tolerant of cosmetic | ||
| * differences like attribute naming and or expression id differences. | ||
| */ | ||
| def sameResult(plan: PlanType): Boolean = { | ||
| val left = this.canonicalized | ||
| val right = plan.canonicalized | ||
| left.getClass == right.getClass && | ||
| left.children.size == right.children.size && | ||
| left.cleanArgs == right.cleanArgs && | ||
| (left.children, right.children).zipped.forall(_ sameResult _) | ||
| } | ||
| final def sameResult(other: PlanType): Boolean = this.canonicalized == other.canonicalized | ||
|
||
|
|
||
| /** | ||
| * Returns a `hashCode` for the calculation performed by this plan. Unlike the standard | ||
| * `hashCode`, an attempt has been made to eliminate cosmetic differences. | ||
| */ | ||
| final def semanticHash(): Int = canonicalized.hashCode() | ||
|
|
||
| /** | ||
| * All the attributes that are used for this plan. | ||
| */ | ||
| lazy val allAttributes: AttributeSeq = children.flatMap(_.output) | ||
|
|
||
| protected def cleanExpression(e: Expression): Expression = e match { | ||
| case a: Alias => | ||
| // As the root of the expression, Alias will always take an arbitrary exprId, we need | ||
| // to erase that for equality testing. | ||
| val cleanedExprId = | ||
| Alias(a.child, a.name)(ExprId(-1), a.qualifier, isGenerated = a.isGenerated) | ||
| BindReferences.bindReference(cleanedExprId, allAttributes, allowFailures = true) | ||
| case other => | ||
| BindReferences.bindReference(other, allAttributes, allowFailures = true) | ||
| } | ||
|
|
||
| /** Args that have cleaned such that differences in expression id should not affect equality */ | ||
| protected lazy val cleanArgs: Seq[Any] = { | ||
| def cleanArg(arg: Any): Any = arg match { | ||
| // Children are checked using sameResult above. | ||
| case tn: TreeNode[_] if containsChild(tn) => null | ||
| case e: Expression => cleanExpression(e).canonicalized | ||
| case other => other | ||
| } | ||
|
|
||
| mapProductIterator { | ||
| case s: Option[_] => s.map(cleanArg) | ||
| case s: Seq[_] => s.map(cleanArg) | ||
| case m: Map[_, _] => m.mapValues(cleanArg) | ||
| case other => cleanArg(other) | ||
| }.toSeq | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -26,10 +26,7 @@ import org.apache.spark.sql.catalyst.InternalRow | |
| trait BroadcastMode { | ||
| def transform(rows: Array[InternalRow]): Any | ||
|
|
||
| /** | ||
| * Returns true iff this [[BroadcastMode]] generates the same result as `other`. | ||
| */ | ||
| def compatibleWith(other: BroadcastMode): Boolean | ||
|
||
| def canonicalized: BroadcastMode | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -39,7 +36,5 @@ case object IdentityBroadcastMode extends BroadcastMode { | |
| // TODO: pack the UnsafeRows into single bytes array. | ||
| override def transform(rows: Array[InternalRow]): Array[InternalRow] = rows | ||
|
|
||
| override def compatibleWith(other: BroadcastMode): Boolean = { | ||
| this eq other | ||
| } | ||
| override def canonicalized: BroadcastMode = this | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -43,17 +43,8 @@ case class LogicalRelation( | |
| com.google.common.base.Objects.hashCode(relation, output) | ||
| } | ||
|
|
||
| override def sameResult(otherPlan: LogicalPlan): Boolean = { | ||
| otherPlan.canonicalized match { | ||
| case LogicalRelation(otherRelation, _, _) => relation == otherRelation | ||
| case _ => false | ||
| } | ||
| } | ||
|
|
||
| // When comparing two LogicalRelations from within LogicalPlan.sameResult, we only need | ||
| // LogicalRelation.cleanArgs to return Seq(relation), since expectedOutputAttribute's | ||
| // expId can be different but the relation is still the same. | ||
| override lazy val cleanArgs: Seq[Any] = Seq(relation) | ||
| // Only care about relation when canonicalizing. | ||
| override def preCanonicalized: LogicalPlan = copy(catalogTable = None) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The builders of external data sources need to implement
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, it's the same behavior as before |
||
|
|
||
| @transient override def computeStats(conf: SQLConf): Statistics = { | ||
| catalogTable.flatMap(_.stats.map(_.toPlanStats(output))).getOrElse( | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we should compare more, e.g. if the table schema is altered, the new table relation should not be considered as same with the old table relation, even after canonicalization. Also, it's tricky to remove the output of a plan during canonicalization as the parenting plan may rely on the output.