-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20229][SQL] add semanticHash to QueryPlan #17541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @rxin @gatorsmile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll revert it once #17537 is merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was wrong previously, sameResult should be commutative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
|
Test build #75550 has finished for PR 17541 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we getting rid of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BroadcastMode is a field of BroadcastExchangeExec. Since we need to canonicalize a QueryPlan, the BroadcastMode also need to be canonicalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only care about table identifier. How about setting them to Nil
dataCols = Nil,
partitionCols = Nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CatalogRelation has 3 asserts at the beginning, so we can't simply use Nil
|
retest this please |
|
Test build #75563 has finished for PR 17541 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To use ==, we need to overwrite many equals and hashCode function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile Given that both sides are canonicalized, the default case class equals and hash code methods should work, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dilipbiswal Yes or no. If it is case class, scala compiler will help you define a default equals. If it is class, you need to define it by yourselves. For example,
class NestedObj (i: Int)
val m = new NestedObj(3)
val n = new NestedObj(3)
assert(m != n)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TreeNode requires its implementations to be Product, I think all of the LogicalPlans and SparkPlans are case class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also how Expression.semanticEquals works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we already assume it in the existing solution. I just realized it. : )
We also need to ensure all the arguments of the case class are primitive types or from the class with a defined equals.
class NestedObj (i: Int)
val m = new NestedObj(3)
val n = new NestedObj(3)
assert(m != n)
case class Obj (i: NestedObj)
val p = Obj(m)
val q = Obj(n)
assert(p != q)|
Test build #75567 has finished for PR 17541 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan this is very nice idea :-) I was running into the same issue when you asked me to try normalizing the attributes in my caching pr.
|
Test build #75577 has finished for PR 17541 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll bring it back after #17552 is merged
|
Test build #75580 has finished for PR 17541 at commit
|
|
Test build #75599 has finished for PR 17541 at commit
|
|
Test build #75620 has finished for PR 17541 at commit
|
|
cc @gatorsmile any more comments? |
| Objects.hashCode(tableMeta.identifier, output) | ||
| } | ||
|
|
||
| /** Only compare table identifier. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we should compare more, e.g. if the table schema is altered, the new table relation should not be considered as same with the old table relation, even after canonicalization. Also, it's tricky to remove the output of a plan during canonicalization as the parenting plan may rely on the output.
|
|
||
| /** | ||
| * Do some simple transformation on this plan before canonicalizing. Implementations can override | ||
| * this method to provide customer canonicalize logic without rewriting the whole logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
customer -> customized
| case ar: AttributeReference => | ||
| val ordinal = input.indexOf(ar.exprId) | ||
| if (ordinal == -1) { | ||
| ar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to normalize exprIds in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, actually this is unexpected, the attribute should either reference to input attributes, or represent new output at top level. Keep it unchanged so that the equality check will fail later.
| Map( | ||
| "Format" -> relation.fileFormat.toString, | ||
| "ReadSchema" -> outputSchema.catalogString, | ||
| "requiredSchema" -> requiredSchema.catalogString, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also for display in SparkPlanInfo? Keep the original name ReadSchema?
| // expId can be different but the relation is still the same. | ||
| override lazy val cleanArgs: Seq[Any] = Seq(relation) | ||
| // Only care about relation when canonicalizing. | ||
| override def preCanonicalized: LogicalPlan = copy(catalogTable = None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The builders of external data sources need to implement equals and hashCode if they want to utilize our cache management.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's the same behavior as before
| partitionPruningPred.map(normalizeExprId(_, input)))(sparkSession) | ||
| } | ||
|
|
||
| override def otherCopyArgs: Seq[AnyRef] = Seq(sparkSession) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds a bug fix.
|
LGTM |
|
Test build #75635 has finished for PR 17541 at commit
|
|
retest this please |
|
Test build #75637 has finished for PR 17541 at commit
|
|
retest this please |
|
Test build #75639 has finished for PR 17541 at commit
|
| lazy val canonicalized: PlanType = { | ||
| val canonicalizedChildren = children.map(_.canonicalized) | ||
| var id = -1 | ||
| preCanonicalized.mapExpressions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to consider non-deterministic expressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see Expression.semanticEquals, non-deterministic expressions will never equal to other expressions.
|
retest this please |
|
Test build #75640 has finished for PR 17541 at commit
|
|
thanks for the review, merging to master! |
What changes were proposed in this pull request?
Like
Expression,QueryPlanshould also have asemanticHashmethod, then we can put plans to a hash map and look it up fast. This PR refactorsQueryPlanto followExpressionand put all the normalization logic inQueryPlan.canonicalized, so that it's very natural to implementsemanticHash.follow-up: improve
CacheManagerto leverage thissemanticHashand speed up plan lookup, instead of iterating all cached plans.How was this patch tested?
existing tests. Note that we don't need to test the
semanticHashmethod, once the existing tests provesameResultis correct, we are good.