-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-27674][SQL] the hint should not be dropped after cache lookup #24580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #105314 has finished for PR 24580 at commit
|
| val cachedPlan = cached.cachedRepresentation.withOutput(currentFragment.output) | ||
| // The returned hint list is in top-down order. We should reverse it so that the top hint | ||
| // is still in the top node. | ||
| hints.reverse.foldLeft[LogicalPlan](cachedPlan) { case (p, hint) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose hints can be lost further down in the tree by matching "canonicalized". Do we need to take care of that as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for hints that don't take effect in the original query, we can drop it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we have to drop these un-accessible hints. The cache lookup returns a leaf node InMemoryRelation, and we should only add back the accessible hints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to drop them, right? Hints are transparent in canonicalization. But I agree the inner hints don't matter, coz they will be replaced with a leaf node anyway.
I'm wondering though, can we change the lookupCachedData instead? like:
def lookupCachedData(plan: LogicalPlan): Option[CachedData] = plan match {
case ResolvedHint(child, hints) => lookupCachedData(child).map(p => ResolvedHint(p, hints))
case _ => cachedData.find(cd => plan.sameResult(cd.plan))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the plan is Filter(ResolvedHint(...))? This PR is trying to fix the problem that when the hint node is not the root node.
BTW lookupCachedData needs to return a CachedData, so we can't add hint node there.
| .map(_.cachedRepresentation.withOutput(currentFragment.output)) | ||
| .getOrElse(currentFragment) | ||
| lookupCachedData(currentFragment).map { cached => | ||
| // After cache lookup, we should still keep the hints from the input plan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the original cached plan has a hint, should we keep/respect them? We need to define a clear behavior in our cache manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't matter, because
- as a cache key, the lookup relies on
semanticEquals, so having the hint node in the plan has no effect. - the cache lookup returns
InMemoryRelation, which has no hint.
I think the behavior is pretty clear: for any query, the hint behavior should be the same no matter some sub-plans are cached or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, we ignore the hints that are specified in the original cached plans. If users want to use hints, they should specify them in the queries.
| val cachedPlan = cached.cachedRepresentation.withOutput(currentFragment.output) | ||
| // The returned hint list is in top-down order. We should reverse it so that the top hint | ||
| // is still in the top node. | ||
| hints.reverse.foldLeft[LogicalPlan](cachedPlan) { case (p, hint) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a test case for covering the logic of reverse?
| checkHintExists() | ||
|
|
||
| // Clean-up | ||
| df.unpersist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use try finally?
finally {
df.unpersist()
}
| .getOrElse(currentFragment) | ||
| lookupCachedData(currentFragment).map { cached => | ||
| // After cache lookup, we should still keep the hints from the input plan. | ||
| val hints = EliminateResolvedHint.extractHintsFromPlan(currentFragment)._2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extractHintsFromPlan(currentFragment)._2 was originally a private function. Asking the caller to call reverse is weird. We can add a new function in EliminateResolvedHint or even add a new object for Hint processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's natural to return the hints in a top-down fashion. And the caller side is free to process the returned hints, including reverse it.
|
Test build #105374 has finished for PR 24580 at commit
|
|
retest this please |
|
Test build #105379 has finished for PR 24580 at commit
|
| // The returned hint list is in top-down order, we should create the hint nodes from | ||
| // right to left. | ||
| hints.foldRight[LogicalPlan](cachedPlan) { case (hint, p) => | ||
| ResolvedHint(p, hint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the same (semantically) as original cached plan?
We can take one example in added test: broadcast(spark.range(1000)).filter($"id" > 100). Originally, the plan broadcasted is spark.range(1000). After using cached data, seems cached spark.range(1000).filter($"id" > 100) is broadcasted by the hint, actually. It is slightly difference, but maybe in significant effect it might cause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantic of a hint node is special. By design only join node has hints, so Hint(Filter(Relation)) is the same as Filter(Hint(Relation)), as they both indicate that the left/right sub-tree of a join node has a hint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see. Makes sense and it's fine.
|
Test build #105383 has finished for PR 24580 at commit
|
|
LGTM Thanks! Merged to master. |
What changes were proposed in this pull request?
This is a followup of #20365 .
#20365 fixed this problem when the hint node is a root node. This PR fixes this problem for all the cases.
How was this patch tested?
a new test