Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is a followup of #20365 .

#20365 fixed this problem when the hint node is a root node. This PR fixes this problem for all the cases.

How was this patch tested?

a new test

@cloud-fan
Copy link
Contributor Author

cc @gatorsmile @maryannxue

@SparkQA
Copy link

SparkQA commented May 10, 2019

Test build #105314 has finished for PR 24580 at commit 8934bf4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val cachedPlan = cached.cachedRepresentation.withOutput(currentFragment.output)
// The returned hint list is in top-down order. We should reverse it so that the top hint
// is still in the top node.
hints.reverse.foldLeft[LogicalPlan](cachedPlan) { case (p, hint) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose hints can be lost further down in the tree by matching "canonicalized". Do we need to take care of that as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for hints that don't take effect in the original query, we can drop it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we have to drop these un-accessible hints. The cache lookup returns a leaf node InMemoryRelation, and we should only add back the accessible hints.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to drop them, right? Hints are transparent in canonicalization. But I agree the inner hints don't matter, coz they will be replaced with a leaf node anyway.

I'm wondering though, can we change the lookupCachedData instead? like:

def lookupCachedData(plan: LogicalPlan): Option[CachedData] = plan match {
  case ResolvedHint(child, hints) => lookupCachedData(child).map(p => ResolvedHint(p, hints))
  case _ => cachedData.find(cd => plan.sameResult(cd.plan))
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the plan is Filter(ResolvedHint(...))? This PR is trying to fix the problem that when the hint node is not the root node.

BTW lookupCachedData needs to return a CachedData, so we can't add hint node there.

.map(_.cachedRepresentation.withOutput(currentFragment.output))
.getOrElse(currentFragment)
lookupCachedData(currentFragment).map { cached =>
// After cache lookup, we should still keep the hints from the input plan.
Copy link
Member

@gatorsmile gatorsmile May 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the original cached plan has a hint, should we keep/respect them? We need to define a clear behavior in our cache manager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter, because

  1. as a cache key, the lookup relies on semanticEquals, so having the hint node in the plan has no effect.
  2. the cache lookup returns InMemoryRelation, which has no hint.

I think the behavior is pretty clear: for any query, the hint behavior should be the same no matter some sub-plans are cached or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, we ignore the hints that are specified in the original cached plans. If users want to use hints, they should specify them in the queries.

val cachedPlan = cached.cachedRepresentation.withOutput(currentFragment.output)
// The returned hint list is in top-down order. We should reverse it so that the top hint
// is still in the top node.
hints.reverse.foldLeft[LogicalPlan](cachedPlan) { case (p, hint) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test case for covering the logic of reverse?

checkHintExists()

// Clean-up
df.unpersist()
Copy link
Member

@gatorsmile gatorsmile May 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use try finally?

finally {
  df.unpersist()
}

.getOrElse(currentFragment)
lookupCachedData(currentFragment).map { cached =>
// After cache lookup, we should still keep the hints from the input plan.
val hints = EliminateResolvedHint.extractHintsFromPlan(currentFragment)._2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extractHintsFromPlan(currentFragment)._2 was originally a private function. Asking the caller to call reverse is weird. We can add a new function in EliminateResolvedHint or even add a new object for Hint processing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's natural to return the hints in a top-down fashion. And the caller side is free to process the returned hints, including reverse it.

@SparkQA
Copy link

SparkQA commented May 14, 2019

Test build #105374 has finished for PR 24580 at commit f47807e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented May 14, 2019

Test build #105379 has finished for PR 24580 at commit f47807e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// The returned hint list is in top-down order, we should create the hint nodes from
// right to left.
hints.foldRight[LogicalPlan](cachedPlan) { case (hint, p) =>
ResolvedHint(p, hint)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same (semantically) as original cached plan?

We can take one example in added test: broadcast(spark.range(1000)).filter($"id" > 100). Originally, the plan broadcasted is spark.range(1000). After using cached data, seems cached spark.range(1000).filter($"id" > 100) is broadcasted by the hint, actually. It is slightly difference, but maybe in significant effect it might cause?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantic of a hint node is special. By design only join node has hints, so Hint(Filter(Relation)) is the same as Filter(Hint(Relation)), as they both indicate that the left/right sub-tree of a join node has a hint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see. Makes sense and it's fine.

@SparkQA
Copy link

SparkQA commented May 14, 2019

Test build #105383 has finished for PR 24580 at commit 48e55fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants