[SPARK-27674][SQL] the hint should not be dropped after cache lookup #24580

cloud-fan · 2019-05-10T13:39:29Z

What changes were proposed in this pull request?

This is a followup of #20365 .

#20365 fixed this problem when the hint node is a root node. This PR fixes this problem for all the cases.

How was this patch tested?

a new test

cloud-fan · 2019-05-10T13:42:38Z

cc @gatorsmile @maryannxue

SparkQA · 2019-05-10T16:37:41Z

Test build #105314 has finished for PR 24580 at commit 8934bf4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maryannxue · 2019-05-11T03:47:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

+          val cachedPlan = cached.cachedRepresentation.withOutput(currentFragment.output)
+          // The returned hint list is in top-down order. We should reverse it so that the top hint
+          // is still in the top node.
+          hints.reverse.foldLeft[LogicalPlan](cachedPlan) { case (p, hint) =>


I suppose hints can be lost further down in the tree by matching "canonicalized". Do we need to take care of that as well?

for hints that don't take effect in the original query, we can drop it.

Actually we have to drop these un-accessible hints. The cache lookup returns a leaf node InMemoryRelation, and we should only add back the accessible hints.

We don't need to drop them, right? Hints are transparent in canonicalization. But I agree the inner hints don't matter, coz they will be replaced with a leaf node anyway.

I'm wondering though, can we change the lookupCachedData instead? like:

def lookupCachedData(plan: LogicalPlan): Option[CachedData] = plan match { case ResolvedHint(child, hints) => lookupCachedData(child).map(p => ResolvedHint(p, hints)) case _ => cachedData.find(cd => plan.sameResult(cd.plan)) }

What if the plan is Filter(ResolvedHint(...))? This PR is trying to fix the problem that when the hint node is not the root node.

BTW lookupCachedData needs to return a CachedData, so we can't add hint node there.

gatorsmile · 2019-05-13T05:14:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

-          .map(_.cachedRepresentation.withOutput(currentFragment.output))
-          .getOrElse(currentFragment)
+        lookupCachedData(currentFragment).map { cached =>
+          // After cache lookup, we should still keep the hints from the input plan.


If the original cached plan has a hint, should we keep/respect them? We need to define a clear behavior in our cache manager.

It doesn't matter, because

as a cache key, the lookup relies on semanticEquals, so having the hint node in the plan has no effect.

the cache lookup returns InMemoryRelation, which has no hint.

I think the behavior is pretty clear: for any query, the hint behavior should be the same no matter some sub-plans are cached or not.

Basically, we ignore the hints that are specified in the original cached plans. If users want to use hints, they should specify them in the queries.

gatorsmile · 2019-05-13T16:53:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

+          val cachedPlan = cached.cachedRepresentation.withOutput(currentFragment.output)
+          // The returned hint list is in top-down order. We should reverse it so that the top hint
+          // is still in the top node.
+          hints.reverse.foldLeft[LogicalPlan](cachedPlan) { case (p, hint) =>


Do we have a test case for covering the logic of reverse?

gatorsmile · 2019-05-13T16:54:39Z

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

+      checkHintExists()
+
+      // Clean-up
+      df.unpersist()


Use try finally?

finally { df.unpersist() }

gatorsmile · 2019-05-13T17:08:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

-          .getOrElse(currentFragment)
+        lookupCachedData(currentFragment).map { cached =>
+          // After cache lookup, we should still keep the hints from the input plan.
+          val hints = EliminateResolvedHint.extractHintsFromPlan(currentFragment)._2


extractHintsFromPlan(currentFragment)._2 was originally a private function. Asking the caller to call reverse is weird. We can add a new function in EliminateResolvedHint or even add a new object for Hint processing.

It's natural to return the hints in a top-down fashion. And the caller side is free to process the returned hints, including reverse it.

SparkQA · 2019-05-14T07:05:03Z

Test build #105374 has finished for PR 24580 at commit f47807e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-05-14T08:57:44Z

retest this please

SparkQA · 2019-05-14T11:10:02Z

Test build #105379 has finished for PR 24580 at commit f47807e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-05-14T15:51:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

+          // The returned hint list is in top-down order, we should create the hint nodes from
+          // right to left.
+          hints.foldRight[LogicalPlan](cachedPlan) { case (hint, p) =>
+            ResolvedHint(p, hint)


Is this the same (semantically) as original cached plan?

We can take one example in added test: broadcast(spark.range(1000)).filter($"id" > 100). Originally, the plan broadcasted is spark.range(1000). After using cached data, seems cached spark.range(1000).filter($"id" > 100) is broadcasted by the hint, actually. It is slightly difference, but maybe in significant effect it might cause?

The semantic of a hint node is special. By design only join node has hints, so Hint(Filter(Relation)) is the same as Filter(Hint(Relation)), as they both indicate that the left/right sub-tree of a join node has a hint.

Ok, I see. Makes sense and it's fine.

SparkQA · 2019-05-14T16:40:48Z

Test build #105383 has finished for PR 24580 at commit 48e55fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2019-05-15T22:46:58Z

LGTM

Thanks! Merged to master.

the hint should not be dropped after cache lookup

8934bf4

cloud-fan force-pushed the bug branch from 25bf3d0 to 8934bf4 Compare May 10, 2019 13:42

maryannxue reviewed May 11, 2019

View reviewed changes

gatorsmile reviewed May 13, 2019

View reviewed changes

address comments

48e55fa

cloud-fan force-pushed the bug branch from f47807e to 48e55fa Compare May 14, 2019 13:37

viirya reviewed May 14, 2019

View reviewed changes

gatorsmile closed this in 3e30a98 May 15, 2019

[SPARK-27674][SQL] the hint should not be dropped after cache lookup #24580

[SPARK-27674][SQL] the hint should not be dropped after cache lookup #24580

Uh oh!

Conversation

cloud-fan commented May 10, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented May 10, 2019

Uh oh!

SparkQA commented May 10, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile May 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile May 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 14, 2019

Uh oh!

cloud-fan commented May 14, 2019

Uh oh!

SparkQA commented May 14, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 14, 2019

Uh oh!

gatorsmile commented May 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gatorsmile May 13, 2019 •

edited

Loading

gatorsmile May 13, 2019 •

edited

Loading