[SPARK-19736][SQL] refreshByPath should clear all cached plans with the specified path #17064

viirya · 2017-02-25T06:08:49Z

What changes were proposed in this pull request?

Catalog.refreshByPath can refresh the cache entry and the associated metadata for all dataframes (if any), that contain the given data source path.

However, CacheManager.invalidateCachedPath doesn't clear all cached plans with the specified path. It causes some strange behaviors reported in SPARK-15678.

How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

SparkQA · 2017-02-25T08:03:08Z

Test build #73464 has finished for PR 17064 at commit dd6d8ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-02-25T09:44:24Z

Thanks, LGTM.

viirya · 2017-02-25T13:22:37Z

@kiszk Thanks.

viirya · 2017-02-27T03:28:40Z

cc @cloud-fan

cloud-fan · 2017-02-27T18:32:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

-        }
-        sparkSession.sharedState.cacheManager.cacheQuery(Dataset.ofRows(sparkSession, data.plan))
-      case _ => // Do Nothing
+    cachedData.filter {


why the previous one doesn't work?

This kind of collection can't be modified during iterating. Some elements are not iterated over if we delete/add elements.

but we are still modifying it during iteration, after the filter. can you be more specific about what the problem is?

can we use a java collection so that we can remove elements while iterating?

After filter, we iterate on a different collection than cachedData, so it is no problem to add/delete elements to cachedData.

The problem can be shown clearly with an example code snippet:

val t = new scala.collection.mutable.ArrayBuffer[Int] t += 1 t += 2 t.foreach { case i if i > 0 => println(s"i = $i") val index = t.indexWhere(_ == i) if (index >= 0) { t.remove(index) } println(s"t: $t") t += (i + 2) println(s"t: $t") }

Output:

i = 1 // The first iteration, we get the first element "1" t: ArrayBuffer(2) // "1" has been removed from the array t: ArrayBuffer(2, 3) // New element "3" has been inserted i = 3 // In next iteration, element "2" is wrongly skipped t: ArrayBuffer(2) // "3" has been removed from the array t: ArrayBuffer(2, 5)

The element "2" is never iterated over.

viirya · 2017-02-28T09:33:29Z

@cloud-fan I noticed you open #17097, so I should close this?

cloud-fan · 2017-02-28T18:52:15Z

no you shouldn't. That's a refactor PR and accidently fixed the same bug.

cloud-fan · 2017-02-28T18:57:51Z

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

+
+  test("refreshByPath should refresh all cached plans with the specified path") {
+    def f(path: String, spark: SparkSession, dataCount: Int): DataFrame = {
+      spark.catalog.refreshByPath(path)


we can put spark.range(dataCount).write.mode("overwrite").parquet(path) at the beginning of this method and name it testRefreshByPath instead of f

cloud-fan · 2017-02-28T18:59:33Z

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

+      val df1 = df.filter("id > 11")
+      df1.cache
+      assert(df1.count == dataCount - 12)
+      df1


I don't get it, so we call refreshByPath before caching the query? Shouldn't we test the opposite order?

The function is called twice. So actually it is meant to refresh the cache in first call. Since I will change the test to what you suggested #17064 (comment), we can get rid of this confusing.

cloud-fan · 2017-02-28T19:04:09Z

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

+    }
+
+    withTempDir { dir =>
+      val path = dir.getPath()


we usually call dir.getCanonicalPath

cloud-fan · 2017-02-28T19:13:22Z

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

+      assert(f(path, spark, 100).count == 88)
+
+      spark.range(1000).write.mode("overwrite").parquet(path)
+      assert(f(path, spark, 1000).count == 988)


we can make this test more explicit

spark.range(10).write.mode("overwrite").parquet(path) spark.read.parquet(path).cache() spark.read.parquet(path).filter($"id" > 4).cache() assert(spark.read.parquet(path).filter($"id" > 4).count() == 5) spark.range(20).write.mode("overwrite").parquet(path) spark.catalog.refreshByPath(path) assert(spark.read.parquet(path).filter($"id" > 4).count() == 15)

Ok. Looks simpler and more explicit.

viirya · 2017-03-01T01:36:48Z

@cloud-fan Thanks. I will address your comments soon.

SparkQA · 2017-03-01T04:22:55Z

Test build #73641 has finished for PR 17064 at commit a9fe0be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-01T08:20:28Z

thanks merging to master!

viirya · 2017-03-01T08:39:37Z

@cloud-fan Thank you!

…L] Backport Three Cache-related PRs to Spark 2.1 ### What changes were proposed in this pull request? Backport a few cache related PRs: --- [[SPARK-19093][SQL] Cached tables are not used in SubqueryExpression](#16493) Consider the plans inside subquery expressions while looking up cache manager to make use of cached data. Currently CacheManager.useCachedData does not consider the subquery expressions in the plan. --- [[SPARK-19736][SQL] refreshByPath should clear all cached plans with the specified path](#17064) Catalog.refreshByPath can refresh the cache entry and the associated metadata for all dataframes (if any), that contain the given data source path. However, CacheManager.invalidateCachedPath doesn't clear all cached plans with the specified path. It causes some strange behaviors reported in SPARK-15678. --- [[SPARK-19765][SPARK-18549][SQL] UNCACHE TABLE should un-cache all cached plans that refer to this table](#17097) When un-cache a table, we should not only remove the cache entry for this table, but also un-cache any other cached plans that refer to this table. The following commands trigger the table uncache: `DropTableCommand`, `TruncateTableCommand`, `AlterTableRenameCommand`, `UncacheTableCommand`, `RefreshTable` and `InsertIntoHiveTable` This PR also includes some refactors: - use java.util.LinkedList to store the cache entries, so that it's safer to remove elements while iterating - rename invalidateCache to recacheByPlan, which is more obvious about what it does. ### How was this patch tested? N/A Author: Xiao Li <[email protected]> Closes #17319 from gatorsmile/backport-17097.

refreshByPath should clear all cached plans with the specified path.

dd6d8ca

cloud-fan reviewed Feb 27, 2017

View reviewed changes

cloud-fan reviewed Feb 28, 2017

View reviewed changes

Address comments.

a9fe0be

asfgit closed this in 38e7835 Mar 1, 2017

gatorsmile mentioned this pull request Mar 16, 2017

[SPARK-19765][SPARK-18549][SPARK-19093][SPARK-19736][BACKPORT-2.1][SQL] Backport Three Cache-related PRs to Spark 2.1 #17319

Closed

viirya deleted the fix-refreshByPath branch December 27, 2023 18:34

[SPARK-19736][SQL] refreshByPath should clear all cached plans with the specified path #17064

[SPARK-19736][SQL] refreshByPath should clear all cached plans with the specified path #17064

Uh oh!

Conversation

viirya commented Feb 25, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Feb 25, 2017

Uh oh!

kiszk commented Feb 25, 2017

Uh oh!

viirya commented Feb 25, 2017

Uh oh!

viirya commented Feb 27, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Feb 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Feb 28, 2017

Uh oh!

cloud-fan commented Feb 28, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Mar 1, 2017

Uh oh!

SparkQA commented Mar 1, 2017

Uh oh!

cloud-fan commented Mar 1, 2017

Uh oh!

viirya commented Mar 1, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

viirya Feb 28, 2017 •

edited

Loading