-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-9689][SQL]Fix bug of not invalidate the cache for InsertIntoHadoopFsRelation #8023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -183,6 +183,16 @@ private[sql] case class InMemoryRelation( | |
| batchStats).asInstanceOf[this.type] | ||
| } | ||
|
|
||
| private[sql] def withChild(newChild: SparkPlan): this.type = { | ||
| new InMemoryRelation( | ||
| output.map(_.newInstance()), | ||
| useCompression, | ||
| batchSize, | ||
| storageLevel, | ||
| newChild, | ||
| tableName)().asInstanceOf[this.type] | ||
| } | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This method is equivalent to |
||
|
|
||
| def cachedColumnBuffers: RDD[CachedBatch] = _cachedColumnBuffers | ||
|
|
||
| override protected def otherCopyArgs: Seq[AnyRef] = | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -27,7 +27,16 @@ import org.apache.spark.storage.StorageLevel | |
| import org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK | ||
|
|
||
| /** Holds a cached logical plan and its data */ | ||
| private[sql] case class CachedData(plan: LogicalPlan, cachedRepresentation: InMemoryRelation) | ||
| private[sql] class CachedData( | ||
| val plan: LogicalPlan, | ||
| var cachedRepresentation: InMemoryRelation) { | ||
| private[sql] def recache(sqlContext: SQLContext): Unit = { | ||
| cachedRepresentation.uncache(true) // release the cache | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of re-run the RDD, we re-create the RDD for the recache and run it, as once the RDD created, we don't have chance to change its input files any more, and the recache will actually doesn't work. |
||
| // re-generate the spark plan and cache | ||
| cachedRepresentation = | ||
| cachedRepresentation.withChild(sqlContext.executePlan(plan).executedPlan) | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Provides support in a SQLContext for caching query results and automatically using these cached | ||
|
|
@@ -97,13 +106,13 @@ private[sql] class CacheManager(sqlContext: SQLContext) extends Logging { | |
| logWarning("Asked to cache already cached data.") | ||
| } else { | ||
| cachedData += | ||
| CachedData( | ||
| new CachedData( | ||
| planToCache, | ||
| InMemoryRelation( | ||
| sqlContext.conf.useCompression, | ||
| sqlContext.conf.columnBatchSize, | ||
| storageLevel, | ||
| sqlContext.executePlan(query.logicalPlan).executedPlan, | ||
| sqlContext.executePlan(planToCache).executedPlan, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we should change this line. For example, an existing Parquet dataset may be overwritten, and the new dataset has a completely different schema. When this does happen, the original code can catch the error by performing the analysis phase. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is just an optimization, and we don't want to re-analysis the logical plan as it's done right before the calls. When you check the full code of the function |
||
| tableName)) | ||
| } | ||
| } | ||
|
|
@@ -156,10 +165,27 @@ private[sql] class CacheManager(sqlContext: SQLContext) extends Logging { | |
| * function will over invalidate. | ||
| */ | ||
| private[sql] def invalidateCache(plan: LogicalPlan): Unit = writeLock { | ||
| cachedData.foreach { | ||
| case data if data.plan.collect { case p if p.sameResult(plan) => p }.nonEmpty => | ||
| data.cachedRepresentation.recache() | ||
| case _ => | ||
| var i = 0 | ||
| var locatedIdx = -1 | ||
| // find the index of the cached data, according to the specified logical plan | ||
| while (i < cachedData.length && locatedIdx < 0) { | ||
| cachedData(i) match { | ||
| case data if data.plan.collect { case p if p.sameResult(plan) => p }.nonEmpty => | ||
| locatedIdx = i | ||
| case _ => | ||
| } | ||
| i += 1 | ||
| } | ||
|
|
||
| if (locatedIdx >= 0) { | ||
| // if the cached data exists, remove it from the cache data list, as we need to | ||
| // re-generate the spark plan, and we don't want the this to be used during the | ||
| // re-generation | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's no need to remove it first, since the whole method is wrapped in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In |
||
| val entry = cachedData.remove(locatedIdx) // TODO do we have to use ArrayBuffer? | ||
| // rebuild the cache | ||
| entry.recache(sqlContext) | ||
| // add it back to the cache data list | ||
| cachedData += entry | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A problem of this change is that, df0 = sqlContext.range(10)
df1 = df0.filter(df0.id > 5).cache()
df2 = df0.filter(df0.id > 1).cache()
df1.count()
df2.count()In the above case, query plan of cachedData.foreach { data =>
if (data.plan.find(_.sameResult(plan)).isDefined) {
data.recache(sqlContext)
}
}There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, that's a good catch, I will see how to fix the chained logical cached plan. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And probably we can not simply use the |
||
| } | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -160,6 +160,9 @@ private[sql] case class InsertIntoHadoopFsRelation( | |
| logInfo("Skipping insertion into a relation that already exists.") | ||
| } | ||
|
|
||
| // Invalidate the cache. | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @yhuai We need to refresh the And even user refresh the file status explicitly, I don't think we have correct API for that, do we? |
||
| sqlContext.cacheManager.invalidateCache(LogicalRelation(relation)) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This line should be moved right after the
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, true, I will update it. |
||
|
|
||
| Seq.empty[Row] | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -565,6 +565,7 @@ abstract class HadoopFsRelation private[sql](maybePartitionSpec: Option[Partitio | |
| filters: Array[Filter], | ||
| inputPaths: Array[String], | ||
| broadcastedConf: Broadcast[SerializableConfiguration]): RDD[Row] = { | ||
| refresh() | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @liancheng seems refresh the file status is unavoidable. let's do that right before getting the input files.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I agree. Basically it's impossible to
In the old JSON relation implementation, the refreshing logic is done by
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree we'd better provide our own
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And we also need to refresh the partition directory before pruning the partition, probably we need to think more further how to fix that also. In the following PR(s).
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I'll probably work on this later this week, it can be relatively tricky to handle... |
||
| val inputStatuses = inputPaths.flatMap { input => | ||
| val path = new Path(input) | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yhuai @liancheng After double checking the source code, the spark plan of
InMemoryRelationis thePhysicalRDD, which hold a data source scanning RDD instances as its property.That's what I mean we will not take the latest files under the path when
recachemethod called, because theRDDis materialized already and never been changed, this PR will re-created the SparkPlan from the logical plan, and theDataSourceStrategywill rebuild the RDD based on the latest files.See:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L99
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L312
I've actually tried some other approaches for the fixing:
PhyscialRDD, to take the RDDBuilder instead of the RDD for as its property, however this failed due to widely impact the existed code.HadoopFsRelation, asinputFiles: Array[FileStatus]is widely used forbuildScan, particularly the partition pruning is done in theDataSourceStrategy, not theHadoopFsRelation.