-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19748][SQL]refresh function has a wrong order to do cache invalidate and regenerate the inmemory var for InMemoryFileIndex with FileStatusCache #17079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…alidate and regenerate the inmemory var for InMemoryFileIndex with FileStatusCache
|
Test build #73502 has finished for PR 17079 at commit
|
| val fileStatusCache = FileStatusCache.getOrCreate(spark) | ||
| val dirPath = new Path(dir.getAbsolutePath) | ||
| val catalog = new InMemoryFileIndex(spark, Seq(dirPath), Map.empty, | ||
| None, fileStatusCache) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
val catalog =
new XXX(...) {
def xxx
}
|
|
||
| assert(catalog.leafFilePaths.size == 1) | ||
| assert(catalog.leafFilePaths.head.toString.stripSuffix("/") == | ||
| s"file:${file.getAbsolutePath.stripSuffix("/")}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks hacky, can you turn them into Path and compare?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let me modify~ thanks~
|
good catch! Can you show a real example that fails because of this bug? I'm wondering why the existing unit tests didn't expose this bug... |
|
there is no related test case for InMemoryFileIndex with FileStatusCache. |
|
Test build #73546 has finished for PR 17079 at commit
|
| } | ||
|
|
||
| assert(catalog.leafDirPaths.isEmpty) | ||
| assert(catalog.leafFilePaths.isEmpty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move these two asserts after stringToFile
| new InMemoryFileIndex(spark, Seq(dirPath), Map.empty, None, fileStatusCache) { | ||
| def leafFilePaths: Seq[Path] = leafFiles.keys.toSeq | ||
| def leafDirPaths: Seq[Path] = leafDirToChildrenFiles.keys.toSeq | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Indents issues for the above three lines.
|
LGTM except two minor comments. |
|
Test build #73560 has finished for PR 17079 at commit
|
|
thanks, merging to master/2.1! |
…alidate and regenerate the inmemory var for InMemoryFileIndex with FileStatusCache
## What changes were proposed in this pull request?
If we refresh a InMemoryFileIndex with a FileStatusCache, it will first use the FileStatusCache to re-generate the cachedLeafFiles etc, then call FileStatusCache.invalidateAll.
While the order to do these two actions is wrong, this lead to the refresh action does not take effect.
```
override def refresh(): Unit = {
refresh0()
fileStatusCache.invalidateAll()
}
private def refresh0(): Unit = {
val files = listLeafFiles(rootPaths)
cachedLeafFiles =
new mutable.LinkedHashMap[Path, FileStatus]() ++= files.map(f => f.getPath -> f)
cachedLeafDirToChildrenFiles = files.toArray.groupBy(_.getPath.getParent)
cachedPartitionSpec = null
}
```
## How was this patch tested?
unit test added
Author: windpiger <[email protected]>
Closes #17079 from windpiger/fixInMemoryFileIndexRefresh.
(cherry picked from commit a350bc1)
Signed-off-by: Wenchen Fan <[email protected]>
…ed to listFiles twice ## What changes were proposed in this pull request? Currently when we resolveRelation for a `FileFormat DataSource` without providing user schema, it will execute `listFiles` twice in `InMemoryFileIndex` during `resolveRelation`. This PR add a `FileStatusCache` for DataSource, this can avoid listFiles twice. But there is a bug in `InMemoryFileIndex` see: [SPARK-19748](#17079) [SPARK-19761](#17093), so this pr should be after SPARK-19748/ SPARK-19761. ## How was this patch tested? unit test added Author: windpiger <[email protected]> Closes #17081 from windpiger/resolveDataSourceScanFilesTwice.
What changes were proposed in this pull request?
If we refresh a InMemoryFileIndex with a FileStatusCache, it will first use the FileStatusCache to re-generate the cachedLeafFiles etc, then call FileStatusCache.invalidateAll.
While the order to do these two actions is wrong, this lead to the refresh action does not take effect.
How was this patch tested?
unit test added