Skip to content

Conversation

@windpiger
Copy link
Contributor

What changes were proposed in this pull request?

If we refresh a InMemoryFileIndex with a FileStatusCache, it will first use the FileStatusCache to re-generate the cachedLeafFiles etc, then call FileStatusCache.invalidateAll.

While the order to do these two actions is wrong, this lead to the refresh action does not take effect.

  override def refresh(): Unit = {
    refresh0()
    fileStatusCache.invalidateAll()
  }

  private def refresh0(): Unit = {
    val files = listLeafFiles(rootPaths)
    cachedLeafFiles =
      new mutable.LinkedHashMap[Path, FileStatus]() ++= files.map(f => f.getPath -> f)
    cachedLeafDirToChildrenFiles = files.toArray.groupBy(_.getPath.getParent)
    cachedPartitionSpec = null
  }

How was this patch tested?

unit test added

…alidate and regenerate the inmemory var for InMemoryFileIndex with FileStatusCache
@windpiger windpiger changed the title [SPARK-19748][SQL]refresh function has an wrong order to do cache invalidate and regenerate the inmemory var for InMemoryFileIndex with FileStatusCache [SPARK-19748][SQL]refresh function has a wrong order to do cache invalidate and regenerate the inmemory var for InMemoryFileIndex with FileStatusCache Feb 27, 2017
@SparkQA
Copy link

SparkQA commented Feb 27, 2017

Test build #73502 has finished for PR 17079 at commit fd3bb21.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger
Copy link
Contributor Author

cc @cloud-fan @gatorsmile

val fileStatusCache = FileStatusCache.getOrCreate(spark)
val dirPath = new Path(dir.getAbsolutePath)
val catalog = new InMemoryFileIndex(spark, Seq(dirPath), Map.empty,
None, fileStatusCache) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

val catalog =
  new XXX(...) {
    def xxx
  }


assert(catalog.leafFilePaths.size == 1)
assert(catalog.leafFilePaths.head.toString.stripSuffix("/") ==
s"file:${file.getAbsolutePath.stripSuffix("/")}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks hacky, can you turn them into Path and compare?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let me modify~ thanks~

@cloud-fan
Copy link
Contributor

good catch! Can you show a real example that fails because of this bug? I'm wondering why the existing unit tests didn't expose this bug...

@windpiger
Copy link
Contributor Author

there is no related test case for InMemoryFileIndex with FileStatusCache.
When I do this PR, and add a fileStatusCache in DataSource, I found this bug..

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73546 has finished for PR 17079 at commit 1ec20a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

assert(catalog.leafDirPaths.isEmpty)
assert(catalog.leafFilePaths.isEmpty)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move these two asserts after stringToFile

new InMemoryFileIndex(spark, Seq(dirPath), Map.empty, None, fileStatusCache) {
def leafFilePaths: Seq[Path] = leafFiles.keys.toSeq
def leafDirPaths: Seq[Path] = leafDirToChildrenFiles.keys.toSeq
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Indents issues for the above three lines.

@gatorsmile
Copy link
Member

LGTM except two minor comments.

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73560 has finished for PR 17079 at commit 94879a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/2.1!

asfgit pushed a commit that referenced this pull request Feb 28, 2017
…alidate and regenerate the inmemory var for InMemoryFileIndex with FileStatusCache

## What changes were proposed in this pull request?

If we refresh a InMemoryFileIndex with a FileStatusCache, it will first use the FileStatusCache to re-generate the cachedLeafFiles etc, then call FileStatusCache.invalidateAll.

While the order to do these two actions is wrong, this lead to the refresh action does not take effect.

```
  override def refresh(): Unit = {
    refresh0()
    fileStatusCache.invalidateAll()
  }

  private def refresh0(): Unit = {
    val files = listLeafFiles(rootPaths)
    cachedLeafFiles =
      new mutable.LinkedHashMap[Path, FileStatus]() ++= files.map(f => f.getPath -> f)
    cachedLeafDirToChildrenFiles = files.toArray.groupBy(_.getPath.getParent)
    cachedPartitionSpec = null
  }
```
## How was this patch tested?
unit test added

Author: windpiger <[email protected]>

Closes #17079 from windpiger/fixInMemoryFileIndexRefresh.

(cherry picked from commit a350bc1)
Signed-off-by: Wenchen Fan <[email protected]>
@asfgit asfgit closed this in a350bc1 Feb 28, 2017
asfgit pushed a commit that referenced this pull request Mar 3, 2017
…ed to listFiles twice

## What changes were proposed in this pull request?

Currently when we resolveRelation for a `FileFormat DataSource` without providing user schema, it will execute `listFiles`  twice in `InMemoryFileIndex` during `resolveRelation`.

This PR add a `FileStatusCache` for DataSource, this can avoid listFiles twice.

But there is a bug in `InMemoryFileIndex` see:
 [SPARK-19748](#17079)
 [SPARK-19761](#17093),
so this pr should be after SPARK-19748/ SPARK-19761.

## How was this patch tested?
unit test added

Author: windpiger <[email protected]>

Closes #17081 from windpiger/resolveDataSourceScanFilesTwice.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants