-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-27801][SQL] Improve performance of InMemoryFileIndex.listLeafFiles for HDFS directories with many files #24672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,6 +23,7 @@ import scala.collection.mutable | |
|
|
||
| import org.apache.hadoop.conf.Configuration | ||
| import org.apache.hadoop.fs._ | ||
| import org.apache.hadoop.hdfs.DistributedFileSystem | ||
| import org.apache.hadoop.mapred.{FileInputFormat, JobConf} | ||
|
|
||
| import org.apache.spark.SparkContext | ||
|
|
@@ -274,7 +275,21 @@ object InMemoryFileIndex extends Logging { | |
| // [SPARK-17599] Prevent InMemoryFileIndex from failing if path doesn't exist | ||
| // Note that statuses only include FileStatus for the files and dirs directly under path, | ||
| // and does not include anything else recursively. | ||
| val statuses = try fs.listStatus(path) catch { | ||
| val statuses: Array[FileStatus] = try { | ||
| fs match { | ||
| // DistributedFileSystem overrides listLocatedStatus to make 1 single call to namenode | ||
| // to retrieve the file status with the file block location. The reason to still fallback | ||
| // to listStatus is because the default implementation would potentially throw a | ||
| // FileNotFoundException which is better handled by doing the lookups manually below. | ||
| case _: DistributedFileSystem => | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @melin Are you hitting a perf regression? Do we need an internal SQLConf? cc @JoshRosen
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unless I'm misunderstanding something, I don't think that @melin's posted benchmark is an apples-to-apples comparison: The direct comparison is only useful in cases where we're never going to make use of the location information and to my knowledge we currently do not have a flag to bypass locality fetching in
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm +1 for @JoshRosen 's comment.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Before coming up with this change I was actually wondering why there was no conf options to ignore locality and why it was required to do those extra lookups. In situations where the spark job is running on a small set of servers out of a big HDFS cluster, locality is a bit useless as the hit rate is super low, the time to get block locations was getting out of control for us and locality wait per task was slowing down the jobs. I think this change to get the blocks from the namenode is good enough for us now but I can definitely see the benefit of having a conf option to disable locality entirely which would turn off fetching of getFileBlockLocations and also use listStatus always instead of my optimization to use listLocatedStatus.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please file a JIRA for the further discussions on new
|
||
| val remoteIter = fs.listLocatedStatus(path) | ||
| new Iterator[LocatedFileStatus]() { | ||
srowen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| def next(): LocatedFileStatus = remoteIter.next | ||
| def hasNext(): Boolean = remoteIter.hasNext | ||
| }.toArray | ||
| case _ => fs.listStatus(path) | ||
| } | ||
| } catch { | ||
| case _: FileNotFoundException => | ||
| logWarning(s"The directory $path was not found. Was it deleted very recently?") | ||
| Array.empty[FileStatus] | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.