-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-27801][SQL] Improve performance of InMemoryFileIndex.listLeafFiles for HDFS directories with many files #24672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ith followup listFileBlockLocations
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
Show resolved
Hide resolved
|
This is an awesome find 🎉 . Fellow readers / reviewers, definitely check out the JIRA description, too, since it lists some impressive perf. numbers: https://issues.apache.org/jira/browse/SPARK-27801 |
…inst case discussed during apache#24672 review)
|
Is this behavior affected by Hadoop / HDFS version? For example, is the class It's unfortunate that the bug fixed by #24668 is adding so many edge cases here 😦 |
|
@JoshRosen I believe all supported versions of Hadoop use this version of listLocatedStatus. Based on the spark documentation page that says only 2.6.5+ is supported I think we're good for that. |
|
Thank you @rrusso2007 @JoshRosen I did simple benchmark in our production environment(Hadoop version is 2.7.1): |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
Show resolved
Hide resolved
|
Retest this please. |
|
Test build #105781 has finished for PR 24672 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to master.
I also tested this in the cluster and saw a lot of improvement.
Thank you for your first contribution, @rrusso2007 . Welcome!
And, thank you for review, @JoshRosen , @srowen , @wangyum .
|
|
Please file a JIRA with the context if you want to report something, @melin . |
| // to retrieve the file status with the file block location. The reason to still fallback | ||
| // to listStatus is because the default implementation would potentially throw a | ||
| // FileNotFoundException which is better handled by doing the lookups manually below. | ||
| case _: DistributedFileSystem => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@melin Are you hitting a perf regression? Do we need an internal SQLConf?
cc @JoshRosen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm misunderstanding something, I don't think that @melin's posted benchmark is an apples-to-apples comparison: listLocatedStatus is going to be slower than listStatus because it's doing more work in order to get location information. In order for this to be a fair performance comparison you'd want to compare listStatus + getBlockLocations to listLocatedStatus (because we'd end up having to call getBlockLocations further down if we didn't compute the locations as part of listLocatedStatus).
The direct comparison is only useful in cases where we're never going to make use of the location information and to my knowledge we currently do not have a flag to bypass locality fetching in InMemoryFileIndex (so the location-free performance isn't relevant).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm +1 for @JoshRosen 's comment.
And, Yep. Of course, if there is a corner case, we had better have conf for this. For now, it looks like we need more description about what he is hitting. SPARK JIRA would be good for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before coming up with this change I was actually wondering why there was no conf options to ignore locality and why it was required to do those extra lookups. In situations where the spark job is running on a small set of servers out of a big HDFS cluster, locality is a bit useless as the hit rate is super low, the time to get block locations was getting out of control for us and locality wait per task was slowing down the jobs.
I think this change to get the blocks from the namenode is good enough for us now but I can definitely see the benefit of having a conf option to disable locality entirely which would turn off fetching of getFileBlockLocations and also use listStatus always instead of my optimization to use listLocatedStatus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please file a JIRA for the further discussions on new conf if it's not filed yet. (I didn't search it, but might be exist in some other context.)
having a conf option to disable locality entirely which would turn off fetching of getFileBlockLocations
…iles for HDFS directories with many files InMemoryFileIndex.listLeafFiles should use listLocatedStatus for DistributedFileSystem. DistributedFileSystem overrides the listLocatedStatus method in order to do it with 1 single namenode call thus saving thousands of calls to getBlockLocations. Currently in InMemoryFileIndex, all directory listings are done using FileSystem.listStatus following by individual calls to FileSystem.getFileBlockLocations. This is painstakingly slow for folders that have large numbers of files because this process happens serially and parallelism is only applied at the folder level, not the file level. FileSystem also provides another API listLocatedStatus which returns the LocatedFileStatus objects that already have the block locations. In FileSystem main class this just delegates to listStatus and getFileBlockLocations similarly to the way Spark does it. However when HDFS specifically is the backing file system, DistributedFileSystem overrides this method and simply makes one single call to the namenode to retrieve the directory listing with the block locations. This avoids potentially thousands or more calls to namenode and also is more consistent because files will either exist with locations or not exist instead of having the FileNotFoundException exception case. For our example directory with 6500 files, the load time of spark.read.parquet was reduced 96x from 76 seconds to .8 seconds. This savings only goes up with the number of files in the directory. In the pull request instead of using this method always which could lead to a FileNotFoundException that could be tough to decipher in the default FileSystem implementation, this method is only used when the FileSystem is a DistributedFileSystem and otherwise the old logic still applies. test suite ran Closes apache#24672 from rrusso2007/master. Authored-by: rrusso2007 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit ebd1431)
What changes were proposed in this pull request?
InMemoryFileIndex.listLeafFiles should use listLocatedStatus for DistributedFileSystem. DistributedFileSystem overrides the listLocatedStatus method in order to do it with 1 single namenode call thus saving thousands of calls to getBlockLocations.
Currently in InMemoryFileIndex, all directory listings are done using FileSystem.listStatus following by individual calls to FileSystem.getFileBlockLocations. This is painstakingly slow for folders that have large numbers of files because this process happens serially and parallelism is only applied at the folder level, not the file level.
FileSystem also provides another API listLocatedStatus which returns the LocatedFileStatus objects that already have the block locations. In FileSystem main class this just delegates to listStatus and getFileBlockLocations similarly to the way Spark does it. However when HDFS specifically is the backing file system, DistributedFileSystem overrides this method and simply makes one single call to the namenode to retrieve the directory listing with the block locations. This avoids potentially thousands or more calls to namenode and also is more consistent because files will either exist with locations or not exist instead of having the FileNotFoundException exception case.
For our example directory with 6500 files, the load time of spark.read.parquet was reduced 96x from 76 seconds to .8 seconds. This savings only goes up with the number of files in the directory.
In the pull request instead of using this method always which could lead to a FileNotFoundException that could be tough to decipher in the default FileSystem implementation, this method is only used when the FileSystem is a DistributedFileSystem and otherwise the old logic still applies.
How was this patch tested?
test suite ran