Skip to content

Commit e0d41e8

Browse files
committed
[SPARK-37530][CORE] Spark reads many paths very slow though newAPIHadoopFile
### What changes were proposed in this pull request? Same as #18441, we parallelize FileInputFormat.listStatus for newAPIHadoopFile ### Why are the changes needed? ![image](https://user-images.githubusercontent.com/8326978/144562490-d8005bf2-2052-4b50-9a5d-8b253ee598cc.png) Spark can be slow when accessing external storage at driver side, improve perf by parallelizing ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing GA Closes #34792 from yaooqinn/SPARK-37530. Authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]>
1 parent ae9aeba commit e0d41e8

File tree

1 file changed

+4
-0
lines changed

1 file changed

+4
-0
lines changed

core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,10 @@ class NewHadoopRDD[K, V](
123123

124124
override def getPartitions: Array[Partition] = {
125125
val inputFormat = inputFormatClass.getConstructor().newInstance()
126+
// setMinPartitions below will call FileInputFormat.listStatus(), which can be quite slow when
127+
// traversing a large number of directories and files. Parallelize it.
128+
_conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS,
129+
Runtime.getRuntime.availableProcessors().toString)
126130
inputFormat match {
127131
case configurable: Configurable =>
128132
configurable.setConf(_conf)

0 commit comments

Comments
 (0)