[SPARK-37530][CORE] Spark reads many paths very slow though newAPIHadoopFile

yaooqinn · yaooqinn · commit e0d41e887ea1 · 2021-12-03T17:35:21.000+08:00
### What changes were proposed in this pull request? Same as #18441, we parallelize FileInputFormat.listStatus for newAPIHadoopFile ### Why are the changes needed? ![image](https://user-images.githubusercontent.com/8326978/144562490-d8005bf2-2052-4b50-9a5d-8b253ee598cc.png) Spark can be slow when accessing external storage at driver side, improve perf by parallelizing ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing GA Closes #34792 from yaooqinn/SPARK-37530. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>
diff --git a/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala b/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala
@@ -123,6 +123,10 @@ class NewHadoopRDD[K, V](
 
   override def getPartitions: Array[Partition] = {
     val inputFormat = inputFormatClass.getConstructor().newInstance()
+    // setMinPartitions below will call FileInputFormat.listStatus(), which can be quite slow when
+    // traversing a large number of directories and files. Parallelize it.
+    _conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS,
+      Runtime.getRuntime.availableProcessors().toString)
     inputFormat match {
       case configurable: Configurable =>
         configurable.setConf(_conf)