You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc
### What changes were proposed in this pull request?
This adds some tuning guide for increasing parallelism of directory listing.
### Why are the changes needed?
Sometimes when job input has large number of directories, the listing can become a bottleneck. There are a few parameters to tune this. This adds some info to Spark tuning guide to make the knowledge better shared.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
N/A
Closes#29498 from sunchao/SPARK-32674.
Authored-by: Chao Sun <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Copy file name to clipboardExpand all lines: docs/tuning.md
+11Lines changed: 11 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -264,6 +264,17 @@ parent RDD's number of partitions. You can pass the level of parallelism as a se
264
264
or set the config property `spark.default.parallelism` to change the default.
265
265
In general, we recommend 2-3 tasks per CPU core in your cluster.
266
266
267
+
## Parallel Listing on Input Paths
268
+
269
+
Sometimes you may also need to increase directory listing parallelism when job input has large number of directories,
270
+
otherwise the process could take a very long time, especially when against object store like S3.
271
+
If your job works on RDD with Hadoop input formats (e.g., via `SparkContext.sequenceFile`), the parallelism is
272
+
controlled via [`spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads`](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml) (currently default is 1).
273
+
274
+
For Spark SQL with file-based data sources, you can tune `spark.sql.sources.parallelPartitionDiscovery.threshold` and
275
+
`spark.sql.sources.parallelPartitionDiscovery.parallelism` to improve listing parallelism. Please
276
+
refer to [Spark SQL performance tuning guide](sql-performance-tuning.html) for more details.
277
+
267
278
## Memory Usage of Reduce Tasks
268
279
269
280
Sometimes, you will get an OutOfMemoryError not because your RDDs don't fit in memory, but because the
0 commit comments