Skip to content

Commit bf221de

Browse files
sunchaoHyukjinKwon
authored andcommitted
[SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc
### What changes were proposed in this pull request? This adds some tuning guide for increasing parallelism of directory listing. ### Why are the changes needed? Sometimes when job input has large number of directories, the listing can become a bottleneck. There are a few parameters to tune this. This adds some info to Spark tuning guide to make the knowledge better shared. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #29498 from sunchao/SPARK-32674. Authored-by: Chao Sun <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
1 parent c75a827 commit bf221de

File tree

2 files changed

+33
-0
lines changed

2 files changed

+33
-0
lines changed

docs/sql-performance-tuning.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,28 @@ that these options will be deprecated in future release as more optimizations ar
114114
</td>
115115
<td>1.1.0</td>
116116
</tr>
117+
<tr>
118+
<td><code>spark.sql.sources.parallelPartitionDiscovery.threshold</code></td>
119+
<td>32</td>
120+
<td>
121+
Configures the threshold to enable parallel listing for job input paths. If the number of
122+
input paths is larger than this threshold, Spark will list the files by using Spark distributed job.
123+
Otherwise, it will fallback to sequential listing. This configuration is only effective when
124+
using file-based data sources such as Parquet, ORC and JSON.
125+
</td>
126+
<td>1.5.0</td>
127+
</tr>
128+
<tr>
129+
<td><code>spark.sql.sources.parallelPartitionDiscovery.parallelism</code></td>
130+
<td>10000</td>
131+
<td>
132+
Configures the maximum listing parallelism for job input paths. In case the number of input
133+
paths is larger than this value, it will be throttled down to use this value. Same as above,
134+
this configuration is only effective when using file-based data sources such as Parquet, ORC
135+
and JSON.
136+
</td>
137+
<td>2.1.1</td>
138+
</tr>
117139
</table>
118140

119141
## Join Strategy Hints for SQL Queries

docs/tuning.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -264,6 +264,17 @@ parent RDD's number of partitions. You can pass the level of parallelism as a se
264264
or set the config property `spark.default.parallelism` to change the default.
265265
In general, we recommend 2-3 tasks per CPU core in your cluster.
266266

267+
## Parallel Listing on Input Paths
268+
269+
Sometimes you may also need to increase directory listing parallelism when job input has large number of directories,
270+
otherwise the process could take a very long time, especially when against object store like S3.
271+
If your job works on RDD with Hadoop input formats (e.g., via `SparkContext.sequenceFile`), the parallelism is
272+
controlled via [`spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads`](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml) (currently default is 1).
273+
274+
For Spark SQL with file-based data sources, you can tune `spark.sql.sources.parallelPartitionDiscovery.threshold` and
275+
`spark.sql.sources.parallelPartitionDiscovery.parallelism` to improve listing parallelism. Please
276+
refer to [Spark SQL performance tuning guide](sql-performance-tuning.html) for more details.
277+
267278
## Memory Usage of Reduce Tasks
268279

269280
Sometimes, you will get an OutOfMemoryError not because your RDDs don't fit in memory, but because the

0 commit comments

Comments
 (0)