-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc #29498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -264,6 +264,17 @@ parent RDD's number of partitions. You can pass the level of parallelism as a se | |
| or set the config property `spark.default.parallelism` to change the default. | ||
| In general, we recommend 2-3 tasks per CPU core in your cluster. | ||
|
|
||
| ## Parallel Listing on Input Paths | ||
|
|
||
| Sometimes you may also need to increase directory listing parallelism when job input has large number of directories, | ||
| otherwise the process could take a very long time, especially when against object store like S3. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about remote HDFS? Since this looks like a general issue for remote storage access. Especially, in disaggregated clusters where remote storage (HDFS/S3) are used, can we generalize more like the following?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It depends on how "remote" the storage is. For HDFS, depending on the use case the compute and storage can still be deployed within the same region or even zone and therefore network/metadata cost is much cheaper than that from S3. Therefore, I think we can stick with the S3 case as it is more characteristic. Let me know if you think otherwise. |
||
| If your job works on RDD with Hadoop input formats (e.g., via `SparkContext.sequenceFile`), the parallelism is | ||
| controlled via [`spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads`](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml) (currently default is 1). | ||
|
|
||
| For Spark SQL with file-based data sources, you can tune `spark.sql.sources.parallelPartitionDiscovery.threshold` and | ||
| `spark.sql.sources.parallelPartitionDiscovery.parallelism` to improve listing parallelism. Please | ||
| refer to [Spark SQL performance tuning guide](sql-performance-tuning.html) for more details. | ||
|
|
||
| ## Memory Usage of Reduce Tasks | ||
|
|
||
| Sometimes, you will get an OutOfMemoryError not because your RDDs don't fit in memory, but because the | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.