-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc #29498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Is the JIRA number incorrect? |
OOPS. I don't know where I get that number from ... fixed - thanks! BTW how do I link this PR with the JIRA? will it happen automatically? |
|
nvm it is linked :) |
|
Yeah, it will be linked automatically once you put into the PR title. That's why I noticed this, because I'm working on the original JIRA. :) |
docs/tuning.md
Outdated
| Sometimes you may also need to increase directory listing parallelism when job input has large number of directories, | ||
| otherwise the process could take a very long time, especially when against object store like S3. | ||
| If your job works on RDD with Hadoop input formats (e.g., via `SparkContext#sequenceFile`), the parallelism is | ||
| controlled via `spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads` (default is 1). For other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems having a limitation that multiple threads cannot be used with non thread-safe path filter?
The number of threads to use to list and fetch block locations for the specified input paths. Note: multiple threads should not be used if a custom non thread-safe path filter is used.
Should we also mention it together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is pretty rare and also that most users won't probably be exposed to this (they mostly interact with RDDs and file formats, I think). Plus this is a Hadoop configuration (recognizable from the name) so they can also find the other doc online.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, shall we add a hyperlink on spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads to the hadoop configuration, https://hadoop.apache.org/docs/r3.2.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml? Then, we can lead them safely if they didn't take a look at that. It should be Hadoop 3.2.0 link on master branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I'll add a link to the master branch.
|
Jenkins, add to whitelist. |
docs/tuning.md
Outdated
| otherwise the process could take a very long time, especially when against object store like S3. | ||
| If your job works on RDD with Hadoop input formats (e.g., via `SparkContext#sequenceFile`), the parallelism is | ||
| controlled via `spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads` (default is 1). For other | ||
| cases such as Spark SQL, you can tune `spark.sql.sources.parallelPartitionDiscovery.threshold` to improve the listing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the last statement should be described in the SQL side: https://github.com/apache/spark/blob/master/docs/sql-performance-tuning.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for @maropu 's suggestion (adding there too while keeping here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. How about we only mention the parameters for Spark SQL side here and direct users to the SQL guide for more detailed guidance?
|
Test build #127711 has finished for PR 29498 at commit
|
|
The PR title was broken and affected the PR description. As it is minor, I modified them together. |
Ah thanks @viirya . |
|
Thank you so much, @sunchao ! |
| In general, we recommend 2-3 tasks per CPU core in your cluster. | ||
|
|
||
| Sometimes you may also need to increase directory listing parallelism when job input has large number of directories, | ||
| otherwise the process could take a very long time, especially when against object store like S3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about remote HDFS? Since this looks like a general issue for remote storage access. Especially, in disaggregated clusters where remote storage (HDFS/S3) are used, can we generalize more like the following?
- especially when against object store like S3
- especially when against remote HDFS or S3 or in the disaggregated clusters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on how "remote" the storage is. For HDFS, depending on the use case the compute and storage can still be deployed within the same region or even zone and therefore network/metadata cost is much cheaper than that from S3.
Therefore, I think we can stick with the S3 case as it is more characteristic. Let me know if you think otherwise.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @sunchao . This addition looks very helpful to me. I believe we need to have this in master/3.0/2.4.
+1, LGTM (except the above existing comments and mine).
|
Test build #127727 has finished for PR 29498 at commit
|
|
Test build #127729 has finished for PR 29498 at commit
|
|
Merged to master, branch-3.0 and branch-2.4. |
…uning doc ### What changes were proposed in this pull request? This adds some tuning guide for increasing parallelism of directory listing. ### Why are the changes needed? Sometimes when job input has large number of directories, the listing can become a bottleneck. There are a few parameters to tune this. This adds some info to Spark tuning guide to make the knowledge better shared. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #29498 from sunchao/SPARK-32674. Authored-by: Chao Sun <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit bf221de) Signed-off-by: HyukjinKwon <[email protected]>
…uning doc ### What changes were proposed in this pull request? This adds some tuning guide for increasing parallelism of directory listing. ### Why are the changes needed? Sometimes when job input has large number of directories, the listing can become a bottleneck. There are a few parameters to tune this. This adds some info to Spark tuning guide to make the knowledge better shared. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #29498 from sunchao/SPARK-32674. Authored-by: Chao Sun <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit bf221de) Signed-off-by: HyukjinKwon <[email protected]>
|
late LGTM. Thanks, @sunchao and all the reviewers! |
|
Thanks everyone for the reviews!! |
What changes were proposed in this pull request?
This adds some tuning guide for increasing parallelism of directory listing.
Why are the changes needed?
Sometimes when job input has large number of directories, the listing can become a bottleneck. There are a few parameters to tune this. This adds some info to Spark tuning guide to make the knowledge better shared.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
N/A