[SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc #29498

sunchao · 2020-08-20T23:25:15Z

What changes were proposed in this pull request?

This adds some tuning guide for increasing parallelism of directory listing.

Why are the changes needed?

Sometimes when job input has large number of directories, the listing can become a bottleneck. There are a few parameters to tune this. This adds some info to Spark tuning guide to make the knowledge better shared.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

N/A

…uning doc

viirya · 2020-08-20T23:29:51Z

Is the JIRA number incorrect?

sunchao · 2020-08-20T23:32:55Z

Is the JIRA number incorrect?

OOPS. I don't know where I get that number from ... fixed - thanks!

BTW how do I link this PR with the JIRA? will it happen automatically?

sunchao · 2020-08-20T23:34:18Z

nvm it is linked :)

viirya · 2020-08-20T23:39:25Z

Yeah, it will be linked automatically once you put into the PR title. That's why I noticed this, because I'm working on the original JIRA. :)

viirya · 2020-08-20T23:44:03Z

docs/tuning.md

+Sometimes you may also need to increase directory listing parallelism when job input has large number of directories,
+otherwise the process could take a very long time, especially when against object store like S3.
+If your job works on RDD with Hadoop input formats (e.g., via `SparkContext#sequenceFile`), the parallelism is
+controlled via `spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads` (default is 1). For other


This seems having a limitation that multiple threads cannot be used with non thread-safe path filter?

https://hadoop.apache.org/docs/r2.7.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

The number of threads to use to list and fetch block locations for the specified input paths. Note: multiple threads should not be used if a custom non thread-safe path filter is used.

Should we also mention it together?

I think this is pretty rare and also that most users won't probably be exposed to this (they mostly interact with RDDs and file formats, I think). Plus this is a Hadoop configuration (recognizable from the name) so they can also find the other doc online.

Maybe, shall we add a hyperlink on spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads to the hadoop configuration, https://hadoop.apache.org/docs/r3.2.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml? Then, we can lead them safely if they didn't take a look at that. It should be Hadoop 3.2.0 link on master branch.

Sounds good. I'll add a link to the master branch.

dbtsai · 2020-08-21T00:12:26Z

Jenkins, add to whitelist.

maropu · 2020-08-21T00:20:04Z

docs/tuning.md

+otherwise the process could take a very long time, especially when against object store like S3.
+If your job works on RDD with Hadoop input formats (e.g., via `SparkContext#sequenceFile`), the parallelism is
+controlled via `spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads` (default is 1). For other
+cases such as Spark SQL, you can tune `spark.sql.sources.parallelPartitionDiscovery.threshold` to improve the listing


I think the last statement should be described in the SQL side: https://github.com/apache/spark/blob/master/docs/sql-performance-tuning.md

will do thanks

+1 for @maropu 's suggestion (adding there too while keeping here).

Thanks. How about we only mention the parameters for Spark SQL side here and direct users to the SQL guide for more detailed guidance?

docs/tuning.md

SparkQA · 2020-08-21T00:25:44Z

Test build #127711 has finished for PR 29498 at commit b4efdb7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-08-21T01:10:49Z

The PR title was broken and affected the PR description. As it is minor, I modified them together.

sunchao · 2020-08-21T01:23:53Z

The PR title was broken and affected the PR description. As it is minor, I modified them together.

Ah thanks @viirya .

docs/tuning.md

dongjoon-hyun · 2020-08-21T05:22:05Z

Thank you so much, @sunchao !

dongjoon-hyun · 2020-08-21T05:43:47Z

docs/tuning.md

 In general, we recommend 2-3 tasks per CPU core in your cluster.

+Sometimes you may also need to increase directory listing parallelism when job input has large number of directories,
+otherwise the process could take a very long time, especially when against object store like S3.


What about remote HDFS? Since this looks like a general issue for remote storage access. Especially, in disaggregated clusters where remote storage (HDFS/S3) are used, can we generalize more like the following?

- especially when against object store like S3 - especially when against remote HDFS or S3 or in the disaggregated clusters

It depends on how "remote" the storage is. For HDFS, depending on the use case the compute and storage can still be deployed within the same region or even zone and therefore network/metadata cost is much cheaper than that from S3.

Therefore, I think we can stick with the S3 case as it is more characteristic. Let me know if you think otherwise.

dongjoon-hyun

Thank you, @sunchao . This addition looks very helpful to me. I believe we need to have this in master/3.0/2.4.

+1, LGTM (except the above existing comments and mine).

docs/sql-performance-tuning.md

docs/tuning.md

SparkQA · 2020-08-21T06:58:39Z

Test build #127727 has finished for PR 29498 at commit 2dfe36f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-21T07:13:43Z

Test build #127729 has finished for PR 29498 at commit c462271.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-08-21T07:48:33Z

Merged to master, branch-3.0 and branch-2.4.

…uning doc ### What changes were proposed in this pull request? This adds some tuning guide for increasing parallelism of directory listing. ### Why are the changes needed? Sometimes when job input has large number of directories, the listing can become a bottleneck. There are a few parameters to tune this. This adds some info to Spark tuning guide to make the knowledge better shared. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #29498 from sunchao/SPARK-32674. Authored-by: Chao Sun <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit bf221de) Signed-off-by: HyukjinKwon <[email protected]>

maropu · 2020-08-21T08:19:01Z

late LGTM. Thanks, @sunchao and all the reviewers!

sunchao · 2020-08-21T08:23:56Z

Thanks everyone for the reviews!!

dongjoon-hyun · 2020-08-21T20:57:06Z

Thank you, @sunchao and all.
BTW, welcome, @sunchao !

[SPARK-32646][DOC] Add suggestion for parallel directory listing in t…

b4efdb7

…uning doc

probot-autolabeler bot added the DOCS label Aug 20, 2020

sunchao changed the title ~~[SPARK-32646][DOC] Add suggestion for parallel directory listing in t…~~ [SPARK-32674][DOC] Add suggestion for parallel directory listing in t… Aug 20, 2020

viirya reviewed Aug 20, 2020

View reviewed changes

viirya changed the title ~~[SPARK-32674][DOC] Add suggestion for parallel directory listing in t…~~ [SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc Aug 21, 2020

maropu reviewed Aug 21, 2020

View reviewed changes

docs/tuning.md Show resolved Hide resolved

HyukjinKwon reviewed Aug 21, 2020

View reviewed changes

docs/tuning.md Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Aug 21, 2020

View reviewed changes

dongjoon-hyun approved these changes Aug 21, 2020

View reviewed changes

Address comments

2dfe36f

viirya reviewed Aug 21, 2020

View reviewed changes

docs/sql-performance-tuning.md Outdated Show resolved Hide resolved

viirya reviewed Aug 21, 2020

View reviewed changes

docs/tuning.md Outdated Show resolved Hide resolved

Address comments

c462271

viirya approved these changes Aug 21, 2020

View reviewed changes

HyukjinKwon approved these changes Aug 21, 2020

View reviewed changes

HyukjinKwon closed this in bf221de Aug 21, 2020

[SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc #29498

[SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc #29498

Uh oh!

Conversation

sunchao commented Aug 20, 2020 • edited by viirya Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

viirya commented Aug 20, 2020

Uh oh!

sunchao commented Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sunchao commented Aug 20, 2020

Uh oh!

viirya commented Aug 20, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dbtsai commented Aug 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Aug 21, 2020

Uh oh!

viirya commented Aug 21, 2020

Uh oh!

sunchao commented Aug 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dongjoon-hyun commented Aug 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunchao Aug 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Aug 21, 2020

Uh oh!

SparkQA commented Aug 21, 2020

Uh oh!

HyukjinKwon commented Aug 21, 2020

Uh oh!

maropu commented Aug 21, 2020

Uh oh!

sunchao commented Aug 21, 2020

Uh oh!

dongjoon-hyun commented Aug 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

sunchao commented Aug 20, 2020 •

edited by viirya

Loading

sunchao commented Aug 20, 2020 •

edited

Loading

sunchao commented Aug 21, 2020 •

edited

Loading

sunchao Aug 21, 2020 •

edited

Loading