Skip to content

Conversation

@sunchao
Copy link
Member

@sunchao sunchao commented Aug 20, 2020

What changes were proposed in this pull request?

This adds some tuning guide for increasing parallelism of directory listing.

Why are the changes needed?

Sometimes when job input has large number of directories, the listing can become a bottleneck. There are a few parameters to tune this. This adds some info to Spark tuning guide to make the knowledge better shared.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

N/A

@viirya
Copy link
Member

viirya commented Aug 20, 2020

Is the JIRA number incorrect?

@sunchao sunchao changed the title [SPARK-32646][DOC] Add suggestion for parallel directory listing in t… [SPARK-32674][DOC] Add suggestion for parallel directory listing in t… Aug 20, 2020
@sunchao
Copy link
Member Author

sunchao commented Aug 20, 2020

Is the JIRA number incorrect?

OOPS. I don't know where I get that number from ... fixed - thanks!

BTW how do I link this PR with the JIRA? will it happen automatically?

@sunchao
Copy link
Member Author

sunchao commented Aug 20, 2020

nvm it is linked :)

@viirya
Copy link
Member

viirya commented Aug 20, 2020

Yeah, it will be linked automatically once you put into the PR title. That's why I noticed this, because I'm working on the original JIRA. :)

docs/tuning.md Outdated
Sometimes you may also need to increase directory listing parallelism when job input has large number of directories,
otherwise the process could take a very long time, especially when against object store like S3.
If your job works on RDD with Hadoop input formats (e.g., via `SparkContext#sequenceFile`), the parallelism is
controlled via `spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads` (default is 1). For other
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems having a limitation that multiple threads cannot be used with non thread-safe path filter?

https://hadoop.apache.org/docs/r2.7.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

The number of threads to use to list and fetch block locations for the specified input paths. Note: multiple threads should not be used if a custom non thread-safe path filter is used.

Should we also mention it together?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is pretty rare and also that most users won't probably be exposed to this (they mostly interact with RDDs and file formats, I think). Plus this is a Hadoop configuration (recognizable from the name) so they can also find the other doc online.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, shall we add a hyperlink on spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads to the hadoop configuration, https://hadoop.apache.org/docs/r3.2.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml? Then, we can lead them safely if they didn't take a look at that. It should be Hadoop 3.2.0 link on master branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'll add a link to the master branch.

@dbtsai
Copy link
Member

dbtsai commented Aug 21, 2020

Jenkins, add to whitelist.

@viirya viirya changed the title [SPARK-32674][DOC] Add suggestion for parallel directory listing in t… [SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc Aug 21, 2020
docs/tuning.md Outdated
otherwise the process could take a very long time, especially when against object store like S3.
If your job works on RDD with Hadoop input formats (e.g., via `SparkContext#sequenceFile`), the parallelism is
controlled via `spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads` (default is 1). For other
cases such as Spark SQL, you can tune `spark.sql.sources.parallelPartitionDiscovery.threshold` to improve the listing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the last statement should be described in the SQL side: https://github.com/apache/spark/blob/master/docs/sql-performance-tuning.md

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @maropu 's suggestion (adding there too while keeping here).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. How about we only mention the parameters for Spark SQL side here and direct users to the SQL guide for more detailed guidance?

@SparkQA
Copy link

SparkQA commented Aug 21, 2020

Test build #127711 has finished for PR 29498 at commit b4efdb7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Aug 21, 2020

The PR title was broken and affected the PR description. As it is minor, I modified them together.

@sunchao
Copy link
Member Author

sunchao commented Aug 21, 2020

The PR title was broken and affected the PR description. As it is minor, I modified them together.

Ah thanks @viirya .

@dongjoon-hyun
Copy link
Member

Thank you so much, @sunchao !

In general, we recommend 2-3 tasks per CPU core in your cluster.

Sometimes you may also need to increase directory listing parallelism when job input has large number of directories,
otherwise the process could take a very long time, especially when against object store like S3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about remote HDFS? Since this looks like a general issue for remote storage access. Especially, in disaggregated clusters where remote storage (HDFS/S3) are used, can we generalize more like the following?

- especially when against object store like S3
- especially when against remote HDFS or S3 or in the disaggregated clusters

Copy link
Member Author

@sunchao sunchao Aug 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on how "remote" the storage is. For HDFS, depending on the use case the compute and storage can still be deployed within the same region or even zone and therefore network/metadata cost is much cheaper than that from S3.

Therefore, I think we can stick with the S3 case as it is more characteristic. Let me know if you think otherwise.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @sunchao . This addition looks very helpful to me. I believe we need to have this in master/3.0/2.4.

+1, LGTM (except the above existing comments and mine).

@SparkQA
Copy link

SparkQA commented Aug 21, 2020

Test build #127727 has finished for PR 29498 at commit 2dfe36f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 21, 2020

Test build #127729 has finished for PR 29498 at commit c462271.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master, branch-3.0 and branch-2.4.

HyukjinKwon pushed a commit that referenced this pull request Aug 21, 2020
…uning doc

### What changes were proposed in this pull request?

This adds some tuning guide for increasing parallelism of directory listing.

### Why are the changes needed?

Sometimes when job input has large number of directories, the listing can become a bottleneck. There are a few parameters to tune this. This adds some info to Spark tuning guide to make the knowledge better shared.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #29498 from sunchao/SPARK-32674.

Authored-by: Chao Sun <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit bf221de)
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Aug 21, 2020
…uning doc

### What changes were proposed in this pull request?

This adds some tuning guide for increasing parallelism of directory listing.

### Why are the changes needed?

Sometimes when job input has large number of directories, the listing can become a bottleneck. There are a few parameters to tune this. This adds some info to Spark tuning guide to make the knowledge better shared.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #29498 from sunchao/SPARK-32674.

Authored-by: Chao Sun <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit bf221de)
Signed-off-by: HyukjinKwon <[email protected]>
@maropu
Copy link
Member

maropu commented Aug 21, 2020

late LGTM. Thanks, @sunchao and all the reviewers!

@sunchao
Copy link
Member Author

sunchao commented Aug 21, 2020

Thanks everyone for the reviews!!

@dongjoon-hyun
Copy link
Member

Thank you, @sunchao and all.
BTW, welcome, @sunchao !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants