[SPARK-18379][SQL] Make the parallelism of parallelPartitionDiscovery configurable. #15829

uncleGen · 2016-11-09T11:13:16Z

What changes were proposed in this pull request?

The largest parallelism in PartitioningAwareFileIndex #listLeafFilesInParallel() is 10000 in hard code. We may need to make this number configurable. And in PR, I reduce it to 100.

How was this patch tested?

Existing ut.

…gurable.

SparkQA · 2016-11-09T13:20:04Z

Test build #68404 has finished for PR 15829 at commit 112f9d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-11-09T13:27:35Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .doc("The number of parallelism to list a collection of path recursively, Set the " +
+        "number to prevent file listing from generating too many tasks.")
+      .intConf
+      .createWithDefault(100)


Do you mind if I ask the reason to reduce this from 10000 to 100?

IMHO, 10000 is too large for the file discovery in consideration of scene where cluster is not very large, and it makes no contributions to improve performance in most scenes.

Should this be internal()? although I tend to agree with your logic, I'd also love to see if the person who made it 10000 has thoughts. I can't figure out where it came from via git blame though after several code refactorings.

cc @liancheng @yhuai

This should be an internal config.

In general, I think we should have more smaller tasks than having a smaller number of tasks that need to handle lots of paths.

If you have a small cluster and we have a large number of paths, with this change, we only have 100 tasks. Every task will take longer time to finish and other jobs in the cluster may need to wait for a longer time to get scheduled.

btw, I think 100 is too small.

Yeah, this should be an internal config, and i will reserve the default 10000 setting.

…lelism

uncleGen · 2016-11-14T02:28:13Z

cc @yhuai @srowen

SparkQA · 2016-11-14T03:56:13Z

Test build #68591 has finished for PR 15829 at commit f6cf77f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Looks OK with respect to the comments here

yhuai · 2016-11-15T18:31:08Z

lgtm. Merging to master.

srowen · 2016-12-30T10:50:12Z

@yhuai per request on the JIRA, are you OK with me backporting to 2.1? seems OK as it doesn't modify the default behavior.

srowen · 2017-01-02T15:26:19Z

Merged to 2.1 as well

… configurable. ## What changes were proposed in this pull request? The largest parallelism in PartitioningAwareFileIndex #listLeafFilesInParallel() is 10000 in hard code. We may need to make this number configurable. And in PR, I reduce it to 100. ## How was this patch tested? Existing ut. Author: genmao.ygm <[email protected]> Author: dylon <[email protected]> Closes #15829 from uncleGen/SPARK-18379. (cherry picked from commit 745ab8b) Signed-off-by: Sean Owen <[email protected]>

yhuai · 2017-01-03T03:04:25Z

Sure. Thanks!

… configurable. ## What changes were proposed in this pull request? The largest parallelism in PartitioningAwareFileIndex #listLeafFilesInParallel() is 10000 in hard code. We may need to make this number configurable. And in PR, I reduce it to 100. ## How was this patch tested? Existing ut. Author: genmao.ygm <[email protected]> Author: dylon <[email protected]> Closes apache#15829 from uncleGen/SPARK-18379.

SPARK-18379: Make the parallelism of parallelPartitionDiscovery confi…

112f9d1

…gurable.

HyukjinKwon reviewed Nov 9, 2016

View reviewed changes

update default value and view mode of parallelPartitionDiscoveryParal…

f6cf77f

…lelism

srowen approved these changes Nov 14, 2016

View reviewed changes

asfgit closed this in 745ab8b Nov 15, 2016

[SPARK-18379][SQL] Make the parallelism of parallelPartitionDiscovery configurable. #15829

[SPARK-18379][SQL] Make the parallelism of parallelPartitionDiscovery configurable. #15829

Uh oh!

Conversation

uncleGen commented Nov 9, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Nov 9, 2016

Uh oh!

HyukjinKwon Nov 9, 2016

Choose a reason for hiding this comment

Uh oh!

uncleGen Nov 9, 2016

Choose a reason for hiding this comment

Uh oh!

srowen Nov 9, 2016

Choose a reason for hiding this comment

Uh oh!

rxin Nov 9, 2016

Choose a reason for hiding this comment

Uh oh!

yhuai Nov 9, 2016

Choose a reason for hiding this comment

Uh oh!

yhuai Nov 9, 2016

Choose a reason for hiding this comment

Uh oh!

uncleGen Nov 10, 2016

Choose a reason for hiding this comment

Uh oh!

uncleGen commented Nov 14, 2016

Uh oh!

SparkQA commented Nov 14, 2016

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

yhuai commented Nov 15, 2016

Uh oh!

srowen commented Dec 30, 2016

Uh oh!

srowen commented Jan 2, 2017

Uh oh!

yhuai commented Jan 3, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants