Skip to content

Conversation

@uncleGen
Copy link
Contributor

@uncleGen uncleGen commented Nov 9, 2016

What changes were proposed in this pull request?

The largest parallelism in PartitioningAwareFileIndex #listLeafFilesInParallel() is 10000 in hard code. We may need to make this number configurable. And in PR, I reduce it to 100.

How was this patch tested?

Existing ut.

@SparkQA
Copy link

SparkQA commented Nov 9, 2016

Test build #68404 has finished for PR 15829 at commit 112f9d1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.doc("The number of parallelism to list a collection of path recursively, Set the " +
"number to prevent file listing from generating too many tasks.")
.intConf
.createWithDefault(100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind if I ask the reason to reduce this from 10000 to 100?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, 10000 is too large for the file discovery in consideration of scene where cluster is not very large, and it makes no contributions to improve performance in most scenes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be internal()? although I tend to agree with your logic, I'd also love to see if the person who made it 10000 has thoughts. I can't figure out where it came from via git blame though after several code refactorings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @liancheng @yhuai

This should be an internal config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think we should have more smaller tasks than having a smaller number of tasks that need to handle lots of paths.

If you have a small cluster and we have a large number of paths, with this change, we only have 100 tasks. Every task will take longer time to finish and other jobs in the cluster may need to wait for a longer time to get scheduled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, I think 100 is too small.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this should be an internal config, and i will reserve the default 10000 setting.

@uncleGen
Copy link
Contributor Author

cc @yhuai @srowen

@SparkQA
Copy link

SparkQA commented Nov 14, 2016

Test build #68591 has finished for PR 15829 at commit f6cf77f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK with respect to the comments here

@yhuai
Copy link
Contributor

yhuai commented Nov 15, 2016

lgtm. Merging to master.

@asfgit asfgit closed this in 745ab8b Nov 15, 2016
@srowen
Copy link
Member

srowen commented Dec 30, 2016

@yhuai per request on the JIRA, are you OK with me backporting to 2.1? seems OK as it doesn't modify the default behavior.

@srowen
Copy link
Member

srowen commented Jan 2, 2017

Merged to 2.1 as well

asfgit pushed a commit that referenced this pull request Jan 2, 2017
… configurable.

## What changes were proposed in this pull request?

The largest parallelism in PartitioningAwareFileIndex #listLeafFilesInParallel() is 10000 in hard code. We may need to make this number configurable. And in PR, I reduce it to 100.

## How was this patch tested?

Existing ut.

Author: genmao.ygm <[email protected]>
Author: dylon <[email protected]>

Closes #15829 from uncleGen/SPARK-18379.

(cherry picked from commit 745ab8b)
Signed-off-by: Sean Owen <[email protected]>
@yhuai
Copy link
Contributor

yhuai commented Jan 3, 2017

Sure. Thanks!

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
… configurable.

## What changes were proposed in this pull request?

The largest parallelism in PartitioningAwareFileIndex #listLeafFilesInParallel() is 10000 in hard code. We may need to make this number configurable. And in PR, I reduce it to 100.

## How was this patch tested?

Existing ut.

Author: genmao.ygm <[email protected]>
Author: dylon <[email protected]>

Closes apache#15829 from uncleGen/SPARK-18379.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants