-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18379][SQL] Make the parallelism of parallelPartitionDiscovery configurable. #15829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #68404 has finished for PR 15829 at commit
|
| .doc("The number of parallelism to list a collection of path recursively, Set the " + | ||
| "number to prevent file listing from generating too many tasks.") | ||
| .intConf | ||
| .createWithDefault(100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind if I ask the reason to reduce this from 10000 to 100?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, 10000 is too large for the file discovery in consideration of scene where cluster is not very large, and it makes no contributions to improve performance in most scenes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be internal()? although I tend to agree with your logic, I'd also love to see if the person who made it 10000 has thoughts. I can't figure out where it came from via git blame though after several code refactorings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @liancheng @yhuai
This should be an internal config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I think we should have more smaller tasks than having a smaller number of tasks that need to handle lots of paths.
If you have a small cluster and we have a large number of paths, with this change, we only have 100 tasks. Every task will take longer time to finish and other jobs in the cluster may need to wait for a longer time to get scheduled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, I think 100 is too small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this should be an internal config, and i will reserve the default 10000 setting.
|
Test build #68591 has finished for PR 15829 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK with respect to the comments here
|
lgtm. Merging to master. |
|
@yhuai per request on the JIRA, are you OK with me backporting to 2.1? seems OK as it doesn't modify the default behavior. |
|
Merged to 2.1 as well |
… configurable. ## What changes were proposed in this pull request? The largest parallelism in PartitioningAwareFileIndex #listLeafFilesInParallel() is 10000 in hard code. We may need to make this number configurable. And in PR, I reduce it to 100. ## How was this patch tested? Existing ut. Author: genmao.ygm <[email protected]> Author: dylon <[email protected]> Closes #15829 from uncleGen/SPARK-18379. (cherry picked from commit 745ab8b) Signed-off-by: Sean Owen <[email protected]>
|
Sure. Thanks! |
… configurable. ## What changes were proposed in this pull request? The largest parallelism in PartitioningAwareFileIndex #listLeafFilesInParallel() is 10000 in hard code. We may need to make this number configurable. And in PR, I reduce it to 100. ## How was this patch tested? Existing ut. Author: genmao.ygm <[email protected]> Author: dylon <[email protected]> Closes apache#15829 from uncleGen/SPARK-18379.
What changes were proposed in this pull request?
The largest parallelism in PartitioningAwareFileIndex #listLeafFilesInParallel() is 10000 in hard code. We may need to make this number configurable. And in PR, I reduce it to 100.
How was this patch tested?
Existing ut.