-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-2288] Hide ShuffleBlockManager behind ShuffleManager #1241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16183/ |
|
Merged build triggered. |
|
Merged build started. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add @return explaining what the boolean return value means
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16190/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I see your concern, I thought about this too. actually in this PR, it mostly is used internally by HashShuffleManager's writter. While, we can take this as a way to give a chance to expose the internal storage objects for short cut usage. Such as current netty based shuffle sender. Without this interface, it's hard to implement without introduce maybe many more extra interface. to keep it simple. I offer the chance to expose the ShuffleBlockManager.
And then, this "location" conception might not be meaningful, but and BlockObjectId might be a good fit for all the possible shuffleManager, afterall, you are handling some objects whether it is a File , or a Stream, or whatever way you save your data to, So, a ShuffleBlockManager it self might still be needed to access this object in certain shortcut cases for simplifier API, and you can name the method getDataObjectHandle or whatever fits.. I do also have a PR for this idea, say generalize the object and pass around an ObjectID for different storage type at #1209
So does this make any sense to you ;) Still, I agree if we could find better way to solve the netty block sender problem, This could be hide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin, How about we also hide current BlockFetcherIterator kind of thing behind shuffleManager. since a specific shuffleManager not necessary using current fetcher approaching to get shuffle data. Each shuffleManager should instance his own shuffle logic, while some could reuse the same logic, say FileBased one could reuse current implementation. By this way, we can solve the above problem and have better chance to not expose shuffleBlockManager, say a read/write interface for shuffle reader/writter is enough.
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16199/ |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build triggered. |
|
Merged build started. |
|
@rxin Moved getBlockLocation method from shuffleBlockManager to HashShuffleBlockMananger to make the interface more general. Does current interface looks reasonable for you? Also still a few shuffle related code could be moved further from block manager to some specific shuffle manager related classes' implementation ( e.g. blockManager.getMultiple). But since they are not tightly related to this shuffleBlockManager generalization works and I am not quite sure whether the other shufflemanager implementation will reuse them or not, so just leave it as it is, and could be done in future PR I guess. |
|
Merged build finished. |
1 similar comment
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16253/ |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16256/ |
|
Thanks - we are all super busy with Spark Summit this week so probably will get to this later in the week... feel free to send a reminder if I don't revisit this towards the end of the week. |
|
ping @rxin ;) |
|
Merged build triggered. |
|
Merged build started. |
|
Sorry will take a look tomorrow! |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16396/ |
|
Thanks for doing this. To help with the review, can you write a short design doc discussing the interfaces between different components, similar to the one attached here https://issues.apache.org/jira/browse/SPARK-3019 ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u add a todo here that getValues should bypass getBytes to use stream based APIs? Otherwise this uses a lot of memory during external sort merge.
|
Jenkins, test this please |
|
Jenkins, test this please. |
|
Jenkins, ok to test. |
|
QA tests have started for PR 1241 at commit
|
|
QA tests have finished for PR 1241 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the reduce id is always 0, why even bother defining it?
|
Merging this now. I will take care of some minor things myself. Thanks! |
By Hiding the shuffleblockmanager behind Shufflemanager, we decouple the shuffle data's block mapping management work from Diskblockmananger. This give a more clear interface and more easy for other shuffle manager to implement their own block management logic. the jira ticket have more details. Author: Raymond Liu <[email protected]> Closes apache#1241 from colorant/shuffle and squashes the following commits: 0e01ae3 [Raymond Liu] Move ShuffleBlockmanager behind shuffleManager
By Hiding the shuffleblockmanager behind Shufflemanager, we decouple the shuffle data's block mapping management work from Diskblockmananger. This give a more clear interface and more easy for other shuffle manager to implement their own block management logic. the jira ticket have more details.