Skip to content

Conversation

@colorant
Copy link
Contributor

By Hiding the shuffleblockmanager behind Shufflemanager, we decouple the shuffle data's block mapping management work from Diskblockmananger. This give a more clear interface and more easy for other shuffle manager to implement their own block management logic. the jira ticket have more details.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16183/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add @return explaining what the boolean return value means

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16190/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I see your concern, I thought about this too. actually in this PR, it mostly is used internally by HashShuffleManager's writter. While, we can take this as a way to give a chance to expose the internal storage objects for short cut usage. Such as current netty based shuffle sender. Without this interface, it's hard to implement without introduce maybe many more extra interface. to keep it simple. I offer the chance to expose the ShuffleBlockManager.

And then, this "location" conception might not be meaningful, but and BlockObjectId might be a good fit for all the possible shuffleManager, afterall, you are handling some objects whether it is a File , or a Stream, or whatever way you save your data to, So, a ShuffleBlockManager it self might still be needed to access this object in certain shortcut cases for simplifier API, and you can name the method getDataObjectHandle or whatever fits.. I do also have a PR for this idea, say generalize the object and pass around an ObjectID for different storage type at #1209

So does this make any sense to you ;) Still, I agree if we could find better way to solve the netty block sender problem, This could be hide.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin, How about we also hide current BlockFetcherIterator kind of thing behind shuffleManager. since a specific shuffleManager not necessary using current fetcher approaching to get shuffle data. Each shuffleManager should instance his own shuffle logic, while some could reuse the same logic, say FileBased one could reuse current implementation. By this way, we can solve the above problem and have better chance to not expose shuffleBlockManager, say a read/write interface for shuffle reader/writter is enough.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16199/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@colorant
Copy link
Contributor Author

@rxin Moved getBlockLocation method from shuffleBlockManager to HashShuffleBlockMananger to make the interface more general. Does current interface looks reasonable for you?

Also still a few shuffle related code could be moved further from block manager to some specific shuffle manager related classes' implementation ( e.g. blockManager.getMultiple). But since they are not tightly related to this shuffleBlockManager generalization works and I am not quite sure whether the other shufflemanager implementation will reuse them or not, so just leave it as it is, and could be done in future PR I guess.

@AmplabJenkins
Copy link

Merged build finished.

1 similar comment
@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16253/

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16256/

@rxin
Copy link
Contributor

rxin commented Jun 30, 2014

Thanks - we are all super busy with Spark Summit this week so probably will get to this later in the week... feel free to send a reminder if I don't revisit this towards the end of the week.

@colorant
Copy link
Contributor Author

colorant commented Jul 7, 2014

ping @rxin ;)

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@rxin
Copy link
Contributor

rxin commented Jul 8, 2014

Sorry will take a look tomorrow!

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16396/

@rxin
Copy link
Contributor

rxin commented Aug 27, 2014

Thanks for doing this. To help with the review, can you write a short design doc discussing the interfaces between different components, similar to the one attached here https://issues.apache.org/jira/browse/SPARK-3019 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add a todo here that getValues should bypass getBytes to use stream based APIs? Otherwise this uses a lot of memory during external sort merge.

@colorant
Copy link
Contributor Author

Jenkins, test this please

@colorant
Copy link
Contributor Author

Jenkins, test this please.

@rxin
Copy link
Contributor

rxin commented Aug 29, 2014

Jenkins, ok to test.

@SparkQA
Copy link

SparkQA commented Aug 29, 2014

QA tests have started for PR 1241 at commit 0e01ae3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 29, 2014

QA tests have finished for PR 1241 at commit 0e01ae3.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the reduce id is always 0, why even bother defining it?

@rxin
Copy link
Contributor

rxin commented Aug 30, 2014

Merging this now. I will take care of some minor things myself. Thanks!

@asfgit asfgit closed this in acea928 Aug 30, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
By Hiding the shuffleblockmanager behind Shufflemanager, we decouple the shuffle data's block mapping management work from Diskblockmananger. This give a more clear interface and more easy for other shuffle manager to implement their own block management logic. the jira ticket have more details.

Author: Raymond Liu <[email protected]>

Closes apache#1241 from colorant/shuffle and squashes the following commits:

0e01ae3 [Raymond Liu] Move ShuffleBlockmanager behind shuffleManager
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants