[SPARK-15353] [CORE] Making peer selection for block replication pluggable #13152

shubhamchopra · 2016-05-17T18:24:28Z

What changes were proposed in this pull request?

This PR makes block replication strategies pluggable. It provides two trait that can be implemented, one that maps a host to its topology and is used in the master, and the second that helps prioritize a list of peers for block replication and would run in the executors.

This patch contains default implementations of these traits that make sure current Spark behavior is unchanged.

How was this patch tested?

This patch should not change Spark behavior in any way, and was tested with unit tests for storage.

HyukjinKwon · 2016-05-18T02:13:29Z

core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala

+  def apply(execId: String,
+            host: String,
+            port: Int,
+            topologyInfo: Option[String] = None): BlockManagerId =


Maybe we need to correct indentation as below..

def apply(execId: String, host: String, port: Int, topologyInfo: Option[String] = None): BlockManagerId =

shubhamchopra · 2016-05-18T20:58:45Z

Fixed style issues pointed out by @HyukjinKwon

ericl · 2016-07-20T21:27:52Z

A couple high level questions:

Rather than send an RPC to the master asking for a worker's topology info, is it possible for this to be provided at initialization time or determined based on the environment?
Is it possible to narrow the interface of the prioritizer to just choose a single next peer? If it is desired to cache the prioritization order, this can be done internally within the prioritizer. For example, the interface could be something like this. Then the default prioritizer does not need to do a random shuffle of the entire peer list to choose its target.

trait BlockReplicationStrategy {

  trait ReplicationTargetSelector {
    def getNextPeer(
      candidatePeers: Set[BlockManagerId],
      successfulReplications: Set[BlockManagerId],
      failedReplications: Set[BlockManagerId]): Option[BlockManagerId]
  }

  def getTargetSelector(
    localId: BlockManagerId,
    blockId: BlockId,
    level: StorageLevel): ReplicationTargetSelector
}

Also, the patch would be more minimal if only the getRandomPeer() call was changed.

shubhamchopra · 2016-07-26T15:24:11Z

The topology info is only queried when the executor initiates and is assumed to stay the same throughout the life of the executor. Depending on the cluster manager being used, I am assuming the exact way this information is provided may differ. Resolving this at the master makes this implementation simpler as only the master needs to be able to access the service/script/class being used to resolve the topology. The communication overhead is minimal as the executors do have to communicate with the master when they initiate anyways.

The getRandomPeer() method was doing quite a bit more than just getting a random peer. It was being used to manage/mutate state, which was being mutated in other places as well. I tried to keep the block placement strategy and the usage of its output separate, to make it simpler to provide a new block placement strategy. I also thought it would be best to de-couple any internal replication state management with the block replication strategy, while still keeping the structure of the state the same.

The costlier operation here is the RPC fetch of all the peers from the master. The prioritization algorithm is only called once if there are no failures. If there are failures, the list of peers is requested from the master again, before the prioritizer is run. The bigger hit again, would be the RPC communication between the executor and the master. Random.shuffle in the default prioritizer uses Fisher-Yates shuffle, so is linear in time.

ericl · 2016-07-27T00:53:07Z

The topology info is only queried when the executor initiates and is assumed to stay the same throughout the life of the executor. Depending on the cluster manager being used, I am assuming the exact way this information is provided may differ. Resolving this at the master makes this implementation simpler as only the master needs to be able to access the service/script/class being used to resolve the topology. The communication overhead is minimal as the executors do have to communicate with the master when they initiate anyways.

I see, that makes sense, though it is a little weird to ask the master for info that you use to register right away later.

The getRandomPeer() method was doing quite a bit more than just getting a random peer. It was being used to manage/mutate state, which was being mutated in other places as well. I tried to keep the block placement strategy and the usage of its output separate, to make it simpler to provide a new block placement strategy. I also thought it would be best to de-couple any internal replication state management with the block replication strategy, while still keeping the structure of the state the same.

Still, I think it would be a smaller change to just move some of that logic out of getRandomPeer(), and retain the rest. Then you just need to implement getNextPeer(), and BlockManager doesn't need to worry about tracking the prioritized order internally.

shubhamchopra · 2016-07-27T21:44:07Z

The state being managed inside getRandomPeer() is also modified in a couple of other places, so it won't be a very clean change to remove some of it out of getRandomPeer. Even if that is done, I agree that your approach would only mean calling getNextPeer. It would however mean adding more state to ensure expected behavior in cases where block replication fails on a peer.

I am flexible about the implementation choices, so can do the modifications if needed. Just to clarify on the motivation of this interface, I have another PR SPARK-15354 that shows a couple of prioritizers that I intend to add (including a simple one that replicates HDFS's block replication strategy). Note that in case of failures, the list of peers is requested from the master afresh and is optimized over again. With this interface, the ReplicationTargetSelector would have to be generated afresh, and an iteration of the optimization would run every time getNextPeer is called. Let me know what you think.

ericl · 2016-07-28T19:19:54Z

You wouldn't have to create a new selector after a failure. That case can be detected by checking if the number of failed replications has increased, e.g. if (failedReplications.length > prevNumFails) { reprioritize... }. Basically, that state would be tracked in the selector, instead of BlockManager.

I think the main benefit here is that the interface would be more flexible as a developer facing API.

ericl · 2016-07-28T19:21:38Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

+
    blockManagerId = BlockManagerId(
-      executorId, blockTransferService.hostName, blockTransferService.port)
+      executorId, blockTransferService.hostName, blockTransferService.port, topologyInfo)


Would it work if topologyInfo was sent back from the master when registerBlockManager is called? It doesn't seem that anything uses blockManagerId until registration finishes. That way we wouldn't need this two-step registration.

I implemented this in the commits below.

ericl · 2016-08-04T22:35:49Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

nit: seems a little easier to read if it was !a && !b

Converted to !a && !b

Still seems the same

My bad, the latest commit fixed this. The while loop also had a similar condition and had fixed that earlier.

…oosing the same peers for replication.

1. Adding rack attribute to hashcode and equals to block manager id. 2. Removing boolean check for rack awareness. Asking master for rack info, and master uses topology mapper. 3. Adding a topology mapper trait and a default implementation that block manager master endpoint uses to discern topology information.

…aster.

…t behavior.

…al topology instead of rack.

…onPrioritization

…tification.

…o get a fully fleshed out id, with topology information, if available.

@ericl

…while loop, as suggested by @ericl 2. Adding SparkConf constructor arguments to TopologyMapper, so any required properties like classname or file/script names can be passed to a custom topology mapper.

…plicationPolicy api to take the number of peers needed and adding a sampling algo linear in time and space along with test cases.

shubhamchopra · 2016-09-20T17:16:07Z

Rebased to master to resolve merge conflicts

rxin · 2016-09-29T06:12:49Z

LGTM - sorry that this has taken a while. I will merge once tests pass.

Also cc @zsxwing for his attention.

SparkQA · 2016-09-29T08:36:36Z

Test build #3291 has finished for PR 13152 at commit 632d043.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

rxin · 2016-10-01T01:24:09Z

Merging in master.

HyukjinKwon reviewed May 18, 2016
View reviewed changes

ericl reviewed Jul 28, 2016
View reviewed changes

shubhamchopra force-pushed the RackAwareBlockReplication branch from 3ee664e to 7b8685c Compare August 4, 2016 17:30

ericl reviewed Aug 4, 2016
View reviewed changes

shubhamchopra added 24 commits September 20, 2016 12:21

Minor modifications to get past the style check errors.

e130fc7

Using blockId hashcode as a source of randomness, so we don't keep ch…

e0df5a5

…oosing the same peers for replication.

Adding null check so a Block Manager can be initiaziled without the m…

6863e25

…aster.

Renaming classes/variables from rack to a more general topology.

f634f0e

Renaming classes/variables from rack to a more general topology.

9275feb

We continue to randomly choose peers, so there is no change in curren…

5d4178b

…t behavior.

Spelling correction and minor changes in comments to use a more gener…

2e3c195

…al topology instead of rack.

Minor change. Changing replication info message to debug level.

d481b69

Providing peersReplicateTo to the prioritizer.

762b8d4

Adding developer api annotations to TopologyMapper and BlockReplicati…

6df3ceb

…onPrioritization

Changes recommended by @HyukjinKwon to fix style issues.

a6dbf3f

Updating prioritizer api to use current blockmanager id for self iden…

487bfae

…tification.

BlockManagerInitialization now only uses a single message to master t…

2ef5199

…o get a fully fleshed out id, with topology information, if available.

1. Changing tail recursive function in BlockManager to an imperative …

00c6d0c

…while loop, as suggested by @ericl 2. Adding SparkConf constructor arguments to TopologyMapper, so any required properties like classname or file/script names can be passed to a custom topology mapper.

Fixing style issues.

52c673e

Fixing style issues.

c6b252b

Adding documentation to clarify on topology.

59e91a7

Adding a simple file based topology-mapper.

e695cd2

Fixing style issues.

5c33cc8

Inlining call to get topology.

90b4e77

converting to not a and not b

6782720

1. Style modifications as suggested by @rxin. 2. Changing the BlockRe…

d8c6210

…plicationPolicy api to take the number of peers needed and adding a sampling algo linear in time and space along with test cases.

Incorporating corrections and suggestions by @rxin

632d043

shubhamchopra force-pushed the RackAwareBlockReplication branch from 907154c to 632d043 Compare September 20, 2016 17:15

asfgit closed this in a26afd5 Oct 1, 2016

[SPARK-15353] [CORE] Making peer selection for block replication pluggable #13152

[SPARK-15353] [CORE] Making peer selection for block replication pluggable #13152

Uh oh!

Conversation

shubhamchopra commented May 17, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon May 18, 2016

Choose a reason for hiding this comment

Uh oh!

shubhamchopra commented May 18, 2016

Uh oh!

ericl commented Jul 20, 2016

Uh oh!

shubhamchopra commented Jul 26, 2016

Uh oh!

ericl commented Jul 27, 2016

Uh oh!

shubhamchopra commented Jul 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericl commented Jul 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericl Jul 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shubhamchopra Aug 4, 2016

Choose a reason for hiding this comment

Uh oh!

ericl Aug 4, 2016

Choose a reason for hiding this comment

Uh oh!

shubhamchopra Aug 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl Aug 9, 2016

Choose a reason for hiding this comment

Uh oh!

ericl Aug 12, 2016

Choose a reason for hiding this comment

Uh oh!

shubhamchopra Aug 12, 2016

Choose a reason for hiding this comment

Uh oh!

shubhamchopra commented Sep 20, 2016

Uh oh!

rxin commented Sep 29, 2016

Uh oh!

SparkQA commented Sep 29, 2016

Uh oh!

rxin commented Oct 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shubhamchopra commented Jul 27, 2016 •

edited

Loading

ericl commented Jul 28, 2016 •

edited

Loading

ericl Jul 28, 2016 •

edited

Loading

shubhamchopra Aug 5, 2016 •

edited

Loading