Scheduler fixes and improvements #828

mateiz · 2013-08-14T21:04:19Z

This is a work in progress to simplify and improve a few things in the TaskSet scheduler and the way we handle locality. Here's a summary of changes (see also the commit messages):

Eliminated most hostPort based data structures and just used executor IDs instead, since the hostPort vs host stuff was confusing (and not checkable with static typing, leading to ugly debug code), and hostPorts are not provided by Mesos. The hostPort stuff was there from before when we could easily pass executorIDs through.
Removed non-local fallback stuff in ClusterScheduler that tried to launch less-local tasks on a node once the local ones were all assigned. This change didn't work because many cluster schedulers send offers for just one node at a time (even the standalone and YARN ones do so as nodes join the cluster one by one). Thus, lots of non-local tasks would be assigned even though a node with locality for them would be able to receive tasks just a short time later.
Added multi-level delay scheduling in ClusterTaskSetManager so you can have different waits for each locality level. This covers part of the use case for the fallback stuff above (for users who wanted to skip node-local waits and just go for rack-local).
Updated the way ClusterTaskSetManager handles racks: instead of enqueueing a task to a separate queue for all the hosts in the rack, which would create lots of large queues, have one queue per rack name.
Added periodic revival of offers in StandaloneSchedulerBackend (this was an option in the YARN patch but seems good to have by default).
Renamed map output tracker "generations" to "epochs".
Simplified and cleaned up a bunch of the code.

The only thing missing is new unit tests for this stuff, which I'm working on, but the code appears to run fine so I wanted to put it up for others to try.

AmplabJenkins · 2013-08-14T21:29:55Z

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/581/

pwendell · 2013-08-14T21:33:32Z

core/src/main/scala/spark/scheduler/cluster/ClusterTaskSetManager.scala

Matei - what are the semantics of these different waits? Say that you want a task to bump from PROCESS_LOCAL to NODE_LOCAL after 10 seconds. Would you set the wait to 10 seconds for PROCESS_LOCAL or NODE_LOCAL? It would be good to make that clear in the docs.

It might make sense to have the default wait be different for different levels so that the "turn key" performance is pretty good. For instance, it's probably fine to make the PROCESS_LOCAL -> NODE_LOCAL switch after only a few seconds. But going from PROCESS_LOCAL to RACK_LOCAL should be a lot longer.

Good idea; I'll add a doc page on this. Basically there is a separate jump between each pair of levels: for example process -> (wait 3 sec) -> node -> (wait 3 more sec) -> rack. You can configure each wait separately.

My instinct was that it's not obvious which jumps will cost more, so I'd rather have a simple default setting. For example, going from process to node could be really bad for data that's expensive to serialize, while going from node to rack might be very cheap with a fast network.

I see - so I guess my main question then was whether this defines the level we are jumping to or jumping from. I guess since you can't set this for ANY it's the level you are jumping from.

mridulm · 2013-08-14T22:34:46Z

Hi Matei,

I have added a few comments, unfortunately I am unable to go through this at the level of detail I would like to - apologies for that.

In general, a few things though :

a) There were a few sync blocks removed - we might need to re-introduce those back.
(Not just for MT-safety, which might not be a big issue, but also to ensure visibility of updates since same thread need not necessarily be used by akka/etc).

b) I noticed change from hostPort to host, executorId - would it not be better to remove host also ?
The earlier code was more of a hack - since I was trying to reconcile pre-executorId codebase with significant drift in yarn branch which had host -> hostPort change.

c) I added a few comments about scheduling in general in the code, and why we hacked in reviveOffers thread : coupled with my earlier mail on this subject, hopefully I have not messed up explaining my rationale for those changes !

Thanks,
Mridul

mateiz · 2013-08-14T23:35:01Z

Thanks for the feedback, Mridul, this is super helpful! I'll address some of the issues you brought up.

ryanlecompte · 2013-08-15T05:30:17Z

core/src/main/scala/spark/scheduler/cluster/ClusterTaskSetManager.scala

minor typo: readding

oh, nevermind! re-adding, not readding :)

AmplabJenkins · 2013-08-16T02:23:05Z

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/610/

AmplabJenkins · 2013-08-16T22:59:36Z

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/635/

- Replace use of hostPort vs host in Task.preferredLocations with a TaskLocation class that contains either an executorId and a host or just a host. This is part of a bigger effort to eliminate hostPort based data structures and just use executorID, since the hostPort vs host stuff is confusing (and not checkable with static typing, leading to ugly debug code), and hostPorts are not provided by Mesos. - Replaced most hostPort-based data structures and fields as above. - Simplified ClusterTaskSetManager to deal with preferred locations in a more concise way and generally be more concise. - Updated the way ClusterTaskSetManager handles racks: instead of enqueueing a task to a separate queue for all the hosts in the rack, which would create lots of large queues, have one queue per rack name. - Removed non-local fallback stuff in ClusterScheduler that tried to launch less-local tasks on a node once the local ones were all assigned. This change didn't work because many cluster schedulers send offers for just one node at a time (even the standalone and YARN ones do so as nodes join the cluster one by one). Thus, lots of non-local tasks would be assigned even though a node with locality for them would be able to receive tasks just a short time later. - Renamed MapOutputTracker "generations" to "epochs".

- Added periodic revival of offers in StandaloneSchedulerBackend - Replaced task scheduling aggression with multi-level delay scheduling in ClusterTaskSetManager - Fixed ZippedRDD preferred locations because they can't currently be process-local - Fixed some uses of hostPort

- When a resourceOffers() call has multiple offers, force the TaskSets to consider them in increasing order of locality levels so that they get a chance to launch stuff locally across all offers - Simplify ClusterScheduler.prioritizeContainers - Add docs on the new configuration options

resetting locality level after a non-local launch

mateiz · 2013-08-19T02:53:12Z

Jenkins, please retest this

mateiz · 2013-08-19T02:56:40Z

Alright, I think this is pretty much done; I also added some unit tests and found a bug with them (ClusterTaskSetManager was setting currentLocalityIndex to allowedLocality instead of taskLocality).

One final improvement might be to add support for rack locality in non-YARN modes, but we can do that as a separate PR.

AmplabJenkins · 2013-08-19T02:59:46Z

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/646/

AmplabJenkins · 2013-08-19T03:22:37Z

Thank you for submitting this pull request.

All automated tests for this request have passed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/647/

mridulm · 2013-08-19T07:09:27Z

I ran some tests with this PR (as pulled on saturday). The results were mixed -

The setup:
a) baseline - spark.tasks.revive_periodic.interval=200, spark.tasks.schedule.aggression=RACK
b) pr applied, enabled - spark.locality.wait.process=0 spark.locality.wait.node=0 spark.locality.wait.rack=3000 (to mirror (a) as closely as possible).
c) pr applied, but not enabled - so run with default config.

The relative performance numbers below are for 4 runs of the expts, and both min and avg of the time taken via 'time' command (so end to end time).

For lower number of nodes ( <= 10) , (b) performed better : from 2% to 10% for the experiments. For (c), the results were break even to worse by 4% or so.

But as the number of nodes increase, the relative performance worsens, and somewhere between 20 to 25, (b) becomes worse than (a).
At 100 nodes, (b) is about 18 % worse and (c) is about 41% worse.

I did not try with large numbers though (partly due to diminishing returns for the dataset, and partly because the graphs were diverging anyway, so intutively it should worsen).

This is much better than earlier though (without 2a4ed10).
Also note that the relative difference between (b) and (c) is much lower than earlier (with and without aggression) - which can be taken as a positive change : default out of box performance is better.

I am planning to rerun the expt set once again with the latest PR, just to be sure.

mateiz · 2013-08-19T16:19:57Z

I actually did fix a small bug in it yesterday, where it wasn't properly remembering the locality of each launched task.

Instead of c), can you also try setting process spark.locality.wait.process=3000, spark.locality.wait.node=0, and spark.locality.wait.rack=3000? This is a new configuration possible with this PR that lets you launch tasks on HDFS in a rack-local manner while still getting good locality for cached data. (Is your test actually using any caching by the way?)

mateiz · 2013-08-19T16:22:27Z

Also, if you want to match a), you need to set spark.scheduler.revive.interval=200 in the new code too�.

mridulm · 2013-08-19T18:09:22Z

I was trying to replicate exactly similar test env for both baseline and the pr change (hence spark.locality.wait.process=0).

I did not notice that revive thread timeout was configurable - will change the test to use spark.scheduler.revive.interval = 200, thanks !

mridulm · 2013-08-19T21:37:47Z

Looks great now Matei !
With the latest commits the preliminary numbers with PR enabled is within margin error with baseline.

mateiz · 2013-08-20T06:39:44Z

Thanks for testing it, Mridul. It also does fine on Patrick's regression tests (https://github.com/amplab/spark-perf), so I'm going to merge it in.

Scheduler fixes and improvements

pwendell reviewed Aug 14, 2013
View reviewed changes

ryanlecompte reviewed Aug 15, 2013
View reviewed changes

mateiz added 6 commits August 18, 2013 19:51

Comment cleanup (via Kay) and some debug messages

222c897

Added some comments on threading in scheduler code

4004cf7

Added unit tests for ClusterTaskSetManager, and fix a bug found with

8ac3d1e

resetting locality level after a non-local launch

mateiz mentioned this pull request Aug 19, 2013

Renamed 'priority' to 'jobId' and assorted minor changes #844

Merged

mateiz added a commit that referenced this pull request Aug 20, 2013

Merge pull request #828 from mateiz/sched-improvements

8cae72e

Scheduler fixes and improvements

mateiz merged commit 8cae72e into mesos:master Aug 20, 2013

Scheduler fixes and improvements #828

Scheduler fixes and improvements #828

Uh oh!

Conversation

mateiz commented Aug 14, 2013

Uh oh!

AmplabJenkins commented Aug 14, 2013

Uh oh!

pwendell Aug 14, 2013

Choose a reason for hiding this comment

Uh oh!

mateiz Aug 14, 2013

Choose a reason for hiding this comment

Uh oh!

pwendell Aug 15, 2013

Choose a reason for hiding this comment

Uh oh!

mridulm commented Aug 14, 2013

Uh oh!

mateiz commented Aug 14, 2013

Uh oh!

ryanlecompte Aug 15, 2013

Choose a reason for hiding this comment

Uh oh!

ryanlecompte Aug 15, 2013

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Aug 16, 2013

Uh oh!

AmplabJenkins commented Aug 16, 2013

Uh oh!

mateiz commented Aug 19, 2013

Uh oh!

mateiz commented Aug 19, 2013

Uh oh!

AmplabJenkins commented Aug 19, 2013

Uh oh!

AmplabJenkins commented Aug 19, 2013

Uh oh!

mridulm commented Aug 19, 2013

Uh oh!

mateiz commented Aug 19, 2013

Uh oh!

mateiz commented Aug 19, 2013

Uh oh!

mridulm commented Aug 19, 2013

Uh oh!

mridulm commented Aug 19, 2013

Uh oh!

mateiz commented Aug 20, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants