[SPARK-13001] [CORE] [MESOS] Prevent getting offers when reached max cores #10924

sebastienrainville · 2016-01-26T14:59:07Z

Similar to #8639

This change rejects offers for 120s when reached spark.cores.max in coarse-grained mode to mitigate offer starvation. This prevents Mesos to send us offers again and again, starving other frameworks. This is especially problematic when running many small frameworks on the same Mesos cluster, e.g. many small Sparks streaming jobs, and cause the bigger spark jobs to stop receiving offers. By rejecting the offers for a long period of time, they become available to those other frameworks.

andrewor14 · 2016-01-28T18:53:53Z

ok to test

SparkQA · 2016-01-28T19:03:48Z

Test build #50288 has finished for PR 10924 at commit 181a6ef.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-29T04:45:11Z

Test build #50334 has finished for PR 10924 at commit bf8d870.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-02-01T21:21:49Z

@tnachen @dragos

dragos · 2016-02-16T15:54:38Z

core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala

I think the two cases where offers are rejected for a longer period should be consolidated in a simple helper function that logs the reason and declines the offer. I'd also reorganize the code to be less nested:

if (!meetsConstraints) { declineFor("unmet constraints", rejectOfferDurationUnmetConstraints) } else if (totalCoresAcquired >= maxCores) { declineFor("reached max cores", rejectOfferDurationMaxCores) } else { .. happy case }

I agree and proposed this before, not sure we want to do this change in this patch. I'm thinking perhaps we can get all the smal changes we want in a reasonable way and refactor all three schedulers (or remove one). What do you think?

My first implementation was actually removing all the code duplication in the decline offer path, but it seemed overkill:

def declineOffer(reason: Option[String] = None, refuseSeconds: Option[Long] = None) { logDebug("Declining offer" + reason.fold("") { r => s" ($r)"} + s": $id with attributes: $offerAttributes mem: $mem cpu: $cpus" + refuseSeconds.fold("") { r => s" for $r seconds" }) refuseSeconds match { case Some(seconds) => { val filter = Filters.newBuilder().setRefuseSeconds(seconds).build() d.declineOffer(offer.getId, filter) } case _ => d.declineOffer(offer.getId) } }

Also this cannot be reused easily in the fine-grained mode since it relies on attributes computed locally in the loop. I opted for simplicity thinking that the resourceOffers method would be refactored at some point, making it easier to factor out the common code. I'm happy to use the implementation above for refuseOffer if you think that it's better. It can be simplified quite a bit if it's only for the 2 cases where offers are rejected for a longer period but then we still have similar code for the default rejection.

To me it seems the only thing it depends on is the offerId, so it could go in MesosSchedulerUtils.

But if that's overkill, let's do it only for this one, and get rid of the nested if structure. It also means there's no need to use Option for the reason and duration.

It also depends on id, offerAttributes, mem and cpus for logging. They're all derived from offer and some are easier than others to get but we shouldn't compute them twice just for logging.

Okay I'll change it to handle only the case where offers are rejected for a longer period.

dragos · 2016-02-16T15:57:22Z

could we have only one rejection delay setting?
why not add the same logic in fine-grained mode as well?

..and sorry for the delay in reviewing this.

sebastienrainville · 2016-02-17T04:12:44Z

I'm not sure we want to use only one rejection delay setting in these 2 cases. Arguably we could reject offers for a much longer period of time for unmet constraints since AFAIK constraints don't change dynamically and therefore are true for the lifetime of a framework. It's a bit different with reached max cores because if we lose an executor we want the scheduler to launch a new one and ideally not have to wait for too long for it. I put the same default delay of 120s for both since it seems to be a reasonable value.

And for the fine-grained mode, there's no reason to not add the same logic. I'll do the change and test it. Unfortunately, the example function declineOffer cannot be reused there because it relies on local variables declared inside the loop. It really feels like this code needs some refactoring.

dragos · 2016-02-23T16:21:12Z

You are right about having two different settings, makes sense. Let's go with that for the moment.

keithchambers · 2016-03-06T18:34:05Z

@andrewor14

@mgummelt works on Spark full time at Mesosphere too. :-)

dragos · 2016-03-22T11:21:22Z

@sebastienrainville sorry for my confusion. Fine-grained mode does not respect spark.cores.max, so my comment does not apply. Can you just do the small refactoring and then this can go in? It's long overdue.

sebastienrainville · 2016-03-22T17:10:28Z

@dragos sorry for the delay. I had also forgotten that spark.cores.max wasn't respected in fine-grained mode. Quite a few things changed in this class since the last time I looked at it. I will rebase on master and do the appropriate changes.

dragos · 2016-03-30T11:57:25Z

Cool, looking forward to pushing this over the finish line!

dragos · 2016-04-27T14:59:48Z

ping @sebastienrainville

…ode when reached max cores

sebastienrainville · 2016-05-03T19:51:11Z

@dragos I finally did the change. Sorry for the delay

SparkQA · 2016-05-03T21:44:20Z

Test build #57661 has finished for PR 10924 at commit 9b314e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgummelt · 2016-05-03T22:19:37Z

What starvation behavior were you seeing? With the DRF allocator, Mesos should offer rejected resources to other frameworks before re-offering to the Spark job.

mgummelt · 2016-05-03T22:20:10Z

core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala

      }

-      declineUnmatchedOffers(d, unmatchedOffers)
+      unmatchedOffers.foreach { offer =>


Please keep this in a separate function declineUnmatchedOffers

mgummelt · 2016-05-03T22:24:40Z

please add a test

sebastienrainville · 2016-05-03T22:52:49Z

@mgummelt the problem we were seeing is when running many spark apps (~20) and most of them small (long lived small streaming apps), then the bigger apps just get allocated a small number of cores even though the cluster still has a lot of available cores. In that scenario the big apps are not actually receiving offers from Mesos anymore, and that's because the small apps have a much smaller "max share" so they get the offers first. With a low number of apps it's okay because with the default refuse_seconds value of 5 seconds it's enough time for Mesos to cycle through every app and send offers to each of them. But as the number of apps increases it becomes more and more problematic, to the point where Mesos stop sending offers to the apps ranked the lowest by DRF, i.e. the big apps.

The solution implemented in this PR is to refuse the offers for a long period of time when we know that we don't need offers anymore because the app already acquired spark.cores.max. The only case where we would need to acquire more cores is if we lost an executor, so a value of 120s for refuse_seconds seems like a good tradeoff.

mgummelt · 2016-05-03T23:41:04Z

Sounds good. Also seems like something we should document, right?

… spark.cores.max

…me when reached spark.cores.max

sebastienrainville · 2016-05-04T00:52:59Z

All the comments should be addressed now

SparkQA · 2016-05-04T02:40:19Z

Test build #57699 has finished for PR 10924 at commit ad2f014.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dragos · 2016-05-04T06:44:24Z

On 4 mai 2016, at 02:22, Sebastien Rainville [email protected] wrote:

@dragos I finally did the change. Sorry for the delay

Excellent, thanks! I won't be able to review but it looks like Michael took over.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

mgummelt · 2016-05-04T17:09:55Z

core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala

+        // more and more problematic, to the point where Mesos stops sending offers to the apps
+        // ranked the lowest by DRF, i.e. the big apps. We mitigate this problem by declining
+        // the offers for a long period of time when we know that we don't need offers anymore
+        // because the app already acquired all the cores it needs.


This is a bit verbose. I think something like "Reject an offer for a configurable amount of time to avoid starving other frameworks" is sufficient.

Also, thanks for the code docs, but I was thinking we should add this config var to the user docs as well.

sebastienrainville · 2016-05-04T17:44:37Z

@mgummelt I fixed the documentation and test description. The variable names are quite long though so they push the content of the right columns quite a bit: https://github.com/sebastienrainville/spark/blob/master/docs/running-on-mesos.md

At least the names are descriptive and spark.mesos.rejectOfferDurationForUnmetConstraints was already included (but undocumented) in previous releases so I don't think we want to change the names.

WDYT?

mgummelt · 2016-05-04T18:08:58Z

docs/running-on-mesos.md

+  <td><code>spark.mesos.rejectOfferDurationForReachedMaxCores</code></td>
+  <td><code>120s</code></td>
+  <td>
+    Set the amount of time for which offers are rejected when the app already acquired <code>spark.cores.max</code> cores.


Can you add "This is used to prevent starvation of other frameworks."

Done. I added that comment to spark.mesos.rejectOfferDurationForUnmetConstraints as well since it's the same idea.

mgummelt · 2016-05-04T18:09:54Z

LGTM. cc @andrewor14

SparkQA · 2016-05-04T19:34:23Z

Test build #57774 has finished for PR 10924 at commit 0ccd71c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-05-04T20:00:26Z

docs/running-on-mesos.md

+  </td>
+</tr>
+<tr>
+  <td><code>spark.mesos.rejectOfferDurationForReachedMaxCores</code></td>


I would actually not document these configs. Doing so would require us to maintain backward compatibility. I can't think of any strong use case where someone would want to change these values so I don't think it's worth the maintenance burden.

andrewor14 · 2016-05-04T20:03:47Z

Looks good. Just minor comments.

sebastienrainville · 2016-05-04T20:12:49Z

This should be ready to go now

andrewor14 · 2016-05-04T20:34:00Z

LGTM

SparkQA · 2016-05-04T20:42:09Z

Test build #57786 has finished for PR 10924 at commit 112f136.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-05-04T21:32:15Z

Merging into master 2.0, thanks for bringing this back to life.

Similar to #8639 This change rejects offers for 120s when reached `spark.cores.max` in coarse-grained mode to mitigate offer starvation. This prevents Mesos to send us offers again and again, starving other frameworks. This is especially problematic when running many small frameworks on the same Mesos cluster, e.g. many small Sparks streaming jobs, and cause the bigger spark jobs to stop receiving offers. By rejecting the offers for a long period of time, they become available to those other frameworks. Author: Sebastien Rainville <[email protected]> Closes #10924 from sebastienrainville/master. (cherry picked from commit eb019af) Signed-off-by: Andrew Or <[email protected]>

SparkQA · 2016-05-04T22:03:53Z

Test build #57794 has finished for PR 10924 at commit 5b55ae0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dragos reviewed Feb 16, 2016
View reviewed changes

[SPARK-13001][CORE][MESOS] Prevent getting offers in coarse-grained m…

9b314e0

…ode when reached max cores

sebastienrainville force-pushed the master branch from bf8d870 to 9b314e0 Compare May 3, 2016 19:48

mgummelt reviewed May 3, 2016
View reviewed changes

Reuse declineOffer() and fix log message

5ef4879

sebastienrainville added 2 commits May 3, 2016 20:19

Add unit test for declining Mesos offers for a long time when reached…

31b2aba

… spark.cores.max

Added documentation for why we decline offers for a long period of ti…

ad2f014

…me when reached spark.cores.max

mgummelt reviewed May 4, 2016
View reviewed changes

sebastienrainville added 2 commits May 4, 2016 13:32

Smaller code comment and added user doc

55732fa

Better description for the unit test

0ccd71c

mgummelt reviewed May 4, 2016
View reviewed changes

More specific doc for reasons to reject offers for a long period of time

112f136

andrewor14 reviewed May 4, 2016
View reviewed changes

Styling fixes and remove extra documentation

5b55ae0

asfgit closed this in eb019af May 4, 2016

[SPARK-13001] [CORE] [MESOS] Prevent getting offers when reached max cores #10924

[SPARK-13001] [CORE] [MESOS] Prevent getting offers when reached max cores #10924

Uh oh!

Conversation

sebastienrainville commented Jan 26, 2016

Uh oh!

andrewor14 commented Jan 28, 2016

Uh oh!

SparkQA commented Jan 28, 2016

Uh oh!

SparkQA commented Jan 29, 2016

Uh oh!

andrewor14 commented Feb 1, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dragos commented Feb 16, 2016

Uh oh!

sebastienrainville commented Feb 17, 2016

Uh oh!

dragos commented Feb 23, 2016

Uh oh!

keithchambers commented Mar 6, 2016

Uh oh!

dragos commented Mar 22, 2016

Uh oh!

sebastienrainville commented Mar 22, 2016

Uh oh!

dragos commented Mar 30, 2016

Uh oh!

dragos commented Apr 27, 2016

Uh oh!

sebastienrainville commented May 3, 2016

Uh oh!

SparkQA commented May 3, 2016

Uh oh!

mgummelt commented May 3, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgummelt commented May 3, 2016

Uh oh!

sebastienrainville commented May 3, 2016

Uh oh!

mgummelt commented May 3, 2016

Uh oh!

sebastienrainville commented May 4, 2016

Uh oh!

SparkQA commented May 4, 2016

Uh oh!

dragos commented May 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sebastienrainville commented May 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgummelt commented May 4, 2016

Uh oh!

SparkQA commented May 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented May 4, 2016

Uh oh!

sebastienrainville commented May 4, 2016

Uh oh!

andrewor14 commented May 4, 2016

sebastienrainville commented May 4, 2016 •

edited

Loading