Skip to content

Conversation

@sebastienrainville
Copy link
Contributor

Similar to #8639

This change rejects offers for 120s when reached spark.cores.max in coarse-grained mode to mitigate offer starvation. This prevents Mesos to send us offers again and again, starving other frameworks. This is especially problematic when running many small frameworks on the same Mesos cluster, e.g. many small Sparks streaming jobs, and cause the bigger spark jobs to stop receiving offers. By rejecting the offers for a long period of time, they become available to those other frameworks.

@andrewor14
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Jan 28, 2016

Test build #50288 has finished for PR 10924 at commit 181a6ef.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 29, 2016

Test build #50334 has finished for PR 10924 at commit bf8d870.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

@tnachen @dragos

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the two cases where offers are rejected for a longer period should be consolidated in a simple helper function that logs the reason and declines the offer. I'd also reorganize the code to be less nested:

if (!meetsConstraints) {
  declineFor("unmet constraints", rejectOfferDurationUnmetConstraints)
} else if (totalCoresAcquired >= maxCores) {
  declineFor("reached max cores", rejectOfferDurationMaxCores)
} else {
  .. happy case
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and proposed this before, not sure we want to do this change in this patch. I'm thinking perhaps we can get all the smal changes we want in a reasonable way and refactor all three schedulers (or remove one). What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first implementation was actually removing all the code duplication in the decline offer path, but it seemed overkill:

        def declineOffer(reason: Option[String] = None, refuseSeconds: Option[Long] = None) {
          logDebug("Declining offer" +
            reason.fold("") { r => s" ($r)"} +
            s": $id with attributes: $offerAttributes mem: $mem cpu: $cpus" +
            refuseSeconds.fold("") { r => s" for $r seconds" })

          refuseSeconds match {
            case Some(seconds) => {
              val filter = Filters.newBuilder().setRefuseSeconds(seconds).build()
              d.declineOffer(offer.getId, filter)
            }
            case _ => d.declineOffer(offer.getId)
          }
        }

Also this cannot be reused easily in the fine-grained mode since it relies on attributes computed locally in the loop. I opted for simplicity thinking that the resourceOffers method would be refactored at some point, making it easier to factor out the common code. I'm happy to use the implementation above for refuseOffer if you think that it's better. It can be simplified quite a bit if it's only for the 2 cases where offers are rejected for a longer period but then we still have similar code for the default rejection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it seems the only thing it depends on is the offerId, so it could go in MesosSchedulerUtils.

But if that's overkill, let's do it only for this one, and get rid of the nested if structure. It also means there's no need to use Option for the reason and duration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also depends on id, offerAttributes, mem and cpus for logging. They're all derived from offer and some are easier than others to get but we shouldn't compute them twice just for logging.

Okay I'll change it to handle only the case where offers are rejected for a longer period.

@dragos
Copy link
Contributor

dragos commented Feb 16, 2016

  • could we have only one rejection delay setting?
  • why not add the same logic in fine-grained mode as well?

..and sorry for the delay in reviewing this.

@sebastienrainville
Copy link
Contributor Author

I'm not sure we want to use only one rejection delay setting in these 2 cases. Arguably we could reject offers for a much longer period of time for unmet constraints since AFAIK constraints don't change dynamically and therefore are true for the lifetime of a framework. It's a bit different with reached max cores because if we lose an executor we want the scheduler to launch a new one and ideally not have to wait for too long for it. I put the same default delay of 120s for both since it seems to be a reasonable value.

And for the fine-grained mode, there's no reason to not add the same logic. I'll do the change and test it. Unfortunately, the example function declineOffer cannot be reused there because it relies on local variables declared inside the loop. It really feels like this code needs some refactoring.

@dragos
Copy link
Contributor

dragos commented Feb 23, 2016

You are right about having two different settings, makes sense. Let's go with that for the moment.

@keithchambers
Copy link

@andrewor14

@mgummelt works on Spark full time at Mesosphere too. :-)

@dragos
Copy link
Contributor

dragos commented Mar 22, 2016

@sebastienrainville sorry for my confusion. Fine-grained mode does not respect spark.cores.max, so my comment does not apply. Can you just do the small refactoring and then this can go in? It's long overdue.

@sebastienrainville
Copy link
Contributor Author

@dragos sorry for the delay. I had also forgotten that spark.cores.max wasn't respected in fine-grained mode. Quite a few things changed in this class since the last time I looked at it. I will rebase on master and do the appropriate changes.

@dragos
Copy link
Contributor

dragos commented Mar 30, 2016

Cool, looking forward to pushing this over the finish line!

@dragos
Copy link
Contributor

dragos commented Apr 27, 2016

ping @sebastienrainville

@sebastienrainville
Copy link
Contributor Author

@dragos I finally did the change. Sorry for the delay

@SparkQA
Copy link

SparkQA commented May 3, 2016

Test build #57661 has finished for PR 10924 at commit 9b314e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgummelt
Copy link

mgummelt commented May 3, 2016

What starvation behavior were you seeing? With the DRF allocator, Mesos should offer rejected resources to other frameworks before re-offering to the Spark job.

}

declineUnmatchedOffers(d, unmatchedOffers)
unmatchedOffers.foreach { offer =>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep this in a separate function declineUnmatchedOffers

@mgummelt
Copy link

mgummelt commented May 3, 2016

please add a test

@sebastienrainville
Copy link
Contributor Author

@mgummelt the problem we were seeing is when running many spark apps (~20) and most of them small (long lived small streaming apps), then the bigger apps just get allocated a small number of cores even though the cluster still has a lot of available cores. In that scenario the big apps are not actually receiving offers from Mesos anymore, and that's because the small apps have a much smaller "max share" so they get the offers first. With a low number of apps it's okay because with the default refuse_seconds value of 5 seconds it's enough time for Mesos to cycle through every app and send offers to each of them. But as the number of apps increases it becomes more and more problematic, to the point where Mesos stop sending offers to the apps ranked the lowest by DRF, i.e. the big apps.

The solution implemented in this PR is to refuse the offers for a long period of time when we know that we don't need offers anymore because the app already acquired spark.cores.max. The only case where we would need to acquire more cores is if we lost an executor, so a value of 120s for refuse_seconds seems like a good tradeoff.

@mgummelt
Copy link

mgummelt commented May 3, 2016

Sounds good. Also seems like something we should document, right?

@sebastienrainville
Copy link
Contributor Author

All the comments should be addressed now

@SparkQA
Copy link

SparkQA commented May 4, 2016

Test build #57699 has finished for PR 10924 at commit ad2f014.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dragos
Copy link
Contributor

dragos commented May 4, 2016

On 4 mai 2016, at 02:22, Sebastien Rainville [email protected] wrote:

@dragos I finally did the change. Sorry for the delay

Excellent, thanks! I won't be able to review but it looks like Michael took over.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

// more and more problematic, to the point where Mesos stops sending offers to the apps
// ranked the lowest by DRF, i.e. the big apps. We mitigate this problem by declining
// the offers for a long period of time when we know that we don't need offers anymore
// because the app already acquired all the cores it needs.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit verbose. I think something like "Reject an offer for a configurable amount of time to avoid starving other frameworks" is sufficient.

Also, thanks for the code docs, but I was thinking we should add this config var to the user docs as well.

@sebastienrainville
Copy link
Contributor Author

sebastienrainville commented May 4, 2016

@mgummelt I fixed the documentation and test description. The variable names are quite long though so they push the content of the right columns quite a bit: https://github.com/sebastienrainville/spark/blob/master/docs/running-on-mesos.md

At least the names are descriptive and spark.mesos.rejectOfferDurationForUnmetConstraints was already included (but undocumented) in previous releases so I don't think we want to change the names.

WDYT?

<td><code>spark.mesos.rejectOfferDurationForReachedMaxCores</code></td>
<td><code>120s</code></td>
<td>
Set the amount of time for which offers are rejected when the app already acquired <code>spark.cores.max</code> cores.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add "This is used to prevent starvation of other frameworks."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I added that comment to spark.mesos.rejectOfferDurationForUnmetConstraints as well since it's the same idea.

@mgummelt
Copy link

mgummelt commented May 4, 2016

LGTM. cc @andrewor14

@SparkQA
Copy link

SparkQA commented May 4, 2016

Test build #57774 has finished for PR 10924 at commit 0ccd71c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

</td>
</tr>
<tr>
<td><code>spark.mesos.rejectOfferDurationForReachedMaxCores</code></td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually not document these configs. Doing so would require us to maintain backward compatibility. I can't think of any strong use case where someone would want to change these values so I don't think it's worth the maintenance burden.

@andrewor14
Copy link
Contributor

Looks good. Just minor comments.

@sebastienrainville
Copy link
Contributor Author

This should be ready to go now

@andrewor14
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented May 4, 2016

Test build #57786 has finished for PR 10924 at commit 112f136.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

Merging into master 2.0, thanks for bringing this back to life.

@asfgit asfgit closed this in eb019af May 4, 2016
asfgit pushed a commit that referenced this pull request May 4, 2016
Similar to #8639

This change rejects offers for 120s when reached `spark.cores.max` in coarse-grained mode to mitigate offer starvation. This prevents Mesos to send us offers again and again, starving other frameworks. This is especially problematic when running many small frameworks on the same Mesos cluster, e.g. many small Sparks streaming jobs, and cause the bigger spark jobs to stop receiving offers. By rejecting the offers for a long period of time, they become available to those other frameworks.

Author: Sebastien Rainville <[email protected]>

Closes #10924 from sebastienrainville/master.

(cherry picked from commit eb019af)
Signed-off-by: Andrew Or <[email protected]>
@SparkQA
Copy link

SparkQA commented May 4, 2016

Test build #57794 has finished for PR 10924 at commit 5b55ae0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants