Skip to content

Conversation

@timout
Copy link

@timout timout commented Apr 12, 2017

What changes were proposed in this pull request?

MesosCoarseGrainedSchedulerBackend ignored spark.blacklist.enabled configuration property and used hardcoded MAX_SLAVE_FAILURES = 2. The purpose of that fix is to remove that hard-coded behaviour. BlacklistTracker is resposible for blacklist functionality.

@mgummelt
@jsoltren

How was this patch tested?

Unit tests, Manual testing.
This patch is a clean up. That functionality is tested by BlacklistTracker tests.

Author: [email protected]

@squito
Copy link
Contributor

squito commented Dec 21, 2017

Sorry I am only just looking at this now --

I am not so sure this is doing what you think. I think the notion of "task" in MesosCoarseGrainedSchedulerBackend might be something different, its really an "executor" in spark's terminology. Perhaps that code should have some additional comments explaining that. Tasks are still handled in spark's TaskScheduler / TaskSetManager etc.

@mgummelt can you confirm my understanding?

@timout
Copy link
Author

timout commented Dec 23, 2017

That does exactly what is supposed to do. And you absolutely right it related to executors.
I am sorry if it is not clear from my previous explanations.
Let us say:
Spark Streaming App - very long running app:
Driver, started by marathon using docker image, schedules (in mesos meaning) executors using
docker images.(net=HOST) (every executor started from docker image on some mesos agent)
So if some recoverable error happens, for instance:
ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Remote RPC client disassociated...(I do not know how about others but it is relatively often in my env.)
As result the executor will be dead and after 2 failures mesos agent node will be included in MesosCoarseGrainedSchedulerBackend black list and driver will never schedule (in mesos meaning) executor on it. So the app will starve... and notice will not die.
That exactly what happened with my streams apps before that patch.

That patch may be incompatible with master already but i can fix it if needed.

@squito
Copy link
Contributor

squito commented Dec 27, 2017

ok I think I understand. This sounds like the equivalent of some of the existing blacklisting behavior which current only exists on yarn -- when a request is made to yarn, the spark context tells yarn which nodes it has blacklisted:

https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128

however, it still seems like there is a missing piece -- you have to tell mesos which nodes you don't want executors on, right?

I also don't understand why you'd get starvation in your app with this -- shouldn't mesos be requesting executors on other nodes?

anyway, I'm agreeing that something seems wrong with the mesos scheduling when there is a bad node, but I'm not certain this is the right fix, and I just don't know enough about the communication between mesos and spark to say exactly what should be done instead, sorry.
@mgummelt can you comment?

might actually be better to have this discussion on jira, since we're talking about general design, not specifics of this change

@andreimaximov
Copy link

andreimaximov commented Dec 27, 2017

@squito not sure if this is still the case, but as of 4 months ago starvation could happen if enough failures occurred on each node so the entire cluster ended up blacklisted. Unlikely but possible for a long running app running on a sufficiently small cluster.

@squito
Copy link
Contributor

squito commented Dec 27, 2017

@andreimaximov that is still sort of the case for all cluster managers. You shouldn't get starvation, you should see the app actively fail (SPARK-15865 was the main change, though some small follow-on stuff after that). What else can you do if it seems there is something wrong with every node in your cluster?

But if you're really seeing your app just hang in mesos in that situation -- yeah seems like something needs to be fixed in the spark-mesos interaction. unfortunately I won't have a clear picture of what needs to change without spending more time understanding what is there now ...

@hantuzun
Copy link

Even though we only run normal Spark jobs this PR is going to fix a case for us as well.

Copy link
Contributor

@squito squito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the long delay @timout and thanks for the ping on this @hantuzun . So I took a deeper look finally, and I understand what is going on here -- I think this is the right change, I just have one change for how to still take advantage of the node blacklist.

There are still some parts of mesos that I don't totally understand which I'd like clarification on @mgummelt :

  1. In yarn, spark has to actively make a request for more resources. In that request, spark tells yarn's resource manager its blacklist, so that the resource manager avoids even offering resources that the app has blacklisted.

IIUC, in mesos the process is more passive. the mesos cluster manager will repeatedly call resourceOffers with whatever its got, and app decides whether it wants it or not. resourceOffers will get called continuously even if the app has already hits its max number of executors and has no interest in more resources.

  1. The app never updates the mesos cluster manager at all about its preferences -- blacklist, locality preferences, resource sizes, etc. I just want to make sure we shouldn't be changing something else to actively tell mesos the app's node blacklist

and maybe less important for this PR, but just something which confused me:

  1. I'm surprised that buildMesosTasks doesn't call declineOffer for offers it doesn't use. If the offer has fewer resources than this spark app can use, shouldn't we tell the mesos that other apps can have it? Then the same thing would go for offers we don't use because of blacklisting.

cpus + totalCoresAcquired <= maxCores &&
mem <= offerMem &&
numExecutors() < executorLimit &&
slaves.get(slaveId).map(_.taskFailures).getOrElse(0) < MAX_SLAVE_FAILURES &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than just deleting this, we should replace it with a check to scheduler.nodeBlacklist(), like the YarnScheduler is doing here:

Copy link
Contributor

@skonto skonto Feb 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@squito regarding 3. That part of code launches aggressively tasks on matched offers.
The remainder of the offer is implicitly declined https://mail-archives.apache.org/mod_mbox/mesos-user/201507.mbox/%[email protected]%3E
In addition if an offer cannot be used to launch a task is declined later on here:
https://github.com/apache/spark/blob/83c008762af444eef73d835eb6f506ecf5aebc17/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L440-#L444

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @skonto , also I realized there was more explicit calls to declineOffer than I thought initially after a closer read of the code. btw @IgorBerman has opened an updated version of this PR here #20640 -- would appreciate a review over there

@IgorBerman
Copy link

+1 here, we are running spark core jobs but with long running driver on Mesos. Sometimes executors fail which is normal(one of the reasons is temp port conflict). With time - less and less executors are valid for the driver, so it creates situation where Mesos cluster has free resource but no-one uses them

@squito
Copy link
Contributor

squito commented Feb 20, 2018

for anyone watching this: @IgorBerman submitted an updated version of this here #20640 which I plan to merge unless there are any objections.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@swevrywhere
Copy link

Sorry, I'm new to GH. Does "closed" mean that it's been merged or that it was abandoned without checking in? If it wasn't checked in, what's the holdup? Was this addressed elsewhere?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants