-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19755][Mesos] Blacklist is always active for MesosCoarseGrainedSchedulerBackend. #17619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Sorry I am only just looking at this now -- I am not so sure this is doing what you think. I think the notion of "task" in MesosCoarseGrainedSchedulerBackend might be something different, its really an "executor" in spark's terminology. Perhaps that code should have some additional comments explaining that. Tasks are still handled in spark's TaskScheduler / TaskSetManager etc. @mgummelt can you confirm my understanding? |
|
That does exactly what is supposed to do. And you absolutely right it related to executors. That patch may be incompatible with master already but i can fix it if needed. |
|
ok I think I understand. This sounds like the equivalent of some of the existing blacklisting behavior which current only exists on yarn -- when a request is made to yarn, the spark context tells yarn which nodes it has blacklisted: however, it still seems like there is a missing piece -- you have to tell mesos which nodes you don't want executors on, right? I also don't understand why you'd get starvation in your app with this -- shouldn't mesos be requesting executors on other nodes? anyway, I'm agreeing that something seems wrong with the mesos scheduling when there is a bad node, but I'm not certain this is the right fix, and I just don't know enough about the communication between mesos and spark to say exactly what should be done instead, sorry. might actually be better to have this discussion on jira, since we're talking about general design, not specifics of this change |
|
@squito not sure if this is still the case, but as of 4 months ago starvation could happen if enough failures occurred on each node so the entire cluster ended up blacklisted. Unlikely but possible for a long running app running on a sufficiently small cluster. |
|
@andreimaximov that is still sort of the case for all cluster managers. You shouldn't get starvation, you should see the app actively fail (SPARK-15865 was the main change, though some small follow-on stuff after that). What else can you do if it seems there is something wrong with every node in your cluster? But if you're really seeing your app just hang in mesos in that situation -- yeah seems like something needs to be fixed in the spark-mesos interaction. unfortunately I won't have a clear picture of what needs to change without spending more time understanding what is there now ... |
|
Even though we only run normal Spark jobs this PR is going to fix a case for us as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the long delay @timout and thanks for the ping on this @hantuzun . So I took a deeper look finally, and I understand what is going on here -- I think this is the right change, I just have one change for how to still take advantage of the node blacklist.
There are still some parts of mesos that I don't totally understand which I'd like clarification on @mgummelt :
- In yarn, spark has to actively make a request for more resources. In that request, spark tells yarn's resource manager its blacklist, so that the resource manager avoids even offering resources that the app has blacklisted.
IIUC, in mesos the process is more passive. the mesos cluster manager will repeatedly call resourceOffers with whatever its got, and app decides whether it wants it or not. resourceOffers will get called continuously even if the app has already hits its max number of executors and has no interest in more resources.
- The app never updates the mesos cluster manager at all about its preferences -- blacklist, locality preferences, resource sizes, etc. I just want to make sure we shouldn't be changing something else to actively tell mesos the app's node blacklist
and maybe less important for this PR, but just something which confused me:
- I'm surprised that
buildMesosTasksdoesn't calldeclineOfferfor offers it doesn't use. If the offer has fewer resources than this spark app can use, shouldn't we tell the mesos that other apps can have it? Then the same thing would go for offers we don't use because of blacklisting.
| cpus + totalCoresAcquired <= maxCores && | ||
| mem <= offerMem && | ||
| numExecutors() < executorLimit && | ||
| slaves.get(slaveId).map(_.taskFailures).getOrElse(0) < MAX_SLAVE_FAILURES && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than just deleting this, we should replace it with a check to scheduler.nodeBlacklist(), like the YarnScheduler is doing here:
Line 128 in 44e20c4
| val nodeBlacklist: Set[String] = scheduler.nodeBlacklist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@squito regarding 3. That part of code launches aggressively tasks on matched offers.
The remainder of the offer is implicitly declined https://mail-archives.apache.org/mod_mbox/mesos-user/201507.mbox/%[email protected]%3E
In addition if an offer cannot be used to launch a task is declined later on here:
https://github.com/apache/spark/blob/83c008762af444eef73d835eb6f506ecf5aebc17/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L440-#L444
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @skonto , also I realized there was more explicit calls to declineOffer than I thought initially after a closer read of the code. btw @IgorBerman has opened an updated version of this PR here #20640 -- would appreciate a review over there
|
+1 here, we are running spark core jobs but with long running driver on Mesos. Sometimes executors fail which is normal(one of the reasons is temp port conflict). With time - less and less executors are valid for the driver, so it creates situation where Mesos cluster has free resource but no-one uses them |
|
for anyone watching this: @IgorBerman submitted an updated version of this here #20640 which I plan to merge unless there are any objections. |
|
Can one of the admins verify this patch? |
|
Sorry, I'm new to GH. Does "closed" mean that it's been merged or that it was abandoned without checking in? If it wasn't checked in, what's the holdup? Was this addressed elsewhere? |
What changes were proposed in this pull request?
MesosCoarseGrainedSchedulerBackend ignored spark.blacklist.enabled configuration property and used hardcoded MAX_SLAVE_FAILURES = 2. The purpose of that fix is to remove that hard-coded behaviour. BlacklistTracker is resposible for blacklist functionality.
@mgummelt
@jsoltren
How was this patch tested?
Unit tests, Manual testing.
This patch is a clean up. That functionality is tested by BlacklistTracker tests.
Author: [email protected]