[SPARK-13002][Mesos] Send initial request of executors for dyn allocation #11047

skyluc · 2016-02-03T10:02:46Z

Fix for SPARK-13002 about the initial number of executors when running with dynamic allocation on Mesos.
Instead of fixing it just for the Mesos case, made the change in ExecutorAllocationManager. It is already driving the number of executors running on Mesos, only no the initial value.

The None and Some(0) are internal details on the computation of resources to reserved, in the Mesos backend scheduler. executorLimitOption has to be initialized correctly, otherwise the Mesos backend scheduler will, either, create to many executors at launch, or not create any executors and not be able to recover from this state.

Removed the 'special case' description in the doc. It was not totally accurate, and is not needed anymore.

This doesn't fix the same problem visible with Spark standalone. There is no straightforward way to send the initial value in standalone mode.

Somebody knowing this part of the yarn support should review this change.

dragos · 2016-02-03T10:27:13Z

LGTM.

SparkQA · 2016-02-03T12:07:52Z

Test build #50657 has finished for PR 11047 at commit 1c75940.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-02-03T17:45:22Z

@skyluc just so I understand the issue is not that dynamic allocation doesn't work, but rather spark.dynamicAllocation.initialExecutors doesn't take effect?

andrewor14 · 2016-02-03T17:53:18Z

@vanzin isn't there already another place where we do this initial syncing? Does YARN have the same issue?

mgummelt · 2016-02-03T18:12:06Z

docs/running-on-mesos.md

I know it was this way before, but can you please s/coarse grain/coarse-grained.

andrewor14 · 2016-02-03T19:06:23Z

@skyluc This change LGTM by the way. I'm just hesitant on backporting it into 1.6 since (1) it's a small issue, and (2) it changes core behavior and so affects other cluster modes as well. In general we try to be conservative about what goes into a maintenance release unless it's a critical issue.

By the way I submitted the standalone mode equivalent of this patch at #11054. The solution is similar; the main difference is that in standalone mode the Master keeps track of the executor limit for each application, whereas in Mesos each driver keeps track of its own limit.

andrewor14 · 2016-02-03T19:07:23Z

Once you address @mgummelt's comments I'll go ahead and merge this.

vanzin · 2016-02-03T19:25:45Z

For YARN, see YarnSparkHadoopUtil.getInitialTargetExecutorNumber. I don't think this change will cause any issues with the YARN backend (it should just see a request to set the target number of executors to the same number it already is, so it will just ignore it).

skyluc · 2016-02-04T09:58:56Z

@andrewor14 yes, dynamic allocation works fine, but spark.dynamicAllocation.initialExecutors is not used at start-up.

SparkQA · 2016-02-04T12:04:22Z

Test build #50747 has finished for PR 11047 at commit 8dda6bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dragos · 2016-02-04T14:04:20Z

docs/running-on-mesos.md

One thing to note, though. Marathon won't be able to launch sbin/start-mesos-shuffle-service.sh because it immediately goes to background and Marathon thinks it exited. It will keep re-launching to the end of days.

What you need is to launch it via spark-class, for instance I'm using bin/spark-class org.apache.spark.deploy.mesos.MesosExternalShuffleService. See this discussion on mesos-user.

andrewor14 · 2016-02-04T17:58:10Z

but spark.dynamicAllocation.initialExecutors is not used at start-up.

sorry what do you mean? Isn't that what this patch is fixing?

Currently the Master would always set an application's initial executor limit to infinity. If the user specified `spark.dynamicAllocation.initialExecutors`, the config would not take effect. This is similar to #11047 but for standalone mode. Author: Andrew Or <[email protected]> Closes #11054 from andrewor14/standalone-da-initial.

andrewor14 · 2016-02-04T18:37:34Z

(you might need to resolve a small conflict from my standalone patch...)

SparkQA · 2016-02-05T16:26:20Z

Test build #50820 has finished for PR 11047 at commit 003e865.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-05T16:57:45Z

Test build #50821 has finished for PR 11047 at commit f5ab629.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-02-05T22:37:19Z

Merged into master. If there are more comments on the docs we can address them separately.

mgummelt reviewed Feb 3, 2016
View reviewed changes

docs/running-on-mesos.md Outdated

Copy link

mgummelt Feb 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it was this way before, but can you please s/coarse grain/coarse-grained.

andrewor14 mentioned this pull request Feb 3, 2016

[SPARK-13162] Standalone mode does not respect initial executors #11054

Closed

dragos reviewed Feb 4, 2016
View reviewed changes

Send initial request of executors for dyn allocation

f5ab629

asfgit closed this in 0bb5b73 Feb 5, 2016

[SPARK-13002][Mesos] Send initial request of executors for dyn allocation #11047

[SPARK-13002][Mesos] Send initial request of executors for dyn allocation #11047

Uh oh!

Conversation

skyluc commented Feb 3, 2016

Uh oh!

dragos commented Feb 3, 2016

Uh oh!

SparkQA commented Feb 3, 2016

Uh oh!

andrewor14 commented Feb 3, 2016

Uh oh!

andrewor14 commented Feb 3, 2016

Uh oh!

mgummelt Feb 3, 2016

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Feb 3, 2016

Uh oh!

andrewor14 commented Feb 3, 2016

Uh oh!

vanzin commented Feb 3, 2016

Uh oh!

skyluc commented Feb 4, 2016

Uh oh!

SparkQA commented Feb 4, 2016

Uh oh!

dragos Feb 4, 2016

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Feb 4, 2016

Uh oh!

andrewor14 commented Feb 4, 2016

Uh oh!

SparkQA commented Feb 5, 2016

Uh oh!

SparkQA commented Feb 5, 2016

Uh oh!

andrewor14 commented Feb 5, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants