-
Notifications
You must be signed in to change notification settings - Fork 2
Add support for dynamic executors #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@foxish Here is a basic dynamic executor capability, using the per-executor shuffle server we talked about as a starting point. |
|
|
||
| if (conf.getBoolean("spark.dynamicAllocation.enabled", false)) { | ||
| submitArgs ++= Vector( | ||
| "--conf spark.dynamicAllocation.enabled=true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should actually try to propogate all spark configurations from the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, that is on my to-do list
| logInfo(s"Adding $delta new executor Pods") | ||
| createExecutorPods(delta) | ||
| } else if (delta < 0) { | ||
| logInfo(s"Deleting ${-delta} new executor Pods") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think Spark should give you a request total executors less than what's already running, we should just assert this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the dynamic-executor case, Spark will scale back what it's requesting. For example, if a job starts to complete and there are fewer partitions left to compute, it will start requesting numbers of executors fewer than what it is running, and eventually drops to zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, how do you plan to pick the executor pods to delete ? I can think of a few ways to do this, but ultimately the external shuffle service would be needed here to prevent losing shuffle blocks. (For safe deletion of executors). If this is in place, then perhaps we don't need to be particularly mindful of which executor pods get removed, so long as the resources are returned back to the cluster
|
|
||
| if (conf.getBoolean("spark.dynamicAllocation.enabled", false)) { | ||
| submitArgs ++= Vector( | ||
| "dynamic-executors") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this conf for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a special argument to flag to the shim script that it needs to spin up the shuffle service before it launches the executor backend. It's a bit of a hack, but putting it right at the beginning made it easy to check for and remove. Shell handling of argument lists isn't very sophisticated :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original plan was to just detect the --conf spark.dynamicAllocation.enabled flag, but the ExecutorBackend doesn't recognize --conf args (it appears to expect its conf sent from the driver), so I needed something easy to strip off the argument list. Alternatively, the pod could be configured with some additional environment variable, but that isn't any simpler, afaict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just set a environment variable? An argument makes it hard to realize what this is really for, at least we can have a descriptive name like "SPARK_LAUNCH_SHUFFLE_SERVICE=1"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And before we go too far we should ask perhaps @rxin or someone familiar with shuffle if they see long term problem with this.
Where is the code that just launchs shuffle service? I don't see it in the PR
| clientJarUri, | ||
| s"--class=${args.userClass}", | ||
| s"--master=$kubernetesHost", | ||
| s"--executor-memory=${driverDescription.mem}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also realize we need to set cores too, however we should just forward all user specified Spark conf into executors and override if need to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: another to-do is to make sure the executor pods are configured with core/mem requirements that align with the corresponding spark configurations for core/mem
|
Thanks, merging this for now in order to iterate. |
* Add support for dynamic executors * fill in some sane logic for doKillExecutors * doRequestTotalExecutors signals graceful executor shutdown, and favors idle executors
|
@erikerlandson, I just tried the old example with this change, passing in |
|
@foxish I've seen that once or twice. It needs two things:
I started those today |
|
@foxish, I'll put those on a new PR |
|
Should we wait for the executors from the previous scaling round to come up before we scale further? |
|
@foxish, that isn't really under the control of the scheduler back-end, but the configuration |
|
I don't think we really need to since dynamic allocation gives you a total number it wants, so if we detect a good number is still launching just running a few more should be fine. About failure detection and handling, on Mesos/Spark side we have bad hosts detection if launching a host continously have problems. |
|
For reference, these are the dynamic executor parameters, which I will explicitly pass to the executors via |
|
What's happening currently is that it takes rather longer than 1 second for new executor pods to spin up and register to the driver. So it just keeps scaling up, because it is still backlogged. Then, the new pods start to thrash with the actual computation, and it goes into a positive feedback loop. setting the backlog timeout to some appropriate value will fix it, as would just setting some maximum number of executors. |
|
I'm just concerned that even with a larger timeout, it may not be sufficient, and we would always have to specify a reasonable upper bound (spark.dynamicAllocation.maxExecutors). We can have scheduling delays or just slow networks taking a long time to pull the docker image, which would land us in the same feedback loop, with the tasks continuing to stay pending, and the allocation trying to scale up aggressively. |
|
@foxish as long as the timeout is reasonably large, it will be able to keep up. |
|
@foxish This code that checks for environment variable val targetNumExecutors =
sys.env
.get("SPARK_EXECUTOR_INSTANCES")
.map(_.toInt)
.getOrElse(numExecutors)
conf.get(EXECUTOR_INSTANCES).getOrElse(targetNumExecutors) |
|
@erikerlandson Yes, you're right. We don't want to get that value from env-vars, so, it can be replaced by the read from |
* Add support for dynamic executors * fill in some sane logic for doKillExecutors * doRequestTotalExecutors signals graceful executor shutdown, and favors idle executors
* Add support for dynamic executors * fill in some sane logic for doKillExecutors * doRequestTotalExecutors signals graceful executor shutdown, and favors idle executors
This PR adds support for dynamic executors. It works by running the shuffle server in the same container as the executor.
To use it, you will want to run with my latest image:
manyangled/kube-spark:dynamicYou must configure dynamic executors, for example submit using
--conf spark.dynamicAllocation.enabled=true