[WIP][SPARK-31198][CORE] Use graceful decommissioning as part of dynamic scaling #28818

holdenk · 2020-06-13T02:16:55Z

This is WIP since it is on top of SPARK-31197 (which itself is WIP on top off SPARK-20629 ) and should probably have more testing. We use SPARK-31197's exiting of the executor once decommissioning is finished to allow us to replace the usage of killExecutor with decommission executor when enabled during dynamic allocation.

What changes were proposed in this pull request?

If graceful decommissioning is enabled, Spark's dynamic scaling uses this instead of directly killing executors.

Why are the changes needed?

When scaling down Spark we should avoid triggering recomputes as much as possible.

Does this PR introduce any user-facing change?

Hopefully their jobs run faster. It also enables experimental shuffle service free decommissioning when graceful decommissioning is enabled.

How was this patch tested?

For now I've extended the ExecutorAllocationManagerSuite to cover this.

SparkQA · 2020-06-13T05:07:43Z

Test build #123956 has finished for PR 28818 at commit 2ff94ec.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

agrawaldevesh

Hi @holdenk .. Using proper decommissioning for dynamic allocation would be great and help unify these codepaths. Thank you again for working on this.

My two comments below are mere nits and just to make sure I am following along. It reads fine as is.

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

SparkQA · 2020-06-14T06:54:59Z

Test build #123995 has finished for PR 28818 at commit ef3f523.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-16T22:07:12Z

Test build #124139 has finished for PR 28818 at commit 7691d2d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-06-16T22:31:19Z

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala

+       * committed.
+       */
+      event.blockUpdatedInfo.blockId match {
+        case ShuffleDataBlockId(shuffleId, _, _) => exec.addShuffle(shuffleId)


Since we are touching ExecutorMonitor, when do we have a counter operation, exec.removeShuffle? In this PR, it seems that executorsKilled is used. Is it enough?
cc @dbtsai

Yeah so since we're only doing migrations during decommissioning, whatever shuffle files remain on the host when it dies will do the cleanup. I can't think of why we would need to do a delete operation here as well, but if it would be useful for your follow on work I can add it?

SparkQA · 2020-06-17T05:19:16Z

Test build #124154 has finished for PR 28818 at commit 9bb0293.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-17T21:21:30Z

Test build #124179 has finished for PR 28818 at commit 81f9dbb.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-18T00:38:16Z

Test build #124181 has finished for PR 28818 at commit 2674055.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2020-06-18T04:53:18Z

cc @Ngone51 @jiangxb1987

HyukjinKwon · 2020-06-18T06:00:55Z

cc @tgravescs too

holdenk · 2020-06-18T21:17:47Z

To be clear this a WIP PR and not yet ready for review. I created it to give additional context after talking with @agrawaldevesh about our respective goals. I'll try and get a "read for review" PR out next week.

holdenk · 2020-06-24T00:54:47Z

Just as a follow up since @HyukjinKwon requested an SPIP for this work I won't be moving this PR out of WIP this week as originally planned.

agrawaldevesh

One question inline please.

I didn't quite follow why this PR is rebased on top of https://issues.apache.org/jira/browse/SPARK-31197 and https://issues.apache.org/jira/browse/SPARK-20629. They seem orthogonal to me. Can you please update the PR description to reflect the relationship b/w this PR and these Jira tickets. Thanks !

agrawaldevesh · 2020-07-10T00:32:31Z

core/src/main/scala/org/apache/spark/ExecutorAllocationClient.scala

+   */
+  def decommissionExecutor(executorId: String): Boolean = {
+    val decommissionedExecutors = decommissionExecutors(Seq(executorId),
+      adjustTargetNumExecutors = true)


Why is adjustTargetNumExecutors defaulting to true here ? This would mean that all schedulers would try to replinish the executor when asked to DecommissionExecutor(...) -- for example by the Master or when an executor gets a SIGPWR.

I think it shouldn't be the default -- it should atleast be configurable. It only makes sense to have adjustTargetNumExecutors=true when called from org.apache.spark.streaming.scheduler.ExecutorAllocationManager#killExecutor (ie when it is truly called from dynamic allocation codepath and we have decided that we want to replinish the executor).

If you look above there is a configurable call. This matches how killExecutor is implemented down on line 124.

Can you please point me to where is the configurable call ? I don't see a config check in the code paths that call this method.

It's fine for killExecutor to unconditionally adjust the target number of executors because it is only called in the dynamic allocation codepath, but decommissionExecutor would be called from many other codepaths as well (for example when the driver gets a DecommissionExecutor message) -- and thus I think it should just assume that it should replenish the executor.

Look on line 95 of this file. I think we should match the semantics of killExecutor as much as possible. If there's a place where we don't want it we can use decommissionExecutors

Hmm, Should we rename decommissionExecutor (singular) to decommissionAndKillExecutor to reflect its purpose better ? It would be too easy to confuse it with decommissionExecutors (on line 95 of this file which allows to not replenish the target number of executors).

Do you want to make the change to the callers of decommissionExecutor in this PR and switch them to decommissioExecutors(Seq(executorId), false) instead. The ones I am most concerned about are:

The handling of message DecommissionExecutor (both sync and async variants) in CoarseGrainedSchedulerBackend

StandaloneSchedulerBackend.executorDecommissioned

In both the above cases, I think we may not always want replenishing. For example, in the standalone case, when the Worker gets a SIGPWR -- do we want to replenish the executors on the remaining workers (ie oversubscribe the remaining workers) ? Similarly when an executor gets a SIGPWR, do we want to put that load on the remaining executors ? I think the answer to both should be NO unless we are doing a dynamic allocation.

Personally I am fine with any choice of naming here as long as the semantics are not silently changed under the cover, as is the case presently.

It's a new function, what are we changing?

ExecutorAllocationClient is a base class of CoarseGrainedSchedulerBackend. We moved decommissionExecutor from the latter class to the former and as such it is not a new function. Since CoarseGrainedSchedulerBackend no longer overrides decommissionExecutor, ExecutorAllocationClient.decommissionExecutor will be called when CoarseGrainedSchedulerBackend gets a DecommissionExecutor message -- and the semantics of that codepath have been changed to unconditionally impose adjustTargetNumExecutors=true.

cool, I'll update the previous calls to decommissioExecutor

holdenk · 2020-07-21T19:37:00Z

@agrawaldevesh So this PR depends on the behaviour of the VM eventually exiting (https://issues.apache.org/jira/browse/SPARK-31197 ) since it's replacing the usage of killExecutor during dynamic allocation.

agrawaldevesh · 2020-07-21T21:21:34Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

-    /**
-     * Mark a given executor as decommissioned and stop making resource offers for it.
-     */
-    private def decommissionExecutor(executorId: String): Boolean = {


@holdenk ... to answer your question: This block of code was moved to the base class ExecutorAllocationClient. So the code in ExecutorAllocationClient is not "New". Furthermore, the semantics of this code were changed as it was moved to now unconditionally replenish the executors.

Makes sense.

SparkQA · 2020-07-21T22:46:58Z

Test build #126273 has finished for PR 28818 at commit 9565c40.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-22T00:59:04Z

Test build #126281 has finished for PR 28818 at commit 3db60f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-22T04:51:12Z

Test build #126292 has finished for PR 28818 at commit daf96dd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Because the mock always says there is an RDD we may replicate more than once, and now that there are independent threads

…we don't scale down too low, update the streaming ExecutorAllocationManager to also delegate to decommission Fix up executor add for resource profile

…eanup the locks we use in decommissioning and clarify some more bits.

…ster manager are re-launched

…ion manager suite.

…o that we can match the pattern for killExecutor/killExecutors

This reverts commit daf96dd.

SparkQA · 2020-07-23T21:39:19Z

Test build #126435 has finished for PR 28818 at commit f921ddd.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

agrawaldevesh · 2020-08-15T00:14:11Z

@holdenk can this PR be abandoned/closed now since this is finally in ?

probot-autolabeler bot added the CORE label Jun 13, 2020

holdenk mentioned this pull request Jun 13, 2020

[SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown #28708

Closed

agrawaldevesh reviewed Jun 13, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala Show resolved Hide resolved

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala Outdated Show resolved Hide resolved

holdenk force-pushed the SPARK-31198-use-graceful-decommissioning-as-part-of-dynamic-scaling-on-top-of-SPARK-1197 branch from ef3f523 to 7691d2d Compare June 16, 2020 19:56

dongjoon-hyun reviewed Jun 16, 2020

View reviewed changes

holdenk force-pushed the SPARK-31198-use-graceful-decommissioning-as-part-of-dynamic-scaling-on-top-of-SPARK-1197 branch from 9bb0293 to 81f9dbb Compare June 17, 2020 20:57

agrawaldevesh reviewed Jul 10, 2020

View reviewed changes

holdenk force-pushed the SPARK-31198-use-graceful-decommissioning-as-part-of-dynamic-scaling-on-top-of-SPARK-1197 branch from 2674055 to 9565c40 Compare July 21, 2020 19:35

probot-autolabeler bot added the DSTREAM label Jul 21, 2020

agrawaldevesh reviewed Jul 21, 2020

View reviewed changes

holdenk force-pushed the SPARK-31198-use-graceful-decommissioning-as-part-of-dynamic-scaling-on-top-of-SPARK-1197 branch from 3db60f2 to daf96dd Compare July 22, 2020 02:19

probot-autolabeler bot added the KUBERNETES label Jul 22, 2020

holdenk added 4 commits July 23, 2020 11:22

Shutdown executor once we are done decommissioning

525b335

Because the mock always says there is an RDD we may replicate more than once, and now that there are independent threads

Make Spark's dynamic allocation use decommissioning

63da300

Track the decommissioning executors in the core dynamic scheduler so …

9f7d752

…we don't scale down too low, update the streaming ExecutorAllocationManager to also delegate to decommission Fix up executor add for resource profile

Fix our exiting and cleanup thread for better debugging next time. Cl…

efbe9a3

…eanup the locks we use in decommissioning and clarify some more bits.

holdenk added 6 commits July 23, 2020 14:11

Verify executors decommissioned, then killed by external external clu…

ccbef6b

…ster manager are re-launched

Verify some additional calls are not occuring in the executor allocat…

b69784d

…ion manager suite.

Dont' close the watcher until the end of the test

d4961d9

Use decommissionExecutors and set adjustTargetNumExecutors to false s…

d5c5ef1

…o that we can match the pattern for killExecutor/killExecutors

bump numparts up to 6

683c83c

Revert "bump numparts up to 6"

f921ddd

This reverts commit daf96dd.

holdenk force-pushed the SPARK-31198-use-graceful-decommissioning-as-part-of-dynamic-scaling-on-top-of-SPARK-1197 branch from daf96dd to f921ddd Compare July 23, 2020 21:35

holdenk mentioned this pull request Aug 6, 2020

[SPARK-31198][CORE] Use graceful decommissioning as part of dynamic scaling #29367

Closed

holdenk closed this Aug 24, 2020

[WIP][SPARK-31198][CORE] Use graceful decommissioning as part of dynamic scaling #28818

[WIP][SPARK-31198][CORE] Use graceful decommissioning as part of dynamic scaling #28818

Uh oh!

Conversation

holdenk commented Jun 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jun 13, 2020

Uh oh!

agrawaldevesh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Jun 14, 2020

Uh oh!

SparkQA commented Jun 16, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 17, 2020

Uh oh!

SparkQA commented Jun 17, 2020

Uh oh!

SparkQA commented Jun 18, 2020

Uh oh!

gatorsmile commented Jun 18, 2020

Uh oh!

HyukjinKwon commented Jun 18, 2020

Uh oh!

holdenk commented Jun 18, 2020

Uh oh!

holdenk commented Jun 24, 2020

Uh oh!

agrawaldevesh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holdenk commented Jul 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 21, 2020

Uh oh!

SparkQA commented Jul 22, 2020

Uh oh!

SparkQA commented Jul 22, 2020

Uh oh!

SparkQA commented Jul 23, 2020

Uh oh!

agrawaldevesh commented Aug 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

holdenk commented Jun 13, 2020 •

edited

Loading