[SPARK-34509][K8S] Make dynamic allocation upscaling more progressive on K8S #31790

attilapiros · 2021-03-09T20:39:58Z

What changes were proposed in this pull request?

Making upscaling more progressive to eagerly go up to the configured allocation batch size even if there is an outstanding POD request.

In addition the pending PODs are removed from outstanding POD request category (numOutstandingPods is renamed to numNewlyCreatedUnknownPods to reflect this change, where unknown means it is not known by the scheduler already) and for the batch size only limits the number of newly created POD requests. And keeping KUBERNETES_ALLOCATION_BATCH_DELAY as processBatchIntervalMillis.
This way driver CPU still won't be overwhelmed when new PODs are requested as we still stop at a limit.

For pending PODs a separate limit is introduced called spark.kubernetes.allocation.max.pendingPods.

Why are the changes needed?

Before this PR executor PODs allocator stop requesting executor PODS when even one POD request is outstanding (either it is newly created POD request or pending PODs) form the current batch so even one slow POD requested could stop the PODs allocator from allocation more PODs.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit test.

SparkQA · 2021-03-09T21:35:50Z

Test build #135910 has finished for PR 31790 at commit 77a5c68.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-09T22:03:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40493/

SparkQA · 2021-03-09T22:11:59Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40493/

SparkQA · 2021-04-06T12:21:07Z

Test build #136942 has finished for PR 31790 at commit 2326247.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-06T12:56:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41518/

SparkQA · 2021-04-06T12:56:47Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41518/

SparkQA · 2021-04-24T05:29:41Z

Test build #137874 has finished for PR 31790 at commit 91bd029.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2021-04-24T05:42:54Z

cc @dongjoon-hyun

With this change pending PODs are not counted as outstanding PODs so their number can be quite high in k8s cluster. But still I would keep the allocation batch size to limit the max number of POD requests made at once.

I am thinking about introducing a new limit for the max number of pending PODs (if k8s tends to struggle to handle high number of pending PODs). This new limit must significantly higher then the POD allocation size (we could even derive it from the batchsize with using constant multiplier like * 10 or make the factor configurable).

WDYT?

SparkQA · 2021-04-24T05:59:04Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42404/

SparkQA · 2021-04-24T05:59:06Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42404/

attilapiros · 2021-04-27T11:24:41Z

I decided to add the limit for Pending PODs, stay tuned!

SparkQA · 2021-06-06T09:19:13Z

Test build #139379 has finished for PR 31790 at commit 5f5ecdd.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-06T09:38:29Z

Test build #139380 has finished for PR 31790 at commit 7c39a1f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-06T09:56:20Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43900/

SparkQA · 2021-06-06T10:03:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43901/

SparkQA · 2021-06-06T10:29:58Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43900/

SparkQA · 2021-06-06T10:36:07Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43901/

SparkQA · 2021-06-06T13:30:09Z

Test build #139383 has finished for PR 31790 at commit 2f7ff59.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-06T13:58:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43904/

SparkQA · 2021-06-06T14:32:15Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43904/

attilapiros · 2021-06-06T15:19:23Z

cc @dongjoon-hyun @holdenk

WDYT about this change (having a separate limit for pending pods and making bach allocation limit a bit more relaxed to make allocation a bit more progressive)?

PS on next Tuesday I will go to Holiday without my laptop (so this is not urgent at all).
I will be back on June 21st so then we can discuss this a bit more.

holdenk · 2021-06-25T17:58:43Z

If you can update this to the latest master that would be rad. I think this is very much needed.

holdenk

I like this concept, one question around the default it seems a little high.

holdenk · 2021-06-25T17:59:21Z

resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala

+      .version("3.2.0")
+      .intConf
+      .checkValue(value => value > 0, "Maximum number of pending pods should be a positive integer")
+      .createWithDefault(150)


This default seems high, can you explain why 150?

Thanks for the review!

My main intention was to come up with a limit which protect us from overloading the k8s scheduler but still allows progressive upscaling. I think when a PODs spends long time in pending state we should make the allocation as early as possible but being careful to avoid the overloading of the k8s scheduler as that would be counterproductive for the allocations. And as we still use the batch size during upscaling there is a limited number of active new POD requests from a single Spark application (this also helps to avoid the overloading).

And the second reason was I hoped this a good default for those envs where the batchsize is already increased (I have seen examples where the batch size was set to 50).

But I just run a few tests and although I have seen 150 pending PODs was not causing any problem during resource allocation my test was running in a EKS cluster where only one Spark app was submitted (my test app) and even the cluster size was small.

So nevertheless we can go for a different solution:
Another strategy to choose this limit would be to use a default which is conformed to the default batch size (which is really small = 5). So what about setting the default to 15 here? In this case we can mention this new config in the migration guide.

@holdenk WDYT?

I'd like to propose to disable this feature at Apache Spark 3.2.0 to remove the side-effect completely. For example, we can use Int.MaxValue as default to disable this feature.
WDYT, @attilapiros and @holdenk ?

@dongjoon-hyun

And, it would be great if we keep the existing test cases with the default configuration (disabled), and add new test coverage for this new conf.

So this PR has 3 small features which relates to each others:

modify how the batch size limit is taken into account: earlier the next batch is not started when even one POD was stuck as newly created POD: these causes some of the test changes

change what is outstanding PODs are: earlier Pending PODs and newly created PODs was counted as outstanding PODs which was stopping the allocation if it was coming from the scheduler.
This causes the rest of the unit test difference.

introduce limit for pending pods

Let me separate them into different PRs (at least for two) this will make the review easier.

Sounds good, ping me on your split PRs and I'll be happy to take a look.

attilapiros · 2021-06-26T22:21:38Z

jenkins retest this please

dongjoon-hyun · 2021-06-27T05:03:22Z

Oh, I missed your ping here. Sorry for being late. I'll review right now.

dongjoon-hyun · 2021-06-27T05:04:14Z

resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala


+  val KUBERNETES_MAX_PENDING_PODS =
+    ConfigBuilder("spark.kubernetes.allocation.max.pendingPods")
+      .doc("Maximum number of pending pods allowed during executor alloction for this application.")


alloction -> allocation?

dongjoon-hyun · 2021-06-27T05:05:45Z

resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala

      .createWithDefault(5)

+  val KUBERNETES_MAX_PENDING_PODS =
+    ConfigBuilder("spark.kubernetes.allocation.max.pendingPods")


This introduces a new namespace, max, with only one child, pendingPods. Do you have a plan to add more? Otherwise, we need to reduce the depth like maxPendingPods.

dongjoon-hyun · 2021-06-27T05:10:21Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

    }
-
-    var totalPendingCount = 0
+    var sumPendingPods = 0


Can we reuse the old variable totalPendingCount? It looks resemble enough for sumPendingPods in the next context.

dongjoon-hyun

Hi, @attilapiros . The feature looks reasonable, but we should introduce a new feature more safely. I left a few comment.

https://github.com/apache/spark/pull/31790/files#r659263233

We can adding this feature with Int.MaxValue. And, it would be great if we keep the existing test cases with the default configuration (disabled), and add new test coverage for this new conf.

SparkQA · 2021-06-29T19:06:05Z

Test build #140389 has finished for PR 31790 at commit 61cdaab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-30T00:27:01Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44913/

SparkQA · 2021-06-30T01:02:30Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44913/

holdenk · 2021-07-20T20:47:12Z

Just following up @attilapiros, let me know when it's ready for review again.

attilapiros · 2021-07-21T14:26:44Z

Thanks @holdenk! I plan to open the new PR which introduces the pending pod limit this week.

attilapiros · 2021-08-11T04:51:41Z

I close this as #33492 went in and POD limit was a common code.
I will think about a bit more regarding the reinterpreting of batch size and relaxing the conditions to make upscaling more progressive.

github-actions bot added the KUBERNETES label Mar 9, 2021

attilapiros force-pushed the SPARK-34509 branch from 77a5c68 to 2326247 Compare April 6, 2021 11:54

attilapiros force-pushed the SPARK-34509 branch from 2326247 to 91bd029 Compare April 24, 2021 05:07

attilapiros changed the title ~~[WIP][SPARK-34509][K8S] Make dynamic allocation upscaling more progressive on K8S~~ [SPARK-34509][K8S] Make dynamic allocation upscaling more progressive on K8S Apr 24, 2021

attilapiros added 2 commits June 6, 2021 09:26

Initial upload

941acfc

Add max pending pod limit

5f5ecdd

attilapiros force-pushed the SPARK-34509 branch from 91bd029 to 5f5ecdd Compare June 6, 2021 08:40

fix extra whitespace

7c39a1f

adding back some asserts

2f7ff59

holdenk reviewed Jun 25, 2021

View reviewed changes

Merge branch 'master' into SPARK-34509

a93b5ae

Merge branch 'master' into SPARK-34509

61cdaab

dongjoon-hyun reviewed Jun 27, 2021

View reviewed changes

attilapiros closed this Aug 11, 2021

[SPARK-34509][K8S] Make dynamic allocation upscaling more progressive on K8S #31790

[SPARK-34509][K8S] Make dynamic allocation upscaling more progressive on K8S #31790

Conversation

attilapiros commented Mar 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Mar 9, 2021

Uh oh!

SparkQA commented Mar 9, 2021

Uh oh!

SparkQA commented Mar 9, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 6, 2021

Uh oh!

SparkQA commented Apr 24, 2021

Uh oh!

attilapiros commented Apr 24, 2021

Uh oh!

SparkQA commented Apr 24, 2021

Uh oh!

SparkQA commented Apr 24, 2021

Uh oh!

attilapiros commented Apr 27, 2021

Uh oh!

SparkQA commented Jun 6, 2021

Uh oh!

SparkQA commented Jun 6, 2021

Uh oh!

SparkQA commented Jun 6, 2021

Uh oh!

SparkQA commented Jun 6, 2021

Uh oh!

SparkQA commented Jun 6, 2021

Uh oh!

SparkQA commented Jun 6, 2021

Uh oh!

SparkQA commented Jun 6, 2021

Uh oh!

SparkQA commented Jun 6, 2021

Uh oh!

SparkQA commented Jun 6, 2021

Uh oh!

attilapiros commented Jun 6, 2021

Uh oh!

holdenk commented Jun 25, 2021

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

holdenk Jun 25, 2021

Choose a reason for hiding this comment

Uh oh!

attilapiros Jun 26, 2021

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jun 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

attilapiros Jun 27, 2021

Choose a reason for hiding this comment

Uh oh!

holdenk Jul 8, 2021

Choose a reason for hiding this comment

Uh oh!

attilapiros commented Jun 26, 2021

Uh oh!

dongjoon-hyun commented Jun 27, 2021

Uh oh!

dongjoon-hyun Jun 27, 2021

Choose a reason for hiding this comment

Uh oh!

attilapiros commented Mar 9, 2021 •

edited

Loading

dongjoon-hyun Jun 27, 2021 •

edited

Loading