[SPARK-36052][K8S] Introducing a limit for pending PODs #33492

attilapiros · 2021-07-23T08:48:50Z

What changes were proposed in this pull request?

Introducing a limit for pending PODs (newly created/requested executors included).
This limit is global for all the resource profiles. So first we have to count all the newly created and pending PODs (decreased by the ones which requested to be deleted) then we can share the remaining pending POD slots among the resource profiles.

Why are the changes needed?

Without this PR dynamic allocation could request too many PODs and the K8S scheduler could be overloaded and scheduling of PODs will be affected by the load.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

With new unit tests.

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

attilapiros · 2021-07-23T08:57:21Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

+      var notRunningPodCountForRpId =
        currentPendingExecutorsForRpId.size + schedulerKnownPendingExecsForRpId.size +
        newlyCreatedExecutorsForRpId.size + schedulerKnownNewlyCreatedExecsForRpId.size
+      val podCountForRpId = currentRunningCount + notRunningPodCountForRpId


This again a rename to avoid "known" prefix as it not for scheduler known PODs but PODs for this resource profile.

attilapiros · 2021-07-23T08:59:11Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

            currentTime - createTime > executorIdleTimeout
          }.keys.take(excess).toList
-        val knownPendingToDelete = currentPendingExecutorsForRpId
+        val pendingToDelete = currentPendingExecutorsForRpId


Last rename for the same reason as earlier: this are PODs unknow by the scheduler so safe to be removed here (no task can be scheduled on them).

SparkQA · 2021-07-23T09:20:06Z

Test build #141548 has finished for PR 33492 at commit 269a0cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

attilapiros · 2021-07-23T09:25:29Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

+      .asScala
+      .toSeq
+      .sortBy(_._1)
+      .flatMap { case (rpId, targetNum) =>


Instead of foreach here is flatMap as we need to do the process in two steps for counting all the not running PODs for all the resource profiles before we decide how to split the remaining pending PODs slot between the resource profiles.

attilapiros · 2021-07-23T09:27:16Z

cc @holdenk, @dongjoon-hyun

SparkQA · 2021-07-23T09:35:20Z

Test build #141549 has finished for PR 33492 at commit adc512d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-23T10:00:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46066/

SparkQA · 2021-07-23T10:06:37Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46067/

SparkQA · 2021-07-23T10:36:19Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46066/

SparkQA · 2021-07-23T10:40:14Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46067/

attilapiros · 2021-07-23T11:21:49Z

The failure is a tpc-ds query run which must be unrelated.

dongjoon-hyun · 2021-07-23T15:01:43Z

Thank you for pinging me, @attilapiros .

holdenk

Thanks for working on this. I'm a little confused about the code if you could clarify in the places I asked questions I'd really appreciate that.

resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

dongjoon-hyun · 2021-07-28T23:55:05Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

+        Some(rpId, podCountForRpId, targetNum)
+      } else {
+        // for this resource profile we do not request more PODs
+        None


I removed the previous my comment because I'm also not sure what is the best way to inform the users this situation. Do you think we have a good way to inform to the users when we hit this limitation, @attilapiros ?

We could change this to logInfo:

spark/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

Lines 323 to 324 in adc512d

logDebug(s"Still waiting for ${newlyCreatedExecutorsForRpId.size} executors for " +

s"ResourceProfile Id $rpId before requesting more.")

But for a higher batch allocation size this message could be annoying as every POD status change will generate such a log line while it reaches 0.

dongjoon-hyun · 2021-07-28T23:59:47Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

      pvcsInUse: Seq[String]): Unit = {
-    val numExecutorsToAllocate = math.min(expected - running, podAllocationSize)
-    logInfo(s"Going to request $numExecutorsToAllocate executors from Kubernetes for " +
-      s"ResourceProfile Id: $resourceProfileId, target: $expected running: $running.")


This message had better be here inside requestNewExecutors.

My reason to change it was to avoid passing more variables for that method.

So from this

spark/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

Lines 343 to 346 in adc512d

logInfo(s"Going to request $numExecutorsToAllocate executors from Kubernetes for " +

s"ResourceProfile Id: $rpId, target: $targetNum, known: $podCountForRpId, " +

s"sharedSlotFromPendingPods: $sharedSlotFromPendingPods.")

requestNewExecutors(numExecutorsToAllocate, applicationId, rpId, k8sKnownPVCNames)

we would need to pass:

targetNum

podCountForRpId

sharedSlotFromPendingPods

and they only needed for the log line and to calculate numExecutorsToAllocate.

With the current solution numExecutorsToAllocate is enough and when we will extend the current logic to consider more limits to use for allocation then numExecutorsToAllocate will be still enough.

WDYT?

@dongjoon-hyun what is your opinion?

dongjoon-hyun

In general, it looks like a good improvement. BTW, I'm curious if you are going to support maxPendingPods per resource profile later.

attilapiros · 2021-07-29T10:18:11Z

In general, it looks like a good improvement. BTW, I'm curious if you are going to support maxPendingPods per resource profile later.

I am happy you like this.

Actually I haven't thought about supporting maxPendingPods per resource profile but if you think that would be valuable for our users I can do that easily in a new PR.

SparkQA · 2021-07-29T10:56:32Z

Test build #141832 has finished for PR 33492 at commit 2e78f6c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-29T12:28:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46345/

SparkQA · 2021-07-29T13:05:05Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46345/

attilapiros · 2021-08-03T11:06:50Z

@holdenk, @dongjoon-hyun may I ask one more pass?

holdenk

LGTM I don't have strong feelings on the logging message location but would be good to give dongjoon some time if he does have strong feelings about that.

dongjoon-hyun

+1, LGTM.
Sorry for missing this, @attilapiros and @holdenk .
I thought this was already merged.

attilapiros · 2021-08-11T04:45:33Z

@dongjoon-hyun, @holdenk Thanks to all of you!

dongjoon-hyun · 2021-08-16T23:06:49Z

Hi, @attilapiros and @holdenk .
I also hit this issue in the production environment when the PVC creation pending.
I'll backport this to branch-3.2 to fix the bug.

dongjoon-hyun · 2021-08-16T23:07:24Z

cc @gengliangwang , too

Introducing a limit for pending PODs (newly created/requested executors included). This limit is global for all the resource profiles. So first we have to count all the newly created and pending PODs (decreased by the ones which requested to be deleted) then we can share the remaining pending POD slots among the resource profiles. Without this PR dynamic allocation could request too many PODs and the K8S scheduler could be overloaded and scheduling of PODs will be affected by the load. No. With new unit tests. Closes #33492 from attilapiros/SPARK-36052. Authored-by: attilapiros <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 1dced49) Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2021-08-16T23:15:23Z

During backporting, I updated the config version 3.2.0 at branch-3.2.
This is a follow-up PR for master branch, [SPARK-36052][K8S][FOLLOWUP] Update config version to 3.2.0 #33755, to match it.

gengliangwang · 2021-08-17T02:52:12Z

@attilapiros Thanks for the work.
@dongjoon-hyun Thanks for pinging me. This one should be in the release note.

dongjoon-hyun · 2021-08-17T14:51:10Z

Thank you, @gengliangwang . According to your suggestion, I added releasenotes label to SPARK-36052.

Introducing a limit for pending PODs (newly created/requested executors included). This limit is global for all the resource profiles. So first we have to count all the newly created and pending PODs (decreased by the ones which requested to be deleted) then we can share the remaining pending POD slots among the resource profiles. Without this PR dynamic allocation could request too many PODs and the K8S scheduler could be overloaded and scheduling of PODs will be affected by the load. No. With new unit tests. Closes apache#33492 from attilapiros/SPARK-36052. Authored-by: attilapiros <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 1dced49) Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit eb09be9) Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 25105db) Signed-off-by: Dongjoon Hyun <[email protected]>

…Allocator#onNewSnapshots` ### What changes were proposed in this pull request? This pr just remove unused local `val outstanding` from `ExecutorPodsAllocator#onNewSnapshots`, the `outstanding > 0` replaced by `newlyCreatedExecutorsForRpId.nonEmpty` after SPARK-36052 | #33492 ### Why are the changes needed? Remove unused local val ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44703 from LuciferYang/minor-val-outstanding. Authored-by: yangjie01 <[email protected]> Signed-off-by: Kent Yao <[email protected]>

### What changes were proposed in this pull request? Introducing a limit for pending PODs (newly created/requested executors included) per resource profile. There exists a config for a global limit for all resource profiles, but here we add a limit per resource profile. #33492 does a lot of the plumbing for us already, counting newly created and pending pods, and we can just pass through the pending pods per resource profile, and limit the number of requests we were going to make for pods for that resource profile to min(previousRequest, maxPodsPerRP). ### Why are the changes needed? For multiple resource profile use cases you can set limits that apply at the resource profile level, instead of globally. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? unit tests added ### Was this patch authored or co-authored using generative AI tooling? No Closes #51913 from ForVic/vsunderl/max_pending_pods_per_rpid. Authored-by: ForVic <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

initial upload

269a0cd

github-actions bot added the KUBERNETES label Jul 23, 2021

attilapiros commented Jul 23, 2021

View reviewed changes

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala Show resolved Hide resolved

attilapiros commented Jul 23, 2021

View reviewed changes

fix a condition

adc512d

attilapiros commented Jul 23, 2021

View reviewed changes

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala Show resolved Hide resolved

attilapiros commented Jul 23, 2021

View reviewed changes

holdenk reviewed Jul 23, 2021

View reviewed changes

resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala Outdated Show resolved Hide resolved

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala Show resolved Hide resolved

dongjoon-hyun reviewed Jul 28, 2021

View reviewed changes

dongjoon-hyun reviewed Jul 29, 2021

View reviewed changes

applying review comment

2e78f6c

holdenk approved these changes Aug 5, 2021

View reviewed changes

dongjoon-hyun approved these changes Aug 11, 2021

View reviewed changes

dongjoon-hyun closed this in 1dced49 Aug 11, 2021

attilapiros mentioned this pull request Aug 11, 2021

[SPARK-34509][K8S] Make dynamic allocation upscaling more progressive on K8S #31790

Closed

LuciferYang mentioned this pull request Jan 12, 2024

[MINOR][K8S] Remove unused local val outstanding from ExecutorPodsAllocator#onNewSnapshots #44703

Closed

ForVic mentioned this pull request Aug 7, 2025

[SPARK-53324][K8S] Introduce pending pod limit per ResourceProfile #51913

Closed

	logDebug(s"Still waiting for ${newlyCreatedExecutorsForRpId.size} executors for " +
	s"ResourceProfile Id $rpId before requesting more.")

	logInfo(s"Going to request $numExecutorsToAllocate executors from Kubernetes for " +
	s"ResourceProfile Id: $rpId, target: $targetNum, known: $podCountForRpId, " +
	s"sharedSlotFromPendingPods: $sharedSlotFromPendingPods.")
	requestNewExecutors(numExecutorsToAllocate, applicationId, rpId, k8sKnownPVCNames)

[SPARK-36052][K8S] Introducing a limit for pending PODs #33492

[SPARK-36052][K8S] Introducing a limit for pending PODs #33492

Uh oh!

Conversation

attilapiros commented Jul 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 23, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

attilapiros commented Jul 23, 2021

Uh oh!

SparkQA commented Jul 23, 2021

Uh oh!

SparkQA commented Jul 23, 2021

Uh oh!

SparkQA commented Jul 23, 2021

Uh oh!

SparkQA commented Jul 23, 2021

Uh oh!

SparkQA commented Jul 23, 2021

Uh oh!

attilapiros commented Jul 23, 2021

Uh oh!

dongjoon-hyun commented Jul 23, 2021

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

attilapiros commented Jul 29, 2021

Uh oh!

SparkQA commented Jul 29, 2021

Uh oh!

SparkQA commented Jul 29, 2021

Uh oh!

SparkQA commented Jul 29, 2021

Uh oh!

attilapiros commented Aug 3, 2021

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

attilapiros commented Aug 11, 2021

Uh oh!

dongjoon-hyun commented Aug 16, 2021

Uh oh!

dongjoon-hyun commented Aug 16, 2021

Uh oh!

dongjoon-hyun commented Aug 16, 2021

Uh oh!

gengliangwang commented Aug 17, 2021

attilapiros commented Jul 23, 2021 •

edited

Loading

dongjoon-hyun commented Aug 17, 2021 •

edited

Loading