Skip to content

Conversation

@attilapiros
Copy link
Contributor

@attilapiros attilapiros commented Mar 9, 2021

What changes were proposed in this pull request?

Making upscaling more progressive to eagerly go up to the configured allocation batch size even if there is an outstanding POD request.

In addition the pending PODs are removed from outstanding POD request category (numOutstandingPods is renamed to numNewlyCreatedUnknownPods to reflect this change, where unknown means it is not known by the scheduler already) and for the batch size only limits the number of newly created POD requests. And keeping KUBERNETES_ALLOCATION_BATCH_DELAY as processBatchIntervalMillis.
This way driver CPU still won't be overwhelmed when new PODs are requested as we still stop at a limit.

For pending PODs a separate limit is introduced called spark.kubernetes.allocation.max.pendingPods.

Why are the changes needed?

Before this PR executor PODs allocator stop requesting executor PODS when even one POD request is outstanding (either it is newly created POD request or pending PODs) form the current batch so even one slow POD requested could stop the PODs allocator from allocation more PODs.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit test.

@SparkQA
Copy link

SparkQA commented Mar 9, 2021

Test build #135910 has finished for PR 31790 at commit 77a5c68.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40493/

@SparkQA
Copy link

SparkQA commented Mar 9, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40493/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Test build #136942 has finished for PR 31790 at commit 2326247.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41518/

@SparkQA
Copy link

SparkQA commented Apr 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41518/

@attilapiros attilapiros changed the title [WIP][SPARK-34509][K8S] Make dynamic allocation upscaling more progressive on K8S [SPARK-34509][K8S] Make dynamic allocation upscaling more progressive on K8S Apr 24, 2021
@SparkQA
Copy link

SparkQA commented Apr 24, 2021

Test build #137874 has finished for PR 31790 at commit 91bd029.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@attilapiros
Copy link
Contributor Author

cc @dongjoon-hyun

With this change pending PODs are not counted as outstanding PODs so their number can be quite high in k8s cluster. But still I would keep the allocation batch size to limit the max number of POD requests made at once.

I am thinking about introducing a new limit for the max number of pending PODs (if k8s tends to struggle to handle high number of pending PODs). This new limit must significantly higher then the POD allocation size (we could even derive it from the batchsize with using constant multiplier like * 10 or make the factor configurable).

WDYT?

@SparkQA
Copy link

SparkQA commented Apr 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42404/

@SparkQA
Copy link

SparkQA commented Apr 24, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42404/

@attilapiros
Copy link
Contributor Author

I decided to add the limit for Pending PODs, stay tuned!

@SparkQA
Copy link

SparkQA commented Jun 6, 2021

Test build #139379 has finished for PR 31790 at commit 5f5ecdd.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 6, 2021

Test build #139380 has finished for PR 31790 at commit 7c39a1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43900/

@SparkQA
Copy link

SparkQA commented Jun 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43901/

@SparkQA
Copy link

SparkQA commented Jun 6, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43900/

@SparkQA
Copy link

SparkQA commented Jun 6, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43901/

@SparkQA
Copy link

SparkQA commented Jun 6, 2021

Test build #139383 has finished for PR 31790 at commit 2f7ff59.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43904/

@SparkQA
Copy link

SparkQA commented Jun 6, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43904/

@attilapiros
Copy link
Contributor Author

cc @dongjoon-hyun @holdenk

WDYT about this change (having a separate limit for pending pods and making bach allocation limit a bit more relaxed to make allocation a bit more progressive)?

PS on next Tuesday I will go to Holiday without my laptop (so this is not urgent at all).
I will be back on June 21st so then we can discuss this a bit more.

@holdenk
Copy link
Contributor

holdenk commented Jun 25, 2021

If you can update this to the latest master that would be rad. I think this is very much needed.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this concept, one question around the default it seems a little high.

.version("3.2.0")
.intConf
.checkValue(value => value > 0, "Maximum number of pending pods should be a positive integer")
.createWithDefault(150)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default seems high, can you explain why 150?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

My main intention was to come up with a limit which protect us from overloading the k8s scheduler but still allows progressive upscaling. I think when a PODs spends long time in pending state we should make the allocation as early as possible but being careful to avoid the overloading of the k8s scheduler as that would be counterproductive for the allocations. And as we still use the batch size during upscaling there is a limited number of active new POD requests from a single Spark application (this also helps to avoid the overloading).

And the second reason was I hoped this a good default for those envs where the batchsize is already increased (I have seen examples where the batch size was set to 50).

But I just run a few tests and although I have seen 150 pending PODs was not causing any problem during resource allocation my test was running in a EKS cluster where only one Spark app was submitted (my test app) and even the cluster size was small.

So nevertheless we can go for a different solution:
Another strategy to choose this limit would be to use a default which is conformed to the default batch size (which is really small = 5). So what about setting the default to 15 here? In this case we can mention this new config in the migration guide.

@holdenk WDYT?

Copy link
Member

@dongjoon-hyun dongjoon-hyun Jun 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to propose to disable this feature at Apache Spark 3.2.0 to remove the side-effect completely. For example, we can use Int.MaxValue as default to disable this feature.
WDYT, @attilapiros and @holdenk ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun

And, it would be great if we keep the existing test cases with the default configuration (disabled), and add new test coverage for this new conf.

So this PR has 3 small features which relates to each others:

  • modify how the batch size limit is taken into account: earlier the next batch is not started when even one POD was stuck as newly created POD: these causes some of the test changes
  • change what is outstanding PODs are: earlier Pending PODs and newly created PODs was counted as outstanding PODs which was stopping the allocation if it was coming from the scheduler.
    This causes the rest of the unit test difference.
  • introduce limit for pending pods

Let me separate them into different PRs (at least for two) this will make the review easier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, ping me on your split PRs and I'll be happy to take a look.

@attilapiros
Copy link
Contributor Author

jenkins retest this please

@dongjoon-hyun
Copy link
Member

Oh, I missed your ping here. Sorry for being late. I'll review right now.


val KUBERNETES_MAX_PENDING_PODS =
ConfigBuilder("spark.kubernetes.allocation.max.pendingPods")
.doc("Maximum number of pending pods allowed during executor alloction for this application.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alloction -> allocation?

.createWithDefault(5)

val KUBERNETES_MAX_PENDING_PODS =
ConfigBuilder("spark.kubernetes.allocation.max.pendingPods")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces a new namespace, max, with only one child, pendingPods. Do you have a plan to add more? Otherwise, we need to reduce the depth like maxPendingPods.

}

var totalPendingCount = 0
var sumPendingPods = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse the old variable totalPendingCount? It looks resemble enough for sumPendingPods in the next context.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @attilapiros . The feature looks reasonable, but we should introduce a new feature more safely. I left a few comment.

We can adding this feature with Int.MaxValue. And, it would be great if we keep the existing test cases with the default configuration (disabled), and add new test coverage for this new conf.

@SparkQA
Copy link

SparkQA commented Jun 29, 2021

Test build #140389 has finished for PR 31790 at commit 61cdaab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44913/

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44913/

@holdenk
Copy link
Contributor

holdenk commented Jul 20, 2021

Just following up @attilapiros, let me know when it's ready for review again.

@attilapiros
Copy link
Contributor Author

Thanks @holdenk! I plan to open the new PR which introduces the pending pod limit this week.

@attilapiros
Copy link
Contributor Author

I close this as #33492 went in and POD limit was a common code.
I will think about a bit more regarding the reinterpreting of batch size and relaxing the conditions to make upscaling more progressive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants