-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-34509][K8S] Make dynamic allocation upscaling more progressive on K8S #31790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #135910 has finished for PR 31790 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #136942 has finished for PR 31790 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
2326247 to
91bd029
Compare
|
Test build #137874 has finished for PR 31790 at commit
|
|
With this change pending PODs are not counted as outstanding PODs so their number can be quite high in k8s cluster. But still I would keep the allocation batch size to limit the max number of POD requests made at once. I am thinking about introducing a new limit for the max number of pending PODs (if k8s tends to struggle to handle high number of pending PODs). This new limit must significantly higher then the POD allocation size (we could even derive it from the batchsize with using constant multiplier like * 10 or make the factor configurable). WDYT? |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
I decided to add the limit for Pending PODs, stay tuned! |
|
Test build #139379 has finished for PR 31790 at commit
|
|
Test build #139380 has finished for PR 31790 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Kubernetes integration test status success |
|
Test build #139383 has finished for PR 31790 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
WDYT about this change (having a separate limit for pending pods and making bach allocation limit a bit more relaxed to make allocation a bit more progressive)? PS on next Tuesday I will go to Holiday without my laptop (so this is not urgent at all). |
|
If you can update this to the latest master that would be rad. I think this is very much needed. |
holdenk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this concept, one question around the default it seems a little high.
| .version("3.2.0") | ||
| .intConf | ||
| .checkValue(value => value > 0, "Maximum number of pending pods should be a positive integer") | ||
| .createWithDefault(150) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This default seems high, can you explain why 150?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
My main intention was to come up with a limit which protect us from overloading the k8s scheduler but still allows progressive upscaling. I think when a PODs spends long time in pending state we should make the allocation as early as possible but being careful to avoid the overloading of the k8s scheduler as that would be counterproductive for the allocations. And as we still use the batch size during upscaling there is a limited number of active new POD requests from a single Spark application (this also helps to avoid the overloading).
And the second reason was I hoped this a good default for those envs where the batchsize is already increased (I have seen examples where the batch size was set to 50).
But I just run a few tests and although I have seen 150 pending PODs was not causing any problem during resource allocation my test was running in a EKS cluster where only one Spark app was submitted (my test app) and even the cluster size was small.
So nevertheless we can go for a different solution:
Another strategy to choose this limit would be to use a default which is conformed to the default batch size (which is really small = 5). So what about setting the default to 15 here? In this case we can mention this new config in the migration guide.
@holdenk WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to propose to disable this feature at Apache Spark 3.2.0 to remove the side-effect completely. For example, we can use Int.MaxValue as default to disable this feature.
WDYT, @attilapiros and @holdenk ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, it would be great if we keep the existing test cases with the default configuration (disabled), and add new test coverage for this new conf.
So this PR has 3 small features which relates to each others:
- modify how the batch size limit is taken into account: earlier the next batch is not started when even one POD was stuck as newly created POD: these causes some of the test changes
- change what is outstanding PODs are: earlier Pending PODs and newly created PODs was counted as outstanding PODs which was stopping the allocation if it was coming from the scheduler.
This causes the rest of the unit test difference. - introduce limit for pending pods
Let me separate them into different PRs (at least for two) this will make the review easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, ping me on your split PRs and I'll be happy to take a look.
|
jenkins retest this please |
|
Oh, I missed your ping here. Sorry for being late. I'll review right now. |
|
|
||
| val KUBERNETES_MAX_PENDING_PODS = | ||
| ConfigBuilder("spark.kubernetes.allocation.max.pendingPods") | ||
| .doc("Maximum number of pending pods allowed during executor alloction for this application.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alloction -> allocation?
| .createWithDefault(5) | ||
|
|
||
| val KUBERNETES_MAX_PENDING_PODS = | ||
| ConfigBuilder("spark.kubernetes.allocation.max.pendingPods") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This introduces a new namespace, max, with only one child, pendingPods. Do you have a plan to add more? Otherwise, we need to reduce the depth like maxPendingPods.
| } | ||
|
|
||
| var totalPendingCount = 0 | ||
| var sumPendingPods = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse the old variable totalPendingCount? It looks resemble enough for sumPendingPods in the next context.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @attilapiros . The feature looks reasonable, but we should introduce a new feature more safely. I left a few comment.
We can adding this feature with Int.MaxValue. And, it would be great if we keep the existing test cases with the default configuration (disabled), and add new test coverage for this new conf.
|
Test build #140389 has finished for PR 31790 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Just following up @attilapiros, let me know when it's ready for review again. |
|
Thanks @holdenk! I plan to open the new PR which introduces the pending pod limit this week. |
|
I close this as #33492 went in and POD limit was a common code. |
What changes were proposed in this pull request?
Making upscaling more progressive to eagerly go up to the configured allocation batch size even if there is an outstanding POD request.
In addition the pending PODs are removed from outstanding POD request category (
numOutstandingPodsis renamed tonumNewlyCreatedUnknownPodsto reflect this change, where unknown means it is not known by the scheduler already) and for the batch size only limits the number of newly created POD requests. And keepingKUBERNETES_ALLOCATION_BATCH_DELAYasprocessBatchIntervalMillis.This way driver CPU still won't be overwhelmed when new PODs are requested as we still stop at a limit.
For pending PODs a separate limit is introduced called
spark.kubernetes.allocation.max.pendingPods.Why are the changes needed?
Before this PR executor PODs allocator stop requesting executor PODS when even one POD request is outstanding (either it is newly created POD request or pending PODs) form the current batch so even one slow POD requested could stop the PODs allocator from allocation more PODs.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing unit test.