-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-29263][SCHEDULER] Update availableSlots in resourceOffers() before checking available slots for barrier taskSet
#25946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // of locality levels so that it gets a chance to launch local tasks on all of them. | ||
| // NOTE: the preferredLocality order: PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY | ||
| for (taskSet <- sortedTaskSets) { | ||
| val availableSlots = availableCpus.map(c => c / CPUS_PER_TASK).sum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: in https://github.com/apache/spark/pull/25946/files#diff-d4000438827afe3a185ae75b24987a61R455 the else could become an else if (availableSlots > 0) to avoid iterating over the remaining task sets when all available slots have been taken already, but that's a micro-optimization orthogonal to the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to avoid reiterating over the availableCpus array, we could also have one var that gets passed and updated inside resourceOfferSingleTaskSet, but that would likely be an overoptimization of this O(#num_offers) sum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
availableSlots in resourceOffers() before checking available slots for barrier taskSet
|
Test build #111445 has finished for PR 25946 at commit
|
availableSlots in resourceOffers() before checking available slots for barrier taskSetavailableSlots in resourceOffers() before checking available slots for barrier taskSet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @juliuszsompolski .
Since this is a bug fix, could you add a UT to prevent a future regression on this, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I did the same fix in #23375 previously and also have a JIRA ticket(SPARK-26431) for it.
Not sure why I got a negative comment in #23375. Maybe, due to my poor explanation.
Anyway, let's close #23375 and resolve SPARK-26431 too after this gets merged.
|
I added |
|
BTW, as @Ngone51 mentioned, this looks similar to #23375 . How do you think about that PR, @juliuszsompolski ? |
|
@dongjoon-hyun I added UT now. |
|
Test build #111478 has finished for PR 25946 at commit
|
|
LGTM |
…` before checking available slots for barrier taskSet ### What changes were proposed in this pull request? availableSlots are computed before the for loop looping over all TaskSets in resourceOffers. But the number of slots changes in every iteration, as in every iteration these slots are taken. The number of available slots checked by a barrier task set has therefore to be recomputed in every iteration from availableCpus. ### Why are the changes needed? Bugfix. This could make resourceOffer attempt to start a barrier task set, even though it has not enough slots available. That would then be caught by the `require` in line 519, which will throw an exception, which will get caught and ignored by Dispatcher's MessageLoop, so nothing terrible would happen, but the exception would prevent resourceOffers from considering further TaskSets. Note that launching the barrier TaskSet can still fail if other requirements are not satisfied, and still can be rolled-back by throwing exception in this `require`. Handling it more gracefully remains a TODO in SPARK-24818, but this fix at least should resolve the situation when it's unable to launch because of insufficient slots. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added UT Closes #23375 Closes #25946 from juliuszsompolski/SPARK-29263. Authored-by: Juliusz Sompolski <[email protected]> Signed-off-by: Xingbo Jiang <[email protected]> (cherry picked from commit 420abb4) Signed-off-by: Xingbo Jiang <[email protected]>
|
Thanks, merged to master/2.4 ! |
…hedulerImplSuite` ### What changes were proposed in this pull request? #25946 Fixed a bug and modified the `TaskSchedulerImplSuite`, when backported to 2.4 it breaks the build. This PR is to fix the broken test build. ### How was this patch tested? Passed locally. Closes #25952 from jiangxb1987/SPARK-29263. Authored-by: Xingbo Jiang <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>
What changes were proposed in this pull request?
availableSlots are computed before the for loop looping over all TaskSets in resourceOffers. But the number of slots changes in every iteration, as in every iteration these slots are taken. The number of available slots checked by a barrier task set has therefore to be recomputed in every iteration from availableCpus.
Why are the changes needed?
Bugfix.
This could make resourceOffer attempt to start a barrier task set, even though it has not enough slots available. That would then be caught by the
requirein line 519, which will throw an exception, which will get caught and ignored by Dispatcher's MessageLoop, so nothing terrible would happen, but the exception would prevent resourceOffers from considering further TaskSets.Note that launching the barrier TaskSet can still fail if other requirements are not satisfied, and still can be rolled-back by throwing exception in this
require. Handling it more gracefully remains a TODO in SPARK-24818, but this fix at least should resolve the situation when it's unable to launch because of insufficient slots.Does this PR introduce any user-facing change?
No
How was this patch tested?
Added UT
Closes #23375