Skip to content

Conversation

@juliuszsompolski
Copy link
Contributor

@juliuszsompolski juliuszsompolski commented Sep 26, 2019

What changes were proposed in this pull request?

availableSlots are computed before the for loop looping over all TaskSets in resourceOffers. But the number of slots changes in every iteration, as in every iteration these slots are taken. The number of available slots checked by a barrier task set has therefore to be recomputed in every iteration from availableCpus.

Why are the changes needed?

Bugfix.
This could make resourceOffer attempt to start a barrier task set, even though it has not enough slots available. That would then be caught by the require in line 519, which will throw an exception, which will get caught and ignored by Dispatcher's MessageLoop, so nothing terrible would happen, but the exception would prevent resourceOffers from considering further TaskSets.
Note that launching the barrier TaskSet can still fail if other requirements are not satisfied, and still can be rolled-back by throwing exception in this require. Handling it more gracefully remains a TODO in SPARK-24818, but this fix at least should resolve the situation when it's unable to launch because of insufficient slots.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT

Closes #23375

// of locality levels so that it gets a chance to launch local tasks on all of them.
// NOTE: the preferredLocality order: PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY
for (taskSet <- sortedTaskSets) {
val availableSlots = availableCpus.map(c => c / CPUS_PER_TASK).sum
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: in https://github.com/apache/spark/pull/25946/files#diff-d4000438827afe3a185ae75b24987a61R455 the else could become an else if (availableSlots > 0) to avoid iterating over the remaining task sets when all available slots have been taken already, but that's a micro-optimization orthogonal to the fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid reiterating over the availableCpus array, we could also have one var that gets passed and updated inside resourceOfferSingleTaskSet, but that would likely be an overoptimization of this O(#num_offers) sum.

Copy link
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jiangxb1987 jiangxb1987 changed the title [SPARK-29263] availableSlots in resourceOffers can change before being checked by barrier TaskSet [SPARK-29263] Update availableSlots in resourceOffers() before checking available slots for barrier taskSet Sep 26, 2019
@SparkQA
Copy link

SparkQA commented Sep 26, 2019

Test build #111445 has finished for PR 25946 at commit 467731c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-29263] Update availableSlots in resourceOffers() before checking available slots for barrier taskSet [SPARK-29263][SCHEDULER] Update availableSlots in resourceOffers() before checking available slots for barrier taskSet Sep 26, 2019
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @juliuszsompolski .
Since this is a bug fix, could you add a UT to prevent a future regression on this, please?

Copy link
Member

@Ngone51 Ngone51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I did the same fix in #23375 previously and also have a JIRA ticket(SPARK-26431) for it.

Not sure why I got a negative comment in #23375. Maybe, due to my poor explanation.

Anyway, let's close #23375 and resolve SPARK-26431 too after this gets merged.

@dongjoon-hyun
Copy link
Member

I added Closes #23375 in the PR description to close it.

@dongjoon-hyun
Copy link
Member

BTW, as @Ngone51 mentioned, this looks similar to #23375 . How do you think about that PR, @juliuszsompolski ?

@juliuszsompolski
Copy link
Contributor Author

@dongjoon-hyun I added UT now.
The fix in this PR is exactly the same as in #23375, just that I wrote it inline, and @Ngone51 pulled it out into a helper def.

@SparkQA
Copy link

SparkQA commented Sep 27, 2019

Test build #111478 has finished for PR 25946 at commit f5b4f48.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

LGTM

jiangxb1987 pushed a commit that referenced this pull request Sep 27, 2019
…` before checking available slots for barrier taskSet

### What changes were proposed in this pull request?

availableSlots are computed before the for loop looping over all TaskSets in resourceOffers. But the number of slots changes in every iteration, as in every iteration these slots are taken. The number of available slots checked by a barrier task set has therefore to be recomputed in every iteration from availableCpus.

### Why are the changes needed?

Bugfix.
This could make resourceOffer attempt to start a barrier task set, even though it has not enough slots available. That would then be caught by the `require` in line 519, which will throw an exception, which will get caught and ignored by Dispatcher's MessageLoop, so nothing terrible would happen, but the exception would prevent resourceOffers from considering further TaskSets.
Note that launching the barrier TaskSet can still fail if other requirements are not satisfied, and still can be rolled-back by throwing exception in this `require`. Handling it more gracefully remains a TODO in SPARK-24818, but this fix at least should resolve the situation when it's unable to launch because of insufficient slots.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Added UT

Closes #23375

Closes #25946 from juliuszsompolski/SPARK-29263.

Authored-by: Juliusz Sompolski <[email protected]>
Signed-off-by: Xingbo Jiang <[email protected]>
(cherry picked from commit 420abb4)
Signed-off-by: Xingbo Jiang <[email protected]>
@jiangxb1987
Copy link
Contributor

Thanks, merged to master/2.4 !

@gatorsmile
Copy link
Member

vanzin pushed a commit that referenced this pull request Sep 27, 2019
…hedulerImplSuite`

### What changes were proposed in this pull request?

#25946 Fixed a bug and modified the `TaskSchedulerImplSuite`, when backported to 2.4 it breaks the build. This PR is to fix the broken test build.

### How was this patch tested?

Passed locally.

Closes #25952 from jiangxb1987/SPARK-29263.

Authored-by: Xingbo Jiang <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants