-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24677][Core]Avoid NoSuchElementException from MedianHeap #21656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can you add a test to check if no exception thrown in that condition with this patch? |
…culation is enabled
|
@maropu |
|
@maropu @cloud-fan @squito |
|
ok to test |
| if (speculationEnabled) { | ||
| taskAttempts(index).headOption.map { info => | ||
| info.markFinished(TaskState.FINISHED, clock.getTimeMillis()) | ||
| successfulTaskDurations.insert(info.duration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the normal code path to update task durations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TaskSetManager#handleSuccessfulTask update successful task durations, and write to successfulTaskDurations.
When there are multiple tasksets for this stage, markPartitionCompletedInAllTaskSets is
accumulate the value of tasksSuccessful.
In this case, when checkSpeculatableTasks is called, the value of tasksSuccessful matches the condition, but successfulTaskDurations is empty.
def handleSuccessfulTask(tid: Long, result: DirectTaskResult[_]): Unit = {
val info = taskInfos(tid)
val index = info.index
info.markFinished(TaskState.FINISHED, clock.getTimeMillis())
if (speculationEnabled) {
successfulTaskDurations.insert(info.duration)
}
// ...
// There may be multiple tasksets for this stage -- we let all of them know that the partition
// was completed. This may result in some of the tasksets getting completed.
sched.markPartitionCompletedInAllTaskSets(stageId, tasks(index).partitionId)override def checkSpeculatableTasks(minTimeToSpeculation: Int): Boolean = {
//...
if (tasksSuccessful >= minFinishedForSpeculation && tasksSuccessful > 0) {
val time = clock.getTimeMillis()
val medianDuration = successfulTaskDurations.median|
cc @jiangxb1987 |
|
Test build #92631 has finished for PR 21656 at commit
|
|
Thanks for finding this and suggesting a fix @cxzl25. But, I'm not sure it makes sense to use this duration. its not how long the task actually took to complete. I think it might make more sense to just ignore this task for speculation. I will think about it some more. |
| private[scheduler] def markPartitionCompleted(partitionId: Int): Unit = { | ||
| partitionToIndex.get(partitionId).foreach { index => | ||
| if (!successful(index)) { | ||
| if (speculationEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC in this case no task in this taskSet actually successfully finishes, it's another task attempt from another taskSet for the same stage that succeeded. In stead of changing this code path, I'd suggest we have another flag to show whether any task succeeded in current taskSet, and if no task have succeeded, skip L987.
WDYT @squito ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that is sort of what I was suggesting -- but I was thinking rather than just a flag, maybe we separate out tasksSuccessful into tasksCompletedSuccessfully (from this taskset) and tasksNoLongerNecessary (from any taskset), perhaps with better names. If you just had a flag, you would avoid the exception from the empty heap, but you still might decide to enable speculation prematurely as you really haven't finished enough for SPECULATION_QUANTILE:
| if (tasksSuccessful >= minFinishedForSpeculation && tasksSuccessful > 0) { |
|
I assume this is really that it isn't updating successfulTaskDurations? MedianHeap is a collection, can you please update description and title to be more explicit |
|
In this case one of the older stage attempts (that is a zombie) marked the task as successful but then the newest stage attempt checked to see if it needed to speculate. Is that correct? Ideally I think for speculation we want to look at the task time for all stage attempts. But that is probably a bigger change then this. If we aren't doing that then I think ignoring it for speculation is ok. Otherwise how hard is it to send the actual task info into here so it could use the real time the successful task took? |
yeah I agree, on both points. One thing which is a little tricky is that you probably want to make sure you're only counting times from different partitions -- you might times from the same partition from multiple attempts, but that shouldn't count. (or maybe we don't really care that much as its just a heuristic anyway ...) |
| private[scheduler] def markPartitionCompleted(partitionId: Int): Unit = { | ||
| partitionToIndex.get(partitionId).foreach { index => | ||
| if (!successful(index)) { | ||
| if (speculationEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
speculationEnabled && ! isZombie
|
Ok what did we decide on the time then? I would say for now either ignore or send down the real time. @cxzl25 How hard is it to send the actual task info into here so it could use the real time the successful task took? At a glance it doesn't look to hard to add in the additional information into the function calls to pass it into markPartitionCompleted |
|
@tgravescs |
|
The changes LGTM |
|
Test build #92817 has finished for PR 21656 at commit
|
|
Test build #92820 has finished for PR 21656 at commit
|
|
@squito are you ok with this approach? |
|
+1 |
|
lgtm |
## What changes were proposed in this pull request? When speculation is enabled, TaskSetManager#markPartitionCompleted should write successful task duration to MedianHeap, not just increase tasksSuccessful. Otherwise when TaskSetManager#checkSpeculatableTasks,tasksSuccessful non-zero, but MedianHeap is empty. Then throw an exception successfulTaskDurations.median java.util.NoSuchElementException: MedianHeap is empty. Finally led to stopping SparkContext. ## How was this patch tested? TaskSetManagerSuite.scala unit test:[SPARK-24677] MedianHeap should not be empty when speculation is enabled Author: sychen <[email protected]> Closes #21656 from cxzl25/fix_MedianHeap_empty. (cherry picked from commit c8bee93) Signed-off-by: Thomas Graves <[email protected]>
## What changes were proposed in this pull request? When speculation is enabled, TaskSetManager#markPartitionCompleted should write successful task duration to MedianHeap, not just increase tasksSuccessful. Otherwise when TaskSetManager#checkSpeculatableTasks,tasksSuccessful non-zero, but MedianHeap is empty. Then throw an exception successfulTaskDurations.median java.util.NoSuchElementException: MedianHeap is empty. Finally led to stopping SparkContext. ## How was this patch tested? TaskSetManagerSuite.scala unit test:[SPARK-24677] MedianHeap should not be empty when speculation is enabled Author: sychen <[email protected]> Closes #21656 from cxzl25/fix_MedianHeap_empty. (cherry picked from commit c8bee93) Signed-off-by: Thomas Graves <[email protected]>
|
merged thanks @cxzl25 |
## What changes were proposed in this pull request? When speculation is enabled, TaskSetManager#markPartitionCompleted should write successful task duration to MedianHeap, not just increase tasksSuccessful. Otherwise when TaskSetManager#checkSpeculatableTasks,tasksSuccessful non-zero, but MedianHeap is empty. Then throw an exception successfulTaskDurations.median java.util.NoSuchElementException: MedianHeap is empty. Finally led to stopping SparkContext. ## How was this patch tested? TaskSetManagerSuite.scala unit test:[SPARK-24677] MedianHeap should not be empty when speculation is enabled Author: sychen <[email protected]> Closes apache#21656 from cxzl25/fix_MedianHeap_empty. (cherry picked from commit c8bee93) Signed-off-by: Thomas Graves <[email protected]>
## What changes were proposed in this pull request? When speculation is enabled, TaskSetManager#markPartitionCompleted should write successful task duration to MedianHeap, not just increase tasksSuccessful. Otherwise when TaskSetManager#checkSpeculatableTasks,tasksSuccessful non-zero, but MedianHeap is empty. Then throw an exception successfulTaskDurations.median java.util.NoSuchElementException: MedianHeap is empty. Finally led to stopping SparkContext. ## How was this patch tested? TaskSetManagerSuite.scala unit test:[SPARK-24677] MedianHeap should not be empty when speculation is enabled Author: sychen <[email protected]> Closes apache#21656 from cxzl25/fix_MedianHeap_empty. (cherry picked from commit c8bee93) Signed-off-by: Thomas Graves <[email protected]>
## What changes were proposed in this pull request? When speculation is enabled, TaskSetManager#markPartitionCompleted should write successful task duration to MedianHeap, not just increase tasksSuccessful. Otherwise when TaskSetManager#checkSpeculatableTasks,tasksSuccessful non-zero, but MedianHeap is empty. Then throw an exception successfulTaskDurations.median java.util.NoSuchElementException: MedianHeap is empty. Finally led to stopping SparkContext. ## How was this patch tested? TaskSetManagerSuite.scala unit test:[SPARK-24677] MedianHeap should not be empty when speculation is enabled Author: sychen <[email protected]> Closes apache#21656 from cxzl25/fix_MedianHeap_empty. (cherry picked from commit c8bee93) Signed-off-by: Thomas Graves <[email protected]>
Ref: LIHADOOP-52383 When speculation is enabled, TaskSetManager#markPartitionCompleted should write successful task duration to MedianHeap, not just increase tasksSuccessful. Otherwise when TaskSetManager#checkSpeculatableTasks,tasksSuccessful non-zero, but MedianHeap is empty. Then throw an exception successfulTaskDurations.median java.util.NoSuchElementException: MedianHeap is empty. Finally led to stopping SparkContext. TaskSetManagerSuite.scala unit test:[SPARK-24677] MedianHeap should not be empty when speculation is enabled Author: sychen <[email protected]> Closes apache#21656 from cxzl25/fix_MedianHeap_empty.
What changes were proposed in this pull request?
When speculation is enabled,
TaskSetManager#markPartitionCompleted should write successful task duration to MedianHeap,
not just increase tasksSuccessful.
Otherwise when TaskSetManager#checkSpeculatableTasks,tasksSuccessful non-zero, but MedianHeap is empty.
Then throw an exception successfulTaskDurations.median java.util.NoSuchElementException: MedianHeap is empty.
Finally led to stopping SparkContext.
How was this patch tested?
TaskSetManagerSuite.scala
unit test:[SPARK-24677] MedianHeap should not be empty when speculation is enabled