You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Make the block manager decommissioning test be less flaky
An interesting failure happens when migrateDuring = true (and persist or
shuffle is true):
- We schedule the job with tasks on executors 0, 1, 2.
- We wait 300 ms and decommission executor 0.
- If the task is not yet done on executor 0, it will now fail because
the block manager won't be able to save the block. This condition is
easy to trigger on a loaded machine where the github checks run.
- The task with retry on a different executor (1 or 2) and its shuffle
blocks will land there.
- No actual block migration happens here because the decommissioned
executor technically failed before it could even produce a block.
So this change makes two fixes to remove the above race condition.
- When migrateDuring = true, wait for a task to complete and write the
block, and then decommission that executor.
- When migrateDuring = false, it is still possible (because of delay
scheduling) for two tasks to be run on the same executor serially and
one executor to go idle. In which case, we must make sure to
decommission an executor that actually had a task run on it.
0 commit comments