Skip to content

Conversation

@holdenk
Copy link

@holdenk holdenk commented Jul 27, 2020

From our conversation, this changes the block manager to always allow local blocks to be put avoiding the race condition. Generally speaking this shouldn't matter in production but it should help avoid the test race condition.

dagrawal3409 and others added 4 commits July 27, 2020 11:17
An interesting failure happens when migrateDuring = true (and persist or
shuffle is true):

- We schedule the job with tasks on executors 0, 1, 2.
- We wait 300 ms and decommission executor 0.
- If the task is not yet done on executor 0, it will now fail because
  the block manager won't be able to save the block. This condition is
  easy to trigger on a loaded machine where the github checks run.
- The task with retry on a different executor (1 or 2) and its shuffle
  blocks will land there.
- No actual block migration happens here because the decommissioned
  executor technically failed before it could even produce a block.

So this change makes two fixes to remove the above race condition.
- When migrateDuring = true, wait for a task to complete and write the
  block, and then decommission that executor.
- When migrateDuring = false, it is still possible (because of delay
  scheduling) for two tasks to be run on the same executor serially and
  one executor to go idle. In which case, we must make sure to
  decommission an executor that actually had a task run on it.
Now that we wait for an actual task to succeed, we don't need to wait
for events prior to that: broadcast of job-info finished and task
started. The waiting for the task end/success subsumes that. Simplifying
the test even further.
…cal blocks from the BlockManager (useful in testing).
@holdenk holdenk changed the base branch from master to block-manager-decom-flaky July 27, 2020 23:39
@agrawaldevesh agrawaldevesh force-pushed the block-manager-decom-flaky branch 3 times, most recently from d709002 to 1b469fb Compare July 29, 2020 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants