[SPARK-17484] Prevent invalid block locations from being reported after put() exceptions #15085

JoshRosen · 2016-09-13T20:08:11Z

What changes were proposed in this pull request?

If a BlockManager put() call failed after the BlockManagerMaster was notified of a block's availability then incomplete cleanup logic in a finally block would never send a second block status method to inform the master of the block's unavailability. This, in turn, leads to fetch failures and used to be capable of causing complete job failures before #15037 was fixed.

This patch addresses this issue via multiple small changes:

The finally block now calls removeBlockInternal when cleaning up from a failed put(); in addition to removing the BlockInfo entry (which was all that the old cleanup logic did), this code (redundantly) tries to remove the block from the memory and disk stores (as an added layer of defense against bugs lower down in the stack) and optionally notifies the master of block removal (which now happens during exception-triggered cleanup).
When a BlockManager receives a request for a block that it does not have it will now notify the master to update its block locations. This ensures that bad metadata pointing to non-existent blocks will eventually be fixed. Note that I could have implemented this logic in the block manager client (rather than in the remote server), but that would introduce the problem of distinguishing between transient and permanent failures; on the server, however, we know definitively that the block isn't present.
Catch NonFatal instead of Exception to avoid swallowing InterruptedExceptions thrown from synchronous block replication calls.

This patch depends upon the refactorings in #15036, so that other patch will also have to be backported when backporting this fix.

For more background on this issue, including example logs from a real production failure, see SPARK-17484.

How was this patch tested?

Two new regression tests in BlockManagerSuite.

… failure

…quest.

JoshRosen · 2016-09-13T20:08:38Z

/cc @ericl @srinathshankar for review

ericl · 2016-09-13T22:09:31Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

+        // notified the master about the availability of this block, so we need to send an update
+        // to remove this block location.
+        removeBlockInternal(
+          blockId, tellMaster = tellMaster && putBlockInfo.tellMaster && exceptionWasThrown)


aren't these two tellMaster values equivalent since we set it in the block from the function arg above?

Yep, tellMaster will equal putBlockInfo.tellMaster in this branch. Let me update this to clarify.

SparkQA · 2016-09-13T22:23:57Z

Test build #65330 has finished for PR 15085 at commit 6609c2a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ericl · 2016-09-13T22:36:50Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

        }
      } else {
-        blockInfoManager.removeBlock(blockId)
+        if (exceptionWasThrown) {


could also combine this with the above else

btw, is it necessary to unlock the block in this path?

I didn't combine it so that the logWarning wouldn't need to be duplicated, but that's not a great rationale.

removeBlockInternal (which is used in both this if and else case now) will handle releasing the lock (this happens in the blockInfoManager.removeBlock call).

srinathshankar · 2016-09-13T23:01:12Z

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

+      exceptionWasThrown = false
      res
    } finally {
      if (blockWasSuccessfullyStored) {


Wouldn't blockWasSuccessfullyStored be false if exceptionWasThrown were true ?
In that case, couldn't we write this as
try {
...
} catch (Exception e) {
removeBlock
addUpdatedBlock
} finally {
// WHatever was there before ?
}

One concern with using a catch here is handling of InterruptedException: if we use case NonFatal(e) that won't match InterruptedException and we'll miss out on cleanup following that. If we catch Throwable, on the other hand, then I think that we'll end up clearing the isInterrupted bit for InterruptedExceptions and it'll be awkward to match and re-set it when rethrowing. Therefore I'd like to keep the exception-handling case in the finally block with a simple check to see if we entered that block via an error case.

Note that I've seen this same exception-handling idiom used in Java code, where code that catches and re-throws Throwable won't compile in older Java versions because of the checked exception-handling (I think that newer versions are a bit more permissive about throwing exceptions from a catch block).

That said, I think we could simplify this by moving the non-error-case code into the try block. Let me do that now.

Ok this is fine then. Could you leave a comment mentioning the InterruptedException problem ? Otherwise, this LGTM

ericl · 2016-09-13T23:02:41Z

Ok, lgtm then

On Tue, Sep 13, 2016, 3:48 PM Apache Spark QA [email protected]
wrote:

Test build #65339 has started
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65339/consoleFull
for PR 15085 at commit 8ab3108
8ab3108
.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15085 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAA6Sl9ErPKKdmjfmFXccG2eAwr92H20ks5qpyhMgaJpZM4J8FrF
.

SparkQA · 2016-09-14T01:12:18Z

Test build #65339 has finished for PR 15085 at commit 8ab3108.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-14T02:37:47Z

Test build #65337 has finished for PR 15085 at commit f69a5ea.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-14T03:57:32Z

Test build #65342 has finished for PR 15085 at commit f60c4be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-15T01:39:09Z

Test build #65410 has finished for PR 15085 at commit 47f9636.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-09-15T18:53:59Z

I'm going to merge this to master and branch-2.0. Thanks for reviews @ericl and @srinathshankar!

…er put() exceptions ## What changes were proposed in this pull request? If a BlockManager `put()` call failed after the BlockManagerMaster was notified of a block's availability then incomplete cleanup logic in a `finally` block would never send a second block status method to inform the master of the block's unavailability. This, in turn, leads to fetch failures and used to be capable of causing complete job failures before #15037 was fixed. This patch addresses this issue via multiple small changes: - The `finally` block now calls `removeBlockInternal` when cleaning up from a failed `put()`; in addition to removing the `BlockInfo` entry (which was _all_ that the old cleanup logic did), this code (redundantly) tries to remove the block from the memory and disk stores (as an added layer of defense against bugs lower down in the stack) and optionally notifies the master of block removal (which now happens during exception-triggered cleanup). - When a BlockManager receives a request for a block that it does not have it will now notify the master to update its block locations. This ensures that bad metadata pointing to non-existent blocks will eventually be fixed. Note that I could have implemented this logic in the block manager client (rather than in the remote server), but that would introduce the problem of distinguishing between transient and permanent failures; on the server, however, we know definitively that the block isn't present. - Catch `NonFatal` instead of `Exception` to avoid swallowing `InterruptedException`s thrown from synchronous block replication calls. This patch depends upon the refactorings in #15036, so that other patch will also have to be backported when backporting this fix. For more background on this issue, including example logs from a real production failure, see [SPARK-17484](https://issues.apache.org/jira/browse/SPARK-17484). ## How was this patch tested? Two new regression tests in BlockManagerSuite. Author: Josh Rosen <[email protected]> Closes #15085 from JoshRosen/SPARK-17484. (cherry picked from commit 1202075) Signed-off-by: Josh Rosen <[email protected]>

…er put() exceptions ## What changes were proposed in this pull request? If a BlockManager `put()` call failed after the BlockManagerMaster was notified of a block's availability then incomplete cleanup logic in a `finally` block would never send a second block status method to inform the master of the block's unavailability. This, in turn, leads to fetch failures and used to be capable of causing complete job failures before apache#15037 was fixed. This patch addresses this issue via multiple small changes: - The `finally` block now calls `removeBlockInternal` when cleaning up from a failed `put()`; in addition to removing the `BlockInfo` entry (which was _all_ that the old cleanup logic did), this code (redundantly) tries to remove the block from the memory and disk stores (as an added layer of defense against bugs lower down in the stack) and optionally notifies the master of block removal (which now happens during exception-triggered cleanup). - When a BlockManager receives a request for a block that it does not have it will now notify the master to update its block locations. This ensures that bad metadata pointing to non-existent blocks will eventually be fixed. Note that I could have implemented this logic in the block manager client (rather than in the remote server), but that would introduce the problem of distinguishing between transient and permanent failures; on the server, however, we know definitively that the block isn't present. - Catch `NonFatal` instead of `Exception` to avoid swallowing `InterruptedException`s thrown from synchronous block replication calls. This patch depends upon the refactorings in apache#15036, so that other patch will also have to be backported when backporting this fix. For more background on this issue, including example logs from a real production failure, see [SPARK-17484](https://issues.apache.org/jira/browse/SPARK-17484). ## How was this patch tested? Two new regression tests in BlockManagerSuite. Author: Josh Rosen <[email protected]> Closes apache#15085 from JoshRosen/SPARK-17484.

JoshRosen added 5 commits September 13, 2016 12:49

Don't swallow InterruptedException in block replication code.

ba3a9bd

Add regression test for cleanup after put() exception

ecc81f7

Add regression test for update of invalid block locations after fetch…

b395313

… failure

Perform more complete cleanup in put() finally block.

aa75e2d

Unconditionally update master's block status when handling invalid re…

6609c2a

…quest.

ericl reviewed Sep 13, 2016
View reviewed changes

Consolidate exceptionWasThrown=true path.

f69a5ea

ericl reviewed Sep 13, 2016
View reviewed changes

Update BlockManager.scala

8ab3108

srinathshankar reviewed Sep 13, 2016
View reviewed changes

Simplify finally block.

f60c4be

Update BlockManager.scala

47f9636

JoshRosen mentioned this pull request Sep 15, 2016

[SPARK-17483] Refactoring in BlockManager status reporting and block removal #15036

Closed

asfgit closed this in 1202075 Sep 15, 2016

JoshRosen deleted the SPARK-17484 branch September 15, 2016 18:58

[SPARK-17484] Prevent invalid block locations from being reported after put() exceptions #15085

[SPARK-17484] Prevent invalid block locations from being reported after put() exceptions #15085

Uh oh!

Conversation

JoshRosen commented Sep 13, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

JoshRosen commented Sep 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl commented Sep 13, 2016

Uh oh!

SparkQA commented Sep 14, 2016

Uh oh!

SparkQA commented Sep 14, 2016

Uh oh!

SparkQA commented Sep 14, 2016

Uh oh!

SparkQA commented Sep 15, 2016

Uh oh!

JoshRosen commented Sep 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants