[SPARK-20038] [SQL]: FileFormatWriter.ExecuteWriteTask.releaseResources() implementations to be re-entrant #17364

steveloughran · 2017-03-20T20:05:46Z

What changes were proposed in this pull request?

have theFileFormatWriter.ExecuteWriteTask.releaseResources() implementations set currentWriter=null in a finally clause. This guarantees that if the first call to currentWriter() throws an exception, the second releaseResources() call made during the task cancel process will not trigger a second attempt to close the stream.

How was this patch tested?

Tricky. I've been fixing the underlying cause when I saw the problem HADOOP-14204, but SPARK-10109 shows I'm not the first to have seen this. I can't replicate it locally any more, my code no longer being broken.

code review, however, should be straightforward

…clauses Change-Id: I1e07f5b90ba1a2b05978b1d65876d746d07d1f3c

steveloughran · 2017-03-20T21:45:53Z

Note that as the exception handler tries to close resources before calling committer.abortTask(taskAttemptContext), without this patch a failing releaseResources() means that abortTask() isn't invoked, though hopefully abortJob will handle tasks too

SparkQA · 2017-03-20T22:06:35Z

Test build #74903 has finished for PR 17364 at commit 725740b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-03-21T03:53:15Z

@steveloughran, maybe this is strictly separate with the problem specified in JIRA but do you know if we should do the same thing to SparkHadoopMapReduceWriter? I remember I had to fix the resource related problem identically with FileFormatWriter before.

steveloughran · 2017-03-21T12:07:06Z

I haven't reviewed that bit of code: make it a separate JIRA and assign to me. This one I came across in the HADOOP-2.8.0 RC3 testing; the underlying fix there is going in, but the spark code still should be made more resilient to failure here. It's always those failure modes which get you —and working with S3, that close() can be where the PUT is initiated, so P(fail) > 0.

Part of HADOOP-13786 is MAPREDUCE-6823: making it straighforward to define a different implementation of the FileOutputFormat committer. That should make it easier to do some fault injection in the commit processes, especially all those bits that violate the state machine entirely. I'll see what I can do about breaking things :)

steveloughran · 2017-03-21T12:22:39Z

Created SPARK-20045. I think there's room to improve resilience in the abort code, primarily to ensure that the underlying failure cause doesn't get lost. The codepath there is fairly complex and I'm not going to point at a snippet and say "here". Some faulting mock committer would probably be the actual first step: show the problems, then fix.

squito · 2017-03-21T16:39:57Z

Note that as the exception handler tries to close resources before calling committer.abortTask(taskAttemptContext), without this patch a failing releaseResources() means that abortTask() isn't invoked, though hopefully abortJob will handle tasks too

it doesn't look that way to me -- the abortTask is in a finally, so it should get called no matter what.

You could add a very targeted regression test -- create the ExecuteWriteTask with mock OutputFormats, have those mocks throw an exception on close. Then make sure two calls to releaseResources work correctly. Your change is obviously correct, but would help avoid future regressions.

in any case, lgtm

steveloughran · 2017-03-21T17:35:36Z

looking some more, yes, as tryWithSafeFinallyAndFailureCallbacks wraps task commit, it guarantees that the original cause doesn't get lost. The abortJob code isn't so well guarded, and looks like a failure there my hide a previous one (like a commitJob failure).

(oh, and looking at FileOutputCommitter.abortJob() it may throw a FNFE if called and its job output path doesn't exist, so callers probably do need to guard around it. It is not an idempotent call

steveloughran · 2017-03-29T13:40:13Z

I don't have a time/plans to do the test here, as it's a fairly complex piece of test setup for what a review should show isn't doing anything other than guarantee the outcome pf currentWriter==null out of all sequential execution paths

steveloughran · 2017-04-10T16:24:09Z

@squito Is this ready to go in? Like I warned, I'm not going to add tests for this, not on its own

squito · 2017-04-13T20:33:02Z

merged to master

sorry I forgot to take look at this for a while @steveloughran, thanks for the reminder

steveloughran · 2017-04-14T11:48:31Z

thanks.

…() implementations to be re-entrant ## What changes were proposed in this pull request? have the`FileFormatWriter.ExecuteWriteTask.releaseResources()` implementations set `currentWriter=null` in a finally clause. This guarantees that if the first call to `currentWriter()` throws an exception, the second releaseResources() call made during the task cancel process will not trigger a second attempt to close the stream. ## How was this patch tested? Tricky. I've been fixing the underlying cause when I saw the problem [HADOOP-14204](https://issues.apache.org/jira/browse/HADOOP-14204), but SPARK-10109 shows I'm not the first to have seen this. I can't replicate it locally any more, my code no longer being broken. code review, however, should be straightforward Author: Steve Loughran <[email protected]> Closes apache#17364 from steveloughran/stevel/SPARK-20038-close.

SPARK-20038: move the currentWriter=null assignments into finally {} …

725740b

…clauses Change-Id: I1e07f5b90ba1a2b05978b1d65876d746d07d1f3c

steveloughran changed the title ~~[SPARK-20038] [core]: move the currentWriter=null assignments into finally {} …~~ [SPARK-20038] [core]: FileFormatWriter.ExecuteWriteTask.releaseResources() implementations to be re-entrant Mar 20, 2017

steveloughran changed the title ~~[SPARK-20038] [core]: FileFormatWriter.ExecuteWriteTask.releaseResources() implementations to be re-entrant~~ [SPARK-20038] [SQL]: FileFormatWriter.ExecuteWriteTask.releaseResources() implementations to be re-entrant Mar 20, 2017

asfgit closed this in 7536e28 Apr 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-20038] [SQL]: FileFormatWriter.ExecuteWriteTask.releaseResources() implementations to be re-entrant #17364

[SPARK-20038] [SQL]: FileFormatWriter.ExecuteWriteTask.releaseResources() implementations to be re-entrant #17364

Uh oh!

steveloughran commented Mar 20, 2017 •

edited

Loading

Uh oh!

steveloughran commented Mar 20, 2017

Uh oh!

SparkQA commented Mar 20, 2017

Uh oh!

HyukjinKwon commented Mar 21, 2017 •

edited

Loading

Uh oh!

steveloughran commented Mar 21, 2017 •

edited

Loading

Uh oh!

steveloughran commented Mar 21, 2017

Uh oh!

squito commented Mar 21, 2017

Uh oh!

steveloughran commented Mar 21, 2017 •

edited

Loading

Uh oh!

steveloughran commented Mar 29, 2017

Uh oh!

steveloughran commented Apr 10, 2017

Uh oh!

squito commented Apr 13, 2017

Uh oh!

steveloughran commented Apr 14, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-20038] [SQL]: FileFormatWriter.ExecuteWriteTask.releaseResources() implementations to be re-entrant #17364

[SPARK-20038] [SQL]: FileFormatWriter.ExecuteWriteTask.releaseResources() implementations to be re-entrant #17364

Uh oh!

Conversation

steveloughran commented Mar 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

steveloughran commented Mar 20, 2017

Uh oh!

SparkQA commented Mar 20, 2017

Uh oh!

HyukjinKwon commented Mar 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steveloughran commented Mar 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steveloughran commented Mar 21, 2017

Uh oh!

squito commented Mar 21, 2017

Uh oh!

steveloughran commented Mar 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steveloughran commented Mar 29, 2017

Uh oh!

steveloughran commented Apr 10, 2017

Uh oh!

squito commented Apr 13, 2017

Uh oh!

steveloughran commented Apr 14, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

steveloughran commented Mar 20, 2017 •

edited

Loading

HyukjinKwon commented Mar 21, 2017 •

edited

Loading

steveloughran commented Mar 21, 2017 •

edited

Loading

steveloughran commented Mar 21, 2017 •

edited

Loading