Skip to content

Conversation

@steveloughran
Copy link
Contributor

@steveloughran steveloughran commented Mar 20, 2017

What changes were proposed in this pull request?

have theFileFormatWriter.ExecuteWriteTask.releaseResources() implementations set currentWriter=null in a finally clause. This guarantees that if the first call to currentWriter() throws an exception, the second releaseResources() call made during the task cancel process will not trigger a second attempt to close the stream.

How was this patch tested?

Tricky. I've been fixing the underlying cause when I saw the problem HADOOP-14204, but SPARK-10109 shows I'm not the first to have seen this. I can't replicate it locally any more, my code no longer being broken.

code review, however, should be straightforward

…clauses

Change-Id: I1e07f5b90ba1a2b05978b1d65876d746d07d1f3c
@steveloughran steveloughran changed the title [SPARK-20038] [core]: move the currentWriter=null assignments into finally {} … [SPARK-20038] [core]: FileFormatWriter.ExecuteWriteTask.releaseResources() implementations to be re-entrant Mar 20, 2017
@steveloughran steveloughran changed the title [SPARK-20038] [core]: FileFormatWriter.ExecuteWriteTask.releaseResources() implementations to be re-entrant [SPARK-20038] [SQL]: FileFormatWriter.ExecuteWriteTask.releaseResources() implementations to be re-entrant Mar 20, 2017
@steveloughran
Copy link
Contributor Author

Note that as the exception handler tries to close resources before calling committer.abortTask(taskAttemptContext), without this patch a failing releaseResources() means that abortTask() isn't invoked, though hopefully abortJob will handle tasks too

@SparkQA
Copy link

SparkQA commented Mar 20, 2017

Test build #74903 has finished for PR 17364 at commit 725740b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Mar 21, 2017

@steveloughran, maybe this is strictly separate with the problem specified in JIRA but do you know if we should do the same thing to SparkHadoopMapReduceWriter? I remember I had to fix the resource related problem identically with FileFormatWriter before.

@steveloughran
Copy link
Contributor Author

steveloughran commented Mar 21, 2017

I haven't reviewed that bit of code: make it a separate JIRA and assign to me. This one I came across in the HADOOP-2.8.0 RC3 testing; the underlying fix there is going in, but the spark code still should be made more resilient to failure here. It's always those failure modes which get you —and working with S3, that close() can be where the PUT is initiated, so P(fail) > 0.

Part of HADOOP-13786 is MAPREDUCE-6823: making it straighforward to define a different implementation of the FileOutputFormat committer. That should make it easier to do some fault injection in the commit processes, especially all those bits that violate the state machine entirely. I'll see what I can do about breaking things :)

@steveloughran
Copy link
Contributor Author

Created SPARK-20045. I think there's room to improve resilience in the abort code, primarily to ensure that the underlying failure cause doesn't get lost. The codepath there is fairly complex and I'm not going to point at a snippet and say "here". Some faulting mock committer would probably be the actual first step: show the problems, then fix.

@squito
Copy link
Contributor

squito commented Mar 21, 2017

Note that as the exception handler tries to close resources before calling committer.abortTask(taskAttemptContext), without this patch a failing releaseResources() means that abortTask() isn't invoked, though hopefully abortJob will handle tasks too

it doesn't look that way to me -- the abortTask is in a finally, so it should get called no matter what.

You could add a very targeted regression test -- create the ExecuteWriteTask with mock OutputFormats, have those mocks throw an exception on close. Then make sure two calls to releaseResources work correctly. Your change is obviously correct, but would help avoid future regressions.

in any case, lgtm

@steveloughran
Copy link
Contributor Author

steveloughran commented Mar 21, 2017

looking some more, yes, as tryWithSafeFinallyAndFailureCallbacks wraps task commit, it guarantees that the original cause doesn't get lost. The abortJob code isn't so well guarded, and looks like a failure there my hide a previous one (like a commitJob failure).

(oh, and looking at FileOutputCommitter.abortJob() it may throw a FNFE if called and its job output path doesn't exist, so callers probably do need to guard around it. It is not an idempotent call

@steveloughran
Copy link
Contributor Author

I don't have a time/plans to do the test here, as it's a fairly complex piece of test setup for what a review should show isn't doing anything other than guarantee the outcome pf currentWriter==null out of all sequential execution paths

@steveloughran
Copy link
Contributor Author

@squito Is this ready to go in? Like I warned, I'm not going to add tests for this, not on its own

@squito
Copy link
Contributor

squito commented Apr 13, 2017

merged to master

sorry I forgot to take look at this for a while @steveloughran, thanks for the reminder

@asfgit asfgit closed this in 7536e28 Apr 13, 2017
@steveloughran
Copy link
Contributor Author

thanks.

peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
…() implementations to be re-entrant

## What changes were proposed in this pull request?

have the`FileFormatWriter.ExecuteWriteTask.releaseResources()` implementations  set `currentWriter=null` in a finally clause. This guarantees that if the first call to `currentWriter()` throws an exception, the second releaseResources() call made during the task cancel process will not trigger a second attempt to close the stream.

## How was this patch tested?

Tricky. I've been fixing the underlying cause when I saw the problem [HADOOP-14204](https://issues.apache.org/jira/browse/HADOOP-14204), but SPARK-10109 shows I'm not the first to have seen this. I can't replicate it locally any more, my code no longer being broken.

code review, however, should be straightforward

Author: Steve Loughran <[email protected]>

Closes apache#17364 from steveloughran/stevel/SPARK-20038-close.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants