-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20038] [SQL]: FileFormatWriter.ExecuteWriteTask.releaseResources() implementations to be re-entrant #17364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20038] [SQL]: FileFormatWriter.ExecuteWriteTask.releaseResources() implementations to be re-entrant #17364
Conversation
…clauses Change-Id: I1e07f5b90ba1a2b05978b1d65876d746d07d1f3c
|
Note that as the exception handler tries to close resources before calling |
|
Test build #74903 has finished for PR 17364 at commit
|
|
@steveloughran, maybe this is strictly separate with the problem specified in JIRA but do you know if we should do the same thing to |
|
I haven't reviewed that bit of code: make it a separate JIRA and assign to me. This one I came across in the HADOOP-2.8.0 RC3 testing; the underlying fix there is going in, but the spark code still should be made more resilient to failure here. It's always those failure modes which get you —and working with S3, that close() can be where the PUT is initiated, so P(fail) > 0. Part of HADOOP-13786 is MAPREDUCE-6823: making it straighforward to define a different implementation of the FileOutputFormat committer. That should make it easier to do some fault injection in the commit processes, especially all those bits that violate the state machine entirely. I'll see what I can do about breaking things :) |
|
Created SPARK-20045. I think there's room to improve resilience in the abort code, primarily to ensure that the underlying failure cause doesn't get lost. The codepath there is fairly complex and I'm not going to point at a snippet and say "here". Some faulting mock committer would probably be the actual first step: show the problems, then fix. |
it doesn't look that way to me -- the You could add a very targeted regression test -- create the in any case, lgtm |
|
looking some more, yes, as (oh, and looking at |
|
I don't have a time/plans to do the test here, as it's a fairly complex piece of test setup for what a review should show isn't doing anything other than guarantee the outcome pf |
|
@squito Is this ready to go in? Like I warned, I'm not going to add tests for this, not on its own |
|
merged to master sorry I forgot to take look at this for a while @steveloughran, thanks for the reminder |
|
thanks. |
…() implementations to be re-entrant ## What changes were proposed in this pull request? have the`FileFormatWriter.ExecuteWriteTask.releaseResources()` implementations set `currentWriter=null` in a finally clause. This guarantees that if the first call to `currentWriter()` throws an exception, the second releaseResources() call made during the task cancel process will not trigger a second attempt to close the stream. ## How was this patch tested? Tricky. I've been fixing the underlying cause when I saw the problem [HADOOP-14204](https://issues.apache.org/jira/browse/HADOOP-14204), but SPARK-10109 shows I'm not the first to have seen this. I can't replicate it locally any more, my code no longer being broken. code review, however, should be straightforward Author: Steve Loughran <[email protected]> Closes apache#17364 from steveloughran/stevel/SPARK-20038-close.
What changes were proposed in this pull request?
have the
FileFormatWriter.ExecuteWriteTask.releaseResources()implementations setcurrentWriter=nullin a finally clause. This guarantees that if the first call tocurrentWriter()throws an exception, the second releaseResources() call made during the task cancel process will not trigger a second attempt to close the stream.How was this patch tested?
Tricky. I've been fixing the underlying cause when I saw the problem HADOOP-14204, but SPARK-10109 shows I'm not the first to have seen this. I can't replicate it locally any more, my code no longer being broken.
code review, however, should be straightforward