-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HADOOP-16085: use object version or etags to protect against inconsistent read after replace/overwrite #794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
commit ae876ab2df46c68ddd923edf8dd1d314191fcc94 Merge: 2e0254e 6a42745 Author: Ben Roling <[email protected]> Date: Thu May 2 10:14:10 2019 -0500 Merge branch 'trunk' into HADOOP-16085-squashed-2 commit 2e0254e Author: Ben Roling <[email protected]> Date: Thu Apr 18 12:13:40 2019 -0500 Remove unused import commit d1275e4 Merge: 450ba66 df76cdc Author: Ben Roling <[email protected]> Date: Thu Apr 18 12:10:01 2019 -0500 Merge branch 'trunk' into HADOOP-16085-squashed commit 450ba66 Author: Ben Roling <[email protected]> Date: Thu Apr 18 11:45:41 2019 -0500 Improvements to TestObjectChangeDetectionAttributes, AbstractS3AMockTest commit 408af6c Author: Ben Roling <[email protected]> Date: Thu Apr 18 10:29:05 2019 -0500 Use HttpStatus code constant instead of magic number commit 5f0532b Author: Ben Roling <[email protected]> Date: Thu Apr 18 10:02:50 2019 -0500 Update core-default.xml commit 3488b20 Author: Ben Roling <[email protected]> Date: Wed Apr 17 16:14:43 2019 -0500 Fix runaround of creating FileStatus and then calling fromFileStatus() commit 90d5c9c Author: Ben Roling <[email protected]> Date: Wed Apr 17 15:45:51 2019 -0500 Fix minor nits commit 3ff59e4 Author: Ben Roling <[email protected]> Date: Wed Apr 17 15:07:02 2019 -0500 Mutate S3AFileStatus instead of creating new instance commit 13fab97 Author: Ben Roling <[email protected]> Date: Wed Apr 17 14:30:55 2019 -0500 Rename S3LocatedFileStatus to S3ALocatedFileStatus commit bee4e52 Author: Ben Roling <[email protected]> Date: Wed Apr 17 14:25:17 2019 -0500 Stop pretending to support group and permission attributes on S3AFileStatus commit 807e13b Author: Ben Roling <[email protected]> Date: Wed Apr 17 14:20:14 2019 -0500 Add serialVersionUID to S3LocatedFileStatus commit 9974cec Author: Ben Roling <[email protected]> Date: Mon Apr 8 13:34:38 2019 -0500 Fix missed group or owner tweak commit 708c001 Author: Ben Roling <[email protected]> Date: Mon Apr 8 12:58:14 2019 -0500 Fix S3AFileStatus group handling ITestS3AConfiguration.testUsernameFromUGI was failing, expecting the user to be copied into the group. Strict copying of user into group causes TestLocalMetadataStore.testPutNew() to fail since it expects the group to be preserved from the original FileStatus. This change copies user into group when group is null/empty. With this change, all existing tests pass. commit 5239a9f Author: Ben Roling <[email protected]> Date: Thu Apr 4 16:38:31 2019 -0500 Skip tests that require versionId when bucket doesn't have versioning enabled commit 4c6331e Author: Ben Roling <[email protected]> Date: Mon Apr 1 13:58:24 2019 -0500 Fix broken TestObjectChangeDetectionAttributes commit 8a19c42 Author: Ben Roling <[email protected]> Date: Mon Apr 1 10:08:33 2019 -0500 Squashed commit of the following: commit 9f4ad88 Author: Ben Roling <[email protected]> Date: Mon Apr 1 09:29:35 2019 -0500 Add test for 412 response commit dc0a3fb Author: Ben Roling <[email protected]> Date: Thu Mar 28 16:53:46 2019 -0500 Update tests that started failing due to HADOOP-15999 commit 5e1f3e3 Author: Ben Roling <[email protected]> Date: Thu Mar 28 15:49:26 2019 -0500 Speed up ITestS3ARemoteFileChanged commit 1b6be40 Author: Ben Roling <[email protected]> Date: Thu Mar 28 14:23:53 2019 -0500 Skip invalid test when object versioning enabled commit 8597d2e Merge: 2d235f8 b5db238 Author: Ben Roling <[email protected]> Date: Thu Mar 28 11:54:50 2019 -0500 Merge remote-tracking branch 'apache/trunk' into HADOOP-16085 commit 2d235f8 Author: Ben Roling <[email protected]> Date: Thu Mar 28 11:51:26 2019 -0500 Fix typo commit dc83cef Author: Ben Roling <[email protected]> Date: Thu Mar 28 10:28:09 2019 -0500 Generalize TestObjectETag to cover versionId and test overwrite commit 0d71f32 Author: Ben Roling <[email protected]> Date: Thu Mar 28 08:45:42 2019 -0500 Fix trailing whitespace commit 324be6d Author: Ben Roling <[email protected]> Date: Wed Mar 27 22:00:57 2019 -0500 S3GuardTool updates to correct ETag or versionId metadata commit 2a2bba7 Author: Ben Roling <[email protected]> Date: Wed Mar 27 21:27:27 2019 -0500 Clarify log message commit 6e62a3a Author: Ben Roling <[email protected]> Date: Wed Mar 27 21:17:48 2019 -0500 Documentation updates per PR feedback commit 1ff8bef Author: Ben Roling <[email protected]> Date: Wed Mar 27 16:05:59 2019 -0500 check version.required on CopyResult commit e296275 Author: Ben Roling <[email protected]> Date: Wed Mar 27 16:04:50 2019 -0500 Minor javadoc improvements from PR review commit 3e9ea19 Author: Ben Roling <[email protected]> Date: Wed Mar 27 13:15:58 2019 -0500 Skip tests that aren't applicable with change.detection.source=versionId commit ddbf68b Author: Ben Roling <[email protected]> Date: Wed Mar 27 11:56:38 2019 -0500 Add tests of case where no version metadata is present commit 21d37dd Author: Ben Roling <[email protected]> Date: Wed Mar 27 09:25:46 2019 -0500 Fix compiler deprecation warning commit b8e1569 Author: Ben Roling <[email protected]> Date: Wed Mar 27 09:19:46 2019 -0500 Fix license issue commit 33bb5f9 Author: Ben Roling <[email protected]> Date: Wed Mar 27 09:19:32 2019 -0500 Fix findbugs issue commit 5b7fadb Author: Ben Roling <[email protected]> Date: Wed Mar 27 09:00:39 2019 -0500 Fix checkstyle issues commit 6110a11 Author: Ben Roling <[email protected]> Date: Wed Mar 27 08:28:37 2019 -0500 Remove trailing whitespace commit d82069b Author: Ben Roling <[email protected]> Date: Tue Mar 26 16:05:01 2019 -0500 Improve S3Guard doc commit ca2f0e9 Author: Ben Roling <[email protected]> Date: Tue Mar 26 14:29:03 2019 -0500 Fix ITestS3ARemoteFileChanged commit 1e4fa85 Author: Ben Roling <[email protected]> Date: Tue Mar 26 11:37:48 2019 -0500 Increase local metastore cache timeout commit 34b0c80 Author: Ben Roling <[email protected]> Date: Tue Mar 26 11:35:34 2019 -0500 Fix isEmptyDir inconsistency commit bbf8365 Author: Ben Roling <[email protected]> Date: Mon Mar 25 16:55:24 2019 -0500 TestPathMetadataDynamoDBTranslation tests null etag, versonId commit 2ae7d16 Author: Ben Roling <[email protected]> Date: Mon Mar 25 16:54:49 2019 -0500 Add constants in TestDirListingMetadata commit 068a55d Author: Ben Roling <[email protected]> Date: Mon Mar 25 15:43:45 2019 -0500 Add copy exception handling commit 0eca6f3 Author: Ben Roling <[email protected]> Date: Mon Mar 25 12:43:51 2019 -0500 Don't process response from copy commit ad9e152 Author: Ben Roling <[email protected]> Date: Mon Feb 25 16:41:54 2019 -0600 HADOOP-16085-003.patch Rebase of previous work after merge of HADOOP-15625.
|
🎊 +1 overall
This message was automatically generated. |
Includes retries for regular reads, select(), and rename()
|
I've pushed a commit that adds retries as discussed in #675 (comment) The retries happen in S3AInputStream if the version doesn't match on initial open. There are no retries if the version doesn't match on re-open (during seek() backwards). Retries also happen for rename() and select(). Testing was added in ITestS3ARemoteFileChanged. I used Mockito.spy() on the s3 client to stub in inconsistent responses until a threshold of retries is met. I've run the full test suite (against a bucket with versioning enabled in us-west-2): The two errors were in ITestDirectoryCommitMRJob and ITestS3GuardConcurrentOps, which succeeded when run individually: #675 (comment) suggests possibly different retry settings for these scenarios. I haven't done that, at least yet. Perhaps that can be carved off as another issue. Similarly, I haven't implemented the HADOOP-13293 proposal. I'm open to those things but would like to get the rest of this settled (merged) first if possible. |
|
thanks, I'm checking this out and going to test/review it locally. with the goal of getting it in this week. I might do some changes to the PR locally and push them up as a branch for you to cherry-pick in, as that is potentially easier than me just adding a large set of bits of homework for you to do. Would that be OK? it should save time all round |
|
Great, thanks Steve!
Sure, that sounds good to me. |
+add stevel review (primarily of tests) Change-Id: I75a3b70917eefc0a0ec3190ca1de527e2081551e
|
Right, I've done my edits and will put it up as a PR alongside that: if you cherry pick my patch in here, then I'll close/delete that one and this will have everything in. That patch is me just going through my review comments and doing them. I am seeing failures with |
|
See #803 |
|
Thanks Steve! I'll have a look over it and see what's up with |
|
I looked over your changes and they all made sense to me. Thanks for cleaning up my mistakes :) . I've fast-forwarded this PR branch to include your commit. The change to add the annotation that labeled the parameters on the parameterized tests is especially nice. Embarrassingly I hadn't learned about that one yet. After pulling your changes in I re-ran |
|
The latest commit fixes the test failures I was seeing against a bucket with versioning disabled. In the (etag,client) case for There were a few other failures in test methods that require versioning since I hadn't copied the code that executes the assumption to make sure versioning is there. I ran the full ITestS3ARemoteFileChanged once each against a bucket with versioning enabled and a bucket with versioning disabled and all tests either succeeded or were skipped as expected. I will run the full test suite again against a bucket with versioning disabled just in case there are somehow other failures to uncover there since previously I only ran the suite against a bucket with versioning enabled. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
I completed the test run (us-west-2, bucket with versioning disabled): The 3 errors and 1 failure were spread across ITestS3AContractGetFileStatusV1List, ITestDirectoryCommitMRJob, and ITestS3GuardToolDynamoDB, which succeeded when ran individually. |
I only learned about it last week. Note, for any code which creates FS paths via calls to getMethodName, you need to have strings which can form valid paths (no spaces etc) |
|
One thing which needs changing: the error message. Currently it says "on position -X" but now it should describe what operation failed |
Ah, yes, the "reading" and position mention is kind of misleading. It does mention the operation, but that is easy to overlook: org.apache.hadoop.fs.s3a.RemoteFileChangedException: copy ... |
|
new exception will include the operation and only include a position if >= 0: |
Sounds good to me. Are you crafting a commit for that? If not I can do it here in a bit. |
-move CopyOutcome -add new (failing) test for directory renaming where one of the files underneath is eventually consistent. -pull out commonanity from test methods (spyOnFilesystem(), expectReadFailure()) -intercept() calls provide details on failure, primarily just by returning the result of the operation expected to fail. -ChangeDetectionPolicy-raised exceptions on copy failures don't include position any more -cache/restore metastore so even on test failures mockito is involved on teardown Change-Id: I1fbdc38e9083881e4d72f13f29845e34539ac4ed
|
here: #807 -move CopyOutcome into fs.s3a.impl package Tests: yes, some fail, testRenameEventuallyConsistent* depending on the options. I think its really down to "how many failures are being generated"...but I don't want to blindly change them until they work, not without knowing why |
|
Perfect. I had a quick review of your changes and as before they look good. The additional logging in the tests should be especially helpful. I'll give the tests a run to presumably reproduce the failures and then help explain and fix them. I have to step away from it for a little bit here but I should be able to come back to it a little later today. |
|
That'd be great. If you turn @debug logging on for the tests you can get a better view of what's happening in terms of when an operation is being called. Now some other issues
But:
I'm worried that the addition of retry resilience may be putting a double wrapper around existing failure conditions and their retries.
BTW, note that the CopyOutcome class only stores SdkBaseExceptions raised in the copy, as does the method to await the copy to complete. I've not put that through all the way through the other bits of the patch, but think it makes sense as other failures shouldn't need to be considered by the change tracker. |
|
Good catches. This retry stuff is a bit tricky. I obviously overlooked the fact that getObjectMetadata() was retrying. That also needs to be considered in S3AFileSystem.select() where I added a layer of retries to wait for consistency, but it is wrapping getObjectMetadata() as well. A getObjectMetadata() that takes the ChangeTracker as you suggest seems like it could apply in both copyFile() and select(). It seems like a good approach to try. I assume maybe you are already taking a shot at that? If not, I can give it a whirl. |
Yep, agree. |
AWS response stubbing was incorrect for inconsistency during copyObject(). AWS throws AmazonClientException caused by NPE if constraints aren't met per aws/aws-sdk-java#1644. Documentation and tests updated to reflect this.
|
Hey @steveloughran - I think I've covered everything previously discussed now if you can have a look again? One thing I did notice with my last update is that the exception message from copyFile() does not qualify the paths. That's previous behavior, but I'm guess was a mistake? Should I make an update here to qualify those? You can see an example in what I added to s3guard.md. I'll be running the full test suite again and posting the results. |
Realized I forgot that. 57ad29b covers it. |
|
🎊 +1 overall
This message was automatically generated. |
|
Test run complete against a bucket in us-west-2 with object versioning enabled: ITestDirectoryCommitMRJob completed successfully when run again individually: I'll do another run with a bucket with object versioning disabled. |
|
🎊 +1 overall
This message was automatically generated. |
That's done now. Like when I ran with versioning enabled, the only test with an error was ITestDirectoryCommitMRJob>AbstractITCommitMRJob.testMRJob, which succeeded on an individual run: |
|
I'm travelling right now & can't review this until wed/thur -it's got priority as I want this in. The failure of the MR commit is interesting as I'm seeing it too -I need to understand why that test is being flaky. Big Q: is it only with this patch which makes it surface? If you could do a test run with your settings against trunk, that'd help isolate us. BTW PR #818 does the SDK update. Again, having a play with that would be good in general. w.r.t qualifying paths in errors, there's no consistency. I like to get the full FS URI as it helps debug things. |
|
Thanks Steve, hope you have a great trip. I've seen the ITestDirectoryCommitMRJob failure before. I noted it on my PR for HADOOP-16221. It looks to me like you logged the issue as HADOOP-16207 back on March 26. The only change I worked on that got in before that was for HADOOP-15625. I suppose it is possible that it is somehow related to that. I can try a run from the commit just prior to that. I'll also have a closer look at that test to see if I can deduce what might be going wrong. |
|
Ben: my etag-server version of the NPE test is failing on the client. Probably there's some config quirk surfacing where the options aren't being cleared. Changing the text of the exception to be clear it's on the client |
…rk consistently * clear auth mode flag from FS config, so there is a metastore check before rename(file), ensuring that the values passed to the mock are consistent everywhere * change error in ChangeDetectionPolicy to make clear when an error is coming on the client for a getObjectMetadata call even when the copy policy is "server" * and fix up ITestS3ARemoteFileChanged to use the constant string defined in ChangeDetectionPolicy precisely to stop tests being brittle against changes in the text. * new tests in ITestS3ARemoteFileChanged to not lose stack traces when examining causes/inner causes of exceptions, just to throw new assertions with nested causes. * deleted a couple of unused imports * moved from two spaces after a "." to one. Sorry. Change-Id: I30e28fdc9431668058ce1eae8517cc25f31fa051
|
Created #824 with my changes. @bgaborg and I worked out that the reason your tests were working but our test runs failing were because you'd explicitly set authoritative = false in your config files, whereas we were going with the default, and for hadoop-aws test runs, auth is automatically set to true (we should look at that elsewhere). Also made some other changes about error reporting and how assertions on exceptions in test failures are examined. FWIW, here's my (best practises on test design)[https://github.com/steveloughran/formality/blob/master/styleguide/styleguide.md], a two keys ones being:
With these test we have consistent test runs. Let's see what Yetus says. We've been discussing how we ought to actually switch between auth and nonauth mode in some of the parameterized tests so we can actually explore both options; the value of getObjectMetadata() calls expected in the tests would be: once we've got that test coverage down and we are all happy with our test suites working locally, we should be done |
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Listing.java
Outdated
Show resolved
Hide resolved
Hmmm, strange. I haven't actually set anything about auth mode in my config. I'll have to look at this again. Unfortunately I have some other things going on here that are probably going to require most of my attention today. I will get back to this as quickly as I can. |
|
OK, I'll do the following
as soon as you can after that, if you can do a test run and are happy then we will declare victory here, get it in, and I can then worry about merge conflicts with HADOOP-15183 in that PR alone. |
|
Also:
|
* added auth mode as a param on tests; switch half the existing tests over to it (so improve coverage without adding new runs) * javadocs for the tests to help maintenance and debugging in future * updated test docs to cover versioned buckets * s3guard bucket info prints change mode/policy (and signing policy, while I am there) * and updated examples of bucket-info to be consistent with the current output. Change-Id: Ib2577da02f5d91df427a0359e6ccca2add4894da
…rivial >80 char line width Change-Id: I4782ae8d7e4da74251f30b81589f67954869127d
|
I've looked at all your changes and they look good to me. I'll pull them into this PR shortly and repeat my test runs. You've definitely made some good improvements. The doc added to ITestRemoteFileChanged is especially nice.
Where is that happening? In my quick search I am not seeing it. It looks to me as though you would need to pass -Dauth to switch it on (something I have not done in any of my runs). If you've been running the tests with auth mode on then that does explain why tests were passing for me but not for you though and your fix looks like a good one.
No need to apologize - I'm not offended :) I'm struggling slightly to unlearn the two spaces thing. I'm not sure why, but I've only become aware of it recently and the behavior is so ingrained in me. I'm young enough not to have ever typed anything real on a typewriter yet for whatever reason in school I learned the two spaces. |
|
I pushed a fast-forward to the latest commit from #824 into this PR. Also, I completed my first full test suite run against a bucket with versioning disabled: The failures in ITestS3AMiscOperations seem to be related to the fact that I recently turned on default encryption on the bucket (to quiet some warnings I was getting from some best practice auditing my org does against buckets). I turned that back off and re-ran that test individually and it succeeded. I'll log a new JIRA to track that. The failure from ITestDirectoryCommitMRJob is one that has been discussed before. Re-running that test individually yielded success. Now I'll run the suite one more time against a bucket with versioning enabled. |
Logged HADOOP-16319. |
Done with that now too: As before, ITestDirectoryCommitMRJob succeeds individually. I'm happy with everything here now. Thanks so much for all of your help! Let me know if there's anything more you'd like to see. |
|
🎊 +1 overall
This message was automatically generated. |
|
OK, I'm happy with this, and am doing my preflight tests
|
|
happy; rebase and retest. Tests in the final run failing were related to SSL (HADOOP-16050) and again, an intermittent v1 listing one. We've seen that v1 list failure before so it's not related, at the same time I'm somewhat worried about hints of inconsistencies as well as the "why is this only the v1 test". Not letting it stop me here though +1 |
|
The changes from this were merged into trunk as a36274d. As such, I think it makes sense to close this out. |
Author: Aditya Toomula <[email protected]> Reviewers: Srinivasulu Punuru <[email protected]> Closes apache#794 from atoomula/remote
This is a new PR to replace #675 . The only difference vs what is on #675 is a merge of the latest code from trunk, resolution of minor merge conflicts when doing so, and squash down to a single commit.
#675 is being closed as-is to preserve the discussion there.
I have performed a full test run on this (against us-west-2 with a bucket with versioning enabled) which passed without any failing tests: