-
Couldn't load subscription status.
- Fork 9.1k
HADOOP-15183 S3Guard store becomes inconsistent after partial failure of rename #654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-15183 S3Guard store becomes inconsistent after partial failure of rename #654
Conversation
|
💔 -1 overall
This message was automatically generated. |
3f9bd3d to
5cd2465
Compare
|
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo, patch -> path
5cd2465 to
c2da32a
Compare
|
💔 -1 overall
This message was automatically generated. |
|
Note, layout of packages & aspects of arch are based on https://github.com/steveloughran/engineering-proposals/blob/master/refactoring-s3a.md , primarily |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
215e4f8 to
bf5527c
Compare
|
💔 -1 overall
This message was automatically generated. |
bf5527c to
23955fc
Compare
|
not applying to trunk...don't understand this. Rebased |
|
💔 -1 overall
This message was automatically generated. |
|
Having weirdness here as yetus can't apply patch. Local attempt (over SFO airport wifi) |
|
OK, some bit of the patch history is causing confusion |
23955fc to
77548b4
Compare
|
squashed entire patch into one |
|
💔 -1 overall
This message was automatically generated. |
77548b4 to
68af07f
Compare
|
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the requirement mentioned below shouldn't the default fs.s3a.connection.maximum be > the default fs.s3a.threads.max?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to 72. Now, 64 threads seems a lot, but most of these are blocking operations rather than IO heavy actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfinished comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/test/ExtraAssertions.java
Outdated
Show resolved
Hide resolved
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
…ing check for multidelete into the correct test case Change-Id: Ie8ca305b79a1e3bfb963cf10c917ce59c2b0c39f
* Dir to delete contains a mix of r/o and r/w files * asserts that for MPU, only the r/o files are guaranteed to be undeleted. * in -Dscale, thousands of files (Created in parallel threads for max perf) Findings are interesting. For a bulk delete, even with a PAYG DDB table, its the metastore delete which takes O(files) and a slow O(files) at that. Change-Id: I388c24605cd4ac9e07bb43dd16fbfdda1e1cf085
This includes some changes in executor construction and the expansion of the defaults from the -too small- default values to something more realistic. Note: big production deployments against a single store still benefit from a very large thread pool and connection pool. The core-default comments cover that. These changes bring the time to delete 1000+ files to tens of seconds from minutes. TODO: 1. unwrap exceptions raised in the DDB operation and wrapped to IOEs 2. page the DDB deletions to some size (100?) to avoid going OOM on a wide/deep tree simply by the number of files to delete. That'll be slightly less efficient. Change-Id: I5e998be6cd4e86b486c744f7f8864ac7945244f4
* disable sleep if interval set to 0. There's no need to throttle it any more * track the # of parent directories across batches to avoid a prune of 1000 files to trigger 40 attempts at changing the parent state. * ITestPartialRenamesDeletes to attempt a prune of the path in teardown This is all to address the fact that once you have a scale test createing a deleting many thousands of files in unique paths, in every test run, you accrue many tombstone markers fast, which prune then goes after. Change-Id: I02f7dffa94785a01cd2a2452cf2ad67d4534f32b
we move to batched execution of (parallel) delete calls, so that a failure fails fast and and there's no need to build up structures of millions of completable futures Change-Id: I5e58310fcf7f545ce8904b0d2a80a02cb2681cf4
…failures-during-copy Change-Id: Ida9e7424bd635fd6cc246ce65ab8d96c6fb8a958
…ry allocator) by providing a specific allocator callback. This is the purer way anyway: users of the store context should get the functionality offered by the store, rather than access to the implementation details. Note; I could have createed template FunctionWithIOE and BiFunctionWithIOE interfaces in fs.impl. Semi tempted, but trying to ramp back ambitions Change-Id: I59fb2bf66218be6318c83fc1c919e004118533ab
…g there in parallel auth test runs. Change the failure to incude a listing so as to work out what is happening. I don't think this is related to the rest of the work, so have made this patch isolated Change-Id: I0ca0a1e315ee1d963457249a2d147fbbd2b98901
…er about copy and update logic first, and want to merge in the versioning stuff from Cerner first Change-Id: If167d7b1d89d0a0873708ee9e471a7b68489f040
Pulls out the code to manage state during an S3A rename into a RenameOperation class, which each metastore gets to return on a call to initiateRenameOperations(); The current implementation of this is "just" the design originally in the S3AFileSystem.innerRename(): built up a list of destination metadata entries to create Change-Id: Ibb88f6af0a7f88cc528700942059aede86a72dea TODO: actually do either incremental updates of the store or write current state on a failure
This schedules copies through the executor up to ten at a time (random choice), doing a join() to wait for completion after a batch of ten, before issuing a builk delete, and after an exception is thrown. That's the trouble with the parallel runs see: if one fails then you still have to wait for the active operation to reach some state before escalating. I've moved the updating of the operation state into the submitted closures. This still doesn't handle making the store consistent, but it reflects the structure I want the rename to work, now I'm going to add a new operation implementation which does incremental work. `ITestPartialRenamesDeletes.testRenameFilesetFailsLeavingSourceUnchanged` still fails. Added the empty test cases for the conditions I now want to create, moving beyond delete() failures into failures of the copy process (Write, read) to cover them too. Once the current test works. Copy code changes + some DurationInfo uses @ debug to log how long things take + propagates storage class from source. + getObjectMetadata now retries using the s3guard invoker if non-null, standard invoker otherwise. This is to handle failures and the case that the object to rename isn't there yet. Thoughts * maybe: issuing individual DELETE calls in that operation if single delete is enabled. This will eliminate the big paused delete phases. But do I need to care about this? It's only needed for working with non-AWS S3 impls, and for them I'd expect copy operations to be O(1) so while DELETE is slow, copy will be so fast that it'll be unimportant except in test cases with thousands of small files. * maybe: when we know files are 0 bytes long, rather than copy just do a 0 byte PUT. (but this loses the headers &c) * we may want to increase the time to wait in the S3Guard invoker to stabilise, making a separate option, and add 412 as retryable. * the page size for a multidelete is fixed at the AWS limit of 1000. This means we aren't currently testing the paged operation. Proposed: fs.s3a.internal. property to make this smaller for testing only (not in Constants, not in docs, except in testing.md) * the existing huge file test does a file rename. It should be changed to do a dir rename. Change-Id: Icfec827b001ec4546491b459d7ccc6f0f1a45706
* Add explicit notion of a StoreOperation, similar to Gang-of-four Command pattern; RenameOperation => RenameTracker and a subclass of this * DelayedUpdateRenameTracker will update metastore with destfiles on a falure, while leaving source alone. * now creating a directory tree for each rename test, so that we can verify that it works for >1 entry deep. Size of tree depends on -Dscale flag. The delete cleanup tries to work out from the metastore if a parent dir is empty or not; needs some isolation for its own test. New tests all happy. This still only updates the store on the completion of a successful rename, so the window of inconsistency (and failure ) is O(files* data); moving to do it after every file is copied will (a) reduce peak load and (b) shrink the window significantly Change-Id: I33c82856107f5c9c685c72eedd5309d6489a56dd
This adds a rename tracker which adds entries incrementally, deletes source files after a delete. As such, the failed() and completed() operations become no-ops. + not yet tested; thinking of some unit tests to add alongside the live ones + also explicitly sorting the order of addition and deletion of pathmetadata elements such that in a patch delete, higher level entries are added first, while during deletion leaves are deleted first. This is to maintain the invariant "there are never any entries in the tables which do not have parents". Lots of tests on the sorting...if someone could tell me why Comparators.reverse returns something which fails my tests, I would love to know. Outstanding TODO items * switch DDB and local metastores to the new tracker * isolated unit test to simulate some of the operations and verify that all is good Change-Id: Iad76eefe4770699da6d867fa34ab59cebc9cf28c Testing: only ITestPartialRenamesDeletes against S3 ireland
* Improvements to diagnostics, e.g. rename tracker string value. * Metastore interface marks move() args may be null, fix up all implementations to handle this (DynamoDB already did, but not local) * added another rename tests to ITestS3AContractRename which verifies file contents at the end and that the counters are as expected. This was done to debug a test failure, but is low cost and useful enough to retain. * remove from S3Guard class the returning of new elements....that class now matches trunk again. * Stub test class TestProgressiveRenameTracker for which I want to add some unit tests if I can think of some Tested: yes, local and dynamo and auth/not-auth. TODO: * some unit tests for progressive rename tracker (we can, so let's try, because that way yetus will run them) * reviews by others and respond to their feedback Change-Id: I9cfe5313f93399b70ac9aedaf0edafd10aa1db3e
This adds the notion of an operation state which can be preserved across move() calls and so allow the metastore to update its view of the world without having to talk to any remote service. For DDB it maintains that hash table of ancestors and so a set of move operations spread across parallel threads will still share the same map of ancestors to avoid creating duplicates of. now, the bad news, as observed during test runs with debug logging: way too many metadata entries are being created. It seems to me that every putItem() call creates an ancestor list which then puts up all the ancestor markers, without checks for them existing. That is: the deeper you write, the more (expensive) write IOPs you create. If I haven't misunderstood it (and I hope I have!) then that map of ancestors isn't that useful, as entries get created anyway. All that's happening is that the write amplifcation isn't quite so big. Added instrumentation of low level record read/write requests for dynamodb. I'm still thinking about how best to deal with this, especially across a series of write operations likely to write to the same directory tree (that's rename and commit). Probably something off * track which parent paths we know to exist * and which have been created This would be part of the current ancestor state and, inevitably, have to be used during batch commit jobs too, so we'll know to avoid so much. For write and commit, it'd be best if finishedWrite() did the walk up the tree and stop on the first successful probe for a parent, as its isolated, and DDB GET costs less than PUT. That'd avoid the big write-context-spanning-operations change I'm considering but don't want to add in this patch, which is big enough anyway Change-Id: Idabb6100e90629ddb3e04dbc3e66573abaedba13
convert the bulk operation for move into into an optional BulkOperationState for put operations in the metastore. This is then used all the way through to commit operations so as to address HADOOP-15604. I've not yet completed up all the wiring in the metastore: This is the first step, having an optional bulk update Change-Id: I0275a344715eab002024f1644db8858059c71bb8 Also: revert changes in ContractTestUtils as it created link conflict, and because that library gets used in contract tests by external stores, I don't want to needlessly break things.
Change-Id: Ib25d5bb7735132745dab150208f5fafad600bdb6
05554e2 to
301a7d4
Compare
|
💔 -1 overall
This message was automatically generated. |
… update parent directories once. +ancestor checking does this +instrumentation update in DDB => optional +ITestPartialRenamesDeletes prunes on teardown properly. This stops the store being full of tombstone markers, and during development of the code, many invalid entries +DDB prune will only sleep at the end of each batch, so if there is only work done at cleanup, no need to sleep at all. Change-Id: Ibdf63af1f5563489d15166f06b8c8e7c9ebfaec0
|
💔 -1 overall
This message was automatically generated. |
Change-Id: Iee6c82070a3c050051ff80dcc03f877d0115c7e0
Change-Id: I4cc830515ce9cce300727d5d73c5c74160951be1
|
🎊 +1 overall
This message was automatically generated. |
Change-Id: I518f427a287e72ac5d617d08ec71e24fbd683a14
|
🎊 +1 overall
This message was automatically generated. |
|
Closing this PR and kicking off a new one |
## What changes were proposed in this pull request? This PR is to implement Samza SQL shell. The document about the shell was attached [here](https://issues.apache.org/jira/browse/SAMZA-1901). ## How was this patch tested? 1. Add unit tests 2. Run the shell with use cases mentioned in the attached document under https://issues.apache.org/jira/browse/SAMZA-1901 Author: Weiqing Yang <[email protected]> Reviewers: Srinivasulu Punuru <[email protected]>, Aditya Toomula <[email protected]> Closes apache#654 from weiqingy/samza-shell
This patch carefully pulls out the changes into a new class, s3a.impl.MultiObjectDeleteSupport which isolates this from everything else, and is designed for testing in isolation.
Tests are still WiP; I've pulled the partial rename failure ones out ITestAssumeRole into isolation here, because the fact that they use assumed roles is a detail to set up the failures.
It currently only does the delete failure, but I'll probably include the copy failures in here as they are so related.