-
Couldn't load subscription status.
- Fork 9.1k
HADOOP-17271. S3A to support IOStatistics #2553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-17271. S3A to support IOStatistics #2553
Conversation
|
Test failures in tests which are seeing 1 too many LIST calls. I think the wiring up of list duration and metrics is causing this; had something similar with object metadata and HEAD requests, hence the different key names for the duration and simple counter. If the duration has the same name then it updates the counter, see. [ERROR] testCostOfListStatusOnEmptyDirraw-delete-markers Time elapsed: 6.77 s <<< FAILURE! |
This comment has been minimized.
This comment has been minimized.
|
Latest PR fixes test failures, does more
For an example of the sorted statistics log, see https://gist.github.com/steveloughran/50ad1cf3d22d6683d4f2f038453be803 That's across all the tests in ITestS3AOperationCost, remote access to S3 london from bristol; There's a 3 millis RTT of pinging my own base station from the laptop so subtract that from all the timings. |
|
💔 -1 overall
This message was automatically generated. |
|
Mock test failure. As usual, as likely as a mock-related change as a functional code change. But as the committer has been instrumented with stats collection, I'm going to suspect the production code first |
|
Narrator: it was the mock. It just to a long time to debug the fact that durationTracking of deleteObject was triggering an NPE, which, in commit cleanup, was being swallowed. Joy. |
|
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Steve. I see you fixed the issues I pointed out in the other PR / answered my questions.
I had 2 failed tests against ireland, same errors happened with 2 consecutive test runs:
[ERROR] Tests run: 22, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 195.891 s <<< FAILURE! - in org.apache.hadoop.fs.s3a.commit.magic.ITestMagicCommitProtocol
[ERROR] testParallelJobsToAdjacentPaths(org.apache.hadoop.fs.s3a.commit.magic.ITestMagicCommitProtocol) Time elapsed: 10.725 s <<< FAILURE!
java.lang.AssertionError: job UUIDs. Actual: job_200707120526_0006
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failEquals(Assert.java:185)
at org.junit.Assert.assertNotEquals(Assert.java:161)
at org.apache.hadoop.fs.s3a.commit.AbstractITCommitProtocol.testParallelJobsToAdjacentPaths(AbstractITCommitProtocol.java:1446)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)
[ERROR] Tests run: 24, Failures: 1, Errors: 0, Skipped: 16, Time elapsed: 56.8 s <<< FAILURE! - in org.apache.hadoop.fs.s3a.performance.ITestS3ADeleteCost
[ERROR] testDeleteSingleFileInDir[raw-delete-markers](org.apache.hadoop.fs.s3a.performance.ITestS3ADeleteCost) Time elapsed: 2.503 s <<< FAILURE!
java.lang.AssertionError: operation returning after fs.delete(simpleFile) action_executor_acquired starting=0 current=0 diff=0, action_http_get_request starting=0 current=0 diff=0, action_http_head_request starting=4 current=5 diff=1, committer_bytes_committed starting=0 current=0 diff=0, committer_bytes_uploaded starting=0 current=0 diff=0, committer_commit_job starting=0 current=0 diff=0, committer_commits.failures starting=0 current=0 diff=0, committer_commits_aborted starting=0 current=0 diff=0, committer_commits_completed starting=0 current=0 diff=0, committer_commits_created starting=0 current=0 diff=0, committer_commits_reverted starting=0 current=0 diff=0, committer_jobs_completed starting=0 current=0 diff=0, committer_jobs_failed starting=0 current=0 diff=0, committer_magic_files_created starting=0 current=0 diff=0, committer_materialize_file starting=0 current=0 diff=0, committer_stage_file_upload starting=0 current=0 diff=0, committer_tasks_completed starting=0 current=0 diff=0, committer_tasks_failed starting=0 current=0 diff=0, delegation_token_issued starting=0 current=0 diff=0, directories_created starting=2 current=3 diff=1, directories_deleted starting=0 current=0 diff=0, fake_directories_created starting=0 current=0 diff=0, fake_directories_deleted starting=6 current=8 diff=2, files_copied starting=0 current=0 diff=0, files_copied_bytes starting=0 current=0 diff=0, files_created starting=1 current=1 diff=0, files_delete_rejected starting=0 current=0 diff=0, files_deleted starting=0 current=1 diff=1, ignored_errors starting=0 current=0 diff=0, multipart_instantiated starting=0 current=0 diff=0, multipart_upload_abort_under_path_invoked starting=0 current=0 diff=0, multipart_upload_aborted starting=0 current=0 diff=0, multipart_upload_completed starting=0 current=0 diff=0, multipart_upload_part_put starting=0 current=0 diff=0, multipart_upload_part_put_bytes starting=0 current=0 diff=0, multipart_upload_started starting=0 current=0 diff=0, object_bulk_delete_request starting=3 current=4 diff=1, object_continue_list_request starting=0 current=0 diff=0, object_copy_requests starting=0 current=0 diff=0, object_delete_objects starting=6 current=9 diff=3, object_delete_request starting=0 current=1 diff=1, object_list_request starting=5 current=6 diff=1, object_metadata_request starting=4 current=5 diff=1, object_multipart_aborted starting=0 current=0 diff=0, object_multipart_initiated starting=0 current=0 diff=0, object_put_bytes starting=0 current=0 diff=0, object_put_request starting=3 current=4 diff=1, object_put_request_completed starting=3 current=4 diff=1, object_select_requests starting=0 current=0 diff=0, op_copy_from_local_file starting=0 current=0 diff=0, op_create starting=1 current=1 diff=0, op_create_non_recursive starting=0 current=0 diff=0, op_delete starting=0 current=1 diff=1, op_exists starting=0 current=0 diff=0, op_get_delegation_token starting=0 current=0 diff=0, op_get_file_checksum starting=0 current=0 diff=0, op_get_file_status starting=2 current=2 diff=0, op_glob_status starting=0 current=0 diff=0, op_is_directory starting=0 current=0 diff=0, op_is_file starting=0 current=0 diff=0, op_list_files starting=0 current=0 diff=0, op_list_located_status starting=0 current=0 diff=0, op_list_status starting=0 current=0 diff=0, op_mkdirs starting=2 current=2 diff=0, op_open starting=0 current=0 diff=0, op_rename starting=0 current=0 diff=0, s3guard_metadatastore_authoritative_directories_updated starting=0 current=0 diff=0, s3guard_metadatastore_initialization starting=0 current=0 diff=0, s3guard_metadatastore_put_path_request starting=0 current=0 diff=0, s3guard_metadatastore_record_deletes starting=0 current=0 diff=0, s3guard_metadatastore_record_reads starting=0 current=0 diff=0, s3guard_metadatastore_record_writes starting=0 current=0 diff=0, s3guard_metadatastore_retry starting=0 current=0 diff=0, s3guard_metadatastore_throttled starting=0 current=0 diff=0, store_io_request starting=0 current=0 diff=0, store_io_retry starting=0 current=0 diff=0, store_io_throttled starting=0 current=0 diff=0, stream_aborted starting=0 current=0 diff=0, stream_read_bytes starting=0 current=0 diff=0, stream_read_bytes_backwards_on_seek starting=0 current=0 diff=0, stream_read_bytes_discarded_in_abort starting=0 current=0 diff=0, stream_read_bytes_discarded_in_close starting=0 current=0 diff=0, stream_read_close_operations starting=0 current=0 diff=0, stream_read_closed starting=0 current=0 diff=0, stream_read_exceptions starting=0 current=0 diff=0, stream_read_fully_operations starting=0 current=0 diff=0, stream_read_opened starting=0 current=0 diff=0, stream_read_operations starting=0 current=0 diff=0, stream_read_operations_incomplete starting=0 current=0 diff=0, stream_read_seek_backward_operations starting=0 current=0 diff=0, stream_read_seek_bytes_discarded starting=0 current=0 diff=0, stream_read_seek_bytes_skipped starting=0 current=0 diff=0, stream_read_seek_forward_operations starting=0 current=0 diff=0, stream_read_seek_operations starting=0 current=0 diff=0, stream_read_seek_policy_changed starting=0 current=0 diff=0, stream_read_total_bytes starting=0 current=0 diff=0, stream_read_version_mismatches starting=0 current=0 diff=0, stream_write_block_uploads starting=0 current=0 diff=0, stream_write_block_uploads_aborted starting=0 current=0 diff=0, stream_write_block_uploads_committed starting=0 current=0 diff=0, stream_write_bytes starting=0 current=0 diff=0, stream_write_exceptions starting=0 current=0 diff=0, stream_write_exceptions_completing_upload starting=0 current=0 diff=0, stream_write_queue_duration starting=0 current=0 diff=0, stream_write_total_data starting=0 current=0 diff=0, stream_write_total_time starting=0 current=0 diff=0: object_delete_objects expected:<2> but was:<3>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:834)
at org.junit.Assert.assertEquals(Assert.java:645)
at org.apache.hadoop.fs.s3a.S3ATestUtils$MetricDiff.assertDiffEquals(S3ATestUtils.java:1022)
at org.apache.hadoop.fs.s3a.performance.OperationCostValidator$ExpectSingleStatistic.verify(OperationCostValidator.java:401)
at org.apache.hadoop.fs.s3a.performance.OperationCostValidator.exec(OperationCostValidator.java:174)
at org.apache.hadoop.fs.s3a.performance.AbstractS3ACostTest.verifyMetrics(AbstractS3ACostTest.java:360)
at org.apache.hadoop.fs.s3a.performance.ITestS3ADeleteCost.testDeleteSingleFileInDir(ITestS3ADeleteCost.java:110)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)
+1 pending on the merge conflict fix, this should be committed asap because this patch is too big and will cause conflicts with other prs
|
(tested with |
|
Hey @bgaborg! thanks for this. Will get it in. I need to understand those failures, one may be independent. ITestS3ADeleteCost is asserting over metrics, so I'm assuming its related. And I bet its a check only performance on unguarded runs. The other one: odd. |
Contains HADOOP-16830. Add IOStatistics API Change-Id: Ic5c4b014f90f1ed61c37d3ce9b17290736475047
This is to allow hadoop-aws to compile without needing to move to the moved interfaces; it will also allow anything external which used those methods (unlikely) to keep working. Also: throw synchronized at methods to get findbugs to STFU. Tempting just to turn it off on the basis that it is overkill, but it could maybe be correct. Making all of MeanStatistics synchronized is irritating, as it will hurt performance on what should be a lightweight class. But it is needed to ensure samples and sum are consistently set. Change-Id: I4c3e2726e1e97d705c55dbb43a507ea4d0e81e28
Change-Id: I5f64704a82a196fd8ff66cbde10d4970722e1fd7
Change-Id: I3e9bb83bc32eddc5c7d84c1df7be77cea3e60f5d
-remove superfluous javadocs -names of inner classes are, where possible, same as the old one. -storeContext reverts to instrumentation as name of field of statistics context; reduces diff slightly Change-Id: I80bc65376f0d87bc9c38f265b38b1ee834c17d96
- failures are registered and reported differently - helper in statistics.impl to wrap functions and callable with handling for this Move BufferedIOStatistics streams from impl to fs.statistics, so its declared OK for others to use counter increment to ignore negative values; returns existing value. successful increment returns the latest value, as documented. Tests for the counter and gauge behaviour Change-Id: Ic851240ebe94033bf93e0a11abefb7adda534191
Change-Id: I5d7237d844de94a1620356a1a45bb08a7515d768
on hasNext/Next, close() is called, so as to cleanup any open state/handles etc. Avoids having to tell developers downstream to look for an iterator being closeable, and call it if so. + minor improvements in iterator test suite + fix typo found by Mehakmeet Change-Id: Ibd103941278b8c0bc6c93a09ea469be4c60a58b1
+test changes add verification that IOStatistics propagate all the way through the tombstone reconciling chain Change-Id: I57260cf0d1f512dff01f8a3aa54427ec9d321877
+ add class which can be used to store duration results, primarily for testing. Made public for others to play with too Change-Id: I0a7f274f5a2a180e1800102be177b308050725c0
* move the methods in fs.impl.FutureIOSupport which turn out to be needed when working with the async code to the public utils.functional package * duration tracking invocation in IOStatisticsBinding tuning based on the ABFS coding experience. Change-Id: Ide9467fa67a8a5d3d8daec94a7a39ff87f106a47
* checkstyles * operation cost test setup failing if Statistic wasn't mapped to a counter Change-Id: I4445c32e0dc3c84064ad2a4ee6a2a9fe955c567c
Fix checkstyle in RemoteIOException unwrapping of exception looks for Error and rethrows Change-Id: I673b1e487cae2ec3204ce6ba1103cdae14cfe48c
+ IOStatisticsAssertions include checks for null IOStatistics instance passed in Not sure that we need the overkill of a wrapper for that type casting, but as we couldn't get away without it in HADOOP-17281, I've gone with it. Change-Id: I3ad4457d63860ac55c6dc7a6d22100dd86e12bf1
Rebase to trunk, fix up listStatusIterable, using the new type casting wrapper in RemoteIterators Change-Id: I158d09863d0dac6866a7c2f1e9130da8645dd82b
* the trackDuration(DurationTrackerFactory,...) methods support null factory and switch to a stub tracker * there's an overloaded logIOStatisticsAtDebug() method to use the IOStatisticsLogging own logger Change-Id: Ie786dc7aa8920a3fc8b22b55916f4b811810dab3
* PairedDurationTrackerFactory allows listing code to update both FS and instance level stats. * test assertions to become verifyStatistic* assertStatistic* to make clear these are statistic assertions; also helps code completion * stream_read_unbuffered stream key Change-Id: I25272fffeb3e66b5cec90ae6c6af74808c139b26
* S3AInstrumentation to have IOStatisticsStore for aggregating values * as counters &c on the FS are incrememented, so is that instanceIOStatistics * when streams are merged back in close, they are updated too * S3AFileSystem.toString() to print the stats * All test suites to collect & log stats in @afterclass * New test run @ end of all suites to print the aggregate stats of that test process. Printing out the aggregate stats is interesting, even though test runs are unrepresentative of real work. Makes me think about what a command line tool to actually aggregate a list of stats files to produce a final one. Change-Id: I24ea0ddd3b64c8233f1bdc1d021b98503df6e914
Change-Id: I911c74d77809038d0849ae90835fa3b88ff08b55
the code to catch/wrap/unwrap IOEs in ansyc logic -> java's own UncheckedIOException. Broadly more usable. Change-Id: I61a6194f952448da01ab48badae0927a243a53b2
changes in hadoop-common Change-Id: Id32d456004eb3ea9a6e0bab154fa41ca2f4e2623
Change-Id: I0a5d24689bdfaea9d4976abc03f1680bea7fda74
* move S3A statistics from o.a.h.fs.s3a.impl.statistics into s3a.statistics (interfaces) and s3a.statistics.impl for the implementations * org.apache.hadoop.fs.s3a.Statistic enum adds a type for each entry. this allows test/instrumentation setup to immediately determine which are counters vs other types, so set things up properly. This will make a different as more statistics are added Change-Id: I97b7d5058aaed8d58307d797b941c89c7e1259af
* tuning standard statistic names * review javadoc * can generate a o.a.h.fs.Statistics instance from an IOStatistics instance; this can be used to unify statistics collection/reporting. * Duration tracker to take long as argument * another duration tracker lambda expression for any InvocationRaisingIOE void -> void. Change-Id: I66323d7e94f45d5e2e2128df18c7feef8cbbf188
* S3AFileSystem.getStatistics() now serves up a dynamic binding to S3AInstrumention.instanceIOStatistics * ...so no need to have separate statistics updating * Committer to track duration of: file upload, file commit * Some review of block upload counting/gauges, but key changes left for a follow-upon JIRA. * javadocs everywhere! We're into the danger zone here as I'm adding more features; to get this in it needs to be merged and then I can add as new patches on top. Change-Id: Ia70a23f63c54fe07eb04ab931aa985fc352f6e7e
* delete duration now tracked * and split into single/bulk delete calls * delete/rename cost tests updated to match * ...and to work correctly if bulk deletes are disabled. * +moved most of the Statistic names into the hadoop-common constant files, for reuse. * New method IOStatisticsLogging.ioStatisticsToSortedString(); sorts the metrics before printing. Adds cost of creation of and insertion into treemaps, so should be used with care. Used for test reporting. Change-Id: I2c7378e504aa364a39ea7feb50c4c6b538bb1d8f
Test failure was just a mock brittleness issue, just took a long time with the debugger to work out what. An NPE was being raised in S3AFileSystem.getDurationFactory() as the instrumentation was null; this was happening in commit abort and so swallowed. Fix: * getDurationFactory() to return null if instrumentation is null; trackDuration code is all robust to this * Mock overrides the new method (This is why I hate mocks BTW. They can simulate failures, but are so, so brittle). Also: * Tests on the trackDuration calls now make sure they are all invoked. * Some copy and paste of the trackDuration logic which somewhat reduces the depth of lambda-expressions. * Serialization of IOStatisticsSnapshot: method to list required classes and a test to verify this holds. To aid secure use. * IOStatisticsLogging.logToStortedString is now ioStatisticsToPrettyString and filters out all empty/unset values (e.g counters == 0, min/max unset, means with samples == 0. Makes for leaner logs Findbugs. Once S3AInstrumentation declares itself an IOStatisticsSource, all nested classes need to do this.getIOStatistics to stop findbugs warning. Change-Id: I7d1da267c442b1e22dd025be97457cd4a6ae5306
Change-Id: I5f6afb156c26783bc4bd488b8be55830c80e1746
3b7426b to
5855790
Compare
|
Rebased and pushed up Gabor, I've run this repeatedly and not been able to recreate either test failure! I'll merge this in as two patches (one for hadoop-common, one for hadoop-aws) and then we will see about the followups. the MagicCommitProtocol one is going to be unrelated and implies that your test run didn't get unique job IDs. I'm wondering if that's due to some state from a previous test run, which would explain why I haven't seen it. Will need to investigate more. |
|
💔 -1 overall
This message was automatically generated. |
This is the rebase of #2324 #2323 with the following changes
We're into the danger zone here as I'm adding more features; to get this
in it needs to be merged and then I can add as new patches on top.
Specifically I'm looking at: aws SDK (major) and better tracking of upload
progress (minor).
Hadoop common
HADOOP-16830 API/Implementation changes for statistics
this can be used to unify statistics collection/reporting.
Hadoop-aws
HADOOP-17271. S3A Statistics Enhancement/tuning
(interfaces) and s3a.statistics.impl for the implementations
this allows test/instrumentation setup to immediately determine which
are counters vs other types, so set things up properly. This will make
a different as more statistics are added
binding to S3AInstrumention.instanceIOStatistics
left for a follow-upon JIRA.
Testing:
S3 london with
I'll do a scale run next.