HADOOP-17338. Intermittent S3AInputStream failures: Premature end of … #2455

yzhangal · 2020-11-10T19:30:29Z

…Content-Length delimited message body etc

NOTICE

Please create an issue in ASF JIRA before opening a pull request,
and you need to set the title of the pull request which starts with
the corresponding JIRA issue number. (e.g. HADOOP-XXXXX. Fix a typo in YYY.)
For more details, please see https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute

…Content-Length delimited message body etc

steveloughran

looks good.
-some minor nits in the pr, mostly about logging.
-afraid you have to following the testing_s3a.md file rules and tell us which S3 region you ran all the integration tests against. Do with -Dscale, as that does more of the input stream seeking

steveloughran · 2020-11-12T12:49:57Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java

  @Retries.OnceRaw
  private void closeStream(String reason, long length, boolean forceAbort) {
-    if (isObjectStreamOpen()) {
+    if (!isObjectStreamOpen())


nit: wrap in { }. Also add a comment "stream is already closed"

steveloughran · 2020-11-12T12:51:07Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java

        // remaining object payload is read from S3 while closing the stream.
-        LOG.debug("Aborting stream");
-        wrappedStream.abort();
+        LOG.warn("Aborting stream {}", uri);


revert to debug. We abort() a lot as week seek round files if read policy != random, and don't want application logs overloaded with messages

steveloughran · 2020-11-12T12:52:36Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java

   * set
   */
  private volatile boolean closed;
+  private S3Object object;


can you add a javadoc comment explaining why this is needed "we have to keep a reference to this to stop garbage collection breaking the input stream"

steveloughran · 2020-11-12T12:55:04Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java

-    LOG.info("Got exception while trying to read from stream {}" +
-        " trying to recover: " + ioe, uri);
+    LOG.info("Got exception while trying to read from stream " + uri
+        + ", client: " + client + " object: " + object


why the change? The previous one used SLF4J expansion, and deliberately only printed the IOEs message, not the stack.
If you want the stack, add a separate log at debug with the full details, and use SL4J expansion of {} elements in the log string

Thanks a lot for the review Steve! Will address other comments asap.

The reason I made the change here is, the symptom is intermittent, and the stack is helpful when we need to diagnose issues. I hope to catch it when it occurs. My worry is that turning on debug will introduce a lot of other debug messages to the logs. But if exceptions happen here very often, we can move the stack printing to debug. Is that the case here though? Thanks.

leave at debug. We collect statistics on it, so if you see it a lot in the stats you can turn on debug logging for this class only.

steveloughran

yes, still want things at debug

steveloughran · 2020-11-16T12:29:43Z

Oh, noticed another JIRA I have related to improvements here:
https://issues.apache.org/jira/browse/HADOOP-15944

on exceptions the full stack should go in @ debug (I think you do this)
2 include the URL of the object in all exceptions

Change #2 makes sense if there's an app reading many files in different threads: we want to know what is causing problems.

Could you do that in this patch?

steveloughran · 2020-11-17T18:33:05Z

build failures are checkstyles

./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java:555:    if (!isObjectStreamOpen()): 'if' construct must use '{}'s. [NeedBraces]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java:586:          LOG.warn("When closing {} stream for {}, will abort the stream", uri, reason, e);: Line is longer than 80 characters (found 91). [LineLength]
./hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java:597:          LOG.warn("When aborting {} stream after failing to close it for {}", uri, reason, e);: Line is longer than 80 characters (found 95). [LineLength]

Be interesting to think of how to test this.

steveloughran · 2020-11-24T14:26:29Z

@yzhangal you active on this? I'd like to wrap it up this month, and all of it is done other than the log and checkstyle tuning

yzhangal · 2020-11-28T20:13:25Z

@yzhangal you active on this? I'd like to wrap it up this month, and all of it is done other than the log and checkstyle tuning

HI @steveloughran , sorry for the delay. I just did a new PR #2497, hopefully addressed all of your comments here. I tried to run test by following https://hadoop.apache.org/docs/r3.1.4/hadoop-aws/tools/hadoop-aws/testing.html#The_submitter_of_any_patch_is_required_to_run_all_the_integration_tests_and_declare_which_S3_region.2Fimplementation_they_used., but I do see errors like timeout, OOM etc. I wonder if the hadoop jenkins test can be set up to do the s3a test automatically like other tests. Thanks and happy holiday season.

steveloughran · 2020-11-30T12:01:45Z

I wonder if the hadoop jenkins test can be set up to do the s3a test automatically like other tests.

we can't give it credentials for security reasons -even if we only issued short-lived session credentials, getting them would be as trivial as submitting a PR which printed them. Same for abfs
2, if someone isn't set up to run the tests, they aren't set up to deal with regressions or debug why their own patch doesn't work.
There's an extra benefit -because everyone's config is slightly different (network, endpoints, encryption, etc) we get better coverage of test configurations by having different people run the tests. It's not unusual for a patch to get merged in but which a few days later needs a followup as someone else finds a regression in their test setup.

I would like more test runs, e.g the daily jenkins runs, to at least have credentials, but I've yet to come up with a good design for secure execution. It'd need something like

isolated AWS account (billed to who?)
two IAM roles: MAPREDUCE-6096.SummarizedJob Class Improvment #1: with limited access to a single s3 bucket, YARN-1964 Launching containers from docker #2: with the permission to call assumeRole on role MAPREDUCE-6096.SummarizedJob Class Improvment #1
something on build setup to call assumeRole at start of run and issue role credentials valid for a few hours max
the jenkins scripts would only get those role credentials
That still leaves with the "what to do at the end of the run" problem. Maybe: revoke all sessions under a specific role through the relevant IAM API call -this might work if role#2 has the permissions and you only ever have one active session in role MAPREDUCE-6096.SummarizedJob Class Improvment #1, because we'd have to revoke all sessions in that role.

See: not easy.

Put your error stack traces into the PR. A single test failure isn't enough to block a patch if we can identify a cause and say "this is independent". Given you are seeing things I'm not, that's something we need to understand.

yzhangal · 2020-12-02T04:44:06Z

Many thanks Steve. I see the challenge of supporting the tests at upstream.

I reran the hadoop-tools/hadoop-aws tests with command "mvn verify -Dscale -DtestsThreadCount=8 " and only got the following (some of the earlier failures I saw did not show):

[ERROR] Errors:
[ERROR] ITestS3AContractRootDir.testListEmptyRootDirectory:82->AbstractContractRootDirectoryTest.testListEmptyRootDirectory:196 » TestTimedOut
[ERROR] ITestS3AContractRootDir>AbstractContractRootDirectoryTest.testRecursiveRootListing:265 » TestTimedOut
[ERROR] ITestS3AContractRootDir>AbstractContractRootDirectoryTest.testRmEmptyRootDirNonRecursive:101 » TestTimedOut
[ERROR] ITestS3ATemporaryCredentials.testSTS:133 » AccessDenied : request session cred...
[INFO]
[ERROR] Tests run: 260, Failures: 0, Errors: 4, Skipped: 45

Wonder if the failed tests work for you.

Thanks.

steveloughran · 2020-12-18T17:25:52Z

closing this as superceded by #2497;

HADOOP-17338. Intermittent S3AInputStream failures: Premature end of …

a5d9de5

…Content-Length delimited message body etc

steveloughran requested changes Nov 12, 2020

View reviewed changes

steveloughran added bug fs/s3 changes related to hadoop-aws; submitter must declare test endpoint labels Nov 12, 2020

steveloughran self-requested a review November 12, 2020 12:58

apache deleted a comment from hadoop-yetus Nov 16, 2020

steveloughran requested changes Nov 16, 2020

View reviewed changes

steveloughran self-requested a review November 17, 2020 11:26

steveloughran closed this Dec 18, 2020

HADOOP-17338. Intermittent S3AInputStream failures: Premature end of … #2455

HADOOP-17338. Intermittent S3AInputStream failures: Premature end of … #2455

Uh oh!

Conversation

yzhangal commented Nov 10, 2020

NOTICE

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

steveloughran Nov 12, 2020

Choose a reason for hiding this comment

Uh oh!

steveloughran Nov 12, 2020

Choose a reason for hiding this comment

Uh oh!

steveloughran Nov 12, 2020

Choose a reason for hiding this comment

Uh oh!

steveloughran Nov 12, 2020

Choose a reason for hiding this comment

Uh oh!

yzhangal Nov 12, 2020

Choose a reason for hiding this comment

Uh oh!

steveloughran Nov 16, 2020

Choose a reason for hiding this comment

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

steveloughran commented Nov 16, 2020

Uh oh!

steveloughran commented Nov 17, 2020

Uh oh!

steveloughran commented Nov 24, 2020

Uh oh!

yzhangal commented Nov 28, 2020

Uh oh!

steveloughran commented Nov 30, 2020

Uh oh!

yzhangal commented Dec 2, 2020

Uh oh!

steveloughran commented Dec 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants