SDK repeatedly complaining "Not all bytes were read from the S3ObjectInputStream"

With a recent upgrade to the 1.11.134 SDK, tests seeking around a large CSV file is triggering a large set of repeated warnings about closing the stream early.

```
2017-06-27 15:47:05,121 [ScalaTest-main-running-S3ACSVReadSuite] WARN  internal.S3AbortableInputStream (S3AbortableInputStream.java:close(163)) - Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
2017-06-27 15:47:06,730 [ScalaTest-main-running-S3ACSVReadSuite] WARN  internal.S3AbortableInputStream (S3AbortableInputStream.java:close(163)) - Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
```

[full log](https://gist.github.com/steveloughran/ce3fe0a81d1162f0003764e218acc2cd)

I've seen other people complaining about this, with the issue being closed as "user gets to fix their code", or similar. However, here's why I believe the system is overreacting: it is making invalid assumptions about the remaining amount of data & cost of read vs abort & reconnect, and it is also failing to note that it's previously warned of this.

Hadoop supports reading in files of tens to hundreds of GB, with the client code assuming its a Posix input stream where seek() is inefficient, but less inefficient than reading. 

The first S3A release did call `close()`, rather than `abort()`, leading to [HADOOP-11570](https://issues.apache.org/jira/browse/HADOOP-11570): seeking being pathologically bad on a long input stream. What we now do, [HADOOP-13047](https://issues.apache.org/jira/browse/HADOOP-13047), is provide a tunable threshold for forward seeks where we skip bytes rather than seek. The default is 64K, for long-haul links a value of 512K works better. But above 512K, even over a long-haul connect, it is better to set up a new HTTPS connection than try and reuse an existing HTTP/1.1 connection. Which is what we do. 

only now, every time it happens a message appears in the log, "This is likely an error". It's not, its exactly what we want to do based on our benchmarking of IO performance. We do have a faster IO mechanism when users explicitly want random access, but as thats pathological on non-seeking file reads, it's not on by default.

I'm covering this in [HADOOP-14596](https://issues.apache.org/jira/browse/HADOOP-14596); I think we'll end up configuring log4j so that, even in production clusters, warning messages from `S3AbortableInputStream` are not logged. This is a bit dangerous.

Here are some ways which our logs could be improved without having to be so drastic.

1. Look at the amount of remaining data before warning of suboptimal performance. If it's more than some threshold (64K? 128K?) then don't complain. Instead consider that the caller has made the optimal choice.
1. Remember that it's been warned once & not bother repeating it *on every single seek() call*.

thanks.






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SDK repeatedly complaining "Not all bytes were read from the S3ObjectInputStream" #1211

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SDK repeatedly complaining "Not all bytes were read from the S3ObjectInputStream" #1211

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions