HADOOP-18028. improve S3 read speed using prefetching & caching #3736

bhalchandrap · 2021-11-29T17:26:31Z

Description of PR

I work for Pinterest. I developed a technique for vastly improving read throughput when reading from the S3 file system. It not only helps the sequential read case (like reading a SequenceFile) but also significantly improves read throughput of a random access case (like reading Parquet). This technique has been very useful in significantly improving efficiency of the data processing jobs at Pinterest.

I would like to contribute that feature to Apache Hadoop. More details on this technique are available in this blog I wrote recently:
https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0

How was this patch tested?

Tested against S3 in us-east-1 region.

There are a small number of test failures (most within s3guard). I can attach test output if it helps.

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/common/BoundedResourcePool.java

steveloughran

OK, I've done the first review.

this review was done using dictation tools and single-handed typing. If a sentence doesn't make sense, it will be for that reason
And I have looked more at integration than the actual functionality

now the code is being moved into the hadoop codebase, it has to move to the existing classes in production and test code. This is to keep maintenance down in future.

It also needs to maintain the same statistics, auditing, etc. as the current stream, including CanUnbuffer, StreamCapabilities, IOStatisticsSource, change tracking. Sorry.

key changes will be

use an Invoker around all calls to S3.
use a callback to indirectly interact with the AWS S3 client, similar to S3AInputStream.InputStreamCallbacks. This is to support testing, maintenance and to allow us to annotate every request with audit headers.
collect detail statistics on what is happening and provide these through the IOStatisticsSource API in the stream; updating the file system in close()
integrate with the openFile() API. That's not just a "would be nice" feature, it is a blocker as LineRecordReader uses that API calls to open files. If it is not wired up you would not actually get prefetch on jobs which use that input source

#2584 add some standard options openFile();
you can also add some custom ones to make the stream parameters tunable on an invocation by invocation basis.
That PR includes read strategies such as sequential and whole-file. These could be used to select specific
cache options (none) as well as block sizes for the next reads.

I also think everything in org.apache.hadoop.fs.common should be moved into the hadoop-common module, in the package org.apache.hadoop.fs.store.
But I also worry about the importance of Twitter classes. Does this add a new dependency?

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/common/BlockCache.java

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/common/BlockData.java

steveloughran · 2021-12-03T13:40:23Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/common/Validate.java

+    checkArgument(Files.isRegularFile(path), "Path %s (%s) must point to a file.", argName, path);
+  }
+
+  public static void checkArgument(boolean expression, String format, Object... args) {


We have our own version of guava preconditions in org.apache.hadoop.util; please use.

why use guava (like) functionality when I am already using Apache Validate?

steveloughran · 2021-12-03T15:55:33Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java


  protected CopyFromLocalOperation.CopyFromLocalOperationCallbacks
-  createCopyFromLocalCallbacks() throws IOException {
+      createCopyFromLocalCallbacks() throws IOException {


sorry, I did not understand the suggestion. can you please clarify?

steveloughran · 2021-12-03T16:11:54Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java

+    if (this.prefetchEnabled) {
+      return this.createPrefetchingInputStream(f, bufferSize);
+    } else {
+      return open(f, Optional.empty(), Optional.empty());


This must go in the open/3 method below, so that openFile() will also use it.
You will also get the benefits of the filestatus retrieved during the existence checks, with all length and etag; openFile() allows callers to pass it in so we can skip the HEAD request.

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java

steveloughran · 2021-12-03T16:22:08Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/S3InputStream.java

+/**
+ * Provides an {@link InputStream} that allows reading from an S3 file.
+ */
+public abstract class S3InputStream extends InputStream {


everything which goes wrong here MUST raise an IOexception.
The invoker coming in on the S3AReadOpContext (which should be passed down in the constructor) will do this
When you use it: declare the retry policy through the @retries annotation. See S3AInputStream for the details.

steveloughran · 2021-12-03T16:26:09Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/S3EInputStream.java

+      String key,
+      long contentLength,
+      AmazonS3 client,
+      FileSystem.Statistics stats) {


Expect to be constructed with a reference to a org.apache.hadoop.fs.s3a.statistics.S3AInputStreamStatistics instance. a stub one can be supplied for testing.
The production one will update the file system statistics as appropriate.

mehakmeet

Looks really good. Did an initial review, still going over it.

...-aws/src/test/java/org/apache/hadoop/fs/s3a/s3guard/TestObjectChangeDetectionAttributes.java

hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/read/S3FileTest.java

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/S3File.java

mehakmeet · 2022-01-03T14:27:15Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/S3CachingInputStream.java

+    this.throwIfClosed();
+    this.throwIfInvalidSeek(pos);
+
+    if (!this.getFilePosition().setAbsolute(pos)) {


Can you add some comments in this section to explain what's happening here?

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/S3InputStream.java

mehakmeet · 2022-01-04T06:04:46Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/S3InputStream.java

+    }
+  }
+
+  // @VisibleForTesting


remove this?

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/S3InputStream.java

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/S3Reader.java

yzhangal

HI Kumar, thanks for the good work here. I'm submitting a partial review while I will be doing more.

yzhangal · 2022-02-09T18:18:59Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/S3EInputStream.java

+ * This implementation provides improved read throughput by asynchronously prefetching
+ * blocks of configurable size from the underlying S3 file.
+ */
+public class S3EInputStream


Thanks for the good work here Kumar.

Suggest to change S3EInputStream to S3AInputStreamWithPrefetch, say, it also log S3AStatistics, it's implemented in S3AFileSystem, it's better to have a signatur e S3A in its name, rather than S3E, and it would be good to stick to the categorization described in https://web.archive.org/web/20170718025436/https://aws.amazon. com/premiumsupport/knowledge-center/emr-file-system-s3/

It would be nice to evaluate and describe the number of S3 accesses implication of this change so users can be aware of. See https://issues.apache.org/jira/browse/HADOOP-14965?focusedCommentId=17489161&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17489161 for reference.

agreed for #1. I will rename to a suitable name to avoid E in the name.

as for 2, the impact is negligible. here is a back of the envelope calculation:
for a 1TB dataset and 8MB block size (default), we will make 125000 accesses. That translates to about $0.05 increase in job cost (assuming list prices from AWS : $0.0004 per 1000 requests). The small increase is more than offset by the reduction in runtime by orders of magnitude.

i agree that saving time is generally the best action to prioritise, as if you are running in ec2 your cpu rental is a bigger cost. it does increase the risk of overloading the IO capacity of a shard in a bucket, but we have that with the current random IO implementation anyway

yzhangal · 2022-02-09T18:22:57Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java

+  private int prefetchBlockSize;
+
+  // Size of prefetch queue (in number of blocks).
+  private int prefetchBlockCount;


suggest to change prefetchBlockCount to prefetchThreadCount.

I will keep name as - is because the number of prefetch threads is independent of number of prefetch blocks.

yzhangal · 2022-02-09T18:23:51Z

hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/TestS3AUnbuffer.java

    // Call read and then unbuffer
    FSDataInputStream stream = fs.open(path);
-    assertEquals(0, stream.read(new byte[8])); // mocks read 0 bytes
+    assertEquals(-1, stream.read(new byte[8])); // mocks read 0 bytes


Please add comment to explain why return value is -1 instead of 0 here

that is the correct behavior. see line 62 & 63. read of underlying objectstream returns -1 which should be reflected in input stream read.

yzhangal · 2022-02-09T18:29:07Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/common/SingleFilePerBlockCache.java

+    }
+  }
+
+  private String getIntList(Iterable<Integer> nums) {


Suggest to add javadoc to describe what the method does

yzhangal · 2022-02-09T18:30:08Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/common/SingleFilePerBlockCache.java

+  private String getIntList(Iterable<Integer> nums) {
+    List<String> numList = new ArrayList<>();
+    List<Integer> numbers = new ArrayList<Integer>();
+    for (Integer n : nums) {


nums.forEach(numbers::add) can be used to replace the loop.

yes it could be though I will keep it as is for consistency with the rest of the code (which does not use streams)

yzhangal · 2022-02-09T18:31:24Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/common/SingleFilePerBlockCache.java

+    Collections.sort(numbers);
+
+    int index = 0;
+    while (index < numbers.size()) {


The logic in this loop is not straightforward, suggest to add a comment what it does

should be clear based on the function header I added.

yzhangal · 2022-02-09T18:32:45Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/common/BlockData.java

+/**
+ * Holds information about blocks of data in a file.
+ */
+public class BlockData {


Suggest to add more detailed description of the values, for example, NOT_READY, what's the meaning of ready? the meaning of queued, what's the diff between ready and cached?

added more details

yzhangal · 2022-02-09T18:47:52Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/common/BlockOperations.java

+public class BlockOperations {
+  private static final Logger LOG = LoggerFactory.getLogger(BlockOperations.class);
+
+  public enum Kind {


Please add a clear description what each of these operations does. Though some seems to be intuitive, others not.

And it would be good to describe the lifecycle of a block data somewhere.

Thanks

This class is for debugging/logging purposes only. It is not a part of the public interface. I added explanation in the class header.

steveloughran · 2022-02-17T20:40:45Z

I'm not deliberately ignoring you, just so overloaded with my own work that I've not been able to review anything.

we've put mukund's vectored IO patch into a feature branch with the idea being we will still do the normal review process before every patch, but we can get that short chain of patches lined up before the big merge into trunk. we can also play rebase and interactive rebase too.

would that work for you too? so patch 1 is "take what there is", patch 2+ being the changes we want in before shipping

the risk is always it gets forgotten, so we still need to push hard to get it into a state where it can be used as an option, the goal being "no side effects if not used", including nothing extra on the classpath.

meanwhile, have a look at this #2584

its a big patch, but a key feature is you can declare your read policy, with whole-file being an option as well as random, sequential and vectored.

distcp and the CLI tools all declare their read plans this way

i'd like this in before both the vectored io and your stream, so you can use it to help decide whether to cache etc, and to support custom options as well as parse the newly defined standard ones

bhalchandrap · 2022-02-18T01:34:59Z

Hi Steve, Thanks for your earlier reviews. Take your time, I do not want to make your workload worse. I am happy to work with you using any process that you recommend. Let me know what your recommendations are and I will adopt them.

…

On Thu, Feb 17, 2022 at 12:40 PM Steve Loughran ***@***.***> wrote: I'm not deliberately ignoring you, just so overloaded with my own work that I've not been able to review anything. we've put mukund's vectored IO patch into a feature branch with the idea being we will still do the normal review process before every patch, but we can get that short chain of patches lined up before the big merge into trunk. we can also play rebase and interactive rebase too. would that work for you too? so patch 1 is "take what there is", patch 2+ being the changes we want in before shipping the risk is always it gets forgotten, so we still need to push hard to get it into a state where it can be used as an option, the goal being "no side effects if not used", including nothing extra on the classpath. meanwhile, have a look at this #2584 <#2584> its a big patch, but a key feature is you can declare your read policy, with whole-file being an option as well as random, sequential and vectored. distcp and the CLI tools all declare their read plans this way i'd like this in before both the vectored io and your stream, so you can use it to help decide whether to cache etc, and to support custom options as well as parse the newly defined standard ones — Reply to this email directly, view it on GitHub <#3736 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ATA7ZBSIKJCGWTVIQ4DS6LLU3VMNTANCNFSM5I7V2OJQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.***>

steveloughran · 2022-03-25T20:24:25Z

I've created a feature branch for this and a PR which takes this change (with test file renames & some other small changes) as an initial PR to apply to it: #4109

can everyone who has outstanding issues/comments look at that patch and create some subtasks on the JIRA itself...we can fix the issues (test failures etc) in individual subtasks.

note: this feature branch can be rebased as we go along, though I'd like to get it in asap. I do want to get #2584 in first, which needs reviews from other people. That change should allow for the new input stream to decide policy from the openFile() parameters

steveloughran · 2022-03-29T12:20:43Z

patch is merged in to the feature branch; closing this to avoid confusion.

thank you for a wonderful contribution!

kpandit added 8 commits November 29, 2021 08:40

improve S3 read speed using prefetching & caching

c899bf6

fix check-style warnings

6f2ed5e

fix spotbugs issues

c496863

add/update javadoc

206333a

fix unchecked casts

1b07890

fix check-stype issues

b7c9dc4

fix check-style issues

a7308a5

fix spotbugs issues

3ad6764

rbalamohan reviewed Dec 1, 2021

View reviewed changes

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/common/BoundedResourcePool.java Outdated Show resolved Hide resolved

rbalamohan reviewed Dec 1, 2021

View reviewed changes

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/common/BoundedResourcePool.java Show resolved Hide resolved

kpandit and others added 2 commits December 2, 2021 11:12

address review comments

da5b712

Merge branch 'apache:trunk' into hadoop-18028

29a482c

apache deleted a comment from hadoop-yetus Dec 3, 2021

steveloughran self-assigned this Dec 3, 2021

steveloughran requested changes Dec 3, 2021

View reviewed changes

kpandit and others added 7 commits December 3, 2021 17:52

address review comments not related to functionality

3ce0f6b

Merge branch 'apache:trunk' into hadoop-18028

ea7e19c

integrate with Invoker, ChangeTracker, etc.

6d9b189

allow ensureBuffer to throw

dc239ab

throw if closed or if invalid seek targets

03f068e

Merge branch 'apache:trunk' into hadoop-18028

6cfbc6b

Merge branch 'apache:trunk' into hadoop-18028

2809d40

add/update javadoc

0c361c5

apache deleted a comment from hadoop-yetus Dec 17, 2021

mehakmeet reviewed Jan 4, 2022

View reviewed changes

bhalchandrap and others added 4 commits January 4, 2022 15:11

Merge branch 'apache:trunk' into hadoop-18028

4392e97

address review comments

bdd40fd

merge with trunk

8760671

use updated S3AReadOpContext constructor

7f80c5f

yzhangal suggested changes Feb 9, 2022

View reviewed changes

yzhangal reviewed Feb 9, 2022

View reviewed changes

address review comments

7cc1453

bhalchandrap requested a review from steveloughran February 12, 2022 02:36

apache deleted a comment from hadoop-yetus Mar 25, 2022

steveloughran mentioned this pull request Mar 25, 2022

HADOOP-18028. High performance S3A input stream #4109

Merged

4 tasks

apache deleted a comment from hadoop-yetus Mar 25, 2022

steveloughran closed this Mar 29, 2022

HADOOP-18028. improve S3 read speed using prefetching & caching #3736

HADOOP-18028. improve S3 read speed using prefetching & caching #3736

Uh oh!

Conversation

bhalchandrap commented Nov 29, 2021

Description of PR

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

steveloughran left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mehakmeet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yzhangal left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

steveloughran left a comment •

edited

Loading