-
Couldn't load subscription status.
- Fork 9.1k
HADOOP-17475. ABFS: Implementing ListStatusRemoteIterator #2548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
we should talk about this in 2021. For now
I think it makes sense to have an over all "optimise abfs incremental listings" JIRA and create issues underneath, as a lot is unified. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
|
||
| private final Path path; | ||
| private final AzureBlobFileSystemStore abfsStore; | ||
| private final Queue<ListIterator<FileStatus>> iteratorsQueue = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrayBlockingQueue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| public ListStatusRemoteIterator(final Path path, | ||
| final AzureBlobFileSystemStore abfsStore) throws IOException { | ||
| this.path = path; | ||
| this.abfsStore = abfsStore; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ioExcetion and currentIterator to null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
...oop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/ListStatusRemoteIterator.java
Outdated
Show resolved
Hide resolved
...oop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/ListStatusRemoteIterator.java
Outdated
Show resolved
Hide resolved
...oop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/ListStatusRemoteIterator.java
Outdated
Show resolved
Hide resolved
...tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystemStore.java
Outdated
Show resolved
Hide resolved
|
I need you to use These iterators propagate the IOStatisticsSource interface, so when the innermost iterator collects cost/count of list calls, the stats will be visible to and collectable by callers. |
...oop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/ListStatusRemoteIterator.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystem.java
Show resolved
Hide resolved
...oop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/ListStatusRemoteIterator.java
Show resolved
Hide resolved
.../src/test/java/org/apache/hadoop/fs/azurebfs/ITestAzureBlobFileSystemListStatusIterator.java
Outdated
Show resolved
Hide resolved
.../src/test/java/org/apache/hadoop/fs/azurebfs/ITestAzureBlobFileSystemListStatusIterator.java
Outdated
Show resolved
Hide resolved
.../src/test/java/org/apache/hadoop/fs/azurebfs/ITestAzureBlobFileSystemListStatusIterator.java
Outdated
Show resolved
Hide resolved
.../src/test/java/org/apache/hadoop/fs/azurebfs/ITestAzureBlobFileSystemListStatusIterator.java
Outdated
Show resolved
Hide resolved
.../src/test/java/org/apache/hadoop/fs/azurebfs/ITestAzureBlobFileSystemListStatusIterator.java
Outdated
Show resolved
Hide resolved
| continuation = getIsNamespaceEnabled() | ||
| ? generateContinuationTokenForXns(startFrom) | ||
| : generateContinuationTokenForNonXns(relativePath, startFrom); | ||
| if (continuation == null || continuation.length() < 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
continuation.isEmpty()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| updateCurrentIterator(); | ||
| } | ||
| } catch (InterruptedException e) { | ||
| Thread.currentThread().interrupt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are here because the thread already got interrupted ? why call interrupt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.../src/test/java/org/apache/hadoop/fs/azurebfs/ITestAzureBlobFileSystemListStatusIterator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the comments
I would recommend something like "ABFS listing to support asynchronous prefetch and optimise for incremental listing of large directories and deep/wide directory trees" That is: if that hurts performance of listing empty directories, or calling the listX calls against files, that is acceptable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fear you are the same mistake we did in the S3A codebase: giving all the helper classes a reference back to the ABFS Store class, so making them too intermingled. It is unsustainable. See: https://github.com/steveloughran/engineering-proposals/blob/trunk/refactoring-s3a.md
Proposed
- Use a specific listing callback which the store can directly/indirectly implement: org.apache.hadoop.fs.s3a.impl.ListingOperationCallbacks.
- This will also help you write unit tests against a stub implementation without having to use reflection and manipulating access modifiers in a way that is over-complex and brittle.
- Use IOStatistics API to track duration of calls
- and serve this up. It is really interesting to know how long lists take
- add a close() operator to cancel queue & wait for completion.
the other incremental listX calls (listFileStatus, listLocatedStatus) are all similar and should be done in the same uber-JIRA. listLocatedStatus is used in LocatedFileStatusFetcher, which is used during MR and spark file scanning. listFileStatus should be used more for deep tree scans.
I'd also prefer if you use a shared thread pool of that ABFS store instance
- stops many, many list iterators overloading things
- may offer faster startup
- custom thread names help identify origin of less-helpful stack traces
- could have a gauge on the instance to measure pool load
- filesystem.close() can interrupt the pool to shut things down.
Have a look @ org.apache.hadoop.fs.s3a.Listing in trunk to see what's been done there. There's a lot you don't need (S3Guard reconciliation), and it currently only prefetches the next page of results. But it does include the stats collection, a restricted callback to the FS, use of org.apache.hadoop.util.functional.RemoteIterators for functional-programming style use of RemoteIterator classes. Take a look at it's {{org.apache.hadoop.util.functional.RemoteIterators#foreach}} method too.
I know, I'm giving extra homework. But I'll only keep asking for these, so better to start now.
...oop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/ListStatusRemoteIterator.java
Outdated
Show resolved
Hide resolved
...oop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/ListStatusRemoteIterator.java
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| private void fetchBatchesAsync() { | ||
| CompletableFuture.runAsync(() -> asyncOp()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should only be scheduled if there isn't one already in progress
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
...oop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/ListStatusRemoteIterator.java
Outdated
Show resolved
Hide resolved
.../src/test/java/org/apache/hadoop/fs/azurebfs/ITestAzureBlobFileSystemListStatusIterator.java
Outdated
Show resolved
Hide resolved
.../src/test/java/org/apache/hadoop/fs/azurebfs/ITestAzureBlobFileSystemListStatusIterator.java
Outdated
Show resolved
Hide resolved
.../src/test/java/org/apache/hadoop/fs/azurebfs/ITestAzureBlobFileSystemListStatusIterator.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystem.java
Outdated
Show resolved
Hide resolved
I will raise the JIRA as suggested. For now keeping the PR as a draft. |
|
💔 -1 overall
This message was automatically generated. |
...azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsListStatusRemoteIterator.java
Outdated
Show resolved
Hide resolved
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Hi @steveloughran |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very close to getting in. other than some minor checkstyle/javadoc complaints, that findbugs needs to be made to stop complaining. Usually it is indicating a real risk of a problem, so I'd like to see what can be done about it -rather than just edit the XML file to turn off the check
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystem.java
Show resolved
Hide resolved
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystem.java
Show resolved
Hide resolved
| continuation = listingSupport | ||
| .listStatus(fileStatus.getPath(), null, fileStatuses, FETCH_ALL_FALSE, | ||
| continuation); | ||
| if(!fileStatuses.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space after if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * "/folder/hfile" and "/folder/ifile". | ||
| * @return the entries in the path start from "startFrom" in lexical order. | ||
| */ | ||
| @InterfaceStability.Unstable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add that at the actual interface, along with the @Private . Not that we'd expect anyone to use it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
...oop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/ITestAbfsListStatusRemoteIterator.java
Outdated
Show resolved
Hide resolved
| for (Future<Void> task : tasks) { | ||
| task.get(); | ||
| } | ||
| es.shutdownNow(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be in a finally block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
...azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsListStatusRemoteIterator.java
Show resolved
Hide resolved
There is one one findbugs warning remaining which is of medium priority. The same can be ignored since at line 147 the continuation token returned by the ListingOperation ouside the synchroniced lock since the same involve an http call. |
|
💔 -1 overall
This message was automatically generated. |
well, it needs to be dealt with either by fixing or findbugs.xml. Define "medium priority" here |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
Medium priority was a terminology used within findbugs report. |
|
that's fine: key is "findbugs must be happy, sometimes it overreacts -but you need to justify its complaints. Sometimes it's a losing battler with the sync stuff; I had that with the MeanStatistic stuff in IOStatistics and just made everything sync even when it was overkill |
|
+1, Merging |
|
merged to trunk; cp'd to branch-3.3 and will push up if a recompile works. I am not testing that ---can you do that and if there are problems we can do a followup? |
The ABFS connector now implements listStatusIterator() with asynchronous prefetching of the next page(s) of results. For listing large directories this can provide tangible speedups. If for any reason this needs to be disabled, set fs.azure.enable.abfslistiterator to false. Contributed by Bilahari T H. Change-Id: Ic9a52b80df1d0ffed4c81beae92c136e2a12698c
Sure |
…tor (apache#2548) The ABFS connector now implements listStatusIterator() with asynchronous prefetching of the next page(s) of results. For listing large directories this can provide tangible speedups. If for any reason this needs to be disabled, set fs.azure.enable.abfslistiterator to false. Contributed by Bilahari T H. Change-Id: Ic9a52b80df1d0ffed4c81beae92c136e2a12698c
This is a draft PR not ready fo review.