Skip to content

Conversation

@snvijaya
Copy link
Contributor

Currently errors in read-ahead are silently ignored, thus failing to highlight any issues and causing slowness to the overall read request.

Any new read request in-turn triggers n num of read-aheads and all of them will silently fail.

This PR will report back error from the read-ahead issued by the active read call. Also, cause subsequent reads to only retry the respective read position based on the failure seen for the previous read-ahead failure on same position.

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a hadoop JIRA and a link back. PRs without a matching JIRA do not exist and SHALL not be committed

@snvijaya snvijaya changed the title Report read-ahead error back HADOOP-16852: Report read-ahead error back Mar 19, 2020
@snvijaya
Copy link
Contributor Author

Test results:
HNS enabled account:
[INFO] Tests run: 58, Failures: 0, Errors: 0, Skipped: 0
[WARNING] Tests run: 412, Failures: 0, Errors: 0, Skipped: 66
[WARNING] Tests run: 206, Failures: 0, Errors: 0, Skipped: 140

HNS not enabled account:
[INFO] Tests run: 58, Failures: 0, Errors: 0, Skipped: 0
[WARNING] Tests run: 412, Failures: 0, Errors: 0, Skipped: 240
[WARNING] Tests run: 206, Failures: 0, Errors: 0, Skipped: 140

@snvijaya snvijaya requested review from goiri and steveloughran March 19, 2020 10:59
@snvijaya
Copy link
Contributor Author

Made a fix where read-ahead thread will never read remote a length greater than its buffer size.
HNS enabled account:
[INFO] Tests run: 58, Failures: 0, Errors: 0, Skipped: 0
[WARNING] Tests run: 412, Failures: 0, Errors: 0, Skipped: 66
[WARNING] Tests run: 206, Failures: 0, Errors: 0, Skipped: 140

HNS not enabled account:
[INFO] Tests run: 58, Failures: 0, Errors: 0, Skipped: 0
[WARNING] Tests run: 412, Failures: 0, Errors: 0, Skipped: 240
[WARNING] Tests run: 206, Failures: 0, Errors: 0, Skipped: 140

@snvijaya
Copy link
Contributor Author

@DadanielZ - Thanks for the review. I have left the comment on the bufferstatus versus the timestamp check unresolved. As mentioned in my comments, the intention is to throw the exception from the read-ahead buffer for any reads that qualify the buffer's offset and length range. Please let me know if you have any concerns.

Copy link
Contributor

@DadanielZ DadanielZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, +1.

@snvijaya
Copy link
Contributor Author

snvijaya commented Apr 1, 2020

needs a hadoop JIRA and a link back. PRs without a matching JIRA do not exist and SHALL not be committed

Have made the necessary updates.

@snvijaya
Copy link
Contributor Author

snvijaya commented Apr 1, 2020

@steveloughran - Can you please help review this PR ?

@steveloughran
Copy link
Contributor

@DadanielZ is happy with the core patch, so I am too. just the checkstyle to fix

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looked at, tests look good, little bit of some logic in the production code I'm querying

Is there any way to avoid a 30s delay on every test run for the timeouts? As that will slow down the tests, and every change like this makes the tests slower and slower, costing us engineers time, our employers money *and reducing the likelihood that people run the tests

Side issue; do those read buffer manager threads ever get released? And what happens in large JVM processes where you have many abfs fs instances?, eg. hive LLAP, Spark. Does this become a bottleneck as irrespective of the #of FS instances, the buffer size and count is hard coded.

What I'm wondering here is should the buffer manager actually be something which belongs to a specific FS instance, uses its thread pool, and is released when the FS instance is destroyed.

Copy link
Contributor

@bilaharith bilaharith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

@steveloughran
Copy link
Contributor

things aren't building because the change made to the abfs constructor is breaking it. Sorry. That refactoring was done to try and reduce change conflict in future.

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-site-plugin:3.6:site (default-site) on project hadoop-azure: failed to get report for org.apache.maven.plugins:maven-dependency-plugin: Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:testCompile (default-testCompile) on project hadoop-azure: Compilation failure
[ERROR] /home/jenkins/jenkins-slave/workspace/hadoop-multibranch_PR-1898/src/hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/services/TestAbfsInputStream.java:[72,34] error: constructor AbfsInputStream in class AbfsInputStream cannot be applied to given types;

@snvijaya
Copy link
Contributor Author

things aren't building because the change made to the abfs constructor is breaking it. Sorry. That refactoring was done to try and reduce change conflict in future.

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-site-plugin:3.6:site (default-site) on project hadoop-azure: failed to get report for org.apache.maven.plugins:maven-dependency-plugin: Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:testCompile (default-testCompile) on project hadoop-azure: Compilation failure
[ERROR] /home/jenkins/jenkins-slave/workspace/hadoop-multibranch_PR-1898/src/hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/services/TestAbfsInputStream.java:[72,34] error: constructor AbfsInputStream in class AbfsInputStream cannot be applied to given types;

Have merged and made test updates that were needed post the recent SAS updates.

@snvijaya
Copy link
Contributor Author

looked at, tests look good, little bit of some logic in the production code I'm querying

Is there any way to avoid a 30s delay on every test run for the timeouts? As that will slow down the tests, and every change like this makes the tests slower and slower, costing us engineers time, our employers money *and reducing the likelihood that people run the tests

Side issue; do those read buffer manager threads ever get released? And what happens in large JVM processes where you have many abfs fs instances?, eg. hive LLAP, Spark. Does this become a bottleneck as irrespective of the #of FS instances, the buffer size and count is hard coded.

What I'm wondering here is should the buffer manager actually be something which belongs to a specific FS instance, uses its thread pool, and is released when the FS instance is destroyed.

Timeout sleep duration in test have been reduced to 3sec. For the other issues on buffer management in Read buffer manager, will investigate separate and create JIRAs for improvement points.

@snvijaya
Copy link
Contributor Author

Tests rerun:

HNS

[INFO] Tests run: 69, Failures: 0, Errors: 0, Skipped: 0
[WARNING] Tests run: 432, Failures: 0, Errors: 0, Skipped: 74
WARNING] Tests run: 206, Failures: 0, Errors: 0, Skipped: 140

non-HNS
[INFO] Tests run: 69, Failures: 0, Errors: 0, Skipped: 0
[WARNING] Tests run: 432, Failures: 0, Errors: 0, Skipped: 248
[WARNING] Tests run: 206, Failures: 0, Errors: 0, Skipped: 140

@snvijaya
Copy link
Contributor Author

@steveloughran - Could you please help complete review and commit.

@steveloughran
Copy link
Contributor

are there plans to backport?
If you can cherry pick onto branch-3.3 and do the test run, let me know and I will do the merge. No

Comment on lines +261 to +265
// As failed ReadBuffers (bufferIndx = -1) are saved in completedReadList,
// avoid adding it to freeList.
if (buf.getBufferindex() != -1) {
freeList.push(buf.getBufferindex());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @snvijaya
I am unable to understand the significance of this change. I couldn't find in code anywhere where bufferIndex is set to -1 in case of read failure apart from the default value in the class. But when the buffers initialised, they are always set to value from 0 to 15.
Trying to understand this for #3285. So please review that as well. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its set to -1 when read fails. You will find the diff for this in ReadBuffer.java line 110.
There is an issue with this commit though, for which a hotfix was made. Incase its relevant to your change -> https://issues.apache.org/jira/browse/HADOOP-17301
Will check on your PR by EOW.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @snvijaya

@apache apache deleted a comment from hadoop-yetus Aug 19, 2021
@apache apache deleted a comment from hadoop-yetus Aug 19, 2021
@apache apache deleted a comment from hadoop-yetus Aug 19, 2021
@apache apache deleted a comment from hadoop-yetus Aug 19, 2021
@apache apache deleted a comment from hadoop-yetus Aug 19, 2021
@apache apache deleted a comment from hadoop-yetus Aug 19, 2021
@apache apache deleted a comment from hadoop-yetus Aug 19, 2021
@apache apache deleted a comment from hadoop-yetus Aug 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants