-
Couldn't load subscription status.
- Fork 3.4k
HBASE-27798: Client side should back off based on wait interval in RpcThrottlingException #5275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
…cThrottlingException (apache#5226) Signed-off-by: Bryan Beaudreault <[email protected]>
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
I don't think the test failures had anything to do with my changes |
| if (error instanceof RpcThrottlingException) { | ||
| RpcThrottlingException rpcThrottlingException = (RpcThrottlingException) error; | ||
| expectedSleepNs = TimeUnit.MILLISECONDS.toNanos(rpcThrottlingException.getWaitInterval()); | ||
| if (expectedSleepNs > remainingTimeNs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better add a log here to mention that we will give up retrying since the remaining time is not enough for the next retry because of the server is throttling us.
| this.pauseNsForServerOverloaded = pauseNsForServerOverloaded; | ||
| } | ||
|
|
||
| public OptionalLong getPauseNsFromException(Throwable error, long remainingTimeNs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add javadoc here to describe the meaning of the return value. For me, I think maybe return long directly is a better choice, as we could return -1 if we should fail. Returning OptionalLong seems indicating that, if we return OptionalLong.empty, the upper layer should decide the pauseNs by their own?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll definitely add a javadoc. Regarding a normal long and -1 to represent exceeding the timeout, I see what you mean. But I like that the empty Optional case is tightly controlled here — there's no opportunity for other code to decide to use -1 arbitrarily which bubbles up into throwing an exception erroneously early. For example we've historically returned -1 (granted, representing millis so it might "just work" here) from RpcThrottlingException#getWaitInterval in cases which cannot be parsed:
hbase/hbase-client/src/main/java/org/apache/hadoop/hbase/quotas/RpcThrottlingException.java
Lines 163 to 182 in bfdf1c0
| // Visible for TestRpcThrottlingException | |
| protected static long timeFromString(String timeDiff) { | |
| Pattern pattern = | |
| Pattern.compile("^(?:(\\d+)hrs?, )?(?:(\\d+)mins?, )?(?:(\\d+)sec[, ]{0,2})?(?:(\\d+)ms)?"); | |
| long[] factors = new long[] { 60 * 60 * 1000, 60 * 1000, 1000, 1 }; | |
| Matcher m = pattern.matcher(timeDiff); | |
| if (m.find()) { | |
| int numGroups = m.groupCount(); | |
| long time = 0; | |
| for (int j = 1; j <= numGroups; j++) { | |
| String group = m.group(j); | |
| if (group == null) { | |
| continue; | |
| } | |
| time += Math.round(Float.parseFloat(group) * factors[j - 1]); | |
| } | |
| return time; | |
| } | |
| return -1; | |
| } |
...t/src/main/java/org/apache/hadoop/hbase/client/backoff/HBaseServerExceptionPauseManager.java
Show resolved
Hide resolved
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
6c63929 to
761efd0
Compare
|
Force pushed to fix checkstyle. @bbeaudreault we're going to want 761efd0 on branch-2 as well |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
I don't think the failure for |
|
🎊 +1 overall
This message was automatically generated. |
| return; | ||
| } | ||
|
|
||
| boolean isServerOverloaded = HBaseServerException.isServerOverloaded(error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to test this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this logic is covered by existing tests in TestAsyncClientPauseForServerOverloaded
| delayNs = pauseNsToUse; | ||
| } | ||
|
|
||
| if (isServerOverloaded) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we only need isServerOverloaded here?
| if (!(error instanceof RpcThrottlingException)) { | ||
| // RpcThrottlingException tells us exactly how long the client should wait for, | ||
| // so we should not factor in the retry count for said exception | ||
| pauseNsToUse = getPauseTime(pauseNsToUse, tries - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this meas the pauseNsToUse is just delayNs now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The small distinction is that they may still differ if the requested pause is longer than the operationTimeout allows. I've also pushed that logic to the PauseManager class
|
|
||
| boolean isServerOverloaded = HBaseServerException.isServerOverloaded(error); | ||
| OptionalLong maybePauseNsToUse = | ||
| pauseManager.getPauseNsFromException(error, remainingTimeNs() - SLEEP_DELTA_NS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just pass the tries to this method so it could return the delayNs directly? The instanceof below seems strange...
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
| delayNs = getPauseTime(pauseNsToUse, tries - 1); | ||
|
|
||
| OptionalLong maybePauseNsToUse = pauseManager.getPauseNsFromException(error, | ||
| remainingTimeNs() - SLEEP_DELTA_NS, tries, scanTimeoutNs > 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing in scanTimeoutNs > 0 along with the remainingTimeNs() seems a bit awkward/redundant.
We are already constructing our PauseManager in the caller construction, and passing in the various pauseNs. Should we also pass in the correct operationTimeNs/scanTimeNs in the constructor? Then we can move remainingTimeNs() into pause manager and simplify the arguments here, along with the awkward Boolean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 good idea, I just pushed this change
| return OptionalLong.of(expectedSleepNs); | ||
| } | ||
|
|
||
| private long remainingTimeNs() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we can delete the remainingTimeNs() methods in the various caller classes now? Or are they still necessary? We could make this public if necessary and have any callers of the other to-be-deleted methods call this instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah actually, I'm realizing this gets a little bit awkward too because the scanner's startNs obviously isn't final. So we either need to support mutability of that field in the pause manager, reconstruct the pause manager on each call, or continue to leave the timeout responsibility outside of the pause manager. Do you have a preference?
Depending on the answer here we can certainly remove some of these remainingTimeNs methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose you could pass in startNs as an argument, and only finalize timeoutNs in the constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the redundancy of remainingTimeNs methods in the client, there's also a little bit of nuance in what these represent. Often at the client level the remainingTimeNs is just the difference between the timeout and the elapsed time, but in the context of the pause manager we also have to subtract the SLEEP_DELTA_NS (presumably to account for the lack of precision in sleeping for a number of millis). So just subbing one for the other isn't exactly 1:1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but in the context of AsyncScanSingleRegionRpcRetryingCaller, remainingTimeNs() is only called once prior to this patch. The call moved into our new PauseManager code, so I assume that method is now unused.
In the context of AsyncBatchRpcRetryingCaller, it is called 3 times... so not unused at this point, but maybe could still be unified. I feel like we could do this:
- Remove
startNsarg from PauseManager constructor, since as you said it's not final for scans. - expose
PauseManager.remainingTimeNs()as public or package protected, and add along startNsargument. - Also add
long startNsas argument ingetPauseNsFromException()since it seems we need that for scans - In getPauseNsFromException(), use
remainingTimeNs(startNs) - SLEEP_DELTA_NS(so the SLEEP_DELTA_NS gets removed fromPauseManager.remainingTimeNs(long startNs)method - Replace all the calls to existing
remainingTimeNs()methods in AsyncBatchCaller withpauseManager.remainingTimeNs(startNs)
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally onboard, just pushed 902f8db
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncBatchRpcRetryingCaller.java
Show resolved
Hide resolved
...ent/src/main/java/org/apache/hadoop/hbase/client/AsyncScanSingleRegionRpcRetryingCaller.java
Show resolved
Hide resolved
|
+1. This is exactly what I expect. Thanks for taking care of this. @bbeaudreault Do you have any concerns? |
|
All good on my end! I can handle merging this Duo, I was just waiting to make sure you were ok. Thanks for the review here. |
…ThrottlingException (apache#5275) Signed-off-by: Bryan Beaudreault <[email protected]> Signed-off-by: Duo Zhang <[email protected]>
…ThrottlingException (apache#5275) Signed-off-by: Bryan Beaudreault <[email protected]> Signed-off-by: Duo Zhang <[email protected]>
…ThrottlingException (apache#5275) Signed-off-by: Bryan Beaudreault <[email protected]> Signed-off-by: Duo Zhang <[email protected]>
…ThrottlingException (apache#5275) Signed-off-by: Bryan Beaudreault <[email protected]> Signed-off-by: Duo Zhang <[email protected]>
…ThrottlingException (apache#5275) Signed-off-by: Bryan Beaudreault <[email protected]> Signed-off-by: Duo Zhang <[email protected]>
…ThrottlingException (apache#5275) Signed-off-by: Bryan Beaudreault <[email protected]> Signed-off-by: Duo Zhang <[email protected]>
The RpcThrottlingException tells the client how much to back off, but right now the recommendation is ignored. This PR introduces logic that respects said back off recommendation.
This feature was added to branch-2 via #5226
@bbeaudreault