-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-25329 Dump region hashes in logs for the regions that are stuck in transition for more than a configured amount of time #2762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
67198af to
61c1278
Compare
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
| if (oldestRITTime < ritTime) { | ||
| oldestRITTime = ritTime; | ||
| } | ||
| if (counter < 500) { // Record 500 oldest RITs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
500 seems high to me. Would 20 be enough? Should this be a constant in the file, even if not configurable?
|
|
||
| LOG.debug("Oldest RIT hashes and states: " + oldestRITHashesAndStates.toString()); | ||
| long time = EnvironmentEdgeManager.currentTime(); | ||
| if ((time - ritThreshold / 2) >= this.lastRITHashMetricUpdate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we update the metrics hashes only if it's been more than ritThreshold / 2 time since last update? If we do the work to find the oldest, it seems like we should update the metrics always. Then any query to get metrics will always have the most recent results (~3 seconds old at max, with default hbase.regionserver.msginterval)
Though I do think it's a good idea for limiting the logging. Should the LOG.debug statement be in here?
Then this statement may deserve a comment.
| if ((time - ritThreshold / 2) >= this.lastRITHashMetricUpdate) { | |
| // Only log if it has been long enough since the last update (default 30 seconds) | |
| if ((time - ritThreshold / 2) >= this.lastRITHashMetricUpdate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I do think it's a good idea for limiting the logging. Should the LOG.debug statement be in here?
yes it is for the reason you stated. will add the comment and move LOG.debug under this condition. maybe the other stuff can be moved outside.
| this.metricsAssignmentManager.updateRITCountOverThreshold(totalRITsOverThreshold); | ||
|
|
||
| LOG.debug("Oldest RIT hashes and states: " + oldestRITHashesAndStates.toString()); | ||
| long time = EnvironmentEdgeManager.currentTime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just reuse currentTime here from the beginning of the method?
| getInt(HConstants.METRICS_RIT_STUCK_WARNING_THRESHOLD, 60000); | ||
| for (RegionState state: regionStates.getRegionsInTransition()) { | ||
| int counter = 0; | ||
| for (RegionState state: regionStates.getRegionsInTransitionOrderedByDuration()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this method uses the state.getStamp() to determine whether the RIT is over the threshold or the oldest time... should we just use getRegionsInTransitionOrderedByTimestamp instead?
This new method getRegionsInTransitionOrderedByDuration is tracking total duration instead of duration since last state transition, so the longest RITs may not necessarily be those that are reported over threshold. I think it's probably good to be consistent... I'm not sure which way is better to be consistent, but since these metrics in this method are reporting time since last change instead of total duration, I'd favor that approach.
| oldestRITTime = ritTime; | ||
| } | ||
| if (counter < 500) { // Record 500 oldest RITs | ||
| oldestRITHashesAndStates.add( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also filter these to only those RITs that are over the threshold? Or do we want to display every RIT, even if it's only been transitioning for 100ms, if it's the longest/only RIT in the cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to only record RITs over threshold, since those would be the problematic ones
| } | ||
| if (counter < 500) { // Record 500 oldest RITs | ||
| oldestRITHashesAndStates.add( | ||
| state.getRegion().getRegionNameAsString() + ":" + state.getState().name() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to use name() as I see other references use state.getState() directly.
region names could be quite long - when you refer to hash, did you want the md5 encoded name?
| state.getRegion().getRegionNameAsString() + ":" + state.getState().name() | |
| state.getRegion().getEncodedName() + ":" + state.getState() |
Or should the entire region name be available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the md5 encoded name is enough info for debugging (which I think yes), then it should suffice to use that 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as for state, that is an enum, and we want the state name, which is why I added .name(). see this SO
| this.metricsAssignmentManager.updateRITCount(totalRITs); | ||
| this.metricsAssignmentManager.updateRITCountOverThreshold(totalRITsOverThreshold); | ||
|
|
||
| LOG.debug("Oldest RIT hashes and states: " + oldestRITHashesAndStates.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should only log if oldestRITHashesAndStates is non-empty.
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
@virajjasani @bharathv do you mind taking a look at this? thanks! |
|
Thanks for putting this together. Definitely useful to have more insights into RIT states. I added a comment in the jira. Would like to know your thoughts. |
|
I just realized that this branch-1 backport PR also has review comments. |
| if (ritTime > ritThreshold) { // more than the threshold | ||
| totalRITsOverThreshold++; | ||
| if (ritsOverThreshold == null) { | ||
| ritsOverThreshold = new HashMap<String, RegionState>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: new HashMap<>()
| if (LOG.isDebugEnabled() && ritsOverThreshold != null && !ritsOverThreshold.isEmpty()) { | ||
| StringBuilder sb = new StringBuilder(); | ||
| for (String regionName: ritsOverThreshold.keySet()) { | ||
| sb.append(regionName + ":" + ritsOverThreshold.get(regionName).getState().name() + "\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use chain of append() to cover String concatenation?
sb.append(regionName).append(":")
.append(ritsOverThreshold.get(regionName).getState().name()).append("\n");
| for (String regionName: ritsOverThreshold.keySet()) { | ||
| sb.append(regionName + ":" + ritsOverThreshold.get(regionName).getState().name() + "\n"); | ||
| } | ||
| sb.delete(sb.length()-2, sb.length()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to remove last appended new line char? If so, start index should be sb.length()-1 right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left few nits and one question on what we want to remove from StringBuilder
| int totalRITs = 0; | ||
| int totalRITsOverThreshold = 0; | ||
| long oldestRITTime = 0; | ||
| HashMap<String, RegionState> ritsOverThreshold = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Map<String, RegionState> ritsOverThreshold = null
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
| } | ||
| if (LOG.isDebugEnabled() && ritsOverThreshold != null && !ritsOverThreshold.isEmpty()) { | ||
| StringBuilder sb = new StringBuilder(); | ||
| for (String regionName: ritsOverThreshold.keySet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you could convert this to entrySet(), that would be great (as per findbugs)
3d173cb to
36e0655
Compare
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending one minor update after last iteration, else good to go
| sb.append(regionName).append(":") | ||
| .append(ritsOverThreshold.get(regionName).getState().name()).append("\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be:
sb.append(regionName.getKey()).append(":")
.append(regionName.getValue().getState().name()).append("\n");
… in transition for more than a configured amount of time
111c16f to
295f851
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, pending QA
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
No description provided.