-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24485][SS] Measure and log elapsed time for filesystem operations in HDFSBackedStateStoreProvider #21506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ons in HDFSBackedStateStoreProvider
|
There're plenty of other debug messages which might hide the log messages added from this patch. Would we want to log them with INFO instead of DEBUG? |
|
Test build #91525 has finished for PR 21506 at commit
|
|
add to whitelist |
|
ok to test |
|
Test build #91626 has finished for PR 21506 at commit
|
|
Test build #91627 has finished for PR 21506 at commit
|
jose-torres
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I agree that these logs are worth the extra code.
I would argue that debug is the right logging level - these are pretty detailed metrics. But I don't feel super strongly about it.
| lastAvailableMap = | ||
| synchronized { loadedMaps.get(lastAvailableVersion) } | ||
| .orElse(readSnapshotFile(lastAvailableVersion)) | ||
| val (result, elapsedMs) = Utils.timeTakenMs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github has an... interesting idea of how to display this diff. The only change was the existing code moving inside timeTakenMs and adding the logWarning statements, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup right. Most of the code change is just wrapping codes into timeTakenMs.
|
|
||
| synchronized { loadedMaps.put(version, resultMap) } | ||
| resultMap | ||
| logWarning(s"Loading state for $version takes $elapsedMs ms.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this should be a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just thought about making a pair between warning message above and this, but once we are guiding end users to turn on DEBUG level to see information regarding addition latencies, turning this to DEBUG would be also OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed log level to DEBUG.
|
Test build #91649 has finished for PR 21506 at commit
|
|
retest this please |
|
Test build #91651 has finished for PR 21506 at commit
|
|
Kindly ping again to @tdas |
|
lgtm |
|
Merged to master. |
| var lastAvailableMap: Option[MapType] = None | ||
| while (lastAvailableMap.isEmpty) { | ||
| lastAvailableVersion -= 1 | ||
| logWarning(s"The state for version $version doesn't exist in loadedMaps. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HeartSaVioR this is normal operation and not yet understand why use logWarning then. Can we lower this to debug or was there a reason to use warning? It's kinda overkill in unit tests...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a normal if it's not just restored from checkpoint. If someone encounters the warning message while batches are running, it should be considered seriously because full state is being loaded from HDFS now though we expect state cache should contain it.
What changes were proposed in this pull request?
This patch measures and logs elapsed time for each operation which communicate with file system (mostly remote HDFS in production) in HDFSBackedStateStoreProvider to help investigating any latency issue.
How was this patch tested?
Manually tested.