-
Notifications
You must be signed in to change notification settings - Fork 15
fix: Guard against negative CPU utilization metrics #929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
76f03af to
424a71b
Compare
424a71b to
c1c6a73
Compare
|
When the metric is not negative, is it also different from before? If so, the change may confuse customers, e.g. trigger monitors that have been quiet. Do we consider reverting #894 for this release? |
No significant difference is observed when excluding the negative values. Reverting #894 would be our last resort, as it still provides value. On the other hand, PR #930 appears to be a more promising fix. |
## Problem CPU utilization metrics could report negative values due to measurement timing issues. When per-CPU idle time measurements are taken at slightly different moments, the calculated idle_time delta could be negative, resulting in invalid CPU utilization percentages. ## Solution Implemented defensive validation in the CPU utilization calculation to prevent negative values while allowing overflow of upper bounds: 1. **Uptime Validation**: Skip metrics if uptime delta is invalid (≤ 0) - Prevents division by zero - Catches timing anomalies early 2. **Per-CPU Idle Time Guard**: Prevent negative idle time values - Changed from `.clamp(0.0, uptime)` to `.max(0.0)` - Allows idle_time to exceed uptime if it occurs - Handles negative idle_time from measurement errors 3. **Utilization Guards**: Prevent negative utilization percentages only - cpu_max_utilization: Changed from `.clamp(0.0, 100.0)` to `.max(0.0)` - cpu_min_utilization: Changed from `.clamp(0.0, 100.0)` to `.max(0.0)` - cpu_total_utilization_decimal: Changed from `.clamp(0.0, 1.0)` to `.max(0.0)` - Values can now exceed 100% if measurement timing causes this ## Rationale This approach prioritizes data accuracy over artificial constraints: - Prevents mathematically invalid negative percentages - Allows overflow (>100%) to be reported if it genuinely occurs - Provides visibility into measurement anomalies rather than hiding them - Maintains clean, idiomatic Rust code with `.max()` method The fix includes debug logging when invalid uptime deltas are detected.
c1c6a73 to
03620f5
Compare
https://datadoghq.atlassian.net/browse/SVLS-7991 ## Overview PR #894 introduced an asynchronous message-passing architecture that changed the timing of CPU metrics collection. This timing difference exposed a pre-existing issue where per-CPU idle time measurements could exceed the wall-clock uptime delta, resulting in negative CPU utilization values being reported to customers. The root cause is a fundamental mismatch between measurement domains: - `/proc/uptime` measures wall-clock system uptime (single value) - `/proc/stat` measures per-CPU cumulative idle time (one value per core) When these measurements are taken at slightly different times (especially in the async processing model), the per-CPU idle time delta can exceed the uptime delta, causing the formula `((uptime - idle_time) / uptime) * 100` to produce negative results. ## Solution Implemented defensive validation and enforce in the CPU utilization calculation to ensure metrics are always within valid ranges: 1. **Uptime Validation**: Skip metrics if uptime delta is invalid (≤ 0) - Prevents division by zero - Catches timing anomalies early 2. **Per-CPU Idle Time Check**: Enforce each CPU's idle time to non-negative value - Handles cases where idle_time > uptime due to measurement timing - Handles negative idle_time from measurement errors 3. **Utilization Calculation**: Force all utilization values to be non-negative value PS: Since this is a complex issue without a definitive solution yet, this fix serves only as a temporary measure to unblock our release commitment. We’ll continue investigating a long-term solution as a follow-up. ## Test - Deployed the layer w/ the change and applied them to [these stacks](https://docs.google.com/spreadsheets/d/1oF60PBhYvwdOfFn6yz3zBUhZCZUP7Z43VVE6XE2QGWQ/edit?gid=0#gid=0) as they were the main contributors of the negative stats - Observe the stats and expect [no more negative stats](https://ddserverless.datadoghq.com/dashboard/35c-u5f-8jm?fromUser=true&fullscreen_end_ts=1763142342127&fullscreen_paused=false&fullscreen_refresh_mode=sliding&fullscreen_section=overview&fullscreen_start_ts=1763138742127&fullscreen_widget=2129549723517146&refresh_mode=paused&tpl_var_runtime%5B0%5D=python3.10&tpl_var_runtime%5B1%5D=dotnet8&tpl_var_runtime%5B2%5D=java11&tpl_var_runtime%5B3%5D=nodejs20.x&tpl_var_runtime%5B4%5D=python3.13&tpl_var_runtime%5B5%5D=ruby3.2&tpl_var_service%5B0%5D=d1&from_ts=1763136901167&to_ts=1763137625000&live=false) are reported <img width="2262" height="1698" alt="image" src="https://github.com/user-attachments/assets/a4482590-64ac-4322-8169-103f82706ddf" />
https://datadoghq.atlassian.net/browse/SVLS-7991
Overview
PR #894 introduced an asynchronous message-passing architecture that changed
the timing of CPU metrics collection. This timing difference exposed a
pre-existing issue where per-CPU idle time measurements could exceed the
wall-clock uptime delta, resulting in negative CPU utilization values being
reported to customers.
The root cause is a fundamental mismatch between measurement domains:
/proc/uptimemeasures wall-clock system uptime (single value)/proc/statmeasures per-CPU cumulative idle time (one value per core)When these measurements are taken at slightly different times (especially in
the async processing model), the per-CPU idle time delta can exceed the uptime
delta, causing the formula
((uptime - idle_time) / uptime) * 100to producenegative results.
Solution
Implemented defensive validation and enforce in the CPU utilization
calculation to ensure metrics are always within valid ranges:
Uptime Validation: Skip metrics if uptime delta is invalid (≤ 0)
Per-CPU Idle Time Check: Enforce each CPU's idle time to non-negative value
Utilization Calculation: Force all utilization values to be non-negative value
PS: Since this is a complex issue without a definitive solution yet, this fix serves only as a temporary measure to unblock our release commitment. We’ll continue investigating a long-term solution as a follow-up.
Test