Skip to content

Commit ebf1aa1

Browse files
fix: Guard against negative CPU utilization metrics (#929)
https://datadoghq.atlassian.net/browse/SVLS-7991 ## Overview PR #894 introduced an asynchronous message-passing architecture that changed the timing of CPU metrics collection. This timing difference exposed a pre-existing issue where per-CPU idle time measurements could exceed the wall-clock uptime delta, resulting in negative CPU utilization values being reported to customers. The root cause is a fundamental mismatch between measurement domains: - `/proc/uptime` measures wall-clock system uptime (single value) - `/proc/stat` measures per-CPU cumulative idle time (one value per core) When these measurements are taken at slightly different times (especially in the async processing model), the per-CPU idle time delta can exceed the uptime delta, causing the formula `((uptime - idle_time) / uptime) * 100` to produce negative results. ## Solution Implemented defensive validation and enforce in the CPU utilization calculation to ensure metrics are always within valid ranges: 1. **Uptime Validation**: Skip metrics if uptime delta is invalid (≤ 0) - Prevents division by zero - Catches timing anomalies early 2. **Per-CPU Idle Time Check**: Enforce each CPU's idle time to non-negative value - Handles cases where idle_time > uptime due to measurement timing - Handles negative idle_time from measurement errors 3. **Utilization Calculation**: Force all utilization values to be non-negative value PS: Since this is a complex issue without a definitive solution yet, this fix serves only as a temporary measure to unblock our release commitment. We’ll continue investigating a long-term solution as a follow-up. ## Test - Deployed the layer w/ the change and applied them to [these stacks](https://docs.google.com/spreadsheets/d/1oF60PBhYvwdOfFn6yz3zBUhZCZUP7Z43VVE6XE2QGWQ/edit?gid=0#gid=0) as they were the main contributors of the negative stats - Observe the stats and expect [no more negative stats](https://ddserverless.datadoghq.com/dashboard/35c-u5f-8jm?fromUser=true&fullscreen_end_ts=1763142342127&fullscreen_paused=false&fullscreen_refresh_mode=sliding&fullscreen_section=overview&fullscreen_start_ts=1763138742127&fullscreen_widget=2129549723517146&refresh_mode=paused&tpl_var_runtime%5B0%5D=python3.10&tpl_var_runtime%5B1%5D=dotnet8&tpl_var_runtime%5B2%5D=java11&tpl_var_runtime%5B3%5D=nodejs20.x&tpl_var_runtime%5B4%5D=python3.13&tpl_var_runtime%5B5%5D=ruby3.2&tpl_var_service%5B0%5D=d1&from_ts=1763136901167&to_ts=1763137625000&live=false) are reported <img width="2262" height="1698" alt="image" src="https://github.com/user-attachments/assets/a4482590-64ac-4322-8169-103f82706ddf" />
1 parent 95ab4b6 commit ebf1aa1

File tree

1 file changed

+390
-3
lines changed

1 file changed

+390
-3
lines changed

0 commit comments

Comments
 (0)