Commit ebf1aa1
authored
fix: Guard against negative CPU utilization metrics (#929)
https://datadoghq.atlassian.net/browse/SVLS-7991
## Overview
PR #894 introduced an asynchronous message-passing architecture that
changed
the timing of CPU metrics collection. This timing difference exposed a
pre-existing issue where per-CPU idle time measurements could exceed the
wall-clock uptime delta, resulting in negative CPU utilization values
being
reported to customers.
The root cause is a fundamental mismatch between measurement domains:
- `/proc/uptime` measures wall-clock system uptime (single value)
- `/proc/stat` measures per-CPU cumulative idle time (one value per
core)
When these measurements are taken at slightly different times
(especially in
the async processing model), the per-CPU idle time delta can exceed the
uptime
delta, causing the formula `((uptime - idle_time) / uptime) * 100` to
produce
negative results.
## Solution
Implemented defensive validation and enforce in the CPU utilization
calculation to ensure metrics are always within valid ranges:
1. **Uptime Validation**: Skip metrics if uptime delta is invalid (≤ 0)
- Prevents division by zero
- Catches timing anomalies early
2. **Per-CPU Idle Time Check**: Enforce each CPU's idle time to
non-negative value
- Handles cases where idle_time > uptime due to measurement timing
- Handles negative idle_time from measurement errors
3. **Utilization Calculation**: Force all utilization values to be
non-negative value
PS: Since this is a complex issue without a definitive solution yet,
this fix serves only as a temporary measure to unblock our release
commitment. We’ll continue investigating a long-term solution as a
follow-up.
## Test
- Deployed the layer w/ the change and applied them to [these
stacks](https://docs.google.com/spreadsheets/d/1oF60PBhYvwdOfFn6yz3zBUhZCZUP7Z43VVE6XE2QGWQ/edit?gid=0#gid=0)
as they were the main contributors of the negative stats
- Observe the stats and expect [no more negative
stats](https://ddserverless.datadoghq.com/dashboard/35c-u5f-8jm?fromUser=true&fullscreen_end_ts=1763142342127&fullscreen_paused=false&fullscreen_refresh_mode=sliding&fullscreen_section=overview&fullscreen_start_ts=1763138742127&fullscreen_widget=2129549723517146&refresh_mode=paused&tpl_var_runtime%5B0%5D=python3.10&tpl_var_runtime%5B1%5D=dotnet8&tpl_var_runtime%5B2%5D=java11&tpl_var_runtime%5B3%5D=nodejs20.x&tpl_var_runtime%5B4%5D=python3.13&tpl_var_runtime%5B5%5D=ruby3.2&tpl_var_service%5B0%5D=d1&from_ts=1763136901167&to_ts=1763137625000&live=false)
are reported
<img width="2262" height="1698" alt="image"
src="https://github.com/user-attachments/assets/a4482590-64ac-4322-8169-103f82706ddf"
/>1 parent 95ab4b6 commit ebf1aa1
1 file changed
+390
-3
lines changed
0 commit comments