fix: Guard against negative CPU utilization metrics #929

litianningdatadog · 2025-11-14T14:34:11Z

https://datadoghq.atlassian.net/browse/SVLS-7991

Overview

PR #894 introduced an asynchronous message-passing architecture that changed
the timing of CPU metrics collection. This timing difference exposed a
pre-existing issue where per-CPU idle time measurements could exceed the
wall-clock uptime delta, resulting in negative CPU utilization values being
reported to customers.

The root cause is a fundamental mismatch between measurement domains:

/proc/uptime measures wall-clock system uptime (single value)
/proc/stat measures per-CPU cumulative idle time (one value per core)

When these measurements are taken at slightly different times (especially in
the async processing model), the per-CPU idle time delta can exceed the uptime
delta, causing the formula ((uptime - idle_time) / uptime) * 100 to produce
negative results.

Solution

Implemented defensive validation and enforce in the CPU utilization
calculation to ensure metrics are always within valid ranges:

Uptime Validation: Skip metrics if uptime delta is invalid (≤ 0)
- Prevents division by zero
- Catches timing anomalies early
Per-CPU Idle Time Check: Enforce each CPU's idle time to non-negative value
- Handles cases where idle_time > uptime due to measurement timing
- Handles negative idle_time from measurement errors
Utilization Calculation: Force all utilization values to be non-negative value

PS: Since this is a complex issue without a definitive solution yet, this fix serves only as a temporary measure to unblock our release commitment. We’ll continue investigating a long-term solution as a follow-up.

Test

Deployed the layer w/ the change and applied them to these stacks as they were the main contributors of the negative stats
Observe the stats and expect no more negative stats are reported

lym953 · 2025-11-14T20:00:17Z

When the metric is not negative, is it also different from before? If so, the change may confuse customers, e.g. trigger monitors that have been quiet. Do we consider reverting #894 for this release?

litianningdatadog · 2025-11-14T21:57:53Z

When the metric is not negative, is it also different from before? If so, the change may confuse customers, e.g. trigger monitors that have been quiet. Do we consider reverting #894 for this release?

No significant difference is observed when excluding the negative values. Reverting #894 would be our last resort, as it still provides value. On the other hand, PR #930 appears to be a more promising fix.

## Problem CPU utilization metrics could report negative values due to measurement timing issues. When per-CPU idle time measurements are taken at slightly different moments, the calculated idle_time delta could be negative, resulting in invalid CPU utilization percentages. ## Solution Implemented defensive validation in the CPU utilization calculation to prevent negative values while allowing overflow of upper bounds: 1. **Uptime Validation**: Skip metrics if uptime delta is invalid (≤ 0) - Prevents division by zero - Catches timing anomalies early 2. **Per-CPU Idle Time Guard**: Prevent negative idle time values - Changed from `.clamp(0.0, uptime)` to `.max(0.0)` - Allows idle_time to exceed uptime if it occurs - Handles negative idle_time from measurement errors 3. **Utilization Guards**: Prevent negative utilization percentages only - cpu_max_utilization: Changed from `.clamp(0.0, 100.0)` to `.max(0.0)` - cpu_min_utilization: Changed from `.clamp(0.0, 100.0)` to `.max(0.0)` - cpu_total_utilization_decimal: Changed from `.clamp(0.0, 1.0)` to `.max(0.0)` - Values can now exceed 100% if measurement timing causes this ## Rationale This approach prioritizes data accuracy over artificial constraints: - Prevents mathematically invalid negative percentages - Allows overflow (>100%) to be reported if it genuinely occurs - Provides visibility into measurement anomalies rather than hiding them - Maintains clean, idiomatic Rust code with `.max()` method The fix includes debug logging when invalid uptime deltas are detected.

https://datadoghq.atlassian.net/browse/SVLS-7991 ## Overview PR #894 introduced an asynchronous message-passing architecture that changed the timing of CPU metrics collection. This timing difference exposed a pre-existing issue where per-CPU idle time measurements could exceed the wall-clock uptime delta, resulting in negative CPU utilization values being reported to customers. The root cause is a fundamental mismatch between measurement domains: - `/proc/uptime` measures wall-clock system uptime (single value) - `/proc/stat` measures per-CPU cumulative idle time (one value per core) When these measurements are taken at slightly different times (especially in the async processing model), the per-CPU idle time delta can exceed the uptime delta, causing the formula `((uptime - idle_time) / uptime) * 100` to produce negative results. ## Solution Implemented defensive validation and enforce in the CPU utilization calculation to ensure metrics are always within valid ranges: 1. **Uptime Validation**: Skip metrics if uptime delta is invalid (≤ 0) - Prevents division by zero - Catches timing anomalies early 2. **Per-CPU Idle Time Check**: Enforce each CPU's idle time to non-negative value - Handles cases where idle_time > uptime due to measurement timing - Handles negative idle_time from measurement errors 3. **Utilization Calculation**: Force all utilization values to be non-negative value PS: Since this is a complex issue without a definitive solution yet, this fix serves only as a temporary measure to unblock our release commitment. We’ll continue investigating a long-term solution as a follow-up. ## Test - Deployed the layer w/ the change and applied them to [these stacks](https://docs.google.com/spreadsheets/d/1oF60PBhYvwdOfFn6yz3zBUhZCZUP7Z43VVE6XE2QGWQ/edit?gid=0#gid=0) as they were the main contributors of the negative stats - Observe the stats and expect [no more negative stats](https://ddserverless.datadoghq.com/dashboard/35c-u5f-8jm?fromUser=true&fullscreen_end_ts=1763142342127&fullscreen_paused=false&fullscreen_refresh_mode=sliding&fullscreen_section=overview&fullscreen_start_ts=1763138742127&fullscreen_widget=2129549723517146&refresh_mode=paused&tpl_var_runtime%5B0%5D=python3.10&tpl_var_runtime%5B1%5D=dotnet8&tpl_var_runtime%5B2%5D=java11&tpl_var_runtime%5B3%5D=nodejs20.x&tpl_var_runtime%5B4%5D=python3.13&tpl_var_runtime%5B5%5D=ruby3.2&tpl_var_service%5B0%5D=d1&from_ts=1763136901167&to_ts=1763137625000&live=false) are reported <img width="2262" height="1698" alt="image" src="https://github.com/user-attachments/assets/a4482590-64ac-4322-8169-103f82706ddf" />

litianningdatadog marked this pull request as ready for review November 14, 2025 14:46

litianningdatadog requested a review from a team as a code owner November 14, 2025 14:46

litianningdatadog marked this pull request as draft November 14, 2025 14:50

litianningdatadog force-pushed the tianning.li/negative-cpu-stats branch from 76f03af to 424a71b Compare November 14, 2025 17:39

litianningdatadog marked this pull request as ready for review November 14, 2025 17:49

litianningdatadog changed the title ~~[fix] Guard the potential negative CPU stats due to different measure of CPU idle time~~ fix: Guard against negative CPU utilization metrics Nov 14, 2025

litianningdatadog force-pushed the tianning.li/negative-cpu-stats branch from 424a71b to c1c6a73 Compare November 14, 2025 18:18

litianningdatadog marked this pull request as draft November 14, 2025 21:58

litianningdatadog closed this Nov 17, 2025

litianningdatadog reopened this Nov 18, 2025

litianningdatadog marked this pull request as ready for review November 19, 2025 19:10

duncanista approved these changes Nov 19, 2025

View reviewed changes

litianningdatadog force-pushed the tianning.li/negative-cpu-stats branch from c1c6a73 to 03620f5 Compare November 19, 2025 20:12

litianningdatadog merged commit ebf1aa1 into main Nov 19, 2025
39 checks passed

litianningdatadog deleted the tianning.li/negative-cpu-stats branch November 19, 2025 21:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Guard against negative CPU utilization metrics #929

fix: Guard against negative CPU utilization metrics #929

litianningdatadog commented Nov 14, 2025 •

edited

Loading

Uh oh!

lym953 commented Nov 14, 2025

Uh oh!

litianningdatadog commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: Guard against negative CPU utilization metrics #929

fix: Guard against negative CPU utilization metrics #929

Conversation

litianningdatadog commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Solution

Test

Uh oh!

lym953 commented Nov 14, 2025

Uh oh!

litianningdatadog commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

litianningdatadog commented Nov 14, 2025 •

edited

Loading