Skip to content

Conversation

@litianningdatadog
Copy link
Contributor

@litianningdatadog litianningdatadog commented Nov 14, 2025

https://datadoghq.atlassian.net/browse/SVLS-7991

Overview

PR #894 introduced an asynchronous message-passing architecture that changed
the timing of CPU metrics collection. This timing difference exposed a
pre-existing issue where per-CPU idle time measurements could exceed the
wall-clock uptime delta, resulting in negative CPU utilization values being
reported to customers.

The root cause is a fundamental mismatch between measurement domains:

  • /proc/uptime measures wall-clock system uptime (single value)
  • /proc/stat measures per-CPU cumulative idle time (one value per core)

When these measurements are taken at slightly different times (especially in
the async processing model), the per-CPU idle time delta can exceed the uptime
delta, causing the formula ((uptime - idle_time) / uptime) * 100 to produce
negative results.

Solution

Implemented defensive validation and enforce in the CPU utilization
calculation to ensure metrics are always within valid ranges:

  1. Uptime Validation: Skip metrics if uptime delta is invalid (≤ 0)

    • Prevents division by zero
    • Catches timing anomalies early
  2. Per-CPU Idle Time Check: Enforce each CPU's idle time to non-negative value

    • Handles cases where idle_time > uptime due to measurement timing
    • Handles negative idle_time from measurement errors
  3. Utilization Calculation: Force all utilization values to be non-negative value

PS: Since this is a complex issue without a definitive solution yet, this fix serves only as a temporary measure to unblock our release commitment. We’ll continue investigating a long-term solution as a follow-up.

Test

  • Deployed the layer w/ the change and applied them to these stacks as they were the main contributors of the negative stats
  • Observe the stats and expect no more negative stats are reported
image

@litianningdatadog litianningdatadog marked this pull request as ready for review November 14, 2025 14:46
@litianningdatadog litianningdatadog requested a review from a team as a code owner November 14, 2025 14:46
@litianningdatadog litianningdatadog marked this pull request as draft November 14, 2025 14:50
@litianningdatadog litianningdatadog force-pushed the tianning.li/negative-cpu-stats branch from 76f03af to 424a71b Compare November 14, 2025 17:39
@litianningdatadog litianningdatadog marked this pull request as ready for review November 14, 2025 17:49
@litianningdatadog litianningdatadog changed the title [fix] Guard the potential negative CPU stats due to different measure of CPU idle time fix: Guard against negative CPU utilization metrics Nov 14, 2025
@litianningdatadog litianningdatadog force-pushed the tianning.li/negative-cpu-stats branch from 424a71b to c1c6a73 Compare November 14, 2025 18:18
@lym953
Copy link
Contributor

lym953 commented Nov 14, 2025

When the metric is not negative, is it also different from before? If so, the change may confuse customers, e.g. trigger monitors that have been quiet. Do we consider reverting #894 for this release?

@litianningdatadog
Copy link
Contributor Author

When the metric is not negative, is it also different from before? If so, the change may confuse customers, e.g. trigger monitors that have been quiet. Do we consider reverting #894 for this release?

No significant difference is observed when excluding the negative values. Reverting #894 would be our last resort, as it still provides value. On the other hand, PR #930 appears to be a more promising fix.

@litianningdatadog litianningdatadog marked this pull request as draft November 14, 2025 21:58
@litianningdatadog litianningdatadog marked this pull request as ready for review November 19, 2025 19:10
## Problem

CPU utilization metrics could report negative values due to measurement
timing issues. When per-CPU idle time measurements are taken at slightly
different moments, the calculated idle_time delta could be negative,
resulting in invalid CPU utilization percentages.

## Solution

Implemented defensive validation in the CPU utilization calculation to
prevent negative values while allowing overflow of upper bounds:

1. **Uptime Validation**: Skip metrics if uptime delta is invalid (≤ 0)
   - Prevents division by zero
   - Catches timing anomalies early

2. **Per-CPU Idle Time Guard**: Prevent negative idle time values
   - Changed from `.clamp(0.0, uptime)` to `.max(0.0)`
   - Allows idle_time to exceed uptime if it occurs
   - Handles negative idle_time from measurement errors

3. **Utilization Guards**: Prevent negative utilization percentages only
   - cpu_max_utilization: Changed from `.clamp(0.0, 100.0)` to `.max(0.0)`
   - cpu_min_utilization: Changed from `.clamp(0.0, 100.0)` to `.max(0.0)`
   - cpu_total_utilization_decimal: Changed from `.clamp(0.0, 1.0)` to `.max(0.0)`
   - Values can now exceed 100% if measurement timing causes this

## Rationale

This approach prioritizes data accuracy over artificial constraints:
- Prevents mathematically invalid negative percentages
- Allows overflow (>100%) to be reported if it genuinely occurs
- Provides visibility into measurement anomalies rather than hiding them
- Maintains clean, idiomatic Rust code with `.max()` method

The fix includes debug logging when invalid uptime deltas are detected.
@litianningdatadog litianningdatadog force-pushed the tianning.li/negative-cpu-stats branch from c1c6a73 to 03620f5 Compare November 19, 2025 20:12
@litianningdatadog litianningdatadog merged commit ebf1aa1 into main Nov 19, 2025
39 checks passed
@litianningdatadog litianningdatadog deleted the tianning.li/negative-cpu-stats branch November 19, 2025 21:34
lym953 pushed a commit that referenced this pull request Nov 21, 2025
https://datadoghq.atlassian.net/browse/SVLS-7991

  ## Overview

PR #894 introduced an asynchronous message-passing architecture that
changed
  the timing of CPU metrics collection. This timing difference exposed a
pre-existing issue where per-CPU idle time measurements could exceed the
wall-clock uptime delta, resulting in negative CPU utilization values
being
  reported to customers.

  The root cause is a fundamental mismatch between measurement domains:
  - `/proc/uptime` measures wall-clock system uptime (single value)
- `/proc/stat` measures per-CPU cumulative idle time (one value per
core)

When these measurements are taken at slightly different times
(especially in
the async processing model), the per-CPU idle time delta can exceed the
uptime
delta, causing the formula `((uptime - idle_time) / uptime) * 100` to
produce
  negative results.

  ## Solution

  Implemented defensive validation and enforce in the CPU utilization
  calculation to ensure metrics are always within valid ranges:

1. **Uptime Validation**: Skip metrics if uptime delta is invalid (≤ 0)
     - Prevents division by zero
     - Catches timing anomalies early

2. **Per-CPU Idle Time Check**: Enforce each CPU's idle time to
non-negative value
     - Handles cases where idle_time > uptime due to measurement timing
     - Handles negative idle_time from measurement errors

3. **Utilization Calculation**: Force all utilization values to be
non-negative value

PS: Since this is a complex issue without a definitive solution yet,
this fix serves only as a temporary measure to unblock our release
commitment. We’ll continue investigating a long-term solution as a
follow-up.

## Test
- Deployed the layer w/ the change and applied them to [these
stacks](https://docs.google.com/spreadsheets/d/1oF60PBhYvwdOfFn6yz3zBUhZCZUP7Z43VVE6XE2QGWQ/edit?gid=0#gid=0)
as they were the main contributors of the negative stats
- Observe the stats and expect [no more negative
stats](https://ddserverless.datadoghq.com/dashboard/35c-u5f-8jm?fromUser=true&fullscreen_end_ts=1763142342127&fullscreen_paused=false&fullscreen_refresh_mode=sliding&fullscreen_section=overview&fullscreen_start_ts=1763138742127&fullscreen_widget=2129549723517146&refresh_mode=paused&tpl_var_runtime%5B0%5D=python3.10&tpl_var_runtime%5B1%5D=dotnet8&tpl_var_runtime%5B2%5D=java11&tpl_var_runtime%5B3%5D=nodejs20.x&tpl_var_runtime%5B4%5D=python3.13&tpl_var_runtime%5B5%5D=ruby3.2&tpl_var_service%5B0%5D=d1&from_ts=1763136901167&to_ts=1763137625000&live=false)
are reported
<img width="2262" height="1698" alt="image"
src="https://github.com/user-attachments/assets/a4482590-64ac-4322-8169-103f82706ddf"
/>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants