Skip to content

HostMetrics process scraper high CPU usage during collection on Windows Server 2019 #32947

Closed
@drewftw

Description

@drewftw

Component(s)

receiver/hostmetrics

What happened?

Description

Otel Collector running on Windows Server 2019 was observed to have high CPU spikes (3-7%) each time the hostmetrics receiver collection process ran which was set to an interval of 1 minute.

image

After testing the issue was narrowed down to the process scraper. The following shows the Otel collector CPU usage when only the process scraper is enabled.

image

After reenabling all other hostmetrics scrapers except for the process scraper, we can see the magnitude of the CPU spikes come down significantly (<0.5%).

Screenshot 2024-04-30 at 2 24 55 PM

Steps to Reproduce

On a machine running Windows Server 2019, download the v0.94 version of Otel collector from https://github.com/open-telemetry/opentelemetry-collector-releases/releases/tag/v0.94.0.

Modify the config.yaml to enable the hostmetrics process scraper and set the collection interval (see config attached to the issue for an example).

Run the otel collector exe

Monitor the CPU usage of the otel collector on Task Manager or graph the usage using perfmon

Expected Result

CPU usage comparable to observed levels on Linux collectors (<0.5%)

Actual Result

CPU spikes to 3-7%

Collector version

v0.93.0

Environment information

Environment

OS: Windows Server 2019

OpenTelemetry Collector configuration

receivers:
  hostmetrics:
    collection_interval: 1m
    scrapers:
      cpu:
        metrics:
          system.cpu.utilization:
            enabled: true
      disk:
      load:
      filesystem:
        metrics:
          system.filesystem.utilization:
            enabled: true
      memory:
        metrics:
          system.memory.utilization:
            enabled: true
      network:
      paging:
        metrics:
          system.paging.utilization:
            enabled: true
      processes:
      process:
        mute_process_exe_error: true
        metrics:
          process.cpu.utilization:
            enabled: true
          process.memory.utilization:
            enabled: true
  docker_stats:
    collection_interval: 1m
    metrics:
      container.cpu.throttling_data.periods:
        enabled: true
      container.cpu.throttling_data.throttled_periods:
        enabled: true
      container.cpu.throttling_data.throttled_time:
        enabled: true
  prometheus:
    config:
      scrape_configs:
        - job_name: $InstanceId/otel-self-metrics-collector-$Region
          scrape_interval: 1m
          static_configs:
            - targets: ['0.0.0.0:9999']
  otlp:
    protocols:
      grpc:
      http:

exporters:
  debug:
    verbosity: normal
  otlp:
    endpoint: <endpoint>

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 500
    spike_limit_mib: 100
  batch:
    send_batch_size: 8192
    send_batch_max_size: 8192
    timeout: 2000ms
  filter:
    metrics:
      exclude:
        match_type: strict
        metric_names:
          # comment a metric to remove from exclusion rule
          - otelcol_exporter_queue_capacity
          - otelcol_exporter_enqueue_failed_spans
          - otelcol_exporter_enqueue_failed_log_records
          - otelcol_exporter_enqueue_failed_metric_points
          - otelcol_exporter_send_failed_metric_points
          - otelcol_process_runtime_heap_alloc_bytes
          - otelcol_process_runtime_total_alloc_bytes
          - otelcol_processor_batch_timeout_trigger_send
          - otelcol_process_runtime_total_sys_memory_bytes
          - otelcol_process_uptime
          - otelcol_scraper_errored_metric_points
          - otelcol_scraper_scraped_metric_points
          - scrape_samples_scraped
          - scrape_samples_post_metric_relabeling
          - scrape_series_added
          - scrape_duration_seconds
          # - up
  resourcedetection:
    detectors: ec2, env, system
    ec2:
      tags:
        - ^Environment$
    system:
      hostname_sources: ["os"]
      resource_attributes:
        host.id:
          enabled: true

extensions:
  health_check:
  pprof:
  zpages:

service:
  telemetry:
    metrics:
      level: detailed
      address: 0.0.0.0:9999
  extensions: [pprof, zpages, health_check]
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [debug, otlp]
      processors: [memory_limiter, batch, resourcedetection]
    metrics:
      receivers: [otlp, hostmetrics, prometheus]
      exporters: [debug, otlp]
      processors: [memory_limiter, batch, resourcedetection, filter]
    logs:
      receivers: [otlp]
      exporters: [debug, otlp]
      processors: [memory_limiter, batch, resourcedetection]

Log output

No response

Additional context

Additional details: Windows 2019 was running on an m5x.large EC2

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions