Skip to content

ECS task metadata API stats endpoint undercounts bytes transmitted/received on EC2 with awsvpc network mode #4618

@isker

Description

@isker

I used this CDK stack to launch two ECS tasks, one running on Fargate and one on EC2 (with awsvpc network mode, to best mirror the Fargate task). Each task is running an alpine container that I could use to ECS Exec into the task and get a shell.

In each task, I then installed iperf3 and jq using apk add iperf3 jq. I then used iperf3 to send one gigabyte of TCP packets from one task to the other, and vice versa: the receiving task runs iperf3 -s to start a server, and the sending task runs iperf3 -c $OTHER_TASK_IPV4_ADDR -n 1G to send 1GB to the receiving task.

In each task, I then printed the alpine container's network stats using the ECS task metadata API: wget -q -O- ${ECS_CONTAINER_METADATA_URI_V4}/stats | jq .networks.

The Fargate task reports reasonable numbers, namely a bit over 1GB transmitted and received (we of course have done other things in these tasks beyond just invoke iperf3):

# wget -q -O- ${ECS_CONTAINER_METADATA_URI_V4}/stats | jq .networks
{
  "eth1": {
    "rx_bytes": 1096752652,
    "rx_packets": 159456,
    "rx_errors": 0,
    "rx_dropped": 0,
    "tx_bytes": 1083080110,
    "tx_packets": 139570,
    "tx_errors": 0,
    "tx_dropped": 0
  }
}

The EC2 task does not report expected numbers:

# wget -q -O- ${ECS_CONTAINER_METADATA_URI_V4}/stats | jq .networks
{
  "eth0": {
    "rx_bytes": 361389142,
    "rx_packets": 46892,
    "rx_errors": 0,
    "rx_dropped": 0,
    "tx_bytes": 361363480,
    "tx_packets": 51774,
    "tx_errors": 0,
    "tx_dropped": 0
  }
}

That is ~361MB transmitted and received, a substantial undercount.

The discrepancy reported by this synthetic test aligns with real observed conditions we have seen reported by the same API endpoints for services we are running in ECS. We have been comparing Fargate and EC2 to determine which is better to run our workloads. We rely on these container stats as consumed by the Prometheus ECS exporter to monitor our services, so the fact that the EC2 data is meaningfully incorrect makes this a challenge. Please fix the EC2 network stats.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions