Skip to content

Conversation

@rakyll
Copy link
Contributor

@rakyll rakyll commented Sep 13, 2021

This change introduces a Prometheus exporter to be run a sidecar on ECS tasks. When run as a part of the ECS tasks, it fetches the metadata server to read ECS metrics such as CPU, memory and network usage by container and publishes a Prometheus metrics handler for Prometheus to scrape them directly. Users still need to publish their custom application metrics from their own containers.

The following changes will document the published container image and add instructions on how to add the exporter as a sidecar to ECS tasks.

Fixes prometheus-community/community#36.

Copy link

@roidelapluie roidelapluie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I have made remarks/asked questions about a few things.

@SuperQ: could you give this a review as well? Thanks!

@roidelapluie
Copy link

Note that I have given a few reviews for the metrics name, I did not review all of them, just giving you my opinions that we could have more labels (like for CPU) and getting names closer to what we do with e.g. the node_exporter. I am not sure if cadvisor/kube-stats-metrics could also be used as a base? @SuperQ

@roidelapluie
Copy link

I also notice that a lot of metrics here are gauges but might just be counters (like cpu usage, dropped trafic...)

@rakyll
Copy link
Contributor Author

rakyll commented Sep 14, 2021

PTAL, I also converted several to counters.

@rakyll rakyll force-pushed the initial branch 2 times, most recently from 8118a88 to 15f15e2 Compare September 14, 2021 05:17
Copy link
Contributor

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far, comments and questions on the metrics.

I don't know enough about ECS tho. Is this is running 1:1 with ECS containers? Do they run in their own cgroup within the ECS deployment (Like K8s pods)?

@rakyll
Copy link
Contributor Author

rakyll commented Sep 14, 2021

Some of these questions baffled me as well. Let me work with someone who is working on the metadata server to find out. I'll update the PR once I have answers.

Copy link
Contributor

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment about the ecs_cpu_online metric. Since we have per-CPU metrics, we can derive this value from ecs_cpu_seconds_total in PromQL. For example

The number of CPUs:

count without (cpu) (ecs_cpu_seconds_total)

The CPU utilization:

avg without (cpu) (rate(ecs_cpu_seconds_total[5m]))

It's actually easier to get the utilization with one metric than two:

sum without (cpu) (rate(ecs_cpu_seconds_total[5m]))
/
ecs_cpu_online

@rakyll
Copy link
Contributor Author

rakyll commented Sep 19, 2021

I don't know enough about ECS tho. Is this is running 1:1 with ECS containers? Do they run in their own cgroup within the ECS deployment (Like K8s pods)?

ECS has a few layers of abstractions. Users can create clusters, services and tasks. Task is similar to Kubernetes pods where multiple containers can be grouped to run together. The metadata server is accessible in tasks hence the exporter will be deployed as a sidecar. Metadata server can report stats for each container running in a task, hence we make a single request to /task/stats and retrieve them all.

@rakyll
Copy link
Contributor Author

rakyll commented Sep 19, 2021

Apparently, CPU system usage was representing all system usage from the node the containers are running. It is not very useful to ECS users and removing it.

The kernel + user mode was not adding up to the total usage because other modes are not reported, we can calculate them by total_cpu - (user + kernel). I removed the reporting of user and kernel space usage for now and I'll do some research before taking any action here.

CPU total seconds are now broken by CPU, removed ecs_cpu_online because it's now possible to query it from ecs_cpu_seconds_total.

Cumulative metrics are named as requested and are counters.

Copy link
Contributor

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work. Thanks for researching all the internal specifics of the ECS API data.

LGTM

@SuperQ
Copy link
Contributor

SuperQ commented Sep 20, 2021

@rakyll One last request, would you mind squashing the commits to a commit history that makes sense?

This change introduces a Prometheus exporter to publish
ECS metrics in the Prometheus exposition format.

The ecs_exporter reads the ECS metadata server to read
ECS task and container stats. This is why
the exporter needs to run as a sidecar container
in an ECS task, or it should be included as a binary in
a container deployed to be able to fetch from the
metadata server.

The exporter publishes metrics from the container runtime
related to CPU usage, networking and more.
Users still need to publish their custom application
metrics from their own containers if they have any.

Currently, these metrics are available from CloudWatch and
can be ingested into Prometheus by using the cloudwatch_exporter,
but there are various users who want to be able to directly
export them to Prometheus to avoid delay in data collection.

In the future, the exporter will report more CPU, memory,
and I/O metrics.

Fixes prometheus-community/community#36.

Signed-off-by: JBD <[email protected]>
@rakyll
Copy link
Contributor Author

rakyll commented Sep 20, 2021

Thanks so much for the review. @SuperQ, commits are now squashed into a single commit.

@roidelapluie roidelapluie merged commit 8e0df6f into prometheus-community:main Sep 20, 2021
@roidelapluie
Copy link

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

An ECS exporter

4 participants