Initial commit #2

rakyll · 2021-09-13T17:19:27Z

This change introduces a Prometheus exporter to be run a sidecar on ECS tasks. When run as a part of the ECS tasks, it fetches the metadata server to read ECS metrics such as CPU, memory and network usage by container and publishes a Prometheus metrics handler for Prometheus to scrape them directly. Users still need to publish their custom application metrics from their own containers.

The following changes will document the published container image and add instructions on how to add the exporter as a sidecar to ECS tasks.

Fixes prometheus-community/community#36.

roidelapluie

Thanks!

I have made remarks/asked questions about a few things.

@SuperQ: could you give this a review as well? Thanks!

README.md

ecscollector/collector.go

ecsmetadata/client.go

roidelapluie · 2021-09-13T22:59:34Z

Note that I have given a few reviews for the metrics name, I did not review all of them, just giving you my opinions that we could have more labels (like for CPU) and getting names closer to what we do with e.g. the node_exporter. I am not sure if cadvisor/kube-stats-metrics could also be used as a base? @SuperQ

roidelapluie · 2021-09-13T23:00:44Z

I also notice that a lot of metrics here are gauges but might just be counters (like cpu usage, dropped trafic...)

ecscollector/collector.go

rakyll · 2021-09-14T05:16:52Z

PTAL, I also converted several to counters.

SuperQ

Looking good so far, comments and questions on the metrics.

I don't know enough about ECS tho. Is this is running 1:1 with ECS containers? Do they run in their own cgroup within the ECS deployment (Like K8s pods)?

ecscollector/collector.go

rakyll · 2021-09-14T18:11:04Z

Some of these questions baffled me as well. Let me work with someone who is working on the metadata server to find out. I'll update the PR once I have answers.

ecscollector/collector.go

SuperQ

One comment about the ecs_cpu_online metric. Since we have per-CPU metrics, we can derive this value from ecs_cpu_seconds_total in PromQL. For example

The number of CPUs:

count without (cpu) (ecs_cpu_seconds_total)

The CPU utilization:

avg without (cpu) (rate(ecs_cpu_seconds_total[5m]))

It's actually easier to get the utilization with one metric than two:

sum without (cpu) (rate(ecs_cpu_seconds_total[5m]))
/
ecs_cpu_online

rakyll · 2021-09-19T21:15:20Z

I don't know enough about ECS tho. Is this is running 1:1 with ECS containers? Do they run in their own cgroup within the ECS deployment (Like K8s pods)?

ECS has a few layers of abstractions. Users can create clusters, services and tasks. Task is similar to Kubernetes pods where multiple containers can be grouped to run together. The metadata server is accessible in tasks hence the exporter will be deployed as a sidecar. Metadata server can report stats for each container running in a task, hence we make a single request to /task/stats and retrieve them all.

rakyll · 2021-09-19T21:40:41Z

Apparently, CPU system usage was representing all system usage from the node the containers are running. It is not very useful to ECS users and removing it.

The kernel + user mode was not adding up to the total usage because other modes are not reported, we can calculate them by total_cpu - (user + kernel). I removed the reporting of user and kernel space usage for now and I'll do some research before taking any action here.

CPU total seconds are now broken by CPU, removed ecs_cpu_online because it's now possible to query it from ecs_cpu_seconds_total.

Cumulative metrics are named as requested and are counters.

ecscollector/collector.go

SuperQ

Awesome work. Thanks for researching all the internal specifics of the ECS API data.

LGTM

SuperQ · 2021-09-20T15:01:02Z

@rakyll One last request, would you mind squashing the commits to a commit history that makes sense?

This change introduces a Prometheus exporter to publish ECS metrics in the Prometheus exposition format. The ecs_exporter reads the ECS metadata server to read ECS task and container stats. This is why the exporter needs to run as a sidecar container in an ECS task, or it should be included as a binary in a container deployed to be able to fetch from the metadata server. The exporter publishes metrics from the container runtime related to CPU usage, networking and more. Users still need to publish their custom application metrics from their own containers if they have any. Currently, these metrics are available from CloudWatch and can be ingested into Prometheus by using the cloudwatch_exporter, but there are various users who want to be able to directly export them to Prometheus to avoid delay in data collection. In the future, the exporter will report more CPU, memory, and I/O metrics. Fixes prometheus-community/community#36. Signed-off-by: JBD <[email protected]>

rakyll · 2021-09-20T17:26:18Z

Thanks so much for the review. @SuperQ, commits are now squashed into a single commit.

roidelapluie · 2021-09-20T17:39:29Z

Thanks!

rakyll requested a review from roidelapluie September 13, 2021 17:19

rakyll force-pushed the initial branch from 7b8c5b6 to 6bb86ac Compare September 13, 2021 17:21

roidelapluie reviewed Sep 13, 2021

View reviewed changes

ecscollector/collector.go Outdated Show resolved Hide resolved

rakyll force-pushed the initial branch from d31e098 to bfeb7f7 Compare September 14, 2021 05:15

rakyll requested a review from roidelapluie September 14, 2021 05:16

rakyll force-pushed the initial branch 2 times, most recently from 8118a88 to 15f15e2 Compare September 14, 2021 05:17

SuperQ reviewed Sep 14, 2021

View reviewed changes

ecscollector/collector.go Outdated Show resolved Hide resolved

ecscollector/collector.go Outdated Show resolved Hide resolved

ecscollector/collector.go Outdated Show resolved Hide resolved

ecscollector/collector.go Outdated Show resolved Hide resolved

rakyll mentioned this pull request Sep 14, 2021

Discovery for ECS prometheus/prometheus#9310

Open

alvinlin123 reviewed Sep 15, 2021

View reviewed changes

ecscollector/collector.go Show resolved Hide resolved

SuperQ reviewed Sep 19, 2021

View reviewed changes

ecscollector/collector.go Show resolved Hide resolved

SuperQ reviewed Sep 19, 2021

View reviewed changes

rakyll force-pushed the initial branch from aea4c10 to 0ae04f4 Compare September 19, 2021 21:11

rakyll force-pushed the initial branch from 5093f62 to fdd78cc Compare September 19, 2021 21:31

rakyll force-pushed the initial branch from 7d76fb7 to 5efd7eb Compare September 19, 2021 21:43

rakyll requested a review from SuperQ September 19, 2021 21:44

SuperQ reviewed Sep 20, 2021

View reviewed changes

ecscollector/collector.go Show resolved Hide resolved

SuperQ approved these changes Sep 20, 2021

View reviewed changes

roidelapluie approved these changes Sep 20, 2021

View reviewed changes

rakyll force-pushed the initial branch from 5efd7eb to b1888f4 Compare September 20, 2021 17:25

roidelapluie merged commit 8e0df6f into prometheus-community:main Sep 20, 2021

Initial commit #2

Initial commit #2

Uh oh!

Conversation

rakyll commented Sep 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roidelapluie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

roidelapluie commented Sep 13, 2021

Uh oh!

roidelapluie commented Sep 13, 2021

Uh oh!

Uh oh!

rakyll commented Sep 14, 2021

Uh oh!

SuperQ left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rakyll commented Sep 14, 2021

Uh oh!

Uh oh!

Uh oh!

SuperQ left a comment

Choose a reason for hiding this comment

Uh oh!

rakyll commented Sep 19, 2021

Uh oh!

rakyll commented Sep 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

SuperQ left a comment

Choose a reason for hiding this comment

Uh oh!

SuperQ commented Sep 20, 2021

Uh oh!

rakyll commented Sep 20, 2021

Uh oh!

roidelapluie commented Sep 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rakyll commented Sep 13, 2021 •

edited

Loading

rakyll commented Sep 19, 2021 •

edited

Loading