-
Notifications
You must be signed in to change notification settings - Fork 24
Initial commit #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
roidelapluie
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I have made remarks/asked questions about a few things.
@SuperQ: could you give this a review as well? Thanks!
|
Note that I have given a few reviews for the metrics name, I did not review all of them, just giving you my opinions that we could have more labels (like for CPU) and getting names closer to what we do with e.g. the node_exporter. I am not sure if cadvisor/kube-stats-metrics could also be used as a base? @SuperQ |
|
I also notice that a lot of metrics here are gauges but might just be counters (like cpu usage, dropped trafic...) |
|
PTAL, I also converted several to counters. |
8118a88 to
15f15e2
Compare
SuperQ
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far, comments and questions on the metrics.
I don't know enough about ECS tho. Is this is running 1:1 with ECS containers? Do they run in their own cgroup within the ECS deployment (Like K8s pods)?
|
Some of these questions baffled me as well. Let me work with someone who is working on the metadata server to find out. I'll update the PR once I have answers. |
SuperQ
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment about the ecs_cpu_online metric. Since we have per-CPU metrics, we can derive this value from ecs_cpu_seconds_total in PromQL. For example
The number of CPUs:
count without (cpu) (ecs_cpu_seconds_total)
The CPU utilization:
avg without (cpu) (rate(ecs_cpu_seconds_total[5m]))
It's actually easier to get the utilization with one metric than two:
sum without (cpu) (rate(ecs_cpu_seconds_total[5m]))
/
ecs_cpu_online
ECS has a few layers of abstractions. Users can create clusters, services and tasks. Task is similar to Kubernetes pods where multiple containers can be grouped to run together. The metadata server is accessible in tasks hence the exporter will be deployed as a sidecar. Metadata server can report stats for each container running in a task, hence we make a single request to |
|
Apparently, CPU system usage was representing all system usage from the node the containers are running. It is not very useful to ECS users and removing it. The kernel + user mode was not adding up to the total usage because other modes are not reported, we can calculate them by total_cpu - (user + kernel). I removed the reporting of user and kernel space usage for now and I'll do some research before taking any action here. CPU total seconds are now broken by CPU, removed ecs_cpu_online because it's now possible to query it from ecs_cpu_seconds_total. Cumulative metrics are named as requested and are counters. |
SuperQ
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work. Thanks for researching all the internal specifics of the ECS API data.
LGTM
|
@rakyll One last request, would you mind squashing the commits to a commit history that makes sense? |
This change introduces a Prometheus exporter to publish ECS metrics in the Prometheus exposition format. The ecs_exporter reads the ECS metadata server to read ECS task and container stats. This is why the exporter needs to run as a sidecar container in an ECS task, or it should be included as a binary in a container deployed to be able to fetch from the metadata server. The exporter publishes metrics from the container runtime related to CPU usage, networking and more. Users still need to publish their custom application metrics from their own containers if they have any. Currently, these metrics are available from CloudWatch and can be ingested into Prometheus by using the cloudwatch_exporter, but there are various users who want to be able to directly export them to Prometheus to avoid delay in data collection. In the future, the exporter will report more CPU, memory, and I/O metrics. Fixes prometheus-community/community#36. Signed-off-by: JBD <[email protected]>
|
Thanks so much for the review. @SuperQ, commits are now squashed into a single commit. |
|
Thanks! |
This change introduces a Prometheus exporter to be run a sidecar on ECS tasks. When run as a part of the ECS tasks, it fetches the metadata server to read ECS metrics such as CPU, memory and network usage by container and publishes a Prometheus metrics handler for Prometheus to scrape them directly. Users still need to publish their custom application metrics from their own containers.
The following changes will document the published container image and add instructions on how to add the exporter as a sidecar to ECS tasks.
Fixes prometheus-community/community#36.