-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
Hello, I would like to propose updating container_memory_usage_bytes
from reading memory.usage_in_bytes
/memory.current
to calculating it manually from Cache+RSS found in the memory.stat
file. The reason being that there can be a discrepancy with memory.usage_in_bytes
that's exacerbated on multi-core systems.
Background
Metrics
My original investigation started when I was trying to debug the memory usage of an unrelated application running in a Kubernetes pod and was watching the container_memory_working_set_bytes
metric as suggested by the kubernetes docs. I wanted to find out the source of this value which lead me to cadvisor GetStats which told me it was container_memory_usage_bytes - inactive file
. My question then became where does container_memory_usage_bytes
come from? The cadvisor code calls out to runc's GetStats which answered my question: the information is gathered from files in the /sys/fs/cgroup/memory
directory specifically memory.stat
and either memory.usage_in_bytes
or memory.current
depending on cgroup version.
Linux Kernel
Throughout this research I ran into a few other cadvisor memory issues like #3197 and #3081 which discuss these values and what we should be subtracting. This made me assume that usage_in_bytes
could be calculated from stat
file so now I was curious about the calculation. That lead me to section5.5 in the Kernel Docs which says the following:
5.5 usage_in_bytes
For efficiency, as other kernel components, memory cgroup uses some optimization to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the method and doesn’t show ‘exact’ value of memory (and swap) usage, it’s a fuzz value for efficient access. (Of course, when necessary, it’s synchronized.) If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) value in memory.stat(see 5.2).
memory.usage_in_bytes
isn't a calculation its a fuzz value! I was surprised to see that the kernel didn't have an exact value for memory usage especially when RSS+CACHE
is available to it in memory.stat
so I kept on digging. Looking back to the commit of that documentation a111c966 we see the following:
These changes improved performance of memory cgroup very much, but made res_counter->usage usually have a bigger value than the actual value of memory usage. So, *.usage_in_bytes, which show res_counter->usage, are not desirable for precise values of memory(and swap) usage anymore.
Instead of removing these files completely(because we cannot know res_counter->usage without them), this patch updates the meaning of those files.
That tells us that usage_in_bytes
is not precise and can be an unreliable metric for memory measurement. However, res_counter->usage
should still be pretty close to actual usage right? From the kernel email discussion regarding the above change there is atleast the guarantee that rss+cache <= usage_in_bytes
. However, the difference between the two will grow with the size of each per cpu bulk pre-allocated memory. In other words this difference can grow with the number of cpu's!
At this point you might be wondering what a 12 year old commit and email thread talking about a removed res_counter api have to do with the current linux kernel? Well lockless page counters are the replacement to resource counters which tried its best to keep the syntax of the usage_in_bytes
and stat
unchanged. In the most recent kernel file mm/memcontrol.c
file we see two functions mem_cgroup_usage and memcg_stat_show which correspond to the usage_in_bytes
and stat
respectively.
In the first function mem_cgroup_usage
we can see that for nonroot cgroups the return value is the current page counter value accessed using page_counter_read(&memcg-memory)
. This page_counter_read
is effectively the same call as res_counter->usage
but without the lock requirement. On the flip side memcg_stat_show
pulls its stats from the memory controllers vmstats
struct which are synced either every 2 seconds or when a large enough stat change occurs as per comments in memcontrol.c
Throughout the above discussion I've been focusing on cgroup v1 usage_in_bytes
. Well it turns out that cgroup v2 .current
uses memory_current_read which pulls from the same page_counter_read
as v1 so is similarly affected.
Example
To illustrate this on a reproducible application I've copied the memory.stat
and memory.usage_in_bytes
from an nginx pod as described in the kube docs and copied below
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
root@nginx:/sys/fs/cgroup/memory# cat memory.{stat,usage_in_bytes}
<redacted for readability can provide full output>
total_cache 270336
total_rss 1826816
<redacted for readability can provide full output>
4562944
As you can see total_cache + total_rss
equals 2097152
which is almost half of the reported usage_in_bytes 4562944
!! Albeit half in this case is a measly 2mb but that difference can add up. In the original cache and memory intensive application I was debugging I noticed the following in my memory files.
bash-4.4$ cat memory.{stat,usage_in_bytes}
<redacted for readability can provide full output>
total_cache 2286772224
total_rss 565260288
<redacted for readability can provide full output>
3483992064
In this example CACHE+RSS
equals 285203251
which is 602.68Mb
less than the reported usage_in_bytes
! However, 602Mb
is only a 20% overestimation which is better than nginx's 50% but the impact is more visible.
Proposal
I'd like to update container_memory_usage_bytes
to be a calculation of CACHE+RSS
instead of the current usage_in_bytes
. This is what's suggested in the kernel docs above and also what the kernel does for usage_in_bytes
for the root cgroup like on the host/node. I also think this would be beneficial for the users since now container_memory_usage_bytes
would be easily calculable and understandable from other statistics.
Most of my understanding about this issue comes from a 12 year old email discussion so I'm still familiarizing myself with the current kernel. Any corrections or historical context for the current implementation would be greatly appreciated. I'd also love to know if this has been discussed before. I'd be happy to work on implementing this change if the proposal is accepted.