Skip to content

Proposal to update container_memory_usage_bytes to Cache+RSS #3286

@HonakerM

Description

@HonakerM

Hello, I would like to propose updating container_memory_usage_bytes from reading memory.usage_in_bytes/memory.current to calculating it manually from Cache+RSS found in the memory.stat file. The reason being that there can be a discrepancy with memory.usage_in_bytes that's exacerbated on multi-core systems.

Background

Metrics

My original investigation started when I was trying to debug the memory usage of an unrelated application running in a Kubernetes pod and was watching the container_memory_working_set_bytes metric as suggested by the kubernetes docs. I wanted to find out the source of this value which lead me to cadvisor GetStats which told me it was container_memory_usage_bytes - inactive file. My question then became where does container_memory_usage_bytes come from? The cadvisor code calls out to runc's GetStats which answered my question: the information is gathered from files in the /sys/fs/cgroup/memory directory specifically memory.stat and either memory.usage_in_bytes or memory.current depending on cgroup version.

Linux Kernel

Throughout this research I ran into a few other cadvisor memory issues like #3197 and #3081 which discuss these values and what we should be subtracting. This made me assume that usage_in_bytes could be calculated from stat file so now I was curious about the calculation. That lead me to section5.5 in the Kernel Docs which says the following:

5.5 usage_in_bytes

For efficiency, as other kernel components, memory cgroup uses some optimization to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the method and doesn’t show ‘exact’ value of memory (and swap) usage, it’s a fuzz value for efficient access. (Of course, when necessary, it’s synchronized.) If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) value in memory.stat(see 5.2).

memory.usage_in_bytes isn't a calculation its a fuzz value! I was surprised to see that the kernel didn't have an exact value for memory usage especially when RSS+CACHE is available to it in memory.stat so I kept on digging. Looking back to the commit of that documentation a111c966 we see the following:

These changes improved performance of memory cgroup very much, but made res_counter->usage usually have a bigger value than the actual value of memory usage. So, *.usage_in_bytes, which show res_counter->usage, are not desirable for precise values of memory(and swap) usage anymore.

Instead of removing these files completely(because we cannot know res_counter->usage without them), this patch updates the meaning of those files.

That tells us that usage_in_bytes is not precise and can be an unreliable metric for memory measurement. However, res_counter->usage should still be pretty close to actual usage right? From the kernel email discussion regarding the above change there is atleast the guarantee that rss+cache <= usage_in_bytes. However, the difference between the two will grow with the size of each per cpu bulk pre-allocated memory. In other words this difference can grow with the number of cpu's!

At this point you might be wondering what a 12 year old commit and email thread talking about a removed res_counter api have to do with the current linux kernel? Well lockless page counters are the replacement to resource counters which tried its best to keep the syntax of the usage_in_bytes and stat unchanged. In the most recent kernel file mm/memcontrol.c file we see two functions mem_cgroup_usage and memcg_stat_show which correspond to the usage_in_bytes and stat respectively.

In the first function mem_cgroup_usage we can see that for nonroot cgroups the return value is the current page counter value accessed using page_counter_read(&memcg-memory). This page_counter_read is effectively the same call as res_counter->usage but without the lock requirement. On the flip side memcg_stat_show pulls its stats from the memory controllers vmstats struct which are synced either every 2 seconds or when a large enough stat change occurs as per comments in memcontrol.c

Throughout the above discussion I've been focusing on cgroup v1 usage_in_bytes. Well it turns out that cgroup v2 .current uses memory_current_read which pulls from the same page_counter_read as v1 so is similarly affected.

Example

To illustrate this on a reproducible application I've copied the memory.stat and memory.usage_in_bytes from an nginx pod as described in the kube docs and copied below

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx:1.14.2
    ports:
    - containerPort: 80
root@nginx:/sys/fs/cgroup/memory# cat memory.{stat,usage_in_bytes}
<redacted for readability can provide full output>
total_cache 270336
total_rss 1826816
<redacted for readability can provide full output>
4562944

As you can see total_cache + total_rss equals 2097152 which is almost half of the reported usage_in_bytes 4562944!! Albeit half in this case is a measly 2mb but that difference can add up. In the original cache and memory intensive application I was debugging I noticed the following in my memory files.

bash-4.4$ cat memory.{stat,usage_in_bytes}
<redacted for readability can provide full output>
total_cache 2286772224
total_rss 565260288
<redacted for readability can provide full output>
3483992064

In this example CACHE+RSS equals 285203251 which is 602.68Mb less than the reported usage_in_bytes! However, 602Mb is only a 20% overestimation which is better than nginx's 50% but the impact is more visible.

Proposal

I'd like to update container_memory_usage_bytes to be a calculation of CACHE+RSS instead of the current usage_in_bytes. This is what's suggested in the kernel docs above and also what the kernel does for usage_in_bytes for the root cgroup like on the host/node. I also think this would be beneficial for the users since now container_memory_usage_bytes would be easily calculable and understandable from other statistics.

Most of my understanding about this issue comes from a 12 year old email discussion so I'm still familiarizing myself with the current kernel. Any corrections or historical context for the current implementation would be greatly appreciated. I'd also love to know if this has been discussed before. I'd be happy to work on implementing this change if the proposal is accepted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions