Proposal to update container_memory_usage_bytes to Cache+RSS

Hello, I would like to propose updating `container_memory_usage_bytes` from reading `memory.usage_in_bytes`/`memory.current` to calculating it manually from Cache+RSS found in the `memory.stat` file. The reason being that there can be a discrepancy  with `memory.usage_in_bytes` that's exacerbated on multi-core systems.

## Background

### Metrics
My original investigation started when I was trying to debug the memory usage of an unrelated application running in a Kubernetes pod and was watching the `container_memory_working_set_bytes` metric as suggested by [the kubernetes docs](https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#memory). I wanted to find out the source of this value which lead me to [cadvisor GetStats](https://github.com/google/cadvisor/blob/ce07bb28eadc18183df15ca5346293af6b020b33/container/libcontainer/handler.go#L76) which told me it was `container_memory_usage_bytes - inactive file`. My question then became where does `container_memory_usage_bytes ` come from? The cadvisor code calls out to [runc's GetStats](https://github.com/opencontainers/runc/blob/a187c84e42015402263167a2c1007b5c62d494f3/libcontainer/cgroups/fs/memory.go#L142) which answered my question: the information is gathered from files in the `/sys/fs/cgroup/memory` directory specifically `memory.stat` and either  `memory.usage_in_bytes` or `memory.current` depending on cgroup version.

### Linux Kernel
Throughout this research I ran into a few other cadvisor memory issues like [#3197](https://github.com/google/cadvisor/issues/3197) and [#3081](https://github.com/google/cadvisor/issues/3081) which discuss these values and what we should be subtracting. This made me assume that `usage_in_bytes` could be calculated from `stat` file so now I was curious about the calculation. That lead me to section[5.5 in the Kernel Docs](https://docs.kernel.org/admin-guide/cgroup-v1/memory.html?highlight=cgroups#usage-in-bytes) which says the following:

>### 5.5 usage_in_bytes
>
>For efficiency, as other kernel components, memory cgroup uses some optimization to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the method and doesn’t show ‘exact’ value of memory (and swap) usage, it’s a fuzz value for efficient access. (Of course, when necessary, it’s synchronized.) If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) value in memory.stat(see 5.2).

`memory.usage_in_bytes` isn't a calculation its a fuzz value! I was surprised to see that the kernel didn't have an exact value for memory usage especially when `RSS+CACHE`  is available to it in `memory.stat` so I kept on digging. Looking back to the commit of that documentation [a111c966](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a111c966a65e4b5d9c6fd2d8459978c1407077d5) we see the following:

>These changes improved performance of memory cgroup very much, but made res_counter->usage usually have a bigger value than the actual value of memory usage. So, *.usage_in_bytes, which show res_counter->usage, are not desirable for precise values of memory(and swap) usage anymore.
>
> Instead of removing these files completely(because we cannot know res_counter->usage without them), this patch updates the meaning of those files.

That tells us that `usage_in_bytes` is not precise and can be an unreliable metric for memory measurement. However, `res_counter->usage` should still be pretty close to actual usage right? From the [kernel email discussion regarding the above change](https://lkml.iu.edu/hypermail/linux/kernel/1103.2/01248.html)  there is atleast the guarantee that `rss+cache <= usage_in_bytes`. However, the difference between the two will grow with the size of each per cpu bulk pre-allocated memory. In other words this difference can grow with the number of cpu's! 

At this point you might be wondering what a 12 year old commit and email thread talking about a [removed res_counter api](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5b1efc027c0b51ca3e76f4e00c83358f8349f543) have to do with the current linux kernel? Well [lockless page counters](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3e32cb2e0a12b6915056ff04601cf1bb9b44f967) are the replacement to resource counters which tried its best to keep the syntax of the `usage_in_bytes` and `stat` unchanged.  In the most recent kernel file `mm/memcontrol.c` file we  see two functions [mem_cgroup_usage](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/memcontrol.c#n3667) and [memcg_stat_show](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/memcontrol.c#n4082) which correspond to the `usage_in_bytes` and `stat` respectively. 

 In the first function `mem_cgroup_usage` we can see that for nonroot cgroups the return value is the current page counter value accessed using  `page_counter_read(&memcg-memory)`. This `page_counter_read` is effectively the same call as `res_counter->usage` but without the lock requirement. On the flip side `memcg_stat_show` pulls its stats from the memory controllers `vmstats` struct which are synced either every 2 seconds or when a large enough stat change occurs as per [comments in memcontrol.c](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/memcontrol.c#n571)

Throughout the above discussion I've been focusing on cgroup v1 `usage_in_bytes`. Well it turns out that cgroup v2 `.current` uses [memory_current_read](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/memcontrol.c#n6380) which pulls from the same `page_counter_read` as v1 so is similarly affected. 


### Example
To illustrate this on a reproducible application I've copied the `memory.stat` and `memory.usage_in_bytes` from an nginx pod as described in the [kube docs](https://kubernetes.io/docs/concepts/workloads/pods/#using-pods) and copied below
```
apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx:1.14.2
    ports:
    - containerPort: 80
```
```
root@nginx:/sys/fs/cgroup/memory# cat memory.{stat,usage_in_bytes}
<redacted for readability can provide full output>
total_cache 270336
total_rss 1826816
<redacted for readability can provide full output>
4562944
```
As you can see  `total_cache + total_rss` equals `2097152` which is almost half of the reported `usage_in_bytes 4562944`!! Albeit half in this case is a measly 2mb but that difference can add up. In the original cache and memory intensive application I was debugging I noticed the following in my memory files.
```
bash-4.4$ cat memory.{stat,usage_in_bytes}
<redacted for readability can provide full output>
total_cache 2286772224
total_rss 565260288
<redacted for readability can provide full output>
3483992064
```  
In this example `CACHE+RSS` equals `285203251` which  is `602.68Mb` less than the reported `usage_in_bytes`! However, `602Mb`  is only a 20% overestimation which is better than nginx's 50% but the impact is more visible. 

## Proposal

I'd like to update `container_memory_usage_bytes` to be a calculation of  `CACHE+RSS` instead of the current `usage_in_bytes`. This is what's suggested in the kernel docs above and also what the [kernel does](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/memcontrol.c#n3673) for `usage_in_bytes` for the root cgroup like on the host/node. I also think this would be beneficial for the users since now `container_memory_usage_bytes` would be easily calculable and understandable from other statistics.

Most of my understanding about this issue comes from a [12 year old email discussion](https://lkml.iu.edu/hypermail/linux/kernel/1103.2/01248.html) so I'm still familiarizing myself with the current kernel. Any corrections or historical context for the current implementation would be greatly appreciated. I'd also love to know if this has been discussed before. I'd be happy to work on implementing this change if the proposal is accepted. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal to update container_memory_usage_bytes to Cache+RSS #3286

Background

Metrics

Linux Kernel

5.5 usage_in_bytes

Example

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal to update container_memory_usage_bytes to Cache+RSS #3286

Description

Background

Metrics

Linux Kernel

5.5 usage_in_bytes

Example

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions