Skip to content

Commit 4b82ab4

Browse files
kuba-mootorvalds
authored andcommitted
mm/memcg: automatically penalize tasks with high swap use
Add a memory.swap.high knob, which can be used to protect the system from SWAP exhaustion. The mechanism used for penalizing is similar to memory.high penalty (sleep on return to user space). That is not to say that the knob itself is equivalent to memory.high. The objective is more to protect the system from potentially buggy tasks consuming a lot of swap and impacting other tasks, or even bringing the whole system to stand still with complete SWAP exhaustion. Hopefully without the need to find per-task hard limits. Slowing misbehaving tasks down gradually allows user space oom killers or other protection mechanisms to react. oomd and earlyoom already do killing based on swap exhaustion, and memory.swap.high protection will help implement such userspace oom policies more reliably. We can use one counter for number of pages allocated under pressure to save struct task space and avoid two separate hierarchy walks on the hot path. The exact overage is calculated on return to user space, anyway. Take the new high limit into account when determining if swap is "full". Borrowing the explanation from Johannes: The idea behind "swap full" is that as long as the workload has plenty of swap space available and it's not changing its memory contents, it makes sense to generously hold on to copies of data in the swap device, even after the swapin. A later reclaim cycle can drop the page without any IO. Trading disk space for IO. But the only two ways to reclaim a swap slot is when they're faulted in and the references go away, or by scanning the virtual address space like swapoff does - which is very expensive (one could argue it's too expensive even for swapoff, it's often more practical to just reboot). So at some point in the fill level, we have to start freeing up swap slots on fault/swapin. Otherwise we could eventually run out of swap slots while they're filled with copies of data that is also in RAM. We don't want to OOM a workload because its available swap space is filled with redundant cache. Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Chris Down <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
1 parent d1663a9 commit 4b82ab4

File tree

3 files changed

+102
-7
lines changed

3 files changed

+102
-7
lines changed

Documentation/admin-guide/cgroup-v2.rst

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1374,6 +1374,22 @@ PAGE_SIZE multiple when read back.
13741374
The total amount of swap currently being used by the cgroup
13751375
and its descendants.
13761376

1377+
memory.swap.high
1378+
A read-write single value file which exists on non-root
1379+
cgroups. The default is "max".
1380+
1381+
Swap usage throttle limit. If a cgroup's swap usage exceeds
1382+
this limit, all its further allocations will be throttled to
1383+
allow userspace to implement custom out-of-memory procedures.
1384+
1385+
This limit marks a point of no return for the cgroup. It is NOT
1386+
designed to manage the amount of swapping a workload does
1387+
during regular operation. Compare to memory.swap.max, which
1388+
prohibits swapping past a set amount, but lets the cgroup
1389+
continue unimpeded as long as other memory can be reclaimed.
1390+
1391+
Healthy workloads are not expected to reach this limit.
1392+
13771393
memory.swap.max
13781394
A read-write single value file which exists on non-root
13791395
cgroups. The default is "max".
@@ -1387,6 +1403,10 @@ PAGE_SIZE multiple when read back.
13871403
otherwise, a value change in this file generates a file
13881404
modified event.
13891405

1406+
high
1407+
The number of times the cgroup's swap usage was over
1408+
the high threshold.
1409+
13901410
max
13911411
The number of times the cgroup's swap usage was about
13921412
to go over the max boundary and swap allocation

include/linux/memcontrol.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@ enum memcg_memory_event {
4545
MEMCG_MAX,
4646
MEMCG_OOM,
4747
MEMCG_OOM_KILL,
48+
MEMCG_SWAP_HIGH,
4849
MEMCG_SWAP_MAX,
4950
MEMCG_SWAP_FAIL,
5051
MEMCG_NR_MEMORY_EVENTS,

mm/memcontrol.c

Lines changed: 81 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2354,6 +2354,22 @@ static u64 mem_find_max_overage(struct mem_cgroup *memcg)
23542354
return max_overage;
23552355
}
23562356

2357+
static u64 swap_find_max_overage(struct mem_cgroup *memcg)
2358+
{
2359+
u64 overage, max_overage = 0;
2360+
2361+
do {
2362+
overage = calculate_overage(page_counter_read(&memcg->swap),
2363+
READ_ONCE(memcg->swap.high));
2364+
if (overage)
2365+
memcg_memory_event(memcg, MEMCG_SWAP_HIGH);
2366+
max_overage = max(overage, max_overage);
2367+
} while ((memcg = parent_mem_cgroup(memcg)) &&
2368+
!mem_cgroup_is_root(memcg));
2369+
2370+
return max_overage;
2371+
}
2372+
23572373
/*
23582374
* Get the number of jiffies that we should penalise a mischievous cgroup which
23592375
* is exceeding its memory.high by checking both it and its ancestors.
@@ -2415,6 +2431,9 @@ void mem_cgroup_handle_over_high(void)
24152431
penalty_jiffies = calculate_high_delay(memcg, nr_pages,
24162432
mem_find_max_overage(memcg));
24172433

2434+
penalty_jiffies += calculate_high_delay(memcg, nr_pages,
2435+
swap_find_max_overage(memcg));
2436+
24182437
/*
24192438
* Clamp the max delay per usermode return so as to still keep the
24202439
* application moving forwards and also permit diagnostics, albeit
@@ -2605,13 +2624,32 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
26052624
* reclaim, the cost of mismatch is negligible.
26062625
*/
26072626
do {
2608-
if (page_counter_read(&memcg->memory) >
2609-
READ_ONCE(memcg->memory.high)) {
2610-
/* Don't bother a random interrupted task */
2611-
if (in_interrupt()) {
2627+
bool mem_high, swap_high;
2628+
2629+
mem_high = page_counter_read(&memcg->memory) >
2630+
READ_ONCE(memcg->memory.high);
2631+
swap_high = page_counter_read(&memcg->swap) >
2632+
READ_ONCE(memcg->swap.high);
2633+
2634+
/* Don't bother a random interrupted task */
2635+
if (in_interrupt()) {
2636+
if (mem_high) {
26122637
schedule_work(&memcg->high_work);
26132638
break;
26142639
}
2640+
continue;
2641+
}
2642+
2643+
if (mem_high || swap_high) {
2644+
/*
2645+
* The allocating tasks in this cgroup will need to do
2646+
* reclaim or be throttled to prevent further growth
2647+
* of the memory or swap footprints.
2648+
*
2649+
* Target some best-effort fairness between the tasks,
2650+
* and distribute reclaim work and delay penalties
2651+
* based on how much each task is actually allocating.
2652+
*/
26152653
current->memcg_nr_pages_over_high += batch;
26162654
set_notify_resume(current);
26172655
break;
@@ -5076,6 +5114,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
50765114

50775115
page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
50785116
memcg->soft_limit = PAGE_COUNTER_MAX;
5117+
page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
50795118
if (parent) {
50805119
memcg->swappiness = mem_cgroup_swappiness(parent);
50815120
memcg->oom_kill_disable = parent->oom_kill_disable;
@@ -5229,6 +5268,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
52295268
page_counter_set_low(&memcg->memory, 0);
52305269
page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
52315270
memcg->soft_limit = PAGE_COUNTER_MAX;
5271+
page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
52325272
memcg_wb_domain_size_changed(memcg);
52335273
}
52345274

@@ -7142,10 +7182,13 @@ bool mem_cgroup_swap_full(struct page *page)
71427182
if (!memcg)
71437183
return false;
71447184

7145-
for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
7146-
if (page_counter_read(&memcg->swap) * 2 >=
7147-
READ_ONCE(memcg->swap.max))
7185+
for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
7186+
unsigned long usage = page_counter_read(&memcg->swap);
7187+
7188+
if (usage * 2 >= READ_ONCE(memcg->swap.high) ||
7189+
usage * 2 >= READ_ONCE(memcg->swap.max))
71487190
return true;
7191+
}
71497192

71507193
return false;
71517194
}
@@ -7175,6 +7218,29 @@ static u64 swap_current_read(struct cgroup_subsys_state *css,
71757218
return (u64)page_counter_read(&memcg->swap) * PAGE_SIZE;
71767219
}
71777220

7221+
static int swap_high_show(struct seq_file *m, void *v)
7222+
{
7223+
return seq_puts_memcg_tunable(m,
7224+
READ_ONCE(mem_cgroup_from_seq(m)->swap.high));
7225+
}
7226+
7227+
static ssize_t swap_high_write(struct kernfs_open_file *of,
7228+
char *buf, size_t nbytes, loff_t off)
7229+
{
7230+
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
7231+
unsigned long high;
7232+
int err;
7233+
7234+
buf = strstrip(buf);
7235+
err = page_counter_memparse(buf, "max", &high);
7236+
if (err)
7237+
return err;
7238+
7239+
page_counter_set_high(&memcg->swap, high);
7240+
7241+
return nbytes;
7242+
}
7243+
71787244
static int swap_max_show(struct seq_file *m, void *v)
71797245
{
71807246
return seq_puts_memcg_tunable(m,
@@ -7202,6 +7268,8 @@ static int swap_events_show(struct seq_file *m, void *v)
72027268
{
72037269
struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
72047270

7271+
seq_printf(m, "high %lu\n",
7272+
atomic_long_read(&memcg->memory_events[MEMCG_SWAP_HIGH]));
72057273
seq_printf(m, "max %lu\n",
72067274
atomic_long_read(&memcg->memory_events[MEMCG_SWAP_MAX]));
72077275
seq_printf(m, "fail %lu\n",
@@ -7216,6 +7284,12 @@ static struct cftype swap_files[] = {
72167284
.flags = CFTYPE_NOT_ON_ROOT,
72177285
.read_u64 = swap_current_read,
72187286
},
7287+
{
7288+
.name = "swap.high",
7289+
.flags = CFTYPE_NOT_ON_ROOT,
7290+
.seq_show = swap_high_show,
7291+
.write = swap_high_write,
7292+
},
72197293
{
72207294
.name = "swap.max",
72217295
.flags = CFTYPE_NOT_ON_ROOT,

0 commit comments

Comments
 (0)