Skip to content

Commit 0a97c01

Browse files
nhatsmrtakpm00
authored andcommitted
list_lru: allow explicit memcg and NUMA node selection
Patch series "workload-specific and memory pressure-driven zswap writeback", v8. There are currently several issues with zswap writeback: 1. There is only a single global LRU for zswap, making it impossible to perform worload-specific shrinking - an memcg under memory pressure cannot determine which pages in the pool it owns, and often ends up writing pages from other memcgs. This issue has been previously observed in practice and mitigated by simply disabling memcg-initiated shrinking: https://lore.kernel.org/all/[email protected]/T/#u But this solution leaves a lot to be desired, as we still do not have an avenue for an memcg to free up its own memory locked up in the zswap pool. 2. We only shrink the zswap pool when the user-defined limit is hit. This means that if we set the limit too high, cold data that are unlikely to be used again will reside in the pool, wasting precious memory. It is hard to predict how much zswap space will be needed ahead of time, as this depends on the workload (specifically, on factors such as memory access patterns and compressibility of the memory pages). This patch series solves these issues by separating the global zswap LRU into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The new shrinker does not have any parameter that must be tuned by the user, and can be opted in or out on a per-memcg basis. As a proof of concept, we ran the following synthetic benchmark: build the linux kernel in a memory-limited cgroup, and allocate some cold data in tmpfs to see if the shrinker could write them out and improved the overall performance. Depending on the amount of cold data generated, we observe from 14% to 35% reduction in kernel CPU time used in the kernel builds. This patch (of 6): The interface of list_lru is based on the assumption that the list node and the data it represents belong to the same allocated on the correct node/memcg. While this assumption is valid for existing slab objects LRU such as dentries and inodes, it is undocumented, and rather inflexible for certain potential list_lru users (such as the upcoming zswap shrinker and the THP shrinker). It has caused us a lot of issues during our development. This patch changes list_lru interface so that the caller must explicitly specify numa node and memcg when adding and removing objects. The old list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and list_lru_del_obj(), respectively. It also extends the list_lru API with a new function, list_lru_putback, which undoes a previous list_lru_isolate call. Unlike list_lru_add, it does not increment the LRU node count (as list_lru_isolate does not decrement the node count). list_lru_putback also allows for explicit memcg and NUMA node selection. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nhat Pham <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Acked-by: Johannes Weiner <[email protected]> Tested-by: Bagas Sanjaya <[email protected]> Cc: Chris Li <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
1 parent 330018f commit 0a97c01

File tree

12 files changed

+117
-36
lines changed

12 files changed

+117
-36
lines changed

drivers/android/binder_alloc.c

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -234,7 +234,7 @@ static int binder_update_page_range(struct binder_alloc *alloc, int allocate,
234234
if (page->page_ptr) {
235235
trace_binder_alloc_lru_start(alloc, index);
236236

237-
on_lru = list_lru_del(&binder_alloc_lru, &page->lru);
237+
on_lru = list_lru_del_obj(&binder_alloc_lru, &page->lru);
238238
WARN_ON(!on_lru);
239239

240240
trace_binder_alloc_lru_end(alloc, index);
@@ -285,7 +285,7 @@ static int binder_update_page_range(struct binder_alloc *alloc, int allocate,
285285

286286
trace_binder_free_lru_start(alloc, index);
287287

288-
ret = list_lru_add(&binder_alloc_lru, &page->lru);
288+
ret = list_lru_add_obj(&binder_alloc_lru, &page->lru);
289289
WARN_ON(!ret);
290290

291291
trace_binder_free_lru_end(alloc, index);
@@ -848,7 +848,7 @@ void binder_alloc_deferred_release(struct binder_alloc *alloc)
848848
if (!alloc->pages[i].page_ptr)
849849
continue;
850850

851-
on_lru = list_lru_del(&binder_alloc_lru,
851+
on_lru = list_lru_del_obj(&binder_alloc_lru,
852852
&alloc->pages[i].lru);
853853
page_addr = alloc->buffer + i * PAGE_SIZE;
854854
binder_alloc_debug(BINDER_DEBUG_BUFFER_ALLOC,
@@ -1287,4 +1287,3 @@ int binder_alloc_copy_from_buffer(struct binder_alloc *alloc,
12871287
return binder_alloc_do_buffer_copy(alloc, false, buffer, buffer_offset,
12881288
dest, bytes);
12891289
}
1290-

fs/dcache.c

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -428,7 +428,8 @@ static void d_lru_add(struct dentry *dentry)
428428
this_cpu_inc(nr_dentry_unused);
429429
if (d_is_negative(dentry))
430430
this_cpu_inc(nr_dentry_negative);
431-
WARN_ON_ONCE(!list_lru_add(&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
431+
WARN_ON_ONCE(!list_lru_add_obj(
432+
&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
432433
}
433434

434435
static void d_lru_del(struct dentry *dentry)
@@ -438,7 +439,8 @@ static void d_lru_del(struct dentry *dentry)
438439
this_cpu_dec(nr_dentry_unused);
439440
if (d_is_negative(dentry))
440441
this_cpu_dec(nr_dentry_negative);
441-
WARN_ON_ONCE(!list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
442+
WARN_ON_ONCE(!list_lru_del_obj(
443+
&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
442444
}
443445

444446
static void d_shrink_del(struct dentry *dentry)
@@ -1240,7 +1242,7 @@ static enum lru_status dentry_lru_isolate(struct list_head *item,
12401242
*
12411243
* This is guaranteed by the fact that all LRU management
12421244
* functions are intermediated by the LRU API calls like
1243-
* list_lru_add and list_lru_del. List movement in this file
1245+
* list_lru_add_obj and list_lru_del_obj. List movement in this file
12441246
* only ever occur through this functions or through callbacks
12451247
* like this one, that are called from the LRU API.
12461248
*

fs/gfs2/quota.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -271,7 +271,7 @@ static struct gfs2_quota_data *gfs2_qd_search_bucket(unsigned int hash,
271271
if (qd->qd_sbd != sdp)
272272
continue;
273273
if (lockref_get_not_dead(&qd->qd_lockref)) {
274-
list_lru_del(&gfs2_qd_lru, &qd->qd_lru);
274+
list_lru_del_obj(&gfs2_qd_lru, &qd->qd_lru);
275275
return qd;
276276
}
277277
}
@@ -344,7 +344,7 @@ static void qd_put(struct gfs2_quota_data *qd)
344344
}
345345

346346
qd->qd_lockref.count = 0;
347-
list_lru_add(&gfs2_qd_lru, &qd->qd_lru);
347+
list_lru_add_obj(&gfs2_qd_lru, &qd->qd_lru);
348348
spin_unlock(&qd->qd_lockref.lock);
349349
}
350350

@@ -1517,7 +1517,7 @@ void gfs2_quota_cleanup(struct gfs2_sbd *sdp)
15171517
lockref_mark_dead(&qd->qd_lockref);
15181518
spin_unlock(&qd->qd_lockref.lock);
15191519

1520-
list_lru_del(&gfs2_qd_lru, &qd->qd_lru);
1520+
list_lru_del_obj(&gfs2_qd_lru, &qd->qd_lru);
15211521
list_add(&qd->qd_lru, &dispose);
15221522
}
15231523
spin_unlock(&qd_lock);

fs/inode.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -464,7 +464,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
464464
if (!mapping_shrinkable(&inode->i_data))
465465
return;
466466

467-
if (list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru))
467+
if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru))
468468
this_cpu_inc(nr_unused);
469469
else if (rotate)
470470
inode->i_state |= I_REFERENCED;
@@ -482,7 +482,7 @@ void inode_add_lru(struct inode *inode)
482482

483483
static void inode_lru_list_del(struct inode *inode)
484484
{
485-
if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
485+
if (list_lru_del_obj(&inode->i_sb->s_inode_lru, &inode->i_lru))
486486
this_cpu_dec(nr_unused);
487487
}
488488

fs/nfs/nfs42xattr.c

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,7 @@ nfs4_xattr_entry_lru_add(struct nfs4_xattr_entry *entry)
132132
lru = (entry->flags & NFS4_XATTR_ENTRY_EXTVAL) ?
133133
&nfs4_xattr_large_entry_lru : &nfs4_xattr_entry_lru;
134134

135-
return list_lru_add(lru, &entry->lru);
135+
return list_lru_add_obj(lru, &entry->lru);
136136
}
137137

138138
static bool
@@ -143,7 +143,7 @@ nfs4_xattr_entry_lru_del(struct nfs4_xattr_entry *entry)
143143
lru = (entry->flags & NFS4_XATTR_ENTRY_EXTVAL) ?
144144
&nfs4_xattr_large_entry_lru : &nfs4_xattr_entry_lru;
145145

146-
return list_lru_del(lru, &entry->lru);
146+
return list_lru_del_obj(lru, &entry->lru);
147147
}
148148

149149
/*
@@ -349,7 +349,7 @@ nfs4_xattr_cache_unlink(struct inode *inode)
349349

350350
oldcache = nfsi->xattr_cache;
351351
if (oldcache != NULL) {
352-
list_lru_del(&nfs4_xattr_cache_lru, &oldcache->lru);
352+
list_lru_del_obj(&nfs4_xattr_cache_lru, &oldcache->lru);
353353
oldcache->inode = NULL;
354354
}
355355
nfsi->xattr_cache = NULL;
@@ -474,7 +474,7 @@ nfs4_xattr_get_cache(struct inode *inode, int add)
474474
kref_get(&cache->ref);
475475
nfsi->xattr_cache = cache;
476476
cache->inode = inode;
477-
list_lru_add(&nfs4_xattr_cache_lru, &cache->lru);
477+
list_lru_add_obj(&nfs4_xattr_cache_lru, &cache->lru);
478478
}
479479

480480
spin_unlock(&inode->i_lock);

fs/nfsd/filecache.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -322,7 +322,7 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
322322
static bool nfsd_file_lru_add(struct nfsd_file *nf)
323323
{
324324
set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
325-
if (list_lru_add(&nfsd_file_lru, &nf->nf_lru)) {
325+
if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
326326
trace_nfsd_file_lru_add(nf);
327327
return true;
328328
}
@@ -331,7 +331,7 @@ static bool nfsd_file_lru_add(struct nfsd_file *nf)
331331

332332
static bool nfsd_file_lru_remove(struct nfsd_file *nf)
333333
{
334-
if (list_lru_del(&nfsd_file_lru, &nf->nf_lru)) {
334+
if (list_lru_del_obj(&nfsd_file_lru, &nf->nf_lru)) {
335335
trace_nfsd_file_lru_del(nf);
336336
return true;
337337
}

fs/xfs/xfs_buf.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -169,7 +169,7 @@ xfs_buf_stale(
169169

170170
atomic_set(&bp->b_lru_ref, 0);
171171
if (!(bp->b_state & XFS_BSTATE_DISPOSE) &&
172-
(list_lru_del(&bp->b_target->bt_lru, &bp->b_lru)))
172+
(list_lru_del_obj(&bp->b_target->bt_lru, &bp->b_lru)))
173173
atomic_dec(&bp->b_hold);
174174

175175
ASSERT(atomic_read(&bp->b_hold) >= 1);
@@ -1047,7 +1047,7 @@ xfs_buf_rele(
10471047
* buffer for the LRU and clear the (now stale) dispose list
10481048
* state flag
10491049
*/
1050-
if (list_lru_add(&bp->b_target->bt_lru, &bp->b_lru)) {
1050+
if (list_lru_add_obj(&bp->b_target->bt_lru, &bp->b_lru)) {
10511051
bp->b_state &= ~XFS_BSTATE_DISPOSE;
10521052
atomic_inc(&bp->b_hold);
10531053
}
@@ -1060,7 +1060,7 @@ xfs_buf_rele(
10601060
* was on was the disposal list
10611061
*/
10621062
if (!(bp->b_state & XFS_BSTATE_DISPOSE)) {
1063-
list_lru_del(&bp->b_target->bt_lru, &bp->b_lru);
1063+
list_lru_del_obj(&bp->b_target->bt_lru, &bp->b_lru);
10641064
} else {
10651065
ASSERT(list_empty(&bp->b_lru));
10661066
}

fs/xfs/xfs_dquot.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1065,7 +1065,7 @@ xfs_qm_dqput(
10651065
struct xfs_quotainfo *qi = dqp->q_mount->m_quotainfo;
10661066
trace_xfs_dqput_free(dqp);
10671067

1068-
if (list_lru_add(&qi->qi_lru, &dqp->q_lru))
1068+
if (list_lru_add_obj(&qi->qi_lru, &dqp->q_lru))
10691069
XFS_STATS_INC(dqp->q_mount, xs_qm_dquot_unused);
10701070
}
10711071
xfs_dqunlock(dqp);

fs/xfs/xfs_qm.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,7 @@ xfs_qm_dqpurge(
171171
* hits zero, so it really should be on the freelist here.
172172
*/
173173
ASSERT(!list_empty(&dqp->q_lru));
174-
list_lru_del(&qi->qi_lru, &dqp->q_lru);
174+
list_lru_del_obj(&qi->qi_lru, &dqp->q_lru);
175175
XFS_STATS_DEC(dqp->q_mount, xs_qm_dquot_unused);
176176

177177
xfs_qm_dqdestroy(dqp);

include/linux/list_lru.h

Lines changed: 51 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,8 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
7575
* list_lru_add: add an element to the lru list's tail
7676
* @lru: the lru pointer
7777
* @item: the item to be added.
78+
* @nid: the node id of the sublist to add the item to.
79+
* @memcg: the cgroup of the sublist to add the item to.
7880
*
7981
* If the element is already part of a list, this function returns doing
8082
* nothing. Therefore the caller does not need to keep state about whether or
@@ -87,20 +89,50 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
8789
*
8890
* Return: true if the list was updated, false otherwise
8991
*/
90-
bool list_lru_add(struct list_lru *lru, struct list_head *item);
92+
bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
93+
struct mem_cgroup *memcg);
9194

9295
/**
93-
* list_lru_del: delete an element to the lru list
96+
* list_lru_add_obj: add an element to the lru list's tail
97+
* @lru: the lru pointer
98+
* @item: the item to be added.
99+
*
100+
* This function is similar to list_lru_add(), but the NUMA node and the
101+
* memcg of the sublist is determined by @item list_head. This assumption is
102+
* valid for slab objects LRU such as dentries, inodes, etc.
103+
*
104+
* Return value: true if the list was updated, false otherwise
105+
*/
106+
bool list_lru_add_obj(struct list_lru *lru, struct list_head *item);
107+
108+
/**
109+
* list_lru_del: delete an element from the lru list
94110
* @lru: the lru pointer
95111
* @item: the item to be deleted.
112+
* @nid: the node id of the sublist to delete the item from.
113+
* @memcg: the cgroup of the sublist to delete the item from.
96114
*
97115
* This function works analogously as list_lru_add() in terms of list
98116
* manipulation. The comments about an element already pertaining to
99117
* a list are also valid for list_lru_del().
100118
*
101119
* Return: true if the list was updated, false otherwise
102120
*/
103-
bool list_lru_del(struct list_lru *lru, struct list_head *item);
121+
bool list_lru_del(struct list_lru *lru, struct list_head *item, int nid,
122+
struct mem_cgroup *memcg);
123+
124+
/**
125+
* list_lru_del_obj: delete an element from the lru list
126+
* @lru: the lru pointer
127+
* @item: the item to be deleted.
128+
*
129+
* This function is similar to list_lru_del(), but the NUMA node and the
130+
* memcg of the sublist is determined by @item list_head. This assumption is
131+
* valid for slab objects LRU such as dentries, inodes, etc.
132+
*
133+
* Return value: true if the list was updated, false otherwise.
134+
*/
135+
bool list_lru_del_obj(struct list_lru *lru, struct list_head *item);
104136

105137
/**
106138
* list_lru_count_one: return the number of objects currently held by @lru
@@ -138,6 +170,22 @@ static inline unsigned long list_lru_count(struct list_lru *lru)
138170
void list_lru_isolate(struct list_lru_one *list, struct list_head *item);
139171
void list_lru_isolate_move(struct list_lru_one *list, struct list_head *item,
140172
struct list_head *head);
173+
/**
174+
* list_lru_putback: undo list_lru_isolate
175+
* @lru: the lru pointer.
176+
* @item: the item to put back.
177+
* @nid: the node id of the sublist to put the item back to.
178+
* @memcg: the cgroup of the sublist to put the item back to.
179+
*
180+
* Put back an isolated item into its original LRU. Note that unlike
181+
* list_lru_add, this does not increment the node LRU count (as
182+
* list_lru_isolate does not originally decrement this count).
183+
*
184+
* Since we might have dropped the LRU lock in between, recompute list_lru_one
185+
* from the node's id and memcg.
186+
*/
187+
void list_lru_putback(struct list_lru *lru, struct list_head *item, int nid,
188+
struct mem_cgroup *memcg);
141189

142190
typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item,
143191
struct list_lru_one *list, spinlock_t *lock, void *cb_arg);

0 commit comments

Comments
 (0)