Skip to content

Commit ffcb5f5

Browse files
nhatsmrtakpm00
authored andcommitted
workingset: refactor LRU refault to expose refault recency check
Patch series "cachestat: a new syscall for page cache state of files", v13. There is currently no good way to query the page cache statistics of large files and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really does not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in this thread: https://lore.kernel.org/lkml/[email protected]/ This series of patches introduces a new system call, cachestat, that summarizes the page cache statistics (number of cached pages, dirty pages, pages marked for writeback, evicted pages etc.) of a file, in a specified range of bytes. It also include a selftest suite that tests some typical usage. Currently, the syscall is only wired in for x86 architecture. This interface is inspired by past discussion and concerns with fincore, which has a similar design (and as a result, issues) as mincore. Relevant links: https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html I have also developed a small tool that computes the memory usage of files and directories, analogous to the du utility. User can choose between mincore or cachestat (with cachestat exporting more information than mincore). To compare the performance of these two options, I benchmarked the tool on the root directory of a Meta's server machine, each for five runs: Using cachestat real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602 user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742 sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689 Using mincore: real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059 user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162 sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046 I also ran both syscalls on a 2TB sparse file: Using cachestat: real 0m0.009s user 0m0.000s sys 0m0.009s Using mincore: real 0m37.510s user 0m2.934s sys 0m34.558s Very large files like this are the pathological case for mincore. In fact, to compute the stats for a single 2TB file, mincore takes as long as cachestat takes to compute the stats for the entire tree! This could easily happen inadvertently when we run it on subdirectories. Mincore is clearly not suitable for a general-purpose command line tool. Regarding security concerns, cachestat() should not pose any additional issues. The caller already has read permission to the file itself (since they need an fd to that file to call cachestat). This means that the caller can access the underlying data in its entirety, which is a much greater source of information (and as a result, a much greater security risk) than the cache status itself. The latest API change (in v13 of the patch series) is suggested by Jens Axboe. It allows for 64-bit length argument, even on 32-bit architecture (which is previously not possible due to the limit on the number of syscall arguments). Furthermore, it eliminates the need for compatibility handling - every user can use the same ABI. This patch (of 4): In preparation for computing recently evicted pages in cachestat, refactor workingset_refault and lru_gen_refault to expose a helper function that would test if an evicted page is recently evicted. [[email protected]: add missing rcu_read_unlock() in lru_gen_refault()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nhat Pham <[email protected]> Signed-off-by: Tetsuo Handa <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Brian Foster <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Tetsuo Handa <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
1 parent 18b1d18 commit ffcb5f5

File tree

2 files changed

+103
-48
lines changed

2 files changed

+103
-48
lines changed

include/linux/swap.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -368,6 +368,7 @@ static inline void folio_set_swap_entry(struct folio *folio, swp_entry_t entry)
368368
}
369369

370370
/* linux/mm/workingset.c */
371+
bool workingset_test_recent(void *shadow, bool file, bool *workingset);
371372
void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
372373
void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg);
373374
void workingset_refault(struct folio *folio, void *shadow);

mm/workingset.c

Lines changed: 102 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -255,6 +255,29 @@ static void *lru_gen_eviction(struct folio *folio)
255255
return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs);
256256
}
257257

258+
/*
259+
* Tests if the shadow entry is for a folio that was recently evicted.
260+
* Fills in @memcgid, @pglist_data, @token, @workingset with the values
261+
* unpacked from shadow.
262+
*/
263+
static bool lru_gen_test_recent(void *shadow, bool file, int *memcgid,
264+
struct pglist_data **pgdat, unsigned long *token, bool *workingset)
265+
{
266+
struct mem_cgroup *eviction_memcg;
267+
struct lruvec *lruvec;
268+
struct lru_gen_folio *lrugen;
269+
unsigned long min_seq;
270+
271+
unpack_shadow(shadow, memcgid, pgdat, token, workingset);
272+
eviction_memcg = mem_cgroup_from_id(*memcgid);
273+
274+
lruvec = mem_cgroup_lruvec(eviction_memcg, *pgdat);
275+
lrugen = &lruvec->lrugen;
276+
277+
min_seq = READ_ONCE(lrugen->min_seq[file]);
278+
return (*token >> LRU_REFS_WIDTH) == (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH));
279+
}
280+
258281
static void lru_gen_refault(struct folio *folio, void *shadow)
259282
{
260283
int hist, tier, refs;
@@ -269,23 +292,22 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
269292
int type = folio_is_file_lru(folio);
270293
int delta = folio_nr_pages(folio);
271294

272-
unpack_shadow(shadow, &memcg_id, &pgdat, &token, &workingset);
273-
274-
if (pgdat != folio_pgdat(folio))
275-
return;
276-
277295
rcu_read_lock();
278296

297+
if (!lru_gen_test_recent(shadow, type, &memcg_id, &pgdat, &token,
298+
&workingset))
299+
goto unlock;
300+
279301
memcg = folio_memcg_rcu(folio);
280302
if (memcg_id != mem_cgroup_id(memcg))
281303
goto unlock;
282304

305+
if (pgdat != folio_pgdat(folio))
306+
goto unlock;
307+
283308
lruvec = mem_cgroup_lruvec(memcg, pgdat);
284309
lrugen = &lruvec->lrugen;
285-
286310
min_seq = READ_ONCE(lrugen->min_seq[type]);
287-
if ((token >> LRU_REFS_WIDTH) != (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH)))
288-
goto unlock;
289311

290312
hist = lru_hist_from_seq(min_seq);
291313
/* see the comment in folio_lru_refs() */
@@ -317,6 +339,12 @@ static void *lru_gen_eviction(struct folio *folio)
317339
return NULL;
318340
}
319341

342+
static bool lru_gen_test_recent(void *shadow, bool file, int *memcgid,
343+
struct pglist_data **pgdat, unsigned long *token, bool *workingset)
344+
{
345+
return false;
346+
}
347+
320348
static void lru_gen_refault(struct folio *folio, void *shadow)
321349
{
322350
}
@@ -385,42 +413,34 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
385413
}
386414

387415
/**
388-
* workingset_refault - Evaluate the refault of a previously evicted folio.
389-
* @folio: The freshly allocated replacement folio.
390-
* @shadow: Shadow entry of the evicted folio.
391-
*
392-
* Calculates and evaluates the refault distance of the previously
393-
* evicted folio in the context of the node and the memcg whose memory
394-
* pressure caused the eviction.
416+
* workingset_test_recent - tests if the shadow entry is for a folio that was
417+
* recently evicted. Also fills in @workingset with the value unpacked from
418+
* shadow.
419+
* @shadow: the shadow entry to be tested.
420+
* @file: whether the corresponding folio is from the file lru.
421+
* @workingset: where the workingset value unpacked from shadow should
422+
* be stored.
423+
*
424+
* Return: true if the shadow is for a recently evicted folio; false otherwise.
395425
*/
396-
void workingset_refault(struct folio *folio, void *shadow)
426+
bool workingset_test_recent(void *shadow, bool file, bool *workingset)
397427
{
398-
bool file = folio_is_file_lru(folio);
399428
struct mem_cgroup *eviction_memcg;
400429
struct lruvec *eviction_lruvec;
401430
unsigned long refault_distance;
402431
unsigned long workingset_size;
403-
struct pglist_data *pgdat;
404-
struct mem_cgroup *memcg;
405-
unsigned long eviction;
406-
struct lruvec *lruvec;
407432
unsigned long refault;
408-
bool workingset;
409433
int memcgid;
410-
long nr;
434+
struct pglist_data *pgdat;
435+
unsigned long eviction;
411436

412-
if (lru_gen_enabled()) {
413-
lru_gen_refault(folio, shadow);
414-
return;
415-
}
437+
if (lru_gen_enabled())
438+
return lru_gen_test_recent(shadow, file, &memcgid, &pgdat, &eviction,
439+
workingset);
416440

417-
unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
441+
unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
418442
eviction <<= bucket_order;
419443

420-
/* Flush stats (and potentially sleep) before holding RCU read lock */
421-
mem_cgroup_flush_stats_ratelimited();
422-
423-
rcu_read_lock();
424444
/*
425445
* Look up the memcg associated with the stored ID. It might
426446
* have been deleted since the folio's eviction.
@@ -439,7 +459,8 @@ void workingset_refault(struct folio *folio, void *shadow)
439459
*/
440460
eviction_memcg = mem_cgroup_from_id(memcgid);
441461
if (!mem_cgroup_disabled() && !eviction_memcg)
442-
goto out;
462+
return false;
463+
443464
eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
444465
refault = atomic_long_read(&eviction_lruvec->nonresident_age);
445466

@@ -461,20 +482,6 @@ void workingset_refault(struct folio *folio, void *shadow)
461482
*/
462483
refault_distance = (refault - eviction) & EVICTION_MASK;
463484

464-
/*
465-
* The activation decision for this folio is made at the level
466-
* where the eviction occurred, as that is where the LRU order
467-
* during folio reclaim is being determined.
468-
*
469-
* However, the cgroup that will own the folio is the one that
470-
* is actually experiencing the refault event.
471-
*/
472-
nr = folio_nr_pages(folio);
473-
memcg = folio_memcg(folio);
474-
pgdat = folio_pgdat(folio);
475-
lruvec = mem_cgroup_lruvec(memcg, pgdat);
476-
477-
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
478485
/*
479486
* Compare the distance to the existing workingset size. We
480487
* don't activate pages that couldn't stay resident even if
@@ -495,7 +502,54 @@ void workingset_refault(struct folio *folio, void *shadow)
495502
NR_INACTIVE_ANON);
496503
}
497504
}
498-
if (refault_distance > workingset_size)
505+
506+
return refault_distance <= workingset_size;
507+
}
508+
509+
/**
510+
* workingset_refault - Evaluate the refault of a previously evicted folio.
511+
* @folio: The freshly allocated replacement folio.
512+
* @shadow: Shadow entry of the evicted folio.
513+
*
514+
* Calculates and evaluates the refault distance of the previously
515+
* evicted folio in the context of the node and the memcg whose memory
516+
* pressure caused the eviction.
517+
*/
518+
void workingset_refault(struct folio *folio, void *shadow)
519+
{
520+
bool file = folio_is_file_lru(folio);
521+
struct pglist_data *pgdat;
522+
struct mem_cgroup *memcg;
523+
struct lruvec *lruvec;
524+
bool workingset;
525+
long nr;
526+
527+
if (lru_gen_enabled()) {
528+
lru_gen_refault(folio, shadow);
529+
return;
530+
}
531+
532+
/* Flush stats (and potentially sleep) before holding RCU read lock */
533+
mem_cgroup_flush_stats_ratelimited();
534+
535+
rcu_read_lock();
536+
537+
/*
538+
* The activation decision for this folio is made at the level
539+
* where the eviction occurred, as that is where the LRU order
540+
* during folio reclaim is being determined.
541+
*
542+
* However, the cgroup that will own the folio is the one that
543+
* is actually experiencing the refault event.
544+
*/
545+
nr = folio_nr_pages(folio);
546+
memcg = folio_memcg(folio);
547+
pgdat = folio_pgdat(folio);
548+
lruvec = mem_cgroup_lruvec(memcg, pgdat);
549+
550+
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
551+
552+
if (!workingset_test_recent(shadow, file, &workingset))
499553
goto out;
500554

501555
folio_set_active(folio);

0 commit comments

Comments
 (0)