Skip to content

Commit ac39cf8

Browse files
akpm00torvalds
authored andcommitted
memcg: fix mis-accounting of file mapped racy with migration
FILE_MAPPED per memcg of migrated file cache is not properly updated, because our hook in page_add_file_rmap() can't know to which memcg FILE_MAPPED should be counted. Basically, this patch is for fixing the bug but includes some big changes to fix up other messes. Now, at migrating mapped file, events happen in following sequence. 1. allocate a new page. 2. get memcg of an old page. 3. charge ageinst a new page before migration. But at this point, no changes to new page's page_cgroup, no commit for the charge. (IOW, PCG_USED bit is not set.) 4. page migration replaces radix-tree, old-page and new-page. 5. page migration remaps the new page if the old page was mapped. 6. Here, the new page is unlocked. 7. memcg commits the charge for newpage, Mark the new page's page_cgroup as PCG_USED. Because "commit" happens after page-remap, we can count FILE_MAPPED at "5", because we should avoid to trust page_cgroup->mem_cgroup. if PCG_USED bit is unset. (Note: memcg's LRU removal code does that but LRU-isolation logic is used for helping it. When we overwrite page_cgroup->mem_cgroup, page_cgroup is not on LRU or page_cgroup->mem_cgroup is NULL.) We can lose file_mapped accounting information at 5 because FILE_MAPPED is updated only when mapcount changes 0->1. So we should catch it. BTW, historically, above implemntation comes from migration-failure of anonymous page. Because we charge both of old page and new page with mapcount=0, we can't catch - the page is really freed before remap. - migration fails but it's freed before remap or .....corner cases. New migration sequence with memcg is: 1. allocate a new page. 2. mark PageCgroupMigration to the old page. 3. charge against a new page onto the old page's memcg. (here, new page's pc is marked as PageCgroupUsed.) 4. page migration replaces radix-tree, page table, etc... 5. At remapping, new page's page_cgroup is now makrked as "USED" We can catch 0->1 event and FILE_MAPPED will be properly updated. And we can catch SWAPOUT event after unlock this and freeing this page by unmap() can be caught. 7. Clear PageCgroupMigration of the old page. So, FILE_MAPPED will be correctly updated. Then, for what MIGRATION flag is ? Without it, at migration failure, we may have to charge old page again because it may be fully unmapped. "charge" means that we have to dive into memory reclaim or something complated. So, it's better to avoid charge it again. Before this patch, __commit_charge() was working for both of the old/new page and fixed up all. But this technique has some racy condtion around FILE_MAPPED and SWAPOUT etc... Now, the kernel use MIGRATION flag and don't uncharge old page until the end of migration. I hope this change will make memcg's page migration much simpler. This page migration has caused several troubles. Worth to add a flag for simplification. Reviewed-by: Daisuke Nishimura <[email protected]> Tested-by: Daisuke Nishimura <[email protected]> Reported-by: Daisuke Nishimura <[email protected]> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 315c199 commit ac39cf8

File tree

4 files changed

+107
-41
lines changed

4 files changed

+107
-41
lines changed

include/linux/memcontrol.h

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,8 @@ int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup)
9090
extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *mem);
9191

9292
extern int
93-
mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr);
93+
mem_cgroup_prepare_migration(struct page *page,
94+
struct page *newpage, struct mem_cgroup **ptr);
9495
extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
9596
struct page *oldpage, struct page *newpage);
9697

@@ -227,7 +228,8 @@ static inline struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *mem)
227228
}
228229

229230
static inline int
230-
mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
231+
mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
232+
struct mem_cgroup **ptr)
231233
{
232234
return 0;
233235
}

include/linux/page_cgroup.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ enum {
4040
PCG_USED, /* this object is in use. */
4141
PCG_ACCT_LRU, /* page has been accounted for */
4242
PCG_FILE_MAPPED, /* page is accounted as "mapped" */
43+
PCG_MIGRATION, /* under page migration */
4344
};
4445

4546
#define TESTPCGFLAG(uname, lname) \
@@ -79,6 +80,10 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
7980
CLEARPCGFLAG(FileMapped, FILE_MAPPED)
8081
TESTPCGFLAG(FileMapped, FILE_MAPPED)
8182

83+
SETPCGFLAG(Migration, MIGRATION)
84+
CLEARPCGFLAG(Migration, MIGRATION)
85+
TESTPCGFLAG(Migration, MIGRATION)
86+
8287
static inline int page_cgroup_nid(struct page_cgroup *pc)
8388
{
8489
return page_to_nid(pc->page);

mm/memcontrol.c

Lines changed: 97 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -2258,7 +2258,8 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
22582258
switch (ctype) {
22592259
case MEM_CGROUP_CHARGE_TYPE_MAPPED:
22602260
case MEM_CGROUP_CHARGE_TYPE_DROP:
2261-
if (page_mapped(page))
2261+
/* See mem_cgroup_prepare_migration() */
2262+
if (page_mapped(page) || PageCgroupMigration(pc))
22622263
goto unlock_out;
22632264
break;
22642265
case MEM_CGROUP_CHARGE_TYPE_SWAPOUT:
@@ -2481,10 +2482,12 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
24812482
* Before starting migration, account PAGE_SIZE to mem_cgroup that the old
24822483
* page belongs to.
24832484
*/
2484-
int mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
2485+
int mem_cgroup_prepare_migration(struct page *page,
2486+
struct page *newpage, struct mem_cgroup **ptr)
24852487
{
24862488
struct page_cgroup *pc;
24872489
struct mem_cgroup *mem = NULL;
2490+
enum charge_type ctype;
24882491
int ret = 0;
24892492

24902493
if (mem_cgroup_disabled())
@@ -2495,69 +2498,125 @@ int mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
24952498
if (PageCgroupUsed(pc)) {
24962499
mem = pc->mem_cgroup;
24972500
css_get(&mem->css);
2501+
/*
2502+
* At migrating an anonymous page, its mapcount goes down
2503+
* to 0 and uncharge() will be called. But, even if it's fully
2504+
* unmapped, migration may fail and this page has to be
2505+
* charged again. We set MIGRATION flag here and delay uncharge
2506+
* until end_migration() is called
2507+
*
2508+
* Corner Case Thinking
2509+
* A)
2510+
* When the old page was mapped as Anon and it's unmap-and-freed
2511+
* while migration was ongoing.
2512+
* If unmap finds the old page, uncharge() of it will be delayed
2513+
* until end_migration(). If unmap finds a new page, it's
2514+
* uncharged when it make mapcount to be 1->0. If unmap code
2515+
* finds swap_migration_entry, the new page will not be mapped
2516+
* and end_migration() will find it(mapcount==0).
2517+
*
2518+
* B)
2519+
* When the old page was mapped but migraion fails, the kernel
2520+
* remaps it. A charge for it is kept by MIGRATION flag even
2521+
* if mapcount goes down to 0. We can do remap successfully
2522+
* without charging it again.
2523+
*
2524+
* C)
2525+
* The "old" page is under lock_page() until the end of
2526+
* migration, so, the old page itself will not be swapped-out.
2527+
* If the new page is swapped out before end_migraton, our
2528+
* hook to usual swap-out path will catch the event.
2529+
*/
2530+
if (PageAnon(page))
2531+
SetPageCgroupMigration(pc);
24982532
}
24992533
unlock_page_cgroup(pc);
2534+
/*
2535+
* If the page is not charged at this point,
2536+
* we return here.
2537+
*/
2538+
if (!mem)
2539+
return 0;
25002540

25012541
*ptr = mem;
2502-
if (mem) {
2503-
ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false);
2504-
css_put(&mem->css);
2542+
ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false);
2543+
css_put(&mem->css);/* drop extra refcnt */
2544+
if (ret || *ptr == NULL) {
2545+
if (PageAnon(page)) {
2546+
lock_page_cgroup(pc);
2547+
ClearPageCgroupMigration(pc);
2548+
unlock_page_cgroup(pc);
2549+
/*
2550+
* The old page may be fully unmapped while we kept it.
2551+
*/
2552+
mem_cgroup_uncharge_page(page);
2553+
}
2554+
return -ENOMEM;
25052555
}
2556+
/*
2557+
* We charge new page before it's used/mapped. So, even if unlock_page()
2558+
* is called before end_migration, we can catch all events on this new
2559+
* page. In the case new page is migrated but not remapped, new page's
2560+
* mapcount will be finally 0 and we call uncharge in end_migration().
2561+
*/
2562+
pc = lookup_page_cgroup(newpage);
2563+
if (PageAnon(page))
2564+
ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
2565+
else if (page_is_file_cache(page))
2566+
ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
2567+
else
2568+
ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
2569+
__mem_cgroup_commit_charge(mem, pc, ctype);
25062570
return ret;
25072571
}
25082572

25092573
/* remove redundant charge if migration failed*/
25102574
void mem_cgroup_end_migration(struct mem_cgroup *mem,
2511-
struct page *oldpage, struct page *newpage)
2575+
struct page *oldpage, struct page *newpage)
25122576
{
2513-
struct page *target, *unused;
2577+
struct page *used, *unused;
25142578
struct page_cgroup *pc;
2515-
enum charge_type ctype;
25162579

25172580
if (!mem)
25182581
return;
2582+
/* blocks rmdir() */
25192583
cgroup_exclude_rmdir(&mem->css);
25202584
/* at migration success, oldpage->mapping is NULL. */
25212585
if (oldpage->mapping) {
2522-
target = oldpage;
2523-
unused = NULL;
2586+
used = oldpage;
2587+
unused = newpage;
25242588
} else {
2525-
target = newpage;
2589+
used = newpage;
25262590
unused = oldpage;
25272591
}
2528-
2529-
if (PageAnon(target))
2530-
ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
2531-
else if (page_is_file_cache(target))
2532-
ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
2533-
else
2534-
ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
2535-
2536-
/* unused page is not on radix-tree now. */
2537-
if (unused)
2538-
__mem_cgroup_uncharge_common(unused, ctype);
2539-
2540-
pc = lookup_page_cgroup(target);
25412592
/*
2542-
* __mem_cgroup_commit_charge() check PCG_USED bit of page_cgroup.
2543-
* So, double-counting is effectively avoided.
2593+
* We disallowed uncharge of pages under migration because mapcount
2594+
* of the page goes down to zero, temporarly.
2595+
* Clear the flag and check the page should be charged.
25442596
*/
2545-
__mem_cgroup_commit_charge(mem, pc, ctype);
2597+
pc = lookup_page_cgroup(oldpage);
2598+
lock_page_cgroup(pc);
2599+
ClearPageCgroupMigration(pc);
2600+
unlock_page_cgroup(pc);
25462601

2602+
if (unused != oldpage)
2603+
pc = lookup_page_cgroup(unused);
2604+
__mem_cgroup_uncharge_common(unused, MEM_CGROUP_CHARGE_TYPE_FORCE);
2605+
2606+
pc = lookup_page_cgroup(used);
25472607
/*
2548-
* Both of oldpage and newpage are still under lock_page().
2549-
* Then, we don't have to care about race in radix-tree.
2550-
* But we have to be careful that this page is unmapped or not.
2551-
*
2552-
* There is a case for !page_mapped(). At the start of
2553-
* migration, oldpage was mapped. But now, it's zapped.
2554-
* But we know *target* page is not freed/reused under us.
2555-
* mem_cgroup_uncharge_page() does all necessary checks.
2608+
* If a page is a file cache, radix-tree replacement is very atomic
2609+
* and we can skip this check. When it was an Anon page, its mapcount
2610+
* goes down to 0. But because we added MIGRATION flage, it's not
2611+
* uncharged yet. There are several case but page->mapcount check
2612+
* and USED bit check in mem_cgroup_uncharge_page() will do enough
2613+
* check. (see prepare_charge() also)
25562614
*/
2557-
if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
2558-
mem_cgroup_uncharge_page(target);
2615+
if (PageAnon(used))
2616+
mem_cgroup_uncharge_page(used);
25592617
/*
2560-
* At migration, we may charge account against cgroup which has no tasks
2618+
* At migration, we may charge account against cgroup which has no
2619+
* tasks.
25612620
* So, rmdir()->pre_destroy() can be called while we do this charge.
25622621
* In that case, we need to call pre_destroy() again. check it here.
25632622
*/

mm/migrate.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -590,7 +590,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
590590
}
591591

592592
/* charge against new page */
593-
charge = mem_cgroup_prepare_migration(page, &mem);
593+
charge = mem_cgroup_prepare_migration(page, newpage, &mem);
594594
if (charge == -ENOMEM) {
595595
rc = -ENOMEM;
596596
goto unlock;

0 commit comments

Comments
 (0)