Skip to content

Commit 9a1ea43

Browse files
Hugh Dickinstorvalds
authored andcommitted
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along since 2006: but you cannot safely wait_on_page_locked() without holding a reference to the page, and that extra reference is enough to make migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults on the entry before migrate_page_move_mapping() gets there. And that failure is retried nine times, amplifying the pain when trying to migrate a popular page. With a single persistent faulter, migration sometimes succeeds; with two or three concurrent faulters, success becomes much less likely (and the more the page was mapped, the worse the overhead of unmapping and remapping it on each try). This is especially a problem for memory offlining, where the outer level retries forever (or until terminated from userspace), because a heavy refault workload can trigger an endless loop of migration failures. wait_on_page_locked() is the wrong tool for the job. David Herrmann (but was he the first?) noticed this issue in 2014: https://marc.info/?l=linux-mm&m=140110465608116&w=2 Tim Chen started a thread in August 2017 which appears relevant: https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went on to implicate __migration_entry_wait(): https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended up with the v4.14 commits: 2554db9 ("sched/wait: Break up long wake list walk") 11a19c7 ("sched/wait: Introduce wakeup boomark in wake_up_page_bit") Baoquan He reported "Memory hotplug softlock issue" 14 November 2018: https://marc.info/?l=linux-mm&m=154217936431300&w=2 We have all assumed that it is essential to hold a page reference while waiting on a page lock: partly to guarantee that there is still a struct page when MEMORY_HOTREMOVE is configured, but also to protect against reuse of the struct page going to someone who then holds the page locked indefinitely, when the waiter can reasonably expect timely unlocking. But in fact, so long as wait_on_page_bit_common() does the put_page(), and is careful not to rely on struct page contents thereafter, there is no need to hold a reference to the page while waiting on it. That does mean that this case cannot go back through the loop: but that's fine for the page migration case, and even if used more widely, is limited by the "Stop walking if it's locked" optimization in wake_page_function(). Add interface put_and_wait_on_page_locked() to do this, using "behavior" enum in place of "lock" arg to wait_on_page_bit_common() to implement it. No interruptible or killable variant needed yet, but they might follow: I have a vague notion that reporting -EINTR should take precedence over return from wait_on_page_bit_common() without knowing the page state, so arrange it accordingly - but that may be nothing but pedantic. __migration_entry_wait() still has to take a brief reference to the page, prior to calling put_and_wait_on_page_locked(): but now that it is dropped before waiting, the chance of impeding page migration is very much reduced. Should we perhaps disable preemption across this? shrink_page_list()'s __ClearPageLocked(): that was a surprise! This survived a lot of testing before that showed up. PageWaiters may have been set by wait_on_page_bit_common(), and the reference dropped, just before shrink_page_list() succeeds in freezing its last page reference: in such a case, unlock_page() must be used. Follow the suggestion from Michal Hocko, just revert a978d6f ("mm: unlockless reclaim") now: that optimization predates PageWaiters, and won't buy much these days; but we can reinstate it for the !PageWaiters case if anyone notices. It does raise the question: should vmscan.c's is_page_cache_freeable() and __remove_mapping() now treat a PageWaiters page as if an extra reference were held? Perhaps, but I don't think it matters much, since shrink_page_list() already had to win its trylock_page(), so waiters are not very common there: I noticed no difference when trying the bigger change, and it's surely not needed while put_and_wait_on_page_locked() is only used for page migration. [[email protected]: add put_and_wait_on_page_locked() kerneldoc] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Reported-by: Baoquan He <[email protected]> Tested-by: Baoquan He <[email protected]> Reviewed-by: Andrea Arcangeli <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Linus Torvalds <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Baoquan He <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Herrmann <[email protected]> Cc: Tim Chen <[email protected]> Cc: Kan Liang <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Nick Piggin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent f0c867d commit 9a1ea43

File tree

5 files changed

+84
-33
lines changed

5 files changed

+84
-33
lines changed

include/linux/pagemap.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -537,6 +537,8 @@ static inline int wait_on_page_locked_killable(struct page *page)
537537
return wait_on_page_bit_killable(compound_head(page), PG_locked);
538538
}
539539

540+
extern void put_and_wait_on_page_locked(struct page *page);
541+
540542
/*
541543
* Wait for a page to complete writeback
542544
*/

mm/filemap.c

Lines changed: 74 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -981,7 +981,14 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
981981
if (wait_page->bit_nr != key->bit_nr)
982982
return 0;
983983

984-
/* Stop walking if it's locked */
984+
/*
985+
* Stop walking if it's locked.
986+
* Is this safe if put_and_wait_on_page_locked() is in use?
987+
* Yes: the waker must hold a reference to this page, and if PG_locked
988+
* has now already been set by another task, that task must also hold
989+
* a reference to the *same usage* of this page; so there is no need
990+
* to walk on to wake even the put_and_wait_on_page_locked() callers.
991+
*/
985992
if (test_bit(key->bit_nr, &key->page->flags))
986993
return -1;
987994

@@ -1049,25 +1056,44 @@ static void wake_up_page(struct page *page, int bit)
10491056
wake_up_page_bit(page, bit);
10501057
}
10511058

1059+
/*
1060+
* A choice of three behaviors for wait_on_page_bit_common():
1061+
*/
1062+
enum behavior {
1063+
EXCLUSIVE, /* Hold ref to page and take the bit when woken, like
1064+
* __lock_page() waiting on then setting PG_locked.
1065+
*/
1066+
SHARED, /* Hold ref to page and check the bit when woken, like
1067+
* wait_on_page_writeback() waiting on PG_writeback.
1068+
*/
1069+
DROP, /* Drop ref to page before wait, no check when woken,
1070+
* like put_and_wait_on_page_locked() on PG_locked.
1071+
*/
1072+
};
1073+
10521074
static inline int wait_on_page_bit_common(wait_queue_head_t *q,
1053-
struct page *page, int bit_nr, int state, bool lock)
1075+
struct page *page, int bit_nr, int state, enum behavior behavior)
10541076
{
10551077
struct wait_page_queue wait_page;
10561078
wait_queue_entry_t *wait = &wait_page.wait;
1079+
bool bit_is_set;
10571080
bool thrashing = false;
1081+
bool delayacct = false;
10581082
unsigned long pflags;
10591083
int ret = 0;
10601084

10611085
if (bit_nr == PG_locked &&
10621086
!PageUptodate(page) && PageWorkingset(page)) {
1063-
if (!PageSwapBacked(page))
1087+
if (!PageSwapBacked(page)) {
10641088
delayacct_thrashing_start();
1089+
delayacct = true;
1090+
}
10651091
psi_memstall_enter(&pflags);
10661092
thrashing = true;
10671093
}
10681094

10691095
init_wait(wait);
1070-
wait->flags = lock ? WQ_FLAG_EXCLUSIVE : 0;
1096+
wait->flags = behavior == EXCLUSIVE ? WQ_FLAG_EXCLUSIVE : 0;
10711097
wait->func = wake_page_function;
10721098
wait_page.page = page;
10731099
wait_page.bit_nr = bit_nr;
@@ -1084,14 +1110,17 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
10841110

10851111
spin_unlock_irq(&q->lock);
10861112

1087-
if (likely(test_bit(bit_nr, &page->flags))) {
1113+
bit_is_set = test_bit(bit_nr, &page->flags);
1114+
if (behavior == DROP)
1115+
put_page(page);
1116+
1117+
if (likely(bit_is_set))
10881118
io_schedule();
1089-
}
10901119

1091-
if (lock) {
1120+
if (behavior == EXCLUSIVE) {
10921121
if (!test_and_set_bit_lock(bit_nr, &page->flags))
10931122
break;
1094-
} else {
1123+
} else if (behavior == SHARED) {
10951124
if (!test_bit(bit_nr, &page->flags))
10961125
break;
10971126
}
@@ -1100,12 +1129,23 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
11001129
ret = -EINTR;
11011130
break;
11021131
}
1132+
1133+
if (behavior == DROP) {
1134+
/*
1135+
* We can no longer safely access page->flags:
1136+
* even if CONFIG_MEMORY_HOTREMOVE is not enabled,
1137+
* there is a risk of waiting forever on a page reused
1138+
* for something that keeps it locked indefinitely.
1139+
* But best check for -EINTR above before breaking.
1140+
*/
1141+
break;
1142+
}
11031143
}
11041144

11051145
finish_wait(q, wait);
11061146

11071147
if (thrashing) {
1108-
if (!PageSwapBacked(page))
1148+
if (delayacct)
11091149
delayacct_thrashing_end();
11101150
psi_memstall_leave(&pflags);
11111151
}
@@ -1124,17 +1164,36 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
11241164
void wait_on_page_bit(struct page *page, int bit_nr)
11251165
{
11261166
wait_queue_head_t *q = page_waitqueue(page);
1127-
wait_on_page_bit_common(q, page, bit_nr, TASK_UNINTERRUPTIBLE, false);
1167+
wait_on_page_bit_common(q, page, bit_nr, TASK_UNINTERRUPTIBLE, SHARED);
11281168
}
11291169
EXPORT_SYMBOL(wait_on_page_bit);
11301170

11311171
int wait_on_page_bit_killable(struct page *page, int bit_nr)
11321172
{
11331173
wait_queue_head_t *q = page_waitqueue(page);
1134-
return wait_on_page_bit_common(q, page, bit_nr, TASK_KILLABLE, false);
1174+
return wait_on_page_bit_common(q, page, bit_nr, TASK_KILLABLE, SHARED);
11351175
}
11361176
EXPORT_SYMBOL(wait_on_page_bit_killable);
11371177

1178+
/**
1179+
* put_and_wait_on_page_locked - Drop a reference and wait for it to be unlocked
1180+
* @page: The page to wait for.
1181+
*
1182+
* The caller should hold a reference on @page. They expect the page to
1183+
* become unlocked relatively soon, but do not wish to hold up migration
1184+
* (for example) by holding the reference while waiting for the page to
1185+
* come unlocked. After this function returns, the caller should not
1186+
* dereference @page.
1187+
*/
1188+
void put_and_wait_on_page_locked(struct page *page)
1189+
{
1190+
wait_queue_head_t *q;
1191+
1192+
page = compound_head(page);
1193+
q = page_waitqueue(page);
1194+
wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, DROP);
1195+
}
1196+
11381197
/**
11391198
* add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
11401199
* @page: Page defining the wait queue of interest
@@ -1264,15 +1323,17 @@ void __lock_page(struct page *__page)
12641323
{
12651324
struct page *page = compound_head(__page);
12661325
wait_queue_head_t *q = page_waitqueue(page);
1267-
wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, true);
1326+
wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE,
1327+
EXCLUSIVE);
12681328
}
12691329
EXPORT_SYMBOL(__lock_page);
12701330

12711331
int __lock_page_killable(struct page *__page)
12721332
{
12731333
struct page *page = compound_head(__page);
12741334
wait_queue_head_t *q = page_waitqueue(page);
1275-
return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE, true);
1335+
return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE,
1336+
EXCLUSIVE);
12761337
}
12771338
EXPORT_SYMBOL_GPL(__lock_page_killable);
12781339

mm/huge_memory.c

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1490,8 +1490,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
14901490
if (!get_page_unless_zero(page))
14911491
goto out_unlock;
14921492
spin_unlock(vmf->ptl);
1493-
wait_on_page_locked(page);
1494-
put_page(page);
1493+
put_and_wait_on_page_locked(page);
14951494
goto out;
14961495
}
14971496

@@ -1527,8 +1526,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
15271526
if (!get_page_unless_zero(page))
15281527
goto out_unlock;
15291528
spin_unlock(vmf->ptl);
1530-
wait_on_page_locked(page);
1531-
put_page(page);
1529+
put_and_wait_on_page_locked(page);
15321530
goto out;
15331531
}
15341532

mm/migrate.c

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -327,16 +327,13 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
327327

328328
/*
329329
* Once page cache replacement of page migration started, page_count
330-
* *must* be zero. And, we don't want to call wait_on_page_locked()
331-
* against a page without get_page().
332-
* So, we use get_page_unless_zero(), here. Even failed, page fault
333-
* will occur again.
330+
* is zero; but we must not call put_and_wait_on_page_locked() without
331+
* a ref. Use get_page_unless_zero(), and just fault again if it fails.
334332
*/
335333
if (!get_page_unless_zero(page))
336334
goto out;
337335
pte_unmap_unlock(ptep, ptl);
338-
wait_on_page_locked(page);
339-
put_page(page);
336+
put_and_wait_on_page_locked(page);
340337
return;
341338
out:
342339
pte_unmap_unlock(ptep, ptl);
@@ -370,8 +367,7 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
370367
if (!get_page_unless_zero(page))
371368
goto unlock;
372369
spin_unlock(ptl);
373-
wait_on_page_locked(page);
374-
put_page(page);
370+
put_and_wait_on_page_locked(page);
375371
return;
376372
unlock:
377373
spin_unlock(ptl);

mm/vmscan.c

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1460,14 +1460,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
14601460
count_memcg_page_event(page, PGLAZYFREED);
14611461
} else if (!mapping || !__remove_mapping(mapping, page, true))
14621462
goto keep_locked;
1463-
/*
1464-
* At this point, we have no other references and there is
1465-
* no way to pick any more up (removed from LRU, removed
1466-
* from pagecache). Can use non-atomic bitops now (and
1467-
* we obviously don't have to worry about waking up a process
1468-
* waiting on the page lock, because there are no references.
1469-
*/
1470-
__ClearPageLocked(page);
1463+
1464+
unlock_page(page);
14711465
free_it:
14721466
nr_reclaimed++;
14731467

0 commit comments

Comments
 (0)