Skip to content

Commit e286781

Browse files
Nick Piggintorvalds
authored andcommitted
mm: speculative page references
If we can be sure that elevating the page_count on a pagecache page will pin it, we can speculatively run this operation, and subsequently check to see if we hit the right page rather than relying on holding a lock or otherwise pinning a reference to the page. This can be done if get_page/put_page behaves consistently throughout the whole tree (ie. if we "get" the page after it has been used for something else, we must be able to free it with a put_page). Actually, there is a period where the count behaves differently: when the page is free or if it is a constituent page of a compound page. We need an atomic_inc_not_zero operation to ensure we don't try to grab the page in either case. This patch introduces the core locking protocol to the pagecache (ie. adds page_cache_get_speculative, and tweaks some update-side code to make it work). Thanks to Hugh for pointing out an improvement to the algorithm setting page_count to zero when we have control of all references, in order to hold off speculative getters. [[email protected]: fix migration_entry_wait()] [[email protected]: fix add_to_page_cache] [[email protected]: repair a comment] Signed-off-by: Nick Piggin <[email protected]> Cc: Jeff Garzik <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Reviewed-by: Peter Zijlstra <[email protected]> Signed-off-by: Daisuke Nishimura <[email protected]> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Nick Piggin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 47feff2 commit e286781

File tree

7 files changed

+227
-45
lines changed

7 files changed

+227
-45
lines changed

drivers/net/cassini.c

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -576,6 +576,18 @@ static void cas_spare_recover(struct cas *cp, const gfp_t flags)
576576
list_for_each_safe(elem, tmp, &list) {
577577
cas_page_t *page = list_entry(elem, cas_page_t, list);
578578

579+
/*
580+
* With the lockless pagecache, cassini buffering scheme gets
581+
* slightly less accurate: we might find that a page has an
582+
* elevated reference count here, due to a speculative ref,
583+
* and skip it as in-use. Ideally we would be able to reclaim
584+
* it. However this would be such a rare case, it doesn't
585+
* matter too much as we should pick it up the next time round.
586+
*
587+
* Importantly, if we find that the page has a refcount of 1
588+
* here (our refcount), then we know it is definitely not inuse
589+
* so we can reuse it.
590+
*/
579591
if (page_count(page->buffer) > 1)
580592
continue;
581593

include/linux/pagemap.h

Lines changed: 110 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
#include <asm/uaccess.h>
1313
#include <linux/gfp.h>
1414
#include <linux/bitops.h>
15+
#include <linux/hardirq.h> /* for in_interrupt() */
1516

1617
/*
1718
* Bits in mapping->flags. The lower __GFP_BITS_SHIFT bits are the page
@@ -62,6 +63,98 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
6263
#define page_cache_release(page) put_page(page)
6364
void release_pages(struct page **pages, int nr, int cold);
6465

66+
/*
67+
* speculatively take a reference to a page.
68+
* If the page is free (_count == 0), then _count is untouched, and 0
69+
* is returned. Otherwise, _count is incremented by 1 and 1 is returned.
70+
*
71+
* This function must be called inside the same rcu_read_lock() section as has
72+
* been used to lookup the page in the pagecache radix-tree (or page table):
73+
* this allows allocators to use a synchronize_rcu() to stabilize _count.
74+
*
75+
* Unless an RCU grace period has passed, the count of all pages coming out
76+
* of the allocator must be considered unstable. page_count may return higher
77+
* than expected, and put_page must be able to do the right thing when the
78+
* page has been finished with, no matter what it is subsequently allocated
79+
* for (because put_page is what is used here to drop an invalid speculative
80+
* reference).
81+
*
82+
* This is the interesting part of the lockless pagecache (and lockless
83+
* get_user_pages) locking protocol, where the lookup-side (eg. find_get_page)
84+
* has the following pattern:
85+
* 1. find page in radix tree
86+
* 2. conditionally increment refcount
87+
* 3. check the page is still in pagecache (if no, goto 1)
88+
*
89+
* Remove-side that cares about stability of _count (eg. reclaim) has the
90+
* following (with tree_lock held for write):
91+
* A. atomically check refcount is correct and set it to 0 (atomic_cmpxchg)
92+
* B. remove page from pagecache
93+
* C. free the page
94+
*
95+
* There are 2 critical interleavings that matter:
96+
* - 2 runs before A: in this case, A sees elevated refcount and bails out
97+
* - A runs before 2: in this case, 2 sees zero refcount and retries;
98+
* subsequently, B will complete and 1 will find no page, causing the
99+
* lookup to return NULL.
100+
*
101+
* It is possible that between 1 and 2, the page is removed then the exact same
102+
* page is inserted into the same position in pagecache. That's OK: the
103+
* old find_get_page using tree_lock could equally have run before or after
104+
* such a re-insertion, depending on order that locks are granted.
105+
*
106+
* Lookups racing against pagecache insertion isn't a big problem: either 1
107+
* will find the page or it will not. Likewise, the old find_get_page could run
108+
* either before the insertion or afterwards, depending on timing.
109+
*/
110+
static inline int page_cache_get_speculative(struct page *page)
111+
{
112+
VM_BUG_ON(in_interrupt());
113+
114+
#if !defined(CONFIG_SMP) && defined(CONFIG_CLASSIC_RCU)
115+
# ifdef CONFIG_PREEMPT
116+
VM_BUG_ON(!in_atomic());
117+
# endif
118+
/*
119+
* Preempt must be disabled here - we rely on rcu_read_lock doing
120+
* this for us.
121+
*
122+
* Pagecache won't be truncated from interrupt context, so if we have
123+
* found a page in the radix tree here, we have pinned its refcount by
124+
* disabling preempt, and hence no need for the "speculative get" that
125+
* SMP requires.
126+
*/
127+
VM_BUG_ON(page_count(page) == 0);
128+
atomic_inc(&page->_count);
129+
130+
#else
131+
if (unlikely(!get_page_unless_zero(page))) {
132+
/*
133+
* Either the page has been freed, or will be freed.
134+
* In either case, retry here and the caller should
135+
* do the right thing (see comments above).
136+
*/
137+
return 0;
138+
}
139+
#endif
140+
VM_BUG_ON(PageTail(page));
141+
142+
return 1;
143+
}
144+
145+
static inline int page_freeze_refs(struct page *page, int count)
146+
{
147+
return likely(atomic_cmpxchg(&page->_count, count, 0) == count);
148+
}
149+
150+
static inline void page_unfreeze_refs(struct page *page, int count)
151+
{
152+
VM_BUG_ON(page_count(page) != 0);
153+
VM_BUG_ON(count == 0);
154+
155+
atomic_set(&page->_count, count);
156+
}
157+
65158
#ifdef CONFIG_NUMA
66159
extern struct page *__page_cache_alloc(gfp_t gfp);
67160
#else
@@ -133,13 +226,29 @@ static inline struct page *read_mapping_page(struct address_space *mapping,
133226
return read_cache_page(mapping, index, filler, data);
134227
}
135228

136-
int add_to_page_cache(struct page *page, struct address_space *mapping,
229+
int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
137230
pgoff_t index, gfp_t gfp_mask);
138231
int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
139232
pgoff_t index, gfp_t gfp_mask);
140233
extern void remove_from_page_cache(struct page *page);
141234
extern void __remove_from_page_cache(struct page *page);
142235

236+
/*
237+
* Like add_to_page_cache_locked, but used to add newly allocated pages:
238+
* the page is new, so we can just run SetPageLocked() against it.
239+
*/
240+
static inline int add_to_page_cache(struct page *page,
241+
struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask)
242+
{
243+
int error;
244+
245+
SetPageLocked(page);
246+
error = add_to_page_cache_locked(page, mapping, offset, gfp_mask);
247+
if (unlikely(error))
248+
ClearPageLocked(page);
249+
return error;
250+
}
251+
143252
/*
144253
* Return byte-offset into filesystem object for page.
145254
*/

mm/filemap.c

Lines changed: 18 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -442,39 +442,43 @@ int filemap_write_and_wait_range(struct address_space *mapping,
442442
}
443443

444444
/**
445-
* add_to_page_cache - add newly allocated pagecache pages
445+
* add_to_page_cache_locked - add a locked page to the pagecache
446446
* @page: page to add
447447
* @mapping: the page's address_space
448448
* @offset: page index
449449
* @gfp_mask: page allocation mode
450450
*
451-
* This function is used to add newly allocated pagecache pages;
452-
* the page is new, so we can just run SetPageLocked() against it.
453-
* The other page state flags were set by rmqueue().
454-
*
451+
* This function is used to add a page to the pagecache. It must be locked.
455452
* This function does not add the page to the LRU. The caller must do that.
456453
*/
457-
int add_to_page_cache(struct page *page, struct address_space *mapping,
454+
int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
458455
pgoff_t offset, gfp_t gfp_mask)
459456
{
460-
int error = mem_cgroup_cache_charge(page, current->mm,
457+
int error;
458+
459+
VM_BUG_ON(!PageLocked(page));
460+
461+
error = mem_cgroup_cache_charge(page, current->mm,
461462
gfp_mask & ~__GFP_HIGHMEM);
462463
if (error)
463464
goto out;
464465

465466
error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
466467
if (error == 0) {
468+
page_cache_get(page);
469+
page->mapping = mapping;
470+
page->index = offset;
471+
467472
write_lock_irq(&mapping->tree_lock);
468473
error = radix_tree_insert(&mapping->page_tree, offset, page);
469-
if (!error) {
470-
page_cache_get(page);
471-
SetPageLocked(page);
472-
page->mapping = mapping;
473-
page->index = offset;
474+
if (likely(!error)) {
474475
mapping->nrpages++;
475476
__inc_zone_page_state(page, NR_FILE_PAGES);
476-
} else
477+
} else {
478+
page->mapping = NULL;
477479
mem_cgroup_uncharge_cache_page(page);
480+
page_cache_release(page);
481+
}
478482

479483
write_unlock_irq(&mapping->tree_lock);
480484
radix_tree_preload_end();
@@ -483,7 +487,7 @@ int add_to_page_cache(struct page *page, struct address_space *mapping,
483487
out:
484488
return error;
485489
}
486-
EXPORT_SYMBOL(add_to_page_cache);
490+
EXPORT_SYMBOL(add_to_page_cache_locked);
487491

488492
int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
489493
pgoff_t offset, gfp_t gfp_mask)

mm/migrate.c

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -285,7 +285,15 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
285285

286286
page = migration_entry_to_page(entry);
287287

288-
get_page(page);
288+
/*
289+
* Once radix-tree replacement of page migration started, page_count
290+
* *must* be zero. And, we don't want to call wait_on_page_locked()
291+
* against a page without get_page().
292+
* So, we use get_page_unless_zero(), here. Even failed, page fault
293+
* will occur again.
294+
*/
295+
if (!get_page_unless_zero(page))
296+
goto out;
289297
pte_unmap_unlock(ptep, ptl);
290298
wait_on_page_locked(page);
291299
put_page(page);
@@ -305,6 +313,7 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
305313
static int migrate_page_move_mapping(struct address_space *mapping,
306314
struct page *newpage, struct page *page)
307315
{
316+
int expected_count;
308317
void **pslot;
309318

310319
if (!mapping) {
@@ -319,12 +328,18 @@ static int migrate_page_move_mapping(struct address_space *mapping,
319328
pslot = radix_tree_lookup_slot(&mapping->page_tree,
320329
page_index(page));
321330

322-
if (page_count(page) != 2 + !!PagePrivate(page) ||
331+
expected_count = 2 + !!PagePrivate(page);
332+
if (page_count(page) != expected_count ||
323333
(struct page *)radix_tree_deref_slot(pslot) != page) {
324334
write_unlock_irq(&mapping->tree_lock);
325335
return -EAGAIN;
326336
}
327337

338+
if (!page_freeze_refs(page, expected_count)) {
339+
write_unlock_irq(&mapping->tree_lock);
340+
return -EAGAIN;
341+
}
342+
328343
/*
329344
* Now we know that no one else is looking at the page.
330345
*/
@@ -338,6 +353,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
338353

339354
radix_tree_replace_slot(pslot, newpage);
340355

356+
page_unfreeze_refs(page, expected_count);
341357
/*
342358
* Drop cache reference from old page.
343359
* We know this isn't the last reference.

mm/shmem.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -936,7 +936,7 @@ static int shmem_unuse_inode(struct shmem_inode_info *info, swp_entry_t entry, s
936936
spin_lock(&info->lock);
937937
ptr = shmem_swp_entry(info, idx, NULL);
938938
if (ptr && ptr->val == entry.val) {
939-
error = add_to_page_cache(page, inode->i_mapping,
939+
error = add_to_page_cache_locked(page, inode->i_mapping,
940940
idx, GFP_NOWAIT);
941941
/* does mem_cgroup_uncharge_cache_page on error */
942942
} else /* we must compensate for our precharge above */
@@ -1301,8 +1301,8 @@ static int shmem_getpage(struct inode *inode, unsigned long idx,
13011301
SetPageUptodate(filepage);
13021302
set_page_dirty(filepage);
13031303
swap_free(swap);
1304-
} else if (!(error = add_to_page_cache(
1305-
swappage, mapping, idx, GFP_NOWAIT))) {
1304+
} else if (!(error = add_to_page_cache_locked(swappage, mapping,
1305+
idx, GFP_NOWAIT))) {
13061306
info->flags |= SHMEM_PAGEIN;
13071307
shmem_swp_set(info, entry, 0);
13081308
shmem_swp_unmap(entry);

mm/swap_state.c

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ void show_swap_cache_info(void)
6464
}
6565

6666
/*
67-
* add_to_swap_cache resembles add_to_page_cache on swapper_space,
67+
* add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
6868
* but sets SwapCache flag and private instead of mapping and index.
6969
*/
7070
int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
@@ -76,19 +76,26 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
7676
BUG_ON(PagePrivate(page));
7777
error = radix_tree_preload(gfp_mask);
7878
if (!error) {
79+
page_cache_get(page);
80+
SetPageSwapCache(page);
81+
set_page_private(page, entry.val);
82+
7983
write_lock_irq(&swapper_space.tree_lock);
8084
error = radix_tree_insert(&swapper_space.page_tree,
8185
entry.val, page);
82-
if (!error) {
83-
page_cache_get(page);
84-
SetPageSwapCache(page);
85-
set_page_private(page, entry.val);
86+
if (likely(!error)) {
8687
total_swapcache_pages++;
8788
__inc_zone_page_state(page, NR_FILE_PAGES);
8889
INC_CACHE_INFO(add_total);
8990
}
9091
write_unlock_irq(&swapper_space.tree_lock);
9192
radix_tree_preload_end();
93+
94+
if (unlikely(error)) {
95+
set_page_private(page, 0UL);
96+
ClearPageSwapCache(page);
97+
page_cache_release(page);
98+
}
9299
}
93100
return error;
94101
}

0 commit comments

Comments
 (0)