Skip to content

Commit 2c653d0

Browse files
aagittorvalds
authored andcommitted
ksm: introduce ksm_max_page_sharing per page deduplication limit
Without a max deduplication limit for each KSM page, the list of the rmap_items associated to each stable_node can grow infinitely large. During the rmap walk each entry can take up to ~10usec to process because of IPIs for the TLB flushing (both for the primary MMU and the secondary MMUs with the MMU notifier). With only 16GB of address space shared in the same KSM page, that would amount to dozens of seconds of kernel runtime. A ~256 max deduplication factor will reduce the latencies of the rmap walks on KSM pages to order of a few msec. Just doing the cond_resched() during the rmap walks is not enough, the list size must have a limit too, otherwise the caller could get blocked in (schedule friendly) kernel computations for seconds, unexpectedly. There's room for optimization to significantly reduce the IPI delivery cost during the page_referenced(), but at least for page_migration in the KSM case (used by hard NUMA bindings, compaction and NUMA balancing) it may be inevitable to send lots of IPIs if each rmap_item->mm is active on a different CPU and there are lots of CPUs. Even if we ignore the IPI delivery cost, we've still to walk the whole KSM rmap list, so we can't allow millions or billions (ulimited) number of entries in the KSM stable_node rmap_item lists. The limit is enforced efficiently by adding a second dimension to the stable rbtree. So there are three types of stable_nodes: the regular ones (identical as before, living in the first flat dimension of the stable rbtree), the "chains" and the "dups". Every "chain" and all "dups" linked into a "chain" enforce the invariant that they represent the same write protected memory content, even if each "dup" will be pointed by a different KSM page copy of that content. This way the stable rbtree lookup computational complexity is unaffected if compared to an unlimited max_sharing_limit. It is still enforced that there cannot be KSM page content duplicates in the stable rbtree itself. Adding the second dimension to the stable rbtree only after the max_page_sharing limit hits, provides for a zero memory footprint increase on 64bit archs. The memory overhead of the per-KSM page stable_tree and per virtual mapping rmap_item is unchanged. Only after the max_page_sharing limit hits, we need to allocate a stable_tree "chain" and rb_replace() the "regular" stable_node with the newly allocated stable_node "chain". After that we simply add the "regular" stable_node to the chain as a stable_node "dup" by linking hlist_dup in the stable_node_chain->hlist. This way the "regular" (flat) stable_node is converted to a stable_node "dup" living in the second dimension of the stable rbtree. During stable rbtree lookups the stable_node "chain" is identified as stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka is_stable_node_chain()). When dropping stable_nodes, the stable_node "dup" is identified as stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()). The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used elsewhere in any stable_node->head/node to avoid a clashes with the stable_node->node.rb_parent_color pointer, and different from &migrate_nodes. So the second field of &migrate_nodes is picked and verified as always safe with a BUILD_BUG_ON in case the list_head implementation changes in the future. The STABLE_NODE_DUP is picked as a random negative value in stable_node->rmap_hlist_len. rmap_hlist_len cannot become negative when it's a "regular" stable_node or a stable_node "dup". The stable_node_chain->nid is irrelevant. The stable_node_chain->kpfn is aliased in a union with a time field used to rate limit the stable_node_chain->hlist prunes. The garbage collection of the stable_node_chain happens lazily during stable rbtree lookups (as for all other kind of stable_nodes), or while disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while collecting the entire stable rbtree. While the "regular" stable_nodes and the stable_node "dups" must wait for their underlying tree_page to be freed before they can be freed themselves, the stable_node "chains" can be freed immediately if the stable_node->hlist turns empty. This is because the "chains" are never pointed by any page->mapping and they're effectively stable rbtree KSM self contained metadata. [[email protected]: fix non-NUMA build] Signed-off-by: Andrea Arcangeli <[email protected]> Tested-by: Petr Holasek <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Evgheni Dereveanchin <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Gavin Guo <[email protected]> Cc: Jay Vosburgh <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 172ffeb commit 2c653d0

File tree

2 files changed

+730
-66
lines changed

2 files changed

+730
-66
lines changed

Documentation/vm/ksm.txt

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,50 @@ use_zero_pages - specifies whether empty pages (i.e. allocated pages
9898
it is only effective for pages merged after the change.
9999
Default: 0 (normal KSM behaviour as in earlier releases)
100100

101+
max_page_sharing - Maximum sharing allowed for each KSM page. This
102+
enforces a deduplication limit to avoid the virtual
103+
memory rmap lists to grow too large. The minimum
104+
value is 2 as a newly created KSM page will have at
105+
least two sharers. The rmap walk has O(N)
106+
complexity where N is the number of rmap_items
107+
(i.e. virtual mappings) that are sharing the page,
108+
which is in turn capped by max_page_sharing. So
109+
this effectively spread the the linear O(N)
110+
computational complexity from rmap walk context
111+
over different KSM pages. The ksmd walk over the
112+
stable_node "chains" is also O(N), but N is the
113+
number of stable_node "dups", not the number of
114+
rmap_items, so it has not a significant impact on
115+
ksmd performance. In practice the best stable_node
116+
"dup" candidate will be kept and found at the head
117+
of the "dups" list. The higher this value the
118+
faster KSM will merge the memory (because there
119+
will be fewer stable_node dups queued into the
120+
stable_node chain->hlist to check for pruning) and
121+
the higher the deduplication factor will be, but
122+
the slowest the worst case rmap walk could be for
123+
any given KSM page. Slowing down the rmap_walk
124+
means there will be higher latency for certain
125+
virtual memory operations happening during
126+
swapping, compaction, NUMA balancing and page
127+
migration, in turn decreasing responsiveness for
128+
the caller of those virtual memory operations. The
129+
scheduler latency of other tasks not involved with
130+
the VM operations doing the rmap walk is not
131+
affected by this parameter as the rmap walks are
132+
always schedule friendly themselves.
133+
134+
stable_node_chains_prune_millisecs - How frequently to walk the whole
135+
list of stable_node "dups" linked in the
136+
stable_node "chains" in order to prune stale
137+
stable_nodes. Smaller milllisecs values will free
138+
up the KSM metadata with lower latency, but they
139+
will make ksmd use more CPU during the scan. This
140+
only applies to the stable_node chains so it's a
141+
noop if not a single KSM page hit the
142+
max_page_sharing yet (there would be no stable_node
143+
chains in such case).
144+
101145
The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/:
102146

103147
pages_shared - how many shared pages are being used
@@ -106,10 +150,29 @@ pages_unshared - how many pages unique but repeatedly checked for merging
106150
pages_volatile - how many pages changing too fast to be placed in a tree
107151
full_scans - how many times all mergeable areas have been scanned
108152

153+
stable_node_chains - number of stable node chains allocated, this is
154+
effectively the number of KSM pages that hit the
155+
max_page_sharing limit
156+
stable_node_dups - number of stable node dups queued into the
157+
stable_node chains
158+
109159
A high ratio of pages_sharing to pages_shared indicates good sharing, but
110160
a high ratio of pages_unshared to pages_sharing indicates wasted effort.
111161
pages_volatile embraces several different kinds of activity, but a high
112162
proportion there would also indicate poor use of madvise MADV_MERGEABLE.
113163

164+
The maximum possible page_sharing/page_shared ratio is limited by the
165+
max_page_sharing tunable. To increase the ratio max_page_sharing must
166+
be increased accordingly.
167+
168+
The stable_node_dups/stable_node_chains ratio is also affected by the
169+
max_page_sharing tunable, and an high ratio may indicate fragmentation
170+
in the stable_node dups, which could be solved by introducing
171+
fragmentation algorithms in ksmd which would refile rmap_items from
172+
one stable_node dup to another stable_node dup, in order to freeup
173+
stable_node "dups" with few rmap_items in them, but that may increase
174+
the ksmd CPU usage and possibly slowdown the readonly computations on
175+
the KSM pages of the applications.
176+
114177
Izik Eidus,
115178
Hugh Dickins, 17 Nov 2009

0 commit comments

Comments
 (0)