Skip to content

Commit f1eca35

Browse files
djbwtorvalds
authored andcommitted
mm/sparsemem: introduce struct mem_section_usage
Patch series "mm: Sub-section memory hotplug support", v10. The memory hotplug section is an arbitrary / convenient unit for memory hotplug. 'Section-size' units have bled into the user interface ('memblock' sysfs) and can not be changed without breaking existing userspace. The section-size constraint, while mostly benign for typical memory hotplug, has and continues to wreak havoc with 'device-memory' use cases, persistent memory (pmem) in particular. Recall that pmem uses devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a 'struct page' memmap for pmem. However, it does not use the 'bottom half' of memory hotplug, i.e. never marks pmem pages online and never exposes the userspace memblock interface for pmem. This leaves an opening to redress the section-size constraint. To date, the libnvdimm subsystem has attempted to inject padding to satisfy the internal constraints of arch_add_memory(). Beyond complicating the code, leading to bugs [2], wasting memory, and limiting configuration flexibility, the padding hack is broken when the platform changes this physical memory alignment of pmem from one boot to the next. Device failure (intermittent or permanent) and physical reconfiguration are events that can cause the platform firmware to change the physical placement of pmem on a subsequent boot, and device failure is an everyday event in a data-center. It turns out that sections are only a hard requirement of the user-facing interface for memory hotplug and with a bit more infrastructure sub-section arch_add_memory() support can be added for kernel internal usages like devm_memremap_pages(). Here is an analysis of the current design assumptions in the current code and how they are addressed in the new implementation: Current design assumptions: - Sections that describe boot memory (early sections) are never unplugged / removed. - pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a valid_section() check - __add_pages() and helper routines assume all operations occur in PAGES_PER_SECTION units. - The memblock sysfs interface only comprehends full sections New design assumptions: - Sections are instrumented with a sub-section bitmask to track (on x86) individual 2MB sub-divisions of a 128MB section. - Partially populated early sections can be extended with additional sub-sections, and those sub-sections can be removed with arch_remove_memory(). With this in place we no longer lose usable memory capacity to padding. - pfn_valid() is updated to look deeper than valid_section() to also check the active-sub-section mask. This indication is in the same cacheline as the valid_section() so the performance impact is expected to be negligible. So far the lkp robot has not reported any regressions. - Outside of the core vmemmap population routines which are replaced, other helper routines like shrink_{zone,pgdat}_span() are updated to handle the smaller granularity. Core memory hotplug routines that deal with online memory are not touched. - The existing memblock sysfs user api guarantees / assumptions are not touched since this capability is limited to !online !memblock-sysfs-accessible sections. Meanwhile the issue reports continue to roll in from users that do not understand when and how the 128MB constraint will bite them. The current implementation relied on being able to support at least one misaligned namespace, but that immediately falls over on any moderately complex namespace creation attempt. Beyond the initial problem of 'System RAM' colliding with pmem, and the unsolvable problem of physical alignment changes, Linux is now being exposed to platforms that collide pmem ranges with other pmem ranges by default [3]. In short, devm_memremap_pages() has pushed the venerable section-size constraint past the breaking point, and the simplicity of section-aligned arch_add_memory() is no longer tenable. These patches are exposed to the kbuild robot on a subsection-v10 branch [4], and a preview of the unit test for this functionality is available on the 'subsection-pending' branch of ndctl [5]. [2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com [3]: pmem/ndctl#76 [4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10 [5]: pmem/ndctl@7c59b4867e1c This patch (of 13): Towards enabling memory hotplug to track partial population of a section, introduce 'struct mem_section_usage'. A pointer to a 'struct mem_section_usage' instance replaces the existing pointer to a 'pageblock_flags' bitmap. Effectively it adds one more 'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house a new 'subsection_map' bitmap. The new bitmap enables the memory hot{plug,remove} implementation to act on incremental sub-divisions of a section. SUBSECTION_SHIFT is defined as global constant instead of per-architecture value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of subsection users. Specifically a common subsection size allows for the possibility that persistent memory namespace configurations be made compatible across architectures. The primary motivation for this functionality is to support platforms that mix "System RAM" and "Persistent Memory" within a single section, or multiple PMEM ranges with different mapping lifetimes within a single section. The section restriction for hotplug has caused an ongoing saga of hacks and bugs for devm_memremap_pages() users. Beyond the fixups to teach existing paths how to retrieve the 'usemap' from a section, and updates to usemap allocation path, there are no expected behavior changes. Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Wei Yang <[email protected]> Tested-by: Aneesh Kumar K.V <[email protected]> [ppc64] Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Logan Gunthorpe <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Jane Chu <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Qian Cai <[email protected]> Cc: Logan Gunthorpe <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Jeff Moyer <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent dd62528 commit f1eca35

File tree

4 files changed

+76
-53
lines changed

4 files changed

+76
-53
lines changed

include/linux/mmzone.h

Lines changed: 25 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1160,6 +1160,24 @@ static inline unsigned long section_nr_to_pfn(unsigned long sec)
11601160
#define SECTION_ALIGN_UP(pfn) (((pfn) + PAGES_PER_SECTION - 1) & PAGE_SECTION_MASK)
11611161
#define SECTION_ALIGN_DOWN(pfn) ((pfn) & PAGE_SECTION_MASK)
11621162

1163+
#define SUBSECTION_SHIFT 21
1164+
1165+
#define PFN_SUBSECTION_SHIFT (SUBSECTION_SHIFT - PAGE_SHIFT)
1166+
#define PAGES_PER_SUBSECTION (1UL << PFN_SUBSECTION_SHIFT)
1167+
#define PAGE_SUBSECTION_MASK (~(PAGES_PER_SUBSECTION-1))
1168+
1169+
#if SUBSECTION_SHIFT > SECTION_SIZE_BITS
1170+
#error Subsection size exceeds section size
1171+
#else
1172+
#define SUBSECTIONS_PER_SECTION (1UL << (SECTION_SIZE_BITS - SUBSECTION_SHIFT))
1173+
#endif
1174+
1175+
struct mem_section_usage {
1176+
DECLARE_BITMAP(subsection_map, SUBSECTIONS_PER_SECTION);
1177+
/* See declaration of similar field in struct zone */
1178+
unsigned long pageblock_flags[0];
1179+
};
1180+
11631181
struct page;
11641182
struct page_ext;
11651183
struct mem_section {
@@ -1177,8 +1195,7 @@ struct mem_section {
11771195
*/
11781196
unsigned long section_mem_map;
11791197

1180-
/* See declaration of similar field in struct zone */
1181-
unsigned long *pageblock_flags;
1198+
struct mem_section_usage *usage;
11821199
#ifdef CONFIG_PAGE_EXTENSION
11831200
/*
11841201
* If SPARSEMEM, pgdat doesn't have page_ext pointer. We use
@@ -1209,6 +1226,11 @@ extern struct mem_section **mem_section;
12091226
extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT];
12101227
#endif
12111228

1229+
static inline unsigned long *section_to_usemap(struct mem_section *ms)
1230+
{
1231+
return ms->usage->pageblock_flags;
1232+
}
1233+
12121234
static inline struct mem_section *__nr_to_section(unsigned long nr)
12131235
{
12141236
#ifdef CONFIG_SPARSEMEM_EXTREME
@@ -1220,7 +1242,7 @@ static inline struct mem_section *__nr_to_section(unsigned long nr)
12201242
return &mem_section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK];
12211243
}
12221244
extern unsigned long __section_nr(struct mem_section *ms);
1223-
extern unsigned long usemap_size(void);
1245+
extern size_t mem_section_usage_size(void);
12241246

12251247
/*
12261248
* We use the lower bits of the mem_map pointer to store

mm/memory_hotplug.c

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -166,9 +166,10 @@ void put_page_bootmem(struct page *page)
166166
#ifndef CONFIG_SPARSEMEM_VMEMMAP
167167
static void register_page_bootmem_info_section(unsigned long start_pfn)
168168
{
169-
unsigned long *usemap, mapsize, section_nr, i;
169+
unsigned long mapsize, section_nr, i;
170170
struct mem_section *ms;
171171
struct page *page, *memmap;
172+
struct mem_section_usage *usage;
172173

173174
section_nr = pfn_to_section_nr(start_pfn);
174175
ms = __nr_to_section(section_nr);
@@ -188,10 +189,10 @@ static void register_page_bootmem_info_section(unsigned long start_pfn)
188189
for (i = 0; i < mapsize; i++, page++)
189190
get_page_bootmem(section_nr, page, SECTION_INFO);
190191

191-
usemap = ms->pageblock_flags;
192-
page = virt_to_page(usemap);
192+
usage = ms->usage;
193+
page = virt_to_page(usage);
193194

194-
mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
195+
mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT;
195196

196197
for (i = 0; i < mapsize; i++, page++)
197198
get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
@@ -200,9 +201,10 @@ static void register_page_bootmem_info_section(unsigned long start_pfn)
200201
#else /* CONFIG_SPARSEMEM_VMEMMAP */
201202
static void register_page_bootmem_info_section(unsigned long start_pfn)
202203
{
203-
unsigned long *usemap, mapsize, section_nr, i;
204+
unsigned long mapsize, section_nr, i;
204205
struct mem_section *ms;
205206
struct page *page, *memmap;
207+
struct mem_section_usage *usage;
206208

207209
section_nr = pfn_to_section_nr(start_pfn);
208210
ms = __nr_to_section(section_nr);
@@ -211,10 +213,10 @@ static void register_page_bootmem_info_section(unsigned long start_pfn)
211213

212214
register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION);
213215

214-
usemap = ms->pageblock_flags;
215-
page = virt_to_page(usemap);
216+
usage = ms->usage;
217+
page = virt_to_page(usage);
216218

217-
mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
219+
mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT;
218220

219221
for (i = 0; i < mapsize; i++, page++)
220222
get_page_bootmem(section_nr, page, MIX_SECTION_INFO);

mm/page_alloc.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -450,7 +450,7 @@ static inline unsigned long *get_pageblock_bitmap(struct page *page,
450450
unsigned long pfn)
451451
{
452452
#ifdef CONFIG_SPARSEMEM
453-
return __pfn_to_section(pfn)->pageblock_flags;
453+
return section_to_usemap(__pfn_to_section(pfn));
454454
#else
455455
return page_zone(page)->pageblock_flags;
456456
#endif /* CONFIG_SPARSEMEM */

mm/sparse.c

Lines changed: 40 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -288,33 +288,31 @@ struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pn
288288

289289
static void __meminit sparse_init_one_section(struct mem_section *ms,
290290
unsigned long pnum, struct page *mem_map,
291-
unsigned long *pageblock_bitmap)
291+
struct mem_section_usage *usage)
292292
{
293293
ms->section_mem_map &= ~SECTION_MAP_MASK;
294294
ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) |
295295
SECTION_HAS_MEM_MAP;
296-
ms->pageblock_flags = pageblock_bitmap;
296+
ms->usage = usage;
297297
}
298298

299-
unsigned long usemap_size(void)
299+
static unsigned long usemap_size(void)
300300
{
301301
return BITS_TO_LONGS(SECTION_BLOCKFLAGS_BITS) * sizeof(unsigned long);
302302
}
303303

304-
#ifdef CONFIG_MEMORY_HOTPLUG
305-
static unsigned long *__kmalloc_section_usemap(void)
304+
size_t mem_section_usage_size(void)
306305
{
307-
return kmalloc(usemap_size(), GFP_KERNEL);
306+
return sizeof(struct mem_section_usage) + usemap_size();
308307
}
309-
#endif /* CONFIG_MEMORY_HOTPLUG */
310308

311309
#ifdef CONFIG_MEMORY_HOTREMOVE
312-
static unsigned long * __init
310+
static struct mem_section_usage * __init
313311
sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat,
314312
unsigned long size)
315313
{
314+
struct mem_section_usage *usage;
316315
unsigned long goal, limit;
317-
unsigned long *p;
318316
int nid;
319317
/*
320318
* A page may contain usemaps for other sections preventing the
@@ -330,15 +328,16 @@ sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat,
330328
limit = goal + (1UL << PA_SECTION_SHIFT);
331329
nid = early_pfn_to_nid(goal >> PAGE_SHIFT);
332330
again:
333-
p = memblock_alloc_try_nid(size, SMP_CACHE_BYTES, goal, limit, nid);
334-
if (!p && limit) {
331+
usage = memblock_alloc_try_nid(size, SMP_CACHE_BYTES, goal, limit, nid);
332+
if (!usage && limit) {
335333
limit = 0;
336334
goto again;
337335
}
338-
return p;
336+
return usage;
339337
}
340338

341-
static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
339+
static void __init check_usemap_section_nr(int nid,
340+
struct mem_section_usage *usage)
342341
{
343342
unsigned long usemap_snr, pgdat_snr;
344343
static unsigned long old_usemap_snr;
@@ -352,7 +351,7 @@ static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
352351
old_pgdat_snr = NR_MEM_SECTIONS;
353352
}
354353

355-
usemap_snr = pfn_to_section_nr(__pa(usemap) >> PAGE_SHIFT);
354+
usemap_snr = pfn_to_section_nr(__pa(usage) >> PAGE_SHIFT);
356355
pgdat_snr = pfn_to_section_nr(__pa(pgdat) >> PAGE_SHIFT);
357356
if (usemap_snr == pgdat_snr)
358357
return;
@@ -380,14 +379,15 @@ static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
380379
usemap_snr, pgdat_snr, nid);
381380
}
382381
#else
383-
static unsigned long * __init
382+
static struct mem_section_usage * __init
384383
sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat,
385384
unsigned long size)
386385
{
387386
return memblock_alloc_node(size, SMP_CACHE_BYTES, pgdat->node_id);
388387
}
389388

390-
static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
389+
static void __init check_usemap_section_nr(int nid,
390+
struct mem_section_usage *usage)
391391
{
392392
}
393393
#endif /* CONFIG_MEMORY_HOTREMOVE */
@@ -474,14 +474,13 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
474474
unsigned long pnum_end,
475475
unsigned long map_count)
476476
{
477-
unsigned long pnum, usemap_longs, *usemap;
477+
struct mem_section_usage *usage;
478+
unsigned long pnum;
478479
struct page *map;
479480

480-
usemap_longs = BITS_TO_LONGS(SECTION_BLOCKFLAGS_BITS);
481-
usemap = sparse_early_usemaps_alloc_pgdat_section(NODE_DATA(nid),
482-
usemap_size() *
483-
map_count);
484-
if (!usemap) {
481+
usage = sparse_early_usemaps_alloc_pgdat_section(NODE_DATA(nid),
482+
mem_section_usage_size() * map_count);
483+
if (!usage) {
485484
pr_err("%s: node[%d] usemap allocation failed", __func__, nid);
486485
goto failed;
487486
}
@@ -497,9 +496,9 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
497496
pnum_begin = pnum;
498497
goto failed;
499498
}
500-
check_usemap_section_nr(nid, usemap);
501-
sparse_init_one_section(__nr_to_section(pnum), pnum, map, usemap);
502-
usemap += usemap_longs;
499+
check_usemap_section_nr(nid, usage);
500+
sparse_init_one_section(__nr_to_section(pnum), pnum, map, usage);
501+
usage = (void *) usage + mem_section_usage_size();
503502
}
504503
sparse_buffer_fini();
505504
return;
@@ -697,9 +696,9 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
697696
struct vmem_altmap *altmap)
698697
{
699698
unsigned long section_nr = pfn_to_section_nr(start_pfn);
699+
struct mem_section_usage *usage;
700700
struct mem_section *ms;
701701
struct page *memmap;
702-
unsigned long *usemap;
703702
int ret;
704703

705704
/*
@@ -713,8 +712,8 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
713712
memmap = kmalloc_section_memmap(section_nr, nid, altmap);
714713
if (!memmap)
715714
return -ENOMEM;
716-
usemap = __kmalloc_section_usemap();
717-
if (!usemap) {
715+
usage = kzalloc(mem_section_usage_size(), GFP_KERNEL);
716+
if (!usage) {
718717
__kfree_section_memmap(memmap, altmap);
719718
return -ENOMEM;
720719
}
@@ -733,11 +732,11 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
733732

734733
set_section_nid(section_nr, nid);
735734
section_mark_present(ms);
736-
sparse_init_one_section(ms, section_nr, memmap, usemap);
735+
sparse_init_one_section(ms, section_nr, memmap, usage);
737736

738737
out:
739738
if (ret < 0) {
740-
kfree(usemap);
739+
kfree(usage);
741740
__kfree_section_memmap(memmap, altmap);
742741
}
743742
return ret;
@@ -773,20 +772,20 @@ static inline void clear_hwpoisoned_pages(struct page *memmap, int nr_pages)
773772
}
774773
#endif
775774

776-
static void free_section_usemap(struct page *memmap, unsigned long *usemap,
777-
struct vmem_altmap *altmap)
775+
static void free_section_usage(struct page *memmap,
776+
struct mem_section_usage *usage, struct vmem_altmap *altmap)
778777
{
779-
struct page *usemap_page;
778+
struct page *usage_page;
780779

781-
if (!usemap)
780+
if (!usage)
782781
return;
783782

784-
usemap_page = virt_to_page(usemap);
783+
usage_page = virt_to_page(usage);
785784
/*
786785
* Check to see if allocation came from hot-plug-add
787786
*/
788-
if (PageSlab(usemap_page) || PageCompound(usemap_page)) {
789-
kfree(usemap);
787+
if (PageSlab(usage_page) || PageCompound(usage_page)) {
788+
kfree(usage);
790789
if (memmap)
791790
__kfree_section_memmap(memmap, altmap);
792791
return;
@@ -805,18 +804,18 @@ void sparse_remove_one_section(struct mem_section *ms, unsigned long map_offset,
805804
struct vmem_altmap *altmap)
806805
{
807806
struct page *memmap = NULL;
808-
unsigned long *usemap = NULL;
807+
struct mem_section_usage *usage = NULL;
809808

810809
if (ms->section_mem_map) {
811-
usemap = ms->pageblock_flags;
810+
usage = ms->usage;
812811
memmap = sparse_decode_mem_map(ms->section_mem_map,
813812
__section_nr(ms));
814813
ms->section_mem_map = 0;
815-
ms->pageblock_flags = NULL;
814+
ms->usage = NULL;
816815
}
817816

818817
clear_hwpoisoned_pages(memmap + map_offset,
819818
PAGES_PER_SECTION - map_offset);
820-
free_section_usemap(memmap, usemap, altmap);
819+
free_section_usage(memmap, usage, altmap);
821820
}
822821
#endif /* CONFIG_MEMORY_HOTPLUG */

0 commit comments

Comments
 (0)