Skip to content

Commit 4917f55

Browse files
jpemartinsakpm00
authored andcommitted
mm/sparse-vmemmap: improve memory savings for compound devmaps
A compound devmap is a dev_pagemap with @vmemmap_shift > 0 and it means that pages are mapped at a given huge page alignment and utilize uses compound pages as opposed to order-0 pages. Take advantage of the fact that most tail pages look the same (except the first two) to minimize struct page overhead. Allocate a separate page for the vmemmap area which contains the head page and separate for the next 64 pages. The rest of the subsections then reuse this tail vmemmap page to initialize the rest of the tail pages. Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and when initializing compound devmap with big enough @vmemmap_shift (e.g. 1G PUD) it may cross multiple sections. The vmemmap code needs to consult @pgmap so that multiple sections that all map the same tail data can refer back to the first copy of that data for a given gigantic page. On compound devmaps with 2M align, this mechanism lets 6 pages be saved out of the 8 necessary PFNs necessary to set the subsection's 512 struct pages being mapped. On a 1G compound devmap it saves 4094 pages. Altmap isn't supported yet, given various restrictions in altmap pfn allocator, thus fallback to the already in use vmemmap_populate(). It is worth noting that altmap for devmap mappings was there to relieve the pressure of inordinate amounts of memmap space to map terabytes of pmem. With compound pages the motivation for altmaps for pmem gets reduced. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Joao Martins <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jane Chu <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Vishal Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
1 parent 60a427d commit 4917f55

File tree

4 files changed

+177
-14
lines changed

4 files changed

+177
-14
lines changed

Documentation/vm/vmemmap_dedup.rst

Lines changed: 53 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,11 @@
11
.. SPDX-License-Identifier: GPL-2.0
22
3-
==================================
4-
Free some vmemmap pages of HugeTLB
5-
==================================
3+
=========================================
4+
A vmemmap diet for HugeTLB and Device DAX
5+
=========================================
6+
7+
HugeTLB
8+
=======
69

710
The struct page structures (page structs) are used to describe a physical
811
page frame. By default, there is a one-to-one mapping from a page frame to
@@ -171,3 +174,50 @@ tail vmemmap pages are mapped to the head vmemmap page frame. So we can see
171174
more than one struct page struct with PG_head (e.g. 8 per 2 MB HugeTLB page)
172175
associated with each HugeTLB page. The compound_head() can handle this
173176
correctly (more details refer to the comment above compound_head()).
177+
178+
Device DAX
179+
==========
180+
181+
The device-dax interface uses the same tail deduplication technique explained
182+
in the previous chapter, except when used with the vmemmap in
183+
the device (altmap).
184+
185+
The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
186+
PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
187+
188+
The differences with HugeTLB are relatively minor.
189+
190+
It only use 3 page structs for storing all information as opposed
191+
to 4 on HugeTLB pages.
192+
193+
There's no remapping of vmemmap given that device-dax memory is not part of
194+
System RAM ranges initialized at boot. Thus the tail page deduplication
195+
happens at a later stage when we populate the sections. HugeTLB reuses the
196+
the head vmemmap page representing, whereas device-dax reuses the tail
197+
vmemmap page. This results in only half of the savings compared to HugeTLB.
198+
199+
Deduplicated tail pages are not mapped read-only.
200+
201+
Here's how things look like on device-dax after the sections are populated::
202+
203+
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
204+
| | | 0 | -------------> | 0 |
205+
| | +-----------+ +-----------+
206+
| | | 1 | -------------> | 1 |
207+
| | +-----------+ +-----------+
208+
| | | 2 | ----------------^ ^ ^ ^ ^ ^
209+
| | +-----------+ | | | | |
210+
| | | 3 | ------------------+ | | | |
211+
| | +-----------+ | | | |
212+
| | | 4 | --------------------+ | | |
213+
| PMD | +-----------+ | | |
214+
| level | | 5 | ----------------------+ | |
215+
| mapping | +-----------+ | |
216+
| | | 6 | ------------------------+ |
217+
| | +-----------+ |
218+
| | | 7 | --------------------------+
219+
| | +-----------+
220+
| |
221+
| |
222+
| |
223+
+-----------+

include/linux/mm.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3161,7 +3161,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
31613161
pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
31623162
pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
31633163
pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
3164-
struct vmem_altmap *altmap);
3164+
struct vmem_altmap *altmap, struct page *reuse);
31653165
void *vmemmap_alloc_block(unsigned long size, int node);
31663166
struct vmem_altmap;
31673167
void *vmemmap_alloc_block_buf(unsigned long size, int node,

mm/memremap.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -287,6 +287,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
287287
{
288288
struct mhp_params params = {
289289
.altmap = pgmap_altmap(pgmap),
290+
.pgmap = pgmap,
290291
.pgprot = PAGE_KERNEL,
291292
};
292293
const int nr_range = pgmap->nr_range;

mm/sparse-vmemmap.c

Lines changed: 122 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -533,16 +533,31 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
533533
}
534534

535535
pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
536-
struct vmem_altmap *altmap)
536+
struct vmem_altmap *altmap,
537+
struct page *reuse)
537538
{
538539
pte_t *pte = pte_offset_kernel(pmd, addr);
539540
if (pte_none(*pte)) {
540541
pte_t entry;
541542
void *p;
542543

543-
p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
544-
if (!p)
545-
return NULL;
544+
if (!reuse) {
545+
p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
546+
if (!p)
547+
return NULL;
548+
} else {
549+
/*
550+
* When a PTE/PMD entry is freed from the init_mm
551+
* there's a a free_pages() call to this page allocated
552+
* above. Thus this get_page() is paired with the
553+
* put_page_testzero() on the freeing path.
554+
* This can only called by certain ZONE_DEVICE path,
555+
* and through vmemmap_populate_compound_pages() when
556+
* slab is available.
557+
*/
558+
get_page(reuse);
559+
p = page_to_virt(reuse);
560+
}
546561
entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
547562
set_pte_at(&init_mm, addr, pte, entry);
548563
}
@@ -609,7 +624,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
609624
}
610625

611626
static pte_t * __meminit vmemmap_populate_address(unsigned long addr, int node,
612-
struct vmem_altmap *altmap)
627+
struct vmem_altmap *altmap,
628+
struct page *reuse)
613629
{
614630
pgd_t *pgd;
615631
p4d_t *p4d;
@@ -629,7 +645,7 @@ static pte_t * __meminit vmemmap_populate_address(unsigned long addr, int node,
629645
pmd = vmemmap_pmd_populate(pud, addr, node);
630646
if (!pmd)
631647
return NULL;
632-
pte = vmemmap_pte_populate(pmd, addr, node, altmap);
648+
pte = vmemmap_pte_populate(pmd, addr, node, altmap, reuse);
633649
if (!pte)
634650
return NULL;
635651
vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
@@ -639,13 +655,14 @@ static pte_t * __meminit vmemmap_populate_address(unsigned long addr, int node,
639655

640656
static int __meminit vmemmap_populate_range(unsigned long start,
641657
unsigned long end, int node,
642-
struct vmem_altmap *altmap)
658+
struct vmem_altmap *altmap,
659+
struct page *reuse)
643660
{
644661
unsigned long addr = start;
645662
pte_t *pte;
646663

647664
for (; addr < end; addr += PAGE_SIZE) {
648-
pte = vmemmap_populate_address(addr, node, altmap);
665+
pte = vmemmap_populate_address(addr, node, altmap, reuse);
649666
if (!pte)
650667
return -ENOMEM;
651668
}
@@ -656,7 +673,95 @@ static int __meminit vmemmap_populate_range(unsigned long start,
656673
int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
657674
int node, struct vmem_altmap *altmap)
658675
{
659-
return vmemmap_populate_range(start, end, node, altmap);
676+
return vmemmap_populate_range(start, end, node, altmap, NULL);
677+
}
678+
679+
/*
680+
* For compound pages bigger than section size (e.g. x86 1G compound
681+
* pages with 2M subsection size) fill the rest of sections as tail
682+
* pages.
683+
*
684+
* Note that memremap_pages() resets @nr_range value and will increment
685+
* it after each range successful onlining. Thus the value or @nr_range
686+
* at section memmap populate corresponds to the in-progress range
687+
* being onlined here.
688+
*/
689+
static bool __meminit reuse_compound_section(unsigned long start_pfn,
690+
struct dev_pagemap *pgmap)
691+
{
692+
unsigned long nr_pages = pgmap_vmemmap_nr(pgmap);
693+
unsigned long offset = start_pfn -
694+
PHYS_PFN(pgmap->ranges[pgmap->nr_range].start);
695+
696+
return !IS_ALIGNED(offset, nr_pages) && nr_pages > PAGES_PER_SUBSECTION;
697+
}
698+
699+
static pte_t * __meminit compound_section_tail_page(unsigned long addr)
700+
{
701+
pte_t *pte;
702+
703+
addr -= PAGE_SIZE;
704+
705+
/*
706+
* Assuming sections are populated sequentially, the previous section's
707+
* page data can be reused.
708+
*/
709+
pte = pte_offset_kernel(pmd_off_k(addr), addr);
710+
if (!pte)
711+
return NULL;
712+
713+
return pte;
714+
}
715+
716+
static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
717+
unsigned long start,
718+
unsigned long end, int node,
719+
struct dev_pagemap *pgmap)
720+
{
721+
unsigned long size, addr;
722+
pte_t *pte;
723+
int rc;
724+
725+
if (reuse_compound_section(start_pfn, pgmap)) {
726+
pte = compound_section_tail_page(start);
727+
if (!pte)
728+
return -ENOMEM;
729+
730+
/*
731+
* Reuse the page that was populated in the prior iteration
732+
* with just tail struct pages.
733+
*/
734+
return vmemmap_populate_range(start, end, node, NULL,
735+
pte_page(*pte));
736+
}
737+
738+
size = min(end - start, pgmap_vmemmap_nr(pgmap) * sizeof(struct page));
739+
for (addr = start; addr < end; addr += size) {
740+
unsigned long next = addr, last = addr + size;
741+
742+
/* Populate the head page vmemmap page */
743+
pte = vmemmap_populate_address(addr, node, NULL, NULL);
744+
if (!pte)
745+
return -ENOMEM;
746+
747+
/* Populate the tail pages vmemmap page */
748+
next = addr + PAGE_SIZE;
749+
pte = vmemmap_populate_address(next, node, NULL, NULL);
750+
if (!pte)
751+
return -ENOMEM;
752+
753+
/*
754+
* Reuse the previous page for the rest of tail pages
755+
* See layout diagram in Documentation/vm/vmemmap_dedup.rst
756+
*/
757+
next += PAGE_SIZE;
758+
rc = vmemmap_populate_range(next, last, node, NULL,
759+
pte_page(*pte));
760+
if (rc)
761+
return -ENOMEM;
762+
}
763+
764+
return 0;
660765
}
661766

662767
struct page * __meminit __populate_section_memmap(unsigned long pfn,
@@ -665,12 +770,19 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
665770
{
666771
unsigned long start = (unsigned long) pfn_to_page(pfn);
667772
unsigned long end = start + nr_pages * sizeof(struct page);
773+
int r;
668774

669775
if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
670776
!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
671777
return NULL;
672778

673-
if (vmemmap_populate(start, end, nid, altmap))
779+
if (is_power_of_2(sizeof(struct page)) &&
780+
pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap)
781+
r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
782+
else
783+
r = vmemmap_populate(start, end, nid, altmap);
784+
785+
if (r < 0)
674786
return NULL;
675787

676788
return pfn_to_page(pfn);

0 commit comments

Comments
 (0)