Skip to content

Commit 5535be3

Browse files
davidhildenbrandakpm00
authored andcommitted
mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know that FOLL_FORCE can be possibly dangerous, especially if there are races that can be exploited by user space. Right now, it would be sufficient to have some code that sets a PTE of a R/O-mapped shared page dirty, in order for it to erroneously become writable by FOLL_FORCE. The implications of setting a write-protected PTE dirty might not be immediately obvious to everyone. And in fact ever since commit 9ae0f87 ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map a shmem page R/O while marking the pte dirty. This can be used by unprivileged user space to modify tmpfs/shmem file content even if the user does not have write permissions to the file, and to bypass memfd write sealing -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590). To fix such security issues for good, the insight is that we really only need that fancy retry logic (FOLL_COW) for COW mappings that are not writable (!VM_WRITE). And in a COW mapping, we really only broke COW if we have an exclusive anonymous page mapped. If we have something else mapped, or the mapped anonymous page might be shared (!PageAnonExclusive), we have to trigger a write fault to break COW. If we don't find an exclusive anonymous page when we retry, we have to trigger COW breaking once again because something intervened. Let's move away from this mandatory-retry + dirty handling and rely on our PageAnonExclusive() flag for making a similar decision, to use the same COW logic as in other kernel parts here as well. In case we stumble over a PTE in a COW mapping that does not map an exclusive anonymous page, COW was not properly broken and we have to trigger a fake write-fault to break COW. Just like we do in can_change_pte_writable() added via commit 64fe24a ("mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection") and commit 76aefad ("mm/mprotect: fix soft-dirty check in can_change_pte_writable()"), take care of softdirty and uffd-wp manually. For example, a write() via /proc/self/mem to a uffd-wp-protected range has to fail instead of silently granting write access and bypassing the userspace fault handler. Note that FOLL_FORCE is not only used for debug access, but also triggered by applications without debug intentions, for example, when pinning pages via RDMA. This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR. Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So let's just get rid of it. Thanks to Nadav Amit for pointing out that the pte_dirty() check in FOLL_FORCE code is problematic and might be exploitable. Note 1: We don't check for the PTE being dirty because it doesn't matter for making a "was COWed" decision anymore, and whoever modifies the page has to set the page dirty either way. Note 2: Kernels before extended uffd-wp support and before PageAnonExclusive (< 5.19) can simply revert the problematic commit instead and be safe regarding UFFDIO_CONTINUE. A backport to v5.19 requires minor adjustments due to lack of vma_soft_dirty_enabled(). Link: https://lkml.kernel.org/r/[email protected] Fixes: 9ae0f87 ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte") Signed-off-by: David Hildenbrand <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Peter Xu <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: David Laight <[email protected]> Cc: <[email protected]> [5.16] Signed-off-by: Andrew Morton <[email protected]>
1 parent 3788778 commit 5535be3

File tree

3 files changed

+89
-44
lines changed

3 files changed

+89
-44
lines changed

include/linux/mm.h

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2885,7 +2885,6 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
28852885
#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
28862886
#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
28872887
#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
2888-
#define FOLL_COW 0x4000 /* internal GUP flag */
28892888
#define FOLL_ANON 0x8000 /* don't do file mappings */
28902889
#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
28912890
#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */

mm/gup.c

Lines changed: 46 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -478,14 +478,42 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
478478
return -EEXIST;
479479
}
480480

481-
/*
482-
* FOLL_FORCE can write to even unwritable pte's, but only
483-
* after we've gone through a COW cycle and they are dirty.
484-
*/
485-
static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
481+
/* FOLL_FORCE can write to even unwritable PTEs in COW mappings. */
482+
static inline bool can_follow_write_pte(pte_t pte, struct page *page,
483+
struct vm_area_struct *vma,
484+
unsigned int flags)
486485
{
487-
return pte_write(pte) ||
488-
((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
486+
/* If the pte is writable, we can write to the page. */
487+
if (pte_write(pte))
488+
return true;
489+
490+
/* Maybe FOLL_FORCE is set to override it? */
491+
if (!(flags & FOLL_FORCE))
492+
return false;
493+
494+
/* But FOLL_FORCE has no effect on shared mappings */
495+
if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED))
496+
return false;
497+
498+
/* ... or read-only private ones */
499+
if (!(vma->vm_flags & VM_MAYWRITE))
500+
return false;
501+
502+
/* ... or already writable ones that just need to take a write fault */
503+
if (vma->vm_flags & VM_WRITE)
504+
return false;
505+
506+
/*
507+
* See can_change_pte_writable(): we broke COW and could map the page
508+
* writable if we have an exclusive anonymous page ...
509+
*/
510+
if (!page || !PageAnon(page) || !PageAnonExclusive(page))
511+
return false;
512+
513+
/* ... and a write-fault isn't required for other reasons. */
514+
if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte))
515+
return false;
516+
return !userfaultfd_pte_wp(vma, pte);
489517
}
490518

491519
static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -528,12 +556,19 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
528556
}
529557
if ((flags & FOLL_NUMA) && pte_protnone(pte))
530558
goto no_page;
531-
if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
532-
pte_unmap_unlock(ptep, ptl);
533-
return NULL;
534-
}
535559

536560
page = vm_normal_page(vma, address, pte);
561+
562+
/*
563+
* We only care about anon pages in can_follow_write_pte() and don't
564+
* have to worry about pte_devmap() because they are never anon.
565+
*/
566+
if ((flags & FOLL_WRITE) &&
567+
!can_follow_write_pte(pte, page, vma, flags)) {
568+
page = NULL;
569+
goto out;
570+
}
571+
537572
if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
538573
/*
539574
* Only return device mapping pages in the FOLL_GET or FOLL_PIN
@@ -986,17 +1021,6 @@ static int faultin_page(struct vm_area_struct *vma,
9861021
return -EBUSY;
9871022
}
9881023

989-
/*
990-
* The VM_FAULT_WRITE bit tells us that do_wp_page has broken COW when
991-
* necessary, even if maybe_mkwrite decided not to set pte_write. We
992-
* can thus safely do subsequent page lookups as if they were reads.
993-
* But only do so when looping for pte_write is futile: in some cases
994-
* userspace may also be wanting to write to the gotten user page,
995-
* which a read fault here might prevent (a readonly page might get
996-
* reCOWed by userspace write).
997-
*/
998-
if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))
999-
*flags |= FOLL_COW;
10001024
return 0;
10011025
}
10021026

mm/huge_memory.c

Lines changed: 43 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1040,12 +1040,6 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
10401040

10411041
assert_spin_locked(pmd_lockptr(mm, pmd));
10421042

1043-
/*
1044-
* When we COW a devmap PMD entry, we split it into PTEs, so we should
1045-
* not be in this function with `flags & FOLL_COW` set.
1046-
*/
1047-
WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set");
1048-
10491043
/* FOLL_GET and FOLL_PIN are mutually exclusive. */
10501044
if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
10511045
(FOLL_PIN | FOLL_GET)))
@@ -1395,14 +1389,42 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
13951389
return VM_FAULT_FALLBACK;
13961390
}
13971391

1398-
/*
1399-
* FOLL_FORCE can write to even unwritable pmd's, but only
1400-
* after we've gone through a COW cycle and they are dirty.
1401-
*/
1402-
static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
1392+
/* FOLL_FORCE can write to even unwritable PMDs in COW mappings. */
1393+
static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page,
1394+
struct vm_area_struct *vma,
1395+
unsigned int flags)
14031396
{
1404-
return pmd_write(pmd) ||
1405-
((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
1397+
/* If the pmd is writable, we can write to the page. */
1398+
if (pmd_write(pmd))
1399+
return true;
1400+
1401+
/* Maybe FOLL_FORCE is set to override it? */
1402+
if (!(flags & FOLL_FORCE))
1403+
return false;
1404+
1405+
/* But FOLL_FORCE has no effect on shared mappings */
1406+
if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED))
1407+
return false;
1408+
1409+
/* ... or read-only private ones */
1410+
if (!(vma->vm_flags & VM_MAYWRITE))
1411+
return false;
1412+
1413+
/* ... or already writable ones that just need to take a write fault */
1414+
if (vma->vm_flags & VM_WRITE)
1415+
return false;
1416+
1417+
/*
1418+
* See can_change_pte_writable(): we broke COW and could map the page
1419+
* writable if we have an exclusive anonymous page ...
1420+
*/
1421+
if (!page || !PageAnon(page) || !PageAnonExclusive(page))
1422+
return false;
1423+
1424+
/* ... and a write-fault isn't required for other reasons. */
1425+
if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd))
1426+
return false;
1427+
return !userfaultfd_huge_pmd_wp(vma, pmd);
14061428
}
14071429

14081430
struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
@@ -1411,23 +1433,24 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
14111433
unsigned int flags)
14121434
{
14131435
struct mm_struct *mm = vma->vm_mm;
1414-
struct page *page = NULL;
1436+
struct page *page;
14151437

14161438
assert_spin_locked(pmd_lockptr(mm, pmd));
14171439

1418-
if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
1419-
goto out;
1440+
page = pmd_page(*pmd);
1441+
VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
1442+
1443+
if ((flags & FOLL_WRITE) &&
1444+
!can_follow_write_pmd(*pmd, page, vma, flags))
1445+
return NULL;
14201446

14211447
/* Avoid dumping huge zero page */
14221448
if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
14231449
return ERR_PTR(-EFAULT);
14241450

14251451
/* Full NUMA hinting faults to serialise migration in fault paths */
14261452
if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
1427-
goto out;
1428-
1429-
page = pmd_page(*pmd);
1430-
VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
1453+
return NULL;
14311454

14321455
if (!pmd_write(*pmd) && gup_must_unshare(flags, page))
14331456
return ERR_PTR(-EMLINK);
@@ -1444,7 +1467,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
14441467
page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
14451468
VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
14461469

1447-
out:
14481470
return page;
14491471
}
14501472

0 commit comments

Comments
 (0)