|
| 1 | +mm/swap: fix race when skipping swapcache |
| 2 | + |
| 3 | +jira LE-1907 |
| 4 | +cve CVE-2024-26759 |
| 5 | +Rebuild_History Non-Buildable kernel-4.18.0-553.8.1.el8_10 |
| 6 | +commit-author Kairui Song < [email protected]> |
| 7 | +commit 13ddaf26be324a7f951891ecd9ccd04466d27458 |
| 8 | +Empty-Commit: Cherry-Pick Conflicts during history rebuild. |
| 9 | +Will be included in final tarball splat. Ref for failed cherry-pick at: |
| 10 | +ciq/ciq_backports/kernel-4.18.0-553.8.1.el8_10/13ddaf26.failed |
| 11 | + |
| 12 | +When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads |
| 13 | +swapin the same entry at the same time, they get different pages (A, B). |
| 14 | +Before one thread (T0) finishes the swapin and installs page (A) to the |
| 15 | +PTE, another thread (T1) could finish swapin of page (B), swap_free the |
| 16 | +entry, then swap out the possibly modified page reusing the same entry. |
| 17 | +It breaks the pte_same check in (T0) because PTE value is unchanged, |
| 18 | +causing ABA problem. Thread (T0) will install a stalled page (A) into the |
| 19 | +PTE and cause data corruption. |
| 20 | + |
| 21 | +One possible callstack is like this: |
| 22 | + |
| 23 | +CPU0 CPU1 |
| 24 | +---- ---- |
| 25 | +do_swap_page() do_swap_page() with same entry |
| 26 | +<direct swapin path> <direct swapin path> |
| 27 | +<alloc page A> <alloc page B> |
| 28 | +swap_read_folio() <- read to page A swap_read_folio() <- read to page B |
| 29 | +<slow on later locks or interrupt> <finished swapin first> |
| 30 | +... set_pte_at() |
| 31 | + swap_free() <- entry is free |
| 32 | + <write to page B, now page A stalled> |
| 33 | + <swap out page B to same swap entry> |
| 34 | +pte_same() <- Check pass, PTE seems |
| 35 | + unchanged, but page A |
| 36 | + is stalled! |
| 37 | +swap_free() <- page B content lost! |
| 38 | +set_pte_at() <- staled page A installed! |
| 39 | + |
| 40 | +And besides, for ZRAM, swap_free() allows the swap device to discard the |
| 41 | +entry content, so even if page (B) is not modified, if swap_read_folio() |
| 42 | +on CPU0 happens later than swap_free() on CPU1, it may also cause data |
| 43 | +loss. |
| 44 | + |
| 45 | +To fix this, reuse swapcache_prepare which will pin the swap entry using |
| 46 | +the cache flag, and allow only one thread to swap it in, also prevent any |
| 47 | +parallel code from putting the entry in the cache. Release the pin after |
| 48 | +PT unlocked. |
| 49 | + |
| 50 | +Racers just loop and wait since it's a rare and very short event. A |
| 51 | +schedule_timeout_uninterruptible(1) call is added to avoid repeated page |
| 52 | +faults wasting too much CPU, causing livelock or adding too much noise to |
| 53 | +perf statistics. A similar livelock issue was described in commit |
| 54 | +029c4628b2eb ("mm: swap: get rid of livelock in swapin readahead") |
| 55 | + |
| 56 | +Reproducer: |
| 57 | + |
| 58 | +This race issue can be triggered easily using a well constructed |
| 59 | +reproducer and patched brd (with a delay in read path) [1]: |
| 60 | + |
| 61 | +With latest 6.8 mainline, race caused data loss can be observed easily: |
| 62 | +$ gcc -g -lpthread test-thread-swap-race.c && ./a.out |
| 63 | + Polulating 32MB of memory region... |
| 64 | + Keep swapping out... |
| 65 | + Starting round 0... |
| 66 | + Spawning 65536 workers... |
| 67 | + 32746 workers spawned, wait for done... |
| 68 | + Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss! |
| 69 | + Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss! |
| 70 | + Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss! |
| 71 | + Round 0 Failed, 15 data loss! |
| 72 | + |
| 73 | +This reproducer spawns multiple threads sharing the same memory region |
| 74 | +using a small swap device. Every two threads updates mapped pages one by |
| 75 | +one in opposite direction trying to create a race, with one dedicated |
| 76 | +thread keep swapping out the data out using madvise. |
| 77 | + |
| 78 | +The reproducer created a reproduce rate of about once every 5 minutes, so |
| 79 | +the race should be totally possible in production. |
| 80 | + |
| 81 | +After this patch, I ran the reproducer for over a few hundred rounds and |
| 82 | +no data loss observed. |
| 83 | + |
| 84 | +Performance overhead is minimal, microbenchmark swapin 10G from 32G |
| 85 | +zram: |
| 86 | + |
| 87 | +Before: 10934698 us |
| 88 | +After: 11157121 us |
| 89 | +Cached: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) |
| 90 | + |
| 91 | + |
| 92 | + Link: https://lkml.kernel.org/r/ [email protected] |
| 93 | +Link: https://lkml.kernel.org/r/ [email protected] |
| 94 | +Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") |
| 95 | + Reported-by: "Huang, Ying" < [email protected]> |
| 96 | +Closes: https://lore.kernel.org/lkml/ [email protected]/ |
| 97 | +Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1] |
| 98 | + Signed-off-by: Kairui Song < [email protected]> |
| 99 | + Reviewed-by: "Huang, Ying" < [email protected]> |
| 100 | + Acked-by: Yu Zhao < [email protected]> |
| 101 | + Acked-by: David Hildenbrand < [email protected]> |
| 102 | + Acked-by: Chris Li < [email protected]> |
| 103 | + Cc: Hugh Dickins < [email protected]> |
| 104 | + Cc: Johannes Weiner < [email protected]> |
| 105 | + Cc: Matthew Wilcox (Oracle) < [email protected]> |
| 106 | + Cc: Michal Hocko < [email protected]> |
| 107 | + Cc: Minchan Kim < [email protected]> |
| 108 | + Cc: Yosry Ahmed < [email protected]> |
| 109 | + |
| 110 | + Cc: Barry Song < [email protected]> |
| 111 | + Cc: SeongJae Park < [email protected]> |
| 112 | + |
| 113 | + Signed-off-by: Andrew Morton < [email protected]> |
| 114 | +(cherry picked from commit 13ddaf26be324a7f951891ecd9ccd04466d27458) |
| 115 | + Signed-off-by: Jonathan Maple < [email protected]> |
| 116 | + |
| 117 | +# Conflicts: |
| 118 | +# mm/memory.c |
| 119 | +# mm/swap.h |
| 120 | +diff --cc mm/memory.c |
| 121 | +index 151674cf8f0e,0bfc8b007c01..000000000000 |
| 122 | +--- a/mm/memory.c |
| 123 | ++++ b/mm/memory.c |
| 124 | +@@@ -3178,12 -3795,14 +3178,18 @@@ EXPORT_SYMBOL(unmap_mapping_range) |
| 125 | + vm_fault_t do_swap_page(struct vm_fault *vmf) |
| 126 | + { |
| 127 | + struct vm_area_struct *vma = vmf->vma; |
| 128 | + - struct folio *swapcache, *folio = NULL; |
| 129 | + - struct page *page; |
| 130 | + + struct page *page = NULL, *swapcache; |
| 131 | + struct swap_info_struct *si = NULL; |
| 132 | +++<<<<<<< HEAD |
| 133 | +++======= |
| 134 | ++ rmap_t rmap_flags = RMAP_NONE; |
| 135 | ++ bool need_clear_cache = false; |
| 136 | ++ bool exclusive = false; |
| 137 | +++>>>>>>> 13ddaf26be32 (mm/swap: fix race when skipping swapcache) |
| 138 | + swp_entry_t entry; |
| 139 | + pte_t pte; |
| 140 | + + int locked; |
| 141 | + + int exclusive = 0; |
| 142 | + vm_fault_t ret = 0; |
| 143 | + void *shadow = NULL; |
| 144 | + |
| 145 | +@@@ -3212,22 -3860,39 +3218,36 @@@ |
| 146 | + if (unlikely(!si)) |
| 147 | + goto out; |
| 148 | + |
| 149 | + - folio = swap_cache_get_folio(entry, vma, vmf->address); |
| 150 | + - if (folio) |
| 151 | + - page = folio_file_page(folio, swp_offset(entry)); |
| 152 | + - swapcache = folio; |
| 153 | + + delayacct_set_flag(current, DELAYACCT_PF_SWAPIN); |
| 154 | + + page = lookup_swap_cache(entry, vma, vmf->address); |
| 155 | + + swapcache = page; |
| 156 | + |
| 157 | + - if (!folio) { |
| 158 | + + if (!page) { |
| 159 | + if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && |
| 160 | + __swap_count(entry) == 1) { |
| 161 | ++ /* |
| 162 | ++ * Prevent parallel swapin from proceeding with |
| 163 | ++ * the cache flag. Otherwise, another thread may |
| 164 | ++ * finish swapin first, free the entry, and swapout |
| 165 | ++ * reusing the same entry. It's undetectable as |
| 166 | ++ * pte_same() returns true due to entry reuse. |
| 167 | ++ */ |
| 168 | ++ if (swapcache_prepare(entry)) { |
| 169 | ++ /* Relax a bit to prevent rapid repeated page faults */ |
| 170 | ++ schedule_timeout_uninterruptible(1); |
| 171 | ++ goto out; |
| 172 | ++ } |
| 173 | ++ need_clear_cache = true; |
| 174 | ++ |
| 175 | + /* skip swapcache */ |
| 176 | + - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, |
| 177 | + - vma, vmf->address, false); |
| 178 | + - page = &folio->page; |
| 179 | + - if (folio) { |
| 180 | + - __folio_set_locked(folio); |
| 181 | + - __folio_set_swapbacked(folio); |
| 182 | + - |
| 183 | + - if (mem_cgroup_swapin_charge_folio(folio, |
| 184 | + - vma->vm_mm, GFP_KERNEL, |
| 185 | + - entry)) { |
| 186 | + + page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, |
| 187 | + + vmf->address); |
| 188 | + + if (page) { |
| 189 | + + __SetPageLocked(page); |
| 190 | + + __SetPageSwapBacked(page); |
| 191 | + + |
| 192 | + + if (mem_cgroup_swapin_charge_page(page, |
| 193 | + + vma->vm_mm, GFP_KERNEL, entry)) { |
| 194 | + ret = VM_FAULT_OOM; |
| 195 | + goto out_page; |
| 196 | + } |
| 197 | +@@@ -3381,23 -4127,30 +3401,28 @@@ |
| 198 | + } |
| 199 | + |
| 200 | + /* No need to invalidate - it was non-present before */ |
| 201 | + - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); |
| 202 | + + update_mmu_cache(vma, vmf->address, vmf->pte); |
| 203 | + unlock: |
| 204 | + - if (vmf->pte) |
| 205 | + - pte_unmap_unlock(vmf->pte, vmf->ptl); |
| 206 | + + pte_unmap_unlock(vmf->pte, vmf->ptl); |
| 207 | + out: |
| 208 | ++ /* Clear the swap cache pin for direct swapin after PTL unlock */ |
| 209 | ++ if (need_clear_cache) |
| 210 | ++ swapcache_clear(si, entry); |
| 211 | + if (si) |
| 212 | + put_swap_device(si); |
| 213 | + return ret; |
| 214 | + out_nomap: |
| 215 | + - if (vmf->pte) |
| 216 | + - pte_unmap_unlock(vmf->pte, vmf->ptl); |
| 217 | + + pte_unmap_unlock(vmf->pte, vmf->ptl); |
| 218 | + out_page: |
| 219 | + - folio_unlock(folio); |
| 220 | + + unlock_page(page); |
| 221 | + out_release: |
| 222 | + - folio_put(folio); |
| 223 | + - if (folio != swapcache && swapcache) { |
| 224 | + - folio_unlock(swapcache); |
| 225 | + - folio_put(swapcache); |
| 226 | + + put_page(page); |
| 227 | + + if (page != swapcache && swapcache) { |
| 228 | + + unlock_page(swapcache); |
| 229 | + + put_page(swapcache); |
| 230 | + } |
| 231 | ++ if (need_clear_cache) |
| 232 | ++ swapcache_clear(si, entry); |
| 233 | + if (si) |
| 234 | + put_swap_device(si); |
| 235 | + return ret; |
| 236 | +* Unmerged path mm/swap.h |
| 237 | +diff --git a/include/linux/swap.h b/include/linux/swap.h |
| 238 | +index 3b25e695d36c..095b25807ea7 100644 |
| 239 | +--- a/include/linux/swap.h |
| 240 | ++++ b/include/linux/swap.h |
| 241 | +@@ -541,6 +541,11 @@ static inline int swap_duplicate(swp_entry_t swp) |
| 242 | + return 0; |
| 243 | + } |
| 244 | + |
| 245 | ++static inline int swapcache_prepare(swp_entry_t swp) |
| 246 | ++{ |
| 247 | ++ return 0; |
| 248 | ++} |
| 249 | ++ |
| 250 | + static inline void swap_free(swp_entry_t swp) |
| 251 | + { |
| 252 | + } |
| 253 | +* Unmerged path mm/memory.c |
| 254 | +* Unmerged path mm/swap.h |
| 255 | +diff --git a/mm/swapfile.c b/mm/swapfile.c |
| 256 | +index 003334905ddc..d334cf54dcca 100644 |
| 257 | +--- a/mm/swapfile.c |
| 258 | ++++ b/mm/swapfile.c |
| 259 | +@@ -3592,6 +3592,19 @@ int swapcache_prepare(swp_entry_t entry) |
| 260 | + return __swap_duplicate(entry, SWAP_HAS_CACHE); |
| 261 | + } |
| 262 | + |
| 263 | ++void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) |
| 264 | ++{ |
| 265 | ++ struct swap_cluster_info *ci; |
| 266 | ++ unsigned long offset = swp_offset(entry); |
| 267 | ++ unsigned char usage; |
| 268 | ++ |
| 269 | ++ ci = lock_cluster_or_swap_info(si, offset); |
| 270 | ++ usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE); |
| 271 | ++ unlock_cluster_or_swap_info(si, ci); |
| 272 | ++ if (!usage) |
| 273 | ++ free_swap_slot(entry); |
| 274 | ++} |
| 275 | ++ |
| 276 | + struct swap_info_struct *swp_swap_info(swp_entry_t entry) |
| 277 | + { |
| 278 | + return swap_type_to_swap_info(swp_type(entry)); |
0 commit comments