Skip to content

Conversation

@bmastbergen
Copy link
Collaborator

jira VULN-1303
cve CVE-2024-36971

commit-author Eric Dumazet <[email protected]>
commit 92f1655aa2b2294d0b49925f3b875a634bd3b59e
upstream-diff This change breaks the kabi.  Use the RH_KABI_REPLACE
              macro to define the negative_advice function such that
              check-kabi will still pass.  From rh_kabi.h:

              "The RH_KABI_REPLACE* macros attempt to add the ability
               to use the '_new' element while preserving size
               alignment and kabi agreement with the '_orig' element."

              Also, this version of the kernel does not have rcu
              locking in ip6_negative_advice like the upstream does
              so the rcu_read_lock/rcu_read_unlock calls are not
              in this changeset.

__dst_negative_advice() does not enforce proper RCU rules when sk->dst_cache must be cleared, leading to possible UAF.

RCU rules are that we must first clear sk->sk_dst_cache, then call dst_release(old_dst).

Note that sk_dst_reset(sk) is implementing this protocol correctly, while __dst_negative_advice() uses the wrong order.

Given that ip6_negative_advice() has special logic against RTF_CACHE, this means each of the three ->negative_advice() existing methods must perform the sk_dst_reset() themselves.

Note the check against NULL dst is centralized in
__dst_negative_advice(), there is no need to duplicate it in various callbacks.

Many thanks to Clement Lecigne for tracking this issue.

This old bug became visible after the blamed commit, using UDP sockets.

Fixes: a87cb3e48ee8 ("net: Facility to report route quality of connected sockets")
        Reported-by: Clement Lecigne <[email protected]>
Diagnosed-by: Clement Lecigne <[email protected]>
        Signed-off-by: Eric Dumazet <[email protected]>
        Cc: Tom Herbert <[email protected]>
        Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
        Signed-off-by: Jakub Kicinski <[email protected]>
(cherry picked from commit 92f1655aa2b2294d0b49925f3b875a634bd3b59e)
        Signed-off-by: Brett Mastbergen <[email protected]>

Build log

build.log

Testing

The very limited kselftests were run before and after the change was applied:

selftests-before.log

selftests-after.log

brett@lycia ~/ciq/vuln-1303 % grep ^ok selftests-before.log | wc -l
2
brett@lycia ~/ciq/vuln-1303 % grep ^ok selftests-after.log | wc -l
2
brett@lycia ~/ciq/vuln-1303 %

jira VULN-1303
cve CVE-2024-36971
commit-author Eric Dumazet <[email protected]>
commit 92f1655
upstream-diff This change breaks the kabi.  Use the RH_KABI_REPLACE
              macro to define the negative_advice function such that
              check-kabi will still pass.  From rh_kabi.h:

              "The RH_KABI_REPLACE* macros attempt to add the ability
               to use the '_new' element while preserving size
               alignment and kabi agreement with the '_orig' element."

              Also, this version of the kernel does not have rcu
              locking in ip6_negative_advice like the upstream does
              so the rcu_read_lock/rcu_read_unlock calls are not
              in this changeset.

__dst_negative_advice() does not enforce proper RCU rules when
sk->dst_cache must be cleared, leading to possible UAF.

RCU rules are that we must first clear sk->sk_dst_cache,
then call dst_release(old_dst).

Note that sk_dst_reset(sk) is implementing this protocol correctly,
while __dst_negative_advice() uses the wrong order.

Given that ip6_negative_advice() has special logic
against RTF_CACHE, this means each of the three ->negative_advice()
existing methods must perform the sk_dst_reset() themselves.

Note the check against NULL dst is centralized in
__dst_negative_advice(), there is no need to duplicate
it in various callbacks.

Many thanks to Clement Lecigne for tracking this issue.

This old bug became visible after the blamed commit, using UDP sockets.

Fixes: a87cb3e ("net: Facility to report route quality of connected sockets")
        Reported-by: Clement Lecigne <[email protected]>
Diagnosed-by: Clement Lecigne <[email protected]>
        Signed-off-by: Eric Dumazet <[email protected]>
        Cc: Tom Herbert <[email protected]>
        Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
        Signed-off-by: Jakub Kicinski <[email protected]>
(cherry picked from commit 92f1655)
        Signed-off-by: Brett Mastbergen <[email protected]>
Copy link
Collaborator

@PlaidCat PlaidCat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Collaborator

@kerneltoast kerneltoast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a KABI break is required, otherwise an OOT module which really depends on the old KABI will introduce the CVE via its negative_advice callback.

@kerneltoast kerneltoast self-requested a review March 11, 2025 16:46
Copy link
Collaborator

@kerneltoast kerneltoast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does indeed break the KABI as desired, sorry for the confusion on my part.

@bmastbergen
Copy link
Collaborator Author

I think a KABI break is required, otherwise an OOT module which really depends on the old KABI will introduce the CVE via its negative_advice callback.

Just to sync up slack discussion here. THE RH_KABI_REPLACE macro does force users to use the new call. But it also fools the check-kabi tool into thinking everything is ok. In reality the kabi is broken by this change.

@bmastbergen bmastbergen merged commit be03ca1 into ciqcbr7_9 Mar 11, 2025
@bmastbergen bmastbergen deleted the bmastbergen_ciqcbr7_9/VULN-1303 branch March 11, 2025 16:55
github-actions bot pushed a commit that referenced this pull request May 26, 2025
apply_to_pte_range() enters the lazy MMU mode and then invokes
kasan_populate_vmalloc_pte() callback on each page table walk iteration. 
However, the callback can go into sleep when trying to allocate a single
page, e.g.  if an architecutre disables preemption on lazy MMU mode enter.

On s390 if make arch_enter_lazy_mmu_mode() -> preempt_enable() and
arch_leave_lazy_mmu_mode() -> preempt_disable(), such crash occurs:

[    0.663336] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321
[    0.663348] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 2, name: kthreadd
[    0.663358] preempt_count: 1, expected: 0
[    0.663366] RCU nest depth: 0, expected: 0
[    0.663375] no locks held by kthreadd/2.
[    0.663383] Preemption disabled at:
[    0.663386] [<0002f3284cbb4eda>] apply_to_pte_range+0xfa/0x4a0
[    0.663405] CPU: 0 UID: 0 PID: 2 Comm: kthreadd Not tainted 6.15.0-rc5-gcc-kasan-00043-gd76bb1ebb558-dirty #162 PREEMPT
[    0.663408] Hardware name: IBM 3931 A01 701 (KVM/Linux)
[    0.663409] Call Trace:
[    0.663410]  [<0002f3284c385f58>] dump_stack_lvl+0xe8/0x140
[    0.663413]  [<0002f3284c507b9e>] __might_resched+0x66e/0x700
[    0.663415]  [<0002f3284cc4f6c0>] __alloc_frozen_pages_noprof+0x370/0x4b0
[    0.663419]  [<0002f3284ccc73c0>] alloc_pages_mpol+0x1a0/0x4a0
[    0.663421]  [<0002f3284ccc8518>] alloc_frozen_pages_noprof+0x88/0xc0
[    0.663424]  [<0002f3284ccc8572>] alloc_pages_noprof+0x22/0x120
[    0.663427]  [<0002f3284cc341ac>] get_free_pages_noprof+0x2c/0xc0
[    0.663429]  [<0002f3284cceba70>] kasan_populate_vmalloc_pte+0x50/0x120
[    0.663433]  [<0002f3284cbb4ef8>] apply_to_pte_range+0x118/0x4a0
[    0.663435]  [<0002f3284cbc7c14>] apply_to_pmd_range+0x194/0x3e0
[    0.663437]  [<0002f3284cbc99be>] __apply_to_page_range+0x2fe/0x7a0
[    0.663440]  [<0002f3284cbc9e88>] apply_to_page_range+0x28/0x40
[    0.663442]  [<0002f3284ccebf12>] kasan_populate_vmalloc+0x82/0xa0
[    0.663445]  [<0002f3284cc1578c>] alloc_vmap_area+0x34c/0xc10
[    0.663448]  [<0002f3284cc1c2a6>] __get_vm_area_node+0x186/0x2a0
[    0.663451]  [<0002f3284cc1e696>] __vmalloc_node_range_noprof+0x116/0x310
[    0.663454]  [<0002f3284cc1d950>] __vmalloc_node_noprof+0xd0/0x110
[    0.663457]  [<0002f3284c454b88>] alloc_thread_stack_node+0xf8/0x330
[    0.663460]  [<0002f3284c458d56>] dup_task_struct+0x66/0x4d0
[    0.663463]  [<0002f3284c45be90>] copy_process+0x280/0x4b90
[    0.663465]  [<0002f3284c460940>] kernel_clone+0xd0/0x4b0
[    0.663467]  [<0002f3284c46115e>] kernel_thread+0xbe/0xe0
[    0.663469]  [<0002f3284c4e440e>] kthreadd+0x50e/0x7f0
[    0.663472]  [<0002f3284c38c04a>] __ret_from_fork+0x8a/0xf0
[    0.663475]  [<0002f3284ed57ff2>] ret_from_fork+0xa/0x38

Instead of allocating single pages per-PTE, bulk-allocate the shadow
memory prior to applying kasan_populate_vmalloc_pte() callback on a page
range.

Link: https://lkml.kernel.org/r/c61d3560297c93ed044f0b1af085610353a06a58.1747316918.git.agordeev@linux.ibm.com
Fixes: 3c5c3cf ("kasan: support backing vmalloc space with real shadow memory")
Signed-off-by: Alexander Gordeev <[email protected]>
Suggested-by: Andrey Ryabinin <[email protected]>
Reviewed-by: Harry Yoo <[email protected]>
Cc: Daniel Axtens <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants