Skip to content

[LTS 8.8 RT] CVE-2023-4623, VULN-6710 #138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 21, 2025

Conversation

pvts-mat
Copy link
Contributor

[LTS 8.8 RT]
CVE-2023-4623
VULN-6710

Problem

https://www.cve.org/CVERecord?id=CVE-2023-4623

A use-after-free vulnerability in the Linux kernel's net/sched: sch_hfsc (HFSC qdisc traffic control) component can be exploited to achieve local privilege escalation. If a class with a link-sharing curve (i.e. with the HFSC_FSC flag set) has a parent without a link-sharing curve, then init_vf() will call vttree_insert() on the parent, but vttree_remove() will be skipped in update_vf(). This leaves a dangling pointer that can cause a use-after-free.

Analysis and solution

A single commit was identified as a fix for this issue: b3d26c5702c7d6c45456326e56d2ccf3f103e60f net/sched: sch_hfsc: Ensure inner classes have fsc curve.

The solution consisted of rejecting the addition of a class with a link-sharing curve to the class without it (see Specific tests for details):

[root@ciqlts8_8-rt pvts]# (
    tc qdisc add dev lo root handle 1: hfsc default 10
    tc class add dev lo parent 1: classid 1:1 hfsc rt m2 100kbps
    tc class add dev lo parent 1:1 classid 1:10 hfsc ls m2 50kbps
    echo $?
)

Error: Invalid parent - parent class must have FSC.
2

The fix introduced a problem with existing network setup scripts for some users https://lore.kernel.org/all/[email protected]/:

I upgraded from 6.1.38 to 6.1.55 this morning and it broke my traffic shaping script, leaving me with a non-functional uplink on a remote router.

It was decided to fix the problem without breaking backwards compatibility https://lore.kernel.org/all/[email protected]/:

Setting 'rt' as a parent is incorrect and the man page is explicit about
it as it doesn't make sense 'qdisc wise'. Being able to set it has
always been wrong unfortunately…

Sure but unfortunately "we don't break backward compat" means
we can't really argue. It will take us more time to debate this
than to fix it (assuming we understand the initial problem).

Frankly one can even argue whether "exploitable by root / userns"
is more important than single user's init scripts breaking.
The "security" issues for root are dime a dozen.

The solution was to change the erroneous qdisc hierarchy to a correct one when the possible UAF condition was detected https://lore.kernel.org/all/[email protected]/:

As reported [1] disallowing 'rt' inner curves breaks already existing
scripts. Even though it doesn't make sense 'qdisc wise' users have been
relying on this behaviour since the qdisc inception.

We need users to update the scripts/applications to use 'sc' or 'ls'
as a inner curve instead of 'rt', but also avoid the UAF found by
Budimir, which was present since the qdisc inception.

The fix of the fix is given in the commit a13b67c9a015c4e21601ef9aa4ec9c5d972df1b4 net/sched: sch_hfsc: upgrade 'rt' to 'sc' when it becomes a inner curve

While the changes could be squashed into a single commit it was decided to retain the sequence of two commits for more straightforward LTS 8.8 RT - mainline patches correspondence.

The same solution was used already in other "version 8" branches centos8, fips-8-complaint/4.18.0-553.16.1, fips-8/4.18.0-553.16.1, rocky8_10, sig-cloud-8/4.18.0-553.22.1.el8_10, sig-cloud-8/4.18.0-553.33.1.el8_10, sig-cloud-8/4.18.0-553.36.1.el8_10:

git --no-pager log --decorate  --format="%h %cd %d %s" --date=short -n 2 ae71642ce

ae71642ce 2024-09-11  net/sched: sch_hfsc: upgrade 'rt' to 'sc' when it becomes a inner curve
1a61f0a5d 2024-09-11  net/sched: sch_hfsc: Ensure inner classes have fsc curve

kABI check: omitted (unstable ABI of RT kernels)

Boot test: passed

Refer to Specific tests for implicit boot test passing.

Kselftests: passed relative

Methodology

A mix of kernel-selftests-internal and source-compiled tests were used:

  • kernel-selftests-internal: bpf tests, except:
    • bpf:test_kmod.sh: takes very long time to finish and always fails anyway,
    • bpf:test_progs: unstable, can crash the machine,
    • bpf:test_progs-no_alu32: unstable, can crash the machine.
  • source-compiled: all the rest.

Coverage (including tests skipped during execution)

android, bpf, breakpoints, capabilities, cgroup, core, cpu-hotplug, cpufreq, drivers/net/bonding, drivers/net/team, efivarfs, exec, filesystems, firmware, fpu, ftrace, futex, gpio, intel_pstate, ipc, kcmp, kvm, lib, livepatch, membarrier, memfd, memory-hotplug, mount, mqueue, net, net/forwarding, net/mptcp, netfilter, nsfs, proc, pstore, ptrace, rseq, rtc, sgx, sigaltstack, size, splice, static_keys, sync, sysctl, tc-testing, tdx, timens, timers, tpm2, user, vm, x86, zram

Reference ciqlts8_8-rt (65d2b53323ce7b26a298afe3f8f7222f33d69f03)

Two test runs were conducted on the reference kernel.
kselftests–mixed–ciqlts8_8-rt–run1.log
kselftests–mixed–ciqlts8_8-rt–run2.log

Patch ciqlts8_8-rt-CVE-2023-4623 (6e54bd4657815ea115cd4e58ab1afcb8648f9277)

Two test runs were conducted on the patched kernel.
kselftests–mixed–ciqlts8_8-rt-CVE-2023-4623–run1.log
kselftests–mixed–ciqlts8_8-rt-CVE-2023-4623–run2.log

Comparison

ktests.xsh  table --where "Summary = 'diff'"  kselftests*.log

Column    File
--------  -------------------------------------------------------
Status0   kselftests--mixed--ciqlts8_8-rt--run1.log
Status1   kselftests--mixed--ciqlts8_8-rt--run2.log
Status2   kselftests--mixed--ciqlts8_8-rt-CVE-2023-4623--run1.log
Status3   kselftests--mixed--ciqlts8_8-rt-CVE-2023-4623--run2.log

TestCase                   Status0  Status1  Status2  Status3  Summary
bpf:test_tcpnotify_user    fail     fail     pass     fail     diff
kvm:hardware_disable_test  pass     fail     pass     pass     diff
net/mptcp:simult_flows.sh  fail     fail     fail     pass     diff
net:gro.sh                 pass     pass     fail     fail     diff
net:reuseport_addr_any.sh  pass     fail     fail     fail     diff
net:tls                    pass     pass     pass     fail     diff

Tests bpf:test_tcpnotify_user, kvm:hardware_disable_test, net:gro.sh have already shown inconsistent behavior before https://docs.google.com/spreadsheets/d/1tUwJ2rV57cYZXh7momPtraSjZcHDjMYHLeHA3DYWrUU/edit?pli=1&gid=0#gid=0&range=H:I

The net:reuseport_addr_any.sh test showed inconsistent behavior in the reference tests set itself. Added to the list of "flappy" tests for the ciqlts8_8-rt platform.

Specific tests: passed

The potential UAF condition was found to be reproducible with the following tc commands sequence:

tc qdisc add dev lo root handle 1: hfsc default 10
# Inner hfsc class constructed with 'rt' - realtime service
tc class add dev lo parent 1: classid 1:1 hfsc rt m2 100kbps
# Leaf hfsc class constructed with 'ls' - linksharing service
tc class add dev lo parent 1:1 classid 1:10 hfsc ls m2 50kbps

The "100kbps", "50kbps" parts are arbitrary. What's important is the use of rt for the inner class and ls for the leaf class. While the exact UAF was not obtained the commands helped confirm the efficacy of the patch.

Reference

The incorrect qdisc hierarchy can be created without any guardrails.

[root@ciqlts8_8-rt pvts]# (
    tc qdisc add dev lo root handle 1: hfsc default 10
    tc class add dev lo parent 1: classid 1:1 hfsc rt m2 100kbps
    tc class add dev lo parent 1:1 classid 1:10 hfsc ls m2 50kbps
    echo $?
)
0
[root@ciqlts8_8-rt pvts]# tc -g class show dev lo
tc -g class show dev lo
+---(1:) hfsc 
     +---(1:1) hfsc rt m1 0bit d 0us m2 800Kbit 
          +---(1:10) hfsc ls m1 0bit d 0us m2 400Kbit 

Full logs:
fix-replicate–ciqlts8_8-rt.log

Patch

Creating the incorrect qdisc hierarchy raises a warning, but succeeds. Notice the type of inner class being sc instead of rt as shown by tc -g class show dev lo command.

[root@ciqlts8_8-rt pvts]# (
    tc qdisc add dev lo root handle 1: hfsc default 10
    tc class add dev lo parent 1: classid 1:1 hfsc rt m2 100kbps
    tc class add dev lo parent 1:1 classid 1:10 hfsc ls m2 50kbps
    echo $?
)
Warning: Forced curve change on parent 'rt' to 'sc'.
0
[root@ciqlts8_8-rt pvts]# tc -g class show dev lo
tc -g class show dev lo
+---(1:) hfsc 
     +---(1:1) hfsc sc m1 0bit d 0us m2 800Kbit 
          +---(1:10) hfsc ls m1 0bit d 0us m2 400Kbit 

Full logs:
fix-replicate–ciqlts8_8-rt-CVE-2023-4623.log

jira VULN-6710
cve CVE-2023-4623
commit-author Budimir Markovic <[email protected]>
commit b3d26c5

HFSC assumes that inner classes have an fsc curve, but it is currently
possible for classes without an fsc curve to become parents. This leads
to bugs including a use-after-free.

Don't allow non-root classes without HFSC_FSC to become parents.

Fixes: 1da177e ("Linux-2.6.12-rc2")
	Reported-by: Budimir Markovic <[email protected]>
	Signed-off-by: Budimir Markovic <[email protected]>
	Acked-by: Jamal Hadi Salim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
	Signed-off-by: Jakub Kicinski <[email protected]>
(cherry picked from commit b3d26c5)
	Signed-off-by: Marcin Wcisło <[email protected]>
jira VULN-6710
cve CVE-2023-4623
commit-author Pedro Tammela <[email protected]>
commit a13b67c

Christian Theune says:
   I upgraded from 6.1.38 to 6.1.55 this morning and it broke my traffic shaping script,
   leaving me with a non-functional uplink on a remote router.

A 'rt' curve cannot be used as a inner curve (parent class), but we were
allowing such configurations since the qdisc was introduced. Such
configurations would trigger a UAF as Budimir explains:
   The parent will have vttree_insert() called on it in init_vf(),
   but will not have vttree_remove() called on it in update_vf()
   because it does not have the HFSC_FSC flag set.

The qdisc always assumes that inner classes have the HFSC_FSC flag set.
This is by design as it doesn't make sense 'qdisc wise' for an 'rt'
curve to be an inner curve.

Budimir's original patch disallows users to add classes with a 'rt'
parent, but this is too strict as it breaks users that have been using
'rt' as a inner class. Another approach, taken by this patch, is to
upgrade the inner 'rt' into a 'sc', warning the user in the process.
It avoids the UAF reported by Budimir while also being more permissive
to bad scripts/users/code using 'rt' as a inner class.

Users checking the `tc class ls [...]` or `tc class get [...]` dumps would
observe the curve change and are potentially breaking with this change.

v1->v2: https://lore.kernel.org/all/[email protected]/
- Correct 'Fixes' tag and merge with revert (Jakub)

	Cc: Christian Theune <[email protected]>
	Cc: Budimir Markovic <[email protected]>
Fixes: b3d26c5 ("net/sched: sch_hfsc: Ensure inner classes have fsc curve")
	Signed-off-by: Pedro Tammela <[email protected]>
	Acked-by: Jamal Hadi Salim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
	Signed-off-by: Jakub Kicinski <[email protected]>
(cherry picked from commit a13b67c)
	Signed-off-by: Marcin Wcisło <[email protected]>
Copy link

@gvrose8192 gvrose8192 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Putting into review so that it can be merged later after reviews. Checks have passed.

Copy link
Collaborator

@bmastbergen bmastbergen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥌

@gvrose8192
Copy link

Checks passed and approved - Rebasing and merging.

@gvrose8192 gvrose8192 merged commit 839c613 into ctrliq:ciqlts8_8-rt Feb 21, 2025
1 check passed
github-actions bot pushed a commit that referenced this pull request Jun 4, 2025
JIRA: https://issues.redhat.com/browse/RHEL-83595

commit 9730763
Author: Nilay Shroff <[email protected]>
Date:   Wed Mar 19 16:23:46 2025 +0530

    block: correct locking order for protecting blk-wbt parameters

    The commit '245618f8e45f ("block: protect wbt_lat_usec using q->
    elevator_lock")' introduced q->elevator_lock to protect updates
    to blk-wbt parameters when writing to the sysfs attribute wbt_
    lat_usec and the cgroup attribute io.cost.qos.  However, both
    these attributes also acquire q->rq_qos_mutex, leading to the
    following lockdep warning:

    ======================================================
    WARNING: possible circular locking dependency detected
    6.14.0-rc5+ #138 Not tainted
    ------------------------------------------------------
    bash/5902 is trying to acquire lock:
    c000000085d495a0 (&q->rq_qos_mutex){+.+.}-{4:4}, at: wbt_init+0x164/0x238

    but task is already holding lock:
    c000000085d498c8 (&q->elevator_lock){+.+.}-{4:4}, at: queue_wb_lat_store+0xb0/0x20c

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&q->elevator_lock){+.+.}-{4:4}:
            __mutex_lock+0xf0/0xa58
            ioc_qos_write+0x16c/0x85c
            cgroup_file_write+0xc4/0x32c
            kernfs_fop_write_iter+0x1b8/0x29c
            vfs_write+0x410/0x584
            ksys_write+0x84/0x140
            system_call_exception+0x134/0x360
            system_call_vectored_common+0x15c/0x2ec

    -> #0 (&q->rq_qos_mutex){+.+.}-{4:4}:
            __lock_acquire+0x1b6c/0x2ae0
            lock_acquire+0x140/0x430
            __mutex_lock+0xf0/0xa58
            wbt_init+0x164/0x238
            queue_wb_lat_store+0x1dc/0x20c
            queue_attr_store+0x12c/0x164
            sysfs_kf_write+0x6c/0xb0
            kernfs_fop_write_iter+0x1b8/0x29c
            vfs_write+0x410/0x584
            ksys_write+0x84/0x140
            system_call_exception+0x134/0x360
            system_call_vectored_common+0x15c/0x2ec

    other info that might help us debug this:

        Possible unsafe locking scenario:

            CPU0                    CPU1
            ----                    ----
        lock(&q->elevator_lock);
                                    lock(&q->rq_qos_mutex);
                                    lock(&q->elevator_lock);
        lock(&q->rq_qos_mutex);

        *** DEADLOCK ***

    6 locks held by bash/5902:
        #0: c000000051122400 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x84/0x140
        #1: c00000007383f088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x174/0x29c
        #2: c000000008550428 (kn->active#182){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x180/0x29c
        #3: c000000085d493a8 (&q->q_usage_counter(io)#5){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
        #4: c000000085d493e0 (&q->q_usage_counter(queue)#5){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
        #5: c000000085d498c8 (&q->elevator_lock){+.+.}-{4:4}, at: queue_wb_lat_store+0xb0/0x20c

    stack backtrace:
    CPU: 17 UID: 0 PID: 5902 Comm: bash Kdump: loaded Not tainted 6.14.0-rc5+ #138
    Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries
    Call Trace:
    [c0000000721ef590] [c00000000118f8a8] dump_stack_lvl+0x108/0x18c (unreliable)
    [c0000000721ef5c0] [c00000000022563c] print_circular_bug+0x448/0x604
    [c0000000721ef670] [c000000000225a44] check_noncircular+0x24c/0x26c
    [c0000000721ef740] [c00000000022bf28] __lock_acquire+0x1b6c/0x2ae0
    [c0000000721ef870] [c000000000229240] lock_acquire+0x140/0x430
    [c0000000721ef970] [c0000000011cfbec] __mutex_lock+0xf0/0xa58
    [c0000000721efaa0] [c00000000096c46c] wbt_init+0x164/0x238
    [c0000000721efaf0] [c0000000008f8cd8] queue_wb_lat_store+0x1dc/0x20c
    [c0000000721efb50] [c0000000008f8fa0] queue_attr_store+0x12c/0x164
    [c0000000721efc60] [c0000000007c11cc] sysfs_kf_write+0x6c/0xb0
    [c0000000721efca0] [c0000000007bfa4c] kernfs_fop_write_iter+0x1b8/0x29c
    [c0000000721efcf0] [c0000000006a281c] vfs_write+0x410/0x584
    [c0000000721efdc0] [c0000000006a2cc8] ksys_write+0x84/0x140
    [c0000000721efe10] [c000000000031b64] system_call_exception+0x134/0x360
    [c0000000721efe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec

    >From the above log it's apparent that method which writes to sysfs attr
    wbt_lat_usec acquires q->elevator_lock first, and then acquires q->rq_
    qos_mutex. However the another method which writes to io.cost.qos,
    acquires q->rq_qos_mutex first, and then acquires q->rq_qos_mutex. So
    this could potentially cause the deadlock.

    A closer look at ioc_qos_write shows that correcting the lock order is
    non-trivial because q->rq_qos_mutex is acquired in blkg_conf_open_bdev
    and released in blkg_conf_exit. The function blkg_conf_open_bdev is
    responsible for parsing user input and finding the corresponding block
    device (bdev) from the user provided major:minor number.

    Since we do not know the bdev until blkg_conf_open_bdev completes, we
    cannot simply move q->elevator_lock acquisition before blkg_conf_open_
    bdev. So to address this, we intoduce new helpers blkg_conf_open_bdev_
    frozen and blkg_conf_exit_frozen which are just wrappers around blkg_
    conf_open_bdev and blkg_conf_exit respectively. The helper blkg_conf_
    open_bdev_frozen is similar to blkg_conf_open_bdev, but additionally
    freezes the queue, acquires q->elevator_lock and ensures the correct
    locking order is followed between q->elevator_lock and q->rq_qos_mutex.
    Similarly another helper blkg_conf_exit_frozen in addition to unfreezing
    the queue ensures that we release the locks in correct order.

    By using these helpers, now we maintain the same locking order in all
    code paths where we update blk-wbt parameters.

    Fixes: 245618f ("block: protect wbt_lat_usec using q->elevator_lock")
    Reported-by: kernel test robot <[email protected]>
    Closes: https://lore.kernel.org/oe-lkp/[email protected]
    Signed-off-by: Nilay Shroff <[email protected]>
    Link: https://lore.kernel.org/r/[email protected]
    Signed-off-by: Jens Axboe <[email protected]>

Signed-off-by: Ming Lei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants