Skip to content

Conversation

@PlaidCat
Copy link
Collaborator

Bug fix requested by SIG-CLOUD user to address users needs with GPUs, CUDA and HMM.
This is the integration and temporary fix to allow the HMM kselftest to not live in an infinite loop.

Internal Ticket: SECO-170

BUILD

[maple@r9-sigcloud-builder kernel-src-tree]$ ../kernel-tools/kernel_build.sh
/mnt/code/kernel-src-tree
  CLEAN   arch/x86/boot/compressed
  CLEAN   arch/x86/boot

  CLEAN   include/config include/generated arch/x86/include/generated .config .config.old .version Module.symvers certs/signing_key.pem certs/signing_key.x509 certs/x509.genkey
[TIMER]{MRPROPER}: 7s
x86_64 architecture detected, copying config
'configs/kernel-x86_64-rhel.config' -> '.config'
Setting Local Version for build
CONFIG_LOCALVERSION="-jmaple_cuda_bug_fix"
Making olddefconfig
  HOSTCC  scripts/basic/fixdep
  HOSTCC  scripts/kconfig/conf.o
  HOSTCC  scripts/kconfig/confdata.o
  HOSTCC  scripts/kconfig/expr.o
  LEX     scripts/kconfig/lexer.lex.c
  YACC    scripts/kconfig/parser.tab.[ch]
  HOSTCC  scripts/kconfig/lexer.lex.o
  HOSTCC  scripts/kconfig/menu.o
  HOSTCC  scripts/kconfig/parser.tab.o
  HOSTCC  scripts/kconfig/preprocess.o
  HOSTCC  scripts/kconfig/symbol.o
  HOSTCC  scripts/kconfig/util.o
  HOSTLD  scripts/kconfig/conf
#
# configuration written to .config
#
Starting Build

  BTF [M] sound/xen/snd_xen_front.ko
  BTF [M] virt/lib/irqbypass.ko
[TIMER]{BUILD}: 1494s
Making Modules
  INSTALL /lib/modules/5.14.0-jmaple_cuda_bug_fix+/kernel/arch/x86/crypto/blake2s-x86_64.ko
  STRIP   /lib/modules/5.14.0-jmaple_cuda_bug_fix+/kernel/arch/x86/crypto/blake2s-x86_64.ko
  SIGN    /lib/modules/5.14.0-jmaple_cuda_bug_fix+/kernel/arch/x86/crypto/blake2s-x86_64.ko

  SIGN    /lib/modules/5.14.0-jmaple_cuda_bug_fix+/kernel/virt/lib/irqbypass.ko
  DEPMOD  /lib/modules/5.14.0-jmaple_cuda_bug_fix+
[TIMER]{MODULES}: 31s
Making Install
sh ./arch/x86/boot/install.sh 5.14.0-jmaple_cuda_bug_fix+ \
	arch/x86/boot/bzImage System.map "/boot"
[TIMER]{INSTALL}: 20s
Checking kABI
kABI check passed
Setting Default Kernel to /boot/vmlinuz-5.14.0-jmaple_cuda_bug_fix+ and Index to 4
The default is /boot/loader/entries/9d52c46c412f49e287269f012ed1d6ff-5.14.0-jmaple_cuda_bug_fix+.conf with index 4 and kernel /boot/vmlinuz-5.14.0-jmaple_cuda_bug_fix+
The default is /boot/loader/entries/9d52c46c412f49e287269f012ed1d6ff-5.14.0-jmaple_cuda_bug_fix+.conf with index 4 and kernel /boot/vmlinuz-5.14.0-jmaple_cuda_bug_fix+
Generating grub configuration file ...
Adding boot menu entry for UEFI Firmware Settings ...
done
Hopefully Grub2.0 took everything ... rebooting after time metrices
[TIMER]{MRPROPER}: 7s
[TIMER]{BUILD}: 1494s
[TIMER]{MODULES}: 31s
[TIMER]{INSTALL}: 20s
[TIMER]{TOTAL} 1556s

kABI Check

See in the build log

Checking kABI
kABI check passed

Kernel Self Tests

There have been issues with kselftests for Rocky9 so in this case we're only testing the HMM situation. Which also has its own problems in TEARDOWN on Failure.

5 Runs per pre and post update

hmm_update_wrong.txt is intentionally wrong to show that there is data that is different

[jmaple@devbox code]$ for i in $(seq 1 $(( ${#files[@]} - 1)) ); do echo ${files[i]}; diff ${files[0]} ${files[i]}; done
hmm_default_test2.txt
hmm_default_test3.txt
hmm_default_test4.txt
hmm_default_test5.txt
hmm_update_test1.txt
hmm_update_test2.txt
hmm_update_test3.txt
hmm_update_test4.txt
hmm_update_test5.txt
hmm_update_wrong.txt
240a241
> # Maple changed below here
242,245c243,246
< #            OK  hmm2.hmm2_device_coherent.double_map
< ok 56 hmm2.hmm2_device_coherent.double_map
< # FAILED: 53 / 56 tests passed.
< # Totals: pass:51 fail:3 xfail:0 xpass:0 skip:2 error:0
---
> #            FAIL  hmm2.hmm2_device_coherent.double_map
> fail 56 hmm2.hmm2_device_coherent.double_map
> # FAILED: 52 / 56 tests passed.
> # Totals: pass:50 fail:4 xfail:0 xpass:0 skip:2 error:0

Example

[maple@r9-sigcloud-builder mm]$ cat /mnt/code/hmm_default_test1.txt
running: bash
--------------------------------
running bash ./test_hmm.sh smoke
--------------------------------
Running smoke test. Note, this test provides basic coverage.
TAP version 13
1..56
# Starting 56 tests from 5 test cases.
#  RUN           hmm.hmm_device_private.open_close ...
#            OK  hmm.hmm_device_private.open_close
ok 1 hmm.hmm_device_private.open_close
#  RUN           hmm.hmm_device_private.anon_read ...
#            OK  hmm.hmm_device_private.anon_read
ok 2 hmm.hmm_device_private.anon_read
#  RUN           hmm.hmm_device_private.anon_read_prot ...
#            OK  hmm.hmm_device_private.anon_read_prot
ok 3 hmm.hmm_device_private.anon_read_prot
#  RUN           hmm.hmm_device_private.anon_write ...
#            OK  hmm.hmm_device_private.anon_write
ok 4 hmm.hmm_device_private.anon_write
#  RUN           hmm.hmm_device_private.anon_write_prot ...
#            OK  hmm.hmm_device_private.anon_write_prot
ok 5 hmm.hmm_device_private.anon_write_prot
#  RUN           hmm.hmm_device_private.anon_write_child ...
#            OK  hmm.hmm_device_private.anon_write_child
ok 6 hmm.hmm_device_private.anon_write_child
#  RUN           hmm.hmm_device_private.anon_write_child_shared ...
#            OK  hmm.hmm_device_private.anon_write_child_shared
ok 7 hmm.hmm_device_private.anon_write_child_shared
#  RUN           hmm.hmm_device_private.anon_write_huge ...
#            OK  hmm.hmm_device_private.anon_write_huge
ok 8 hmm.hmm_device_private.anon_write_huge
#  RUN           hmm.hmm_device_private.anon_write_hugetlbfs ...
#            OK  hmm.hmm_device_private.anon_write_hugetlbfs
ok 9 # SKIP Huge page could not be allocated
#  RUN           hmm.hmm_device_private.file_read ...
#            OK  hmm.hmm_device_private.file_read
ok 10 hmm.hmm_device_private.file_read
#  RUN           hmm.hmm_device_private.file_write ...
#            OK  hmm.hmm_device_private.file_write
ok 11 hmm.hmm_device_private.file_write
#  RUN           hmm.hmm_device_private.migrate ...
#            OK  hmm.hmm_device_private.migrate
ok 12 hmm.hmm_device_private.migrate
#  RUN           hmm.hmm_device_private.migrate_fault ...
#            OK  hmm.hmm_device_private.migrate_fault
ok 13 hmm.hmm_device_private.migrate_fault
#  RUN           hmm.hmm_device_private.migrate_release ...
#            OK  hmm.hmm_device_private.migrate_release
ok 14 hmm.hmm_device_private.migrate_release
#  RUN           hmm.hmm_device_private.migrate_shared ...
#            OK  hmm.hmm_device_private.migrate_shared
ok 15 hmm.hmm_device_private.migrate_shared
#  RUN           hmm.hmm_device_private.migrate_multiple ...
#            OK  hmm.hmm_device_private.migrate_multiple
ok 16 hmm.hmm_device_private.migrate_multiple
#  RUN           hmm.hmm_device_private.anon_read_multiple ...
#            OK  hmm.hmm_device_private.anon_read_multiple
ok 17 hmm.hmm_device_private.anon_read_multiple
#  RUN           hmm.hmm_device_private.anon_teardown ...
#            OK  hmm.hmm_device_private.anon_teardown
ok 18 hmm.hmm_device_private.anon_teardown
#  RUN           hmm.hmm_device_private.mixedmap ...
#            OK  hmm.hmm_device_private.mixedmap
ok 19 hmm.hmm_device_private.mixedmap
#  RUN           hmm.hmm_device_private.compound ...
#            OK  hmm.hmm_device_private.compound
ok 20 hmm.hmm_device_private.compound
#  RUN           hmm.hmm_device_private.exclusive ...
#          FAIL  hmm.hmm_device_private.exclusive
not ok 21 hmm.hmm_device_private.exclusive
#  RUN           hmm.hmm_device_private.exclusive_mprotect ...
#          FAIL  hmm.hmm_device_private.exclusive_mprotect
not ok 22 hmm.hmm_device_private.exclusive_mprotect
#  RUN           hmm.hmm_device_private.exclusive_cow ...
#          FAIL  hmm.hmm_device_private.exclusive_cow
not ok 23 hmm.hmm_device_private.exclusive_cow
#  RUN           hmm.hmm_device_private.hmm_gup_test ...
#            OK  hmm.hmm_device_private.hmm_gup_test
ok 24 # SKIP Skipping test, could not find gup_test driver
#  RUN           hmm.hmm_device_private.hmm_cow_in_device ...
#            OK  hmm.hmm_device_private.hmm_cow_in_device
ok 25 hmm.hmm_device_private.hmm_cow_in_device
#  RUN           hmm.hmm_device_coherent.open_close ...
#            OK  hmm.hmm_device_coherent.open_close
ok 26 hmm.hmm_device_coherent.open_close
#  RUN           hmm.hmm_device_coherent.anon_read ...
#            OK  hmm.hmm_device_coherent.anon_read
ok 27 hmm.hmm_device_coherent.anon_read
#  RUN           hmm.hmm_device_coherent.anon_read_prot ...
#            OK  hmm.hmm_device_coherent.anon_read_prot
ok 28 hmm.hmm_device_coherent.anon_read_prot
#  RUN           hmm.hmm_device_coherent.anon_write ...
#            OK  hmm.hmm_device_coherent.anon_write
ok 29 hmm.hmm_device_coherent.anon_write
#  RUN           hmm.hmm_device_coherent.anon_write_prot ...
#            OK  hmm.hmm_device_coherent.anon_write_prot
ok 30 hmm.hmm_device_coherent.anon_write_prot
#  RUN           hmm.hmm_device_coherent.anon_write_child ...
#            OK  hmm.hmm_device_coherent.anon_write_child
ok 31 hmm.hmm_device_coherent.anon_write_child
#  RUN           hmm.hmm_device_coherent.anon_write_child_shared ...
#            OK  hmm.hmm_device_coherent.anon_write_child_shared
ok 32 hmm.hmm_device_coherent.anon_write_child_shared
#  RUN           hmm.hmm_device_coherent.anon_write_huge ...
#            OK  hmm.hmm_device_coherent.anon_write_huge
ok 33 hmm.hmm_device_coherent.anon_write_huge
#  RUN           hmm.hmm_device_coherent.anon_write_hugetlbfs ...
#            OK  hmm.hmm_device_coherent.anon_write_hugetlbfs
ok 34 hmm.hmm_device_coherent.anon_write_hugetlbfs
#  RUN           hmm.hmm_device_coherent.file_read ...
#            OK  hmm.hmm_device_coherent.file_read
ok 35 hmm.hmm_device_coherent.file_read
#  RUN           hmm.hmm_device_coherent.file_write ...
#            OK  hmm.hmm_device_coherent.file_write
ok 36 hmm.hmm_device_coherent.file_write
#  RUN           hmm.hmm_device_coherent.migrate ...
#            OK  hmm.hmm_device_coherent.migrate
ok 37 hmm.hmm_device_coherent.migrate
#  RUN           hmm.hmm_device_coherent.migrate_fault ...
#            OK  hmm.hmm_device_coherent.migrate_fault
ok 38 hmm.hmm_device_coherent.migrate_fault
#  RUN           hmm.hmm_device_coherent.migrate_release ...
#            OK  hmm.hmm_device_coherent.migrate_release
ok 39 hmm.hmm_device_coherent.migrate_release
#  RUN           hmm.hmm_device_coherent.migrate_shared ...
#            OK  hmm.hmm_device_coherent.migrate_shared
ok 40 hmm.hmm_device_coherent.migrate_shared
#  RUN           hmm.hmm_device_coherent.migrate_multiple ...
#            OK  hmm.hmm_device_coherent.migrate_multiple
ok 41 hmm.hmm_device_coherent.migrate_multiple
#  RUN           hmm.hmm_device_coherent.anon_read_multiple ...
#            OK  hmm.hmm_device_coherent.anon_read_multiple
ok 42 hmm.hmm_device_coherent.anon_read_multiple
#  RUN           hmm.hmm_device_coherent.anon_teardown ...
#            OK  hmm.hmm_device_coherent.anon_teardown
ok 43 hmm.hmm_device_coherent.anon_teardown
#  RUN           hmm.hmm_device_coherent.mixedmap ...
#            OK  hmm.hmm_device_coherent.mixedmap
ok 44 hmm.hmm_device_coherent.mixedmap
#  RUN           hmm.hmm_device_coherent.compound ...
#            OK  hmm.hmm_device_coherent.compound
ok 45 hmm.hmm_device_coherent.compound
#  RUN           hmm.hmm_device_coherent.exclusive ...
#            OK  hmm.hmm_device_coherent.exclusive
ok 46 hmm.hmm_device_coherent.exclusive
#  RUN           hmm.hmm_device_coherent.exclusive_mprotect ...
#            OK  hmm.hmm_device_coherent.exclusive_mprotect
ok 47 hmm.hmm_device_coherent.exclusive_mprotect
#  RUN           hmm.hmm_device_coherent.exclusive_cow ...
#            OK  hmm.hmm_device_coherent.exclusive_cow
ok 48 hmm.hmm_device_coherent.exclusive_cow
#  RUN           hmm.hmm_device_coherent.hmm_gup_test ...
#            OK  hmm.hmm_device_coherent.hmm_gup_test
ok 49 hmm.hmm_device_coherent.hmm_gup_test
#  RUN           hmm.hmm_device_coherent.hmm_cow_in_device ...
#            OK  hmm.hmm_device_coherent.hmm_cow_in_device
ok 50 hmm.hmm_device_coherent.hmm_cow_in_device
#  RUN           hmm2.hmm2_device_private.migrate_mixed ...
#            OK  hmm2.hmm2_device_private.migrate_mixed
ok 51 hmm2.hmm2_device_private.migrate_mixed
#  RUN           hmm2.hmm2_device_private.snapshot ...
#            OK  hmm2.hmm2_device_private.snapshot
ok 52 hmm2.hmm2_device_private.snapshot
#  RUN           hmm2.hmm2_device_private.double_map ...
#            OK  hmm2.hmm2_device_private.double_map
ok 53 hmm2.hmm2_device_private.double_map
#  RUN           hmm2.hmm2_device_coherent.migrate_mixed ...
#            OK  hmm2.hmm2_device_coherent.migrate_mixed
ok 54 hmm2.hmm2_device_coherent.migrate_mixed
#  RUN           hmm2.hmm2_device_coherent.snapshot ...
#            OK  hmm2.hmm2_device_coherent.snapshot
ok 55 hmm2.hmm2_device_coherent.snapshot
#  RUN           hmm2.hmm2_device_coherent.double_map ...
#            OK  hmm2.hmm2_device_coherent.double_map
ok 56 hmm2.hmm2_device_coherent.double_map
# FAILED: 53 / 56 tests passed.
# Totals: pass:51 fail:3 xfail:0 xpass:0 skip:2 error:0
[PASS]
0
SUMMARY: PASS=1 SKIP=0 FAIL=0

jira SECO-170
bugfix cuda hangs
commit-author Thomas Gleixner <[email protected]>
commit ea72ce5
upstream-diff Missing CONFIG_KMSAN commits in
arch/x86/include/asm/pgtable_64_types.h

iounmap() on x86 occasionally fails to unmap because the provided valid
ioremap address is not below high_memory. It turned out that this
happens due to KASLR.

KASLR uses the full address space between PAGE_OFFSET and vaddr_end to
randomize the starting points of the direct map, vmalloc and vmemmap
regions.  It thereby limits the size of the direct map by using the
installed memory size plus an extra configurable margin for hot-plug
memory.  This limitation is done to gain more randomization space
because otherwise only the holes between the direct map, vmalloc,
vmemmap and vaddr_end would be usable for randomizing.

The limited direct map size is not exposed to the rest of the kernel, so
the memory hot-plug and resource management related code paths still
operate under the assumption that the available address space can be
determined with MAX_PHYSMEM_BITS.

request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1
downwards.  That means the first allocation happens past the end of the
direct map and if unlucky this address is in the vmalloc space, which
causes high_memory to become greater than VMALLOC_START and consequently
causes iounmap() to fail for valid ioremap addresses.

MAX_PHYSMEM_BITS cannot be changed for that because the randomization
does not align with address bit boundaries and there are other places
which actually require to know the maximum number of address bits.  All
remaining usage sites of MAX_PHYSMEM_BITS have been analyzed and found
to be correct.

Cure this by exposing the end of the direct map via PHYSMEM_END and use
that for the memory hot-plug and resource management related places
instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END
maps to a variable which is initialized by the KASLR initialization and
otherwise it is based on MAX_PHYSMEM_BITS as before.

To prevent future hickups add a check into add_pages() to catch callers
trying to add memory above PHYSMEM_END.

Fixes: 0483e1f ("x86/mm: Implement ASLR for kernel memory regions")
	Reported-by: Max Ramanouski <[email protected]>
	Reported-by: Alistair Popple <[email protected]>
	Signed-off-by: Thomas Gleixner <[email protected]>
Tested-By: Max Ramanouski <[email protected]>
	Tested-by: Alistair Popple <[email protected]>
	Reviewed-by: Dan Williams <[email protected]>
	Reviewed-by: Alistair Popple <[email protected]>
	Reviewed-by: Kees Cook <[email protected]>
	Cc: [email protected]
Link: https://lore.kernel.org/all/87ed6soy3z.ffs@tglx
(cherry picked from commit ea72ce5)
	Signed-off-by: Jonathan Maple <[email protected]>
jira SECO-170

In Rocky9 if you run ./run_vmtests.sh -t hmm it will fail and cause an
infinite loop on ASSERTs in FIXTURE_TEARDOWN()
This temporary fix is based on the discussion here
https://patchwork.kernel.org/project/linux-kselftest/patch/[email protected]/#25046055

We will investigate further kselftest updates that will resolve the root
causes of this.
@PlaidCat PlaidCat requested a review from gvrose8192 October 23, 2024 20:51
Copy link

@gvrose8192 gvrose8192 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@PlaidCat PlaidCat merged commit 9b46c72 into sig-cloud-9/5.14.0-427.40.1.el9_4 Oct 23, 2024
@PlaidCat PlaidCat deleted the jmaple_r9-sigcloud-cuda_bug_fix branch October 23, 2024 21:05
@PlaidCat PlaidCat restored the jmaple_r9-sigcloud-cuda_bug_fix branch October 24, 2024 14:26
@PlaidCat PlaidCat deleted the jmaple_r9-sigcloud-cuda_bug_fix branch October 24, 2024 14:26
PlaidCat added a commit that referenced this pull request Nov 1, 2024
jira LE-2015
cve CVE-2024-40904
Rebuild_History Non-Buildable kernel-5.14.0-427.42.1.el9_4
commit-author Alan Stern <[email protected]>
commit 22f0081

The syzbot fuzzer found that the interrupt-URB completion callback in
the cdc-wdm driver was taking too long, and the driver's immediate
resubmission of interrupt URBs with -EPROTO status combined with the
dummy-hcd emulation to cause a CPU lockup:

cdc_wdm 1-1:1.0: nonzero urb status received: -71
cdc_wdm 1-1:1.0: wdm_int_callback - 0 bytes
watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [syz-executor782:6625]
CPU#0 Utilization every 4s during lockup:
	#1:  98% system,	  0% softirq,	  3% hardirq,	  0% idle
	#2:  98% system,	  0% softirq,	  3% hardirq,	  0% idle
	#3:  98% system,	  0% softirq,	  3% hardirq,	  0% idle
	#4:  98% system,	  0% softirq,	  3% hardirq,	  0% idle
	#5:  98% system,	  1% softirq,	  3% hardirq,	  0% idle
Modules linked in:
irq event stamp: 73096
hardirqs last  enabled at (73095): [<ffff80008037bc00>] console_emit_next_record kernel/printk/printk.c:2935 [inline]
hardirqs last  enabled at (73095): [<ffff80008037bc00>] console_flush_all+0x650/0xb74 kernel/printk/printk.c:2994
hardirqs last disabled at (73096): [<ffff80008af10b00>] __el1_irq arch/arm64/kernel/entry-common.c:533 [inline]
hardirqs last disabled at (73096): [<ffff80008af10b00>] el1_interrupt+0x24/0x68 arch/arm64/kernel/entry-common.c:551
softirqs last  enabled at (73048): [<ffff8000801ea530>] softirq_handle_end kernel/softirq.c:400 [inline]
softirqs last  enabled at (73048): [<ffff8000801ea530>] handle_softirqs+0xa60/0xc34 kernel/softirq.c:582
softirqs last disabled at (73043): [<ffff800080020de8>] __do_softirq+0x14/0x20 kernel/softirq.c:588
CPU: 0 PID: 6625 Comm: syz-executor782 Tainted: G        W          6.10.0-rc2-syzkaller-g8867bbd4a056 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024

Testing showed that the problem did not occur if the two error
messages -- the first two lines above -- were removed; apparently adding
material to the kernel log takes a surprisingly large amount of time.

In any case, the best approach for preventing these lockups and to
avoid spamming the log with thousands of error messages per second is
to ratelimit the two dev_err() calls.  Therefore we replace them with
dev_err_ratelimited().

	Signed-off-by: Alan Stern <[email protected]>
	Suggested-by: Greg KH <[email protected]>
Reported-and-tested-by: [email protected]
Closes: https://lore.kernel.org/linux-usb/[email protected]/
Reported-and-tested-by: [email protected]
Closes: https://lore.kernel.org/linux-usb/[email protected]/
Fixes: 9908a32 ("USB: remove err() macro from usb class drivers")
Link: https://lore.kernel.org/linux-usb/[email protected]/
	Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
	Signed-off-by: Greg Kroah-Hartman <[email protected]>
(cherry picked from commit 22f0081)
	Signed-off-by: Jonathan Maple <[email protected]>
PlaidCat added a commit that referenced this pull request Dec 17, 2024
jira LE-2169
cve CVE-2024-38540
Rebuild_History Non-Buildable kernel-4.18.0-553.27.1.el8_10
commit-author Michal Schmidt <[email protected]>
commit 78cfd17
Empty-Commit: Cherry-Pick Conflicts during history rebuild.
Will be included in final tarball splat. Ref for failed cherry-pick at:
ciq/ciq_backports/kernel-4.18.0-553.27.1.el8_10/78cfd171.failed

Undefined behavior is triggered when bnxt_qplib_alloc_init_hwq is called
with hwq_attr->aux_depth != 0 and hwq_attr->aux_stride == 0.
In that case, "roundup_pow_of_two(hwq_attr->aux_stride)" gets called.
roundup_pow_of_two is documented as undefined for 0.

Fix it in the one caller that had this combination.

The undefined behavior was detected by UBSAN:
  UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
  shift exponent 64 is too large for 64-bit type 'long unsigned int'
  CPU: 24 PID: 1075 Comm: (udev-worker) Not tainted 6.9.0-rc6+ #4
  Hardware name: Abacus electric, s.r.o. - [email protected] Super Server/H12SSW-iN, BIOS 2.7 10/25/2023
  Call Trace:
   <TASK>
   dump_stack_lvl+0x5d/0x80
   ubsan_epilogue+0x5/0x30
   __ubsan_handle_shift_out_of_bounds.cold+0x61/0xec
   __roundup_pow_of_two+0x25/0x35 [bnxt_re]
   bnxt_qplib_alloc_init_hwq+0xa1/0x470 [bnxt_re]
   bnxt_qplib_create_qp+0x19e/0x840 [bnxt_re]
   bnxt_re_create_qp+0x9b1/0xcd0 [bnxt_re]
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __kmalloc+0x1b6/0x4f0
   ? create_qp.part.0+0x128/0x1c0 [ib_core]
   ? __pfx_bnxt_re_create_qp+0x10/0x10 [bnxt_re]
   create_qp.part.0+0x128/0x1c0 [ib_core]
   ib_create_qp_kernel+0x50/0xd0 [ib_core]
   create_mad_qp+0x8e/0xe0 [ib_core]
   ? __pfx_qp_event_handler+0x10/0x10 [ib_core]
   ib_mad_init_device+0x2be/0x680 [ib_core]
   add_client_context+0x10d/0x1a0 [ib_core]
   enable_device_and_get+0xe0/0x1d0 [ib_core]
   ib_register_device+0x53c/0x630 [ib_core]
   ? srso_alias_return_thunk+0x5/0xfbef5
   bnxt_re_probe+0xbd8/0xe50 [bnxt_re]
   ? __pfx_bnxt_re_probe+0x10/0x10 [bnxt_re]
   auxiliary_bus_probe+0x49/0x80
   ? driver_sysfs_add+0x57/0xc0
   really_probe+0xde/0x340
   ? pm_runtime_barrier+0x54/0x90
   ? __pfx___driver_attach+0x10/0x10
   __driver_probe_device+0x78/0x110
   driver_probe_device+0x1f/0xa0
   __driver_attach+0xba/0x1c0
   bus_for_each_dev+0x8f/0xe0
   bus_add_driver+0x146/0x220
   driver_register+0x72/0xd0
   __auxiliary_driver_register+0x6e/0xd0
   ? __pfx_bnxt_re_mod_init+0x10/0x10 [bnxt_re]
   bnxt_re_mod_init+0x3e/0xff0 [bnxt_re]
   ? __pfx_bnxt_re_mod_init+0x10/0x10 [bnxt_re]
   do_one_initcall+0x5b/0x310
   do_init_module+0x90/0x250
   init_module_from_file+0x86/0xc0
   idempotent_init_module+0x121/0x2b0
   __x64_sys_finit_module+0x5e/0xb0
   do_syscall_64+0x82/0x160
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? syscall_exit_to_user_mode_prepare+0x149/0x170
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? syscall_exit_to_user_mode+0x75/0x230
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_syscall_64+0x8e/0x160
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __count_memcg_events+0x69/0x100
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? count_memcg_events.constprop.0+0x1a/0x30
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? handle_mm_fault+0x1f0/0x300
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_user_addr_fault+0x34e/0x640
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f4e5132821d
  Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e3 db 0c 00 f7 d8 64 89 01 48
  RSP: 002b:00007ffca9c906a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
  RAX: ffffffffffffffda RBX: 0000563ec8a8f130 RCX: 00007f4e5132821d
  RDX: 0000000000000000 RSI: 00007f4e518fa07d RDI: 000000000000003b
  RBP: 00007ffca9c90760 R08: 00007f4e513f6b20 R09: 00007ffca9c906f0
  R10: 0000563ec8a8faa0 R11: 0000000000000246 R12: 00007f4e518fa07d
  R13: 0000000000020000 R14: 0000563ec8409e90 R15: 0000563ec8a8fa60
   </TASK>
  ---[ end trace ]---

Fixes: 0c4dcd6 ("RDMA/bnxt_re: Refactor hardware queue memory allocation")
	Signed-off-by: Michal Schmidt <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
	Acked-by: Selvin Xavier <[email protected]>
	Signed-off-by: Leon Romanovsky <[email protected]>
(cherry picked from commit 78cfd17)
	Signed-off-by: Jonathan Maple <[email protected]>

# Conflicts:
#	drivers/infiniband/hw/bnxt_re/qplib_fp.c
PlaidCat pushed a commit that referenced this pull request Dec 19, 2024
Its used from trace__run(), for the 'perf trace' live mode, i.e. its
strace-like, non-perf.data file processing mode, the most common one.

The trace__run() function will set trace->host using machine__new_host()
that is supposed to give a machine instance representing the running
machine, and since we'll use perf_env__arch_strerrno() to get the right
errno -> string table, we need to use machine->env, so initialize it in
machine__new_host().

Before the patch:

  (gdb) run trace --errno-summary -a sleep 1
  <SNIP>
   Summary of events:

   gvfs-afc-volume (3187), 2 events, 0.0%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     pselect6               1      0     0.000     0.000     0.000     0.000      0.00%

   GUsbEventThread (3519), 2 events, 0.0%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     poll                   1      0     0.000     0.000     0.000     0.000      0.00%
  <SNIP>
  Program received signal SIGSEGV, Segmentation fault.
  0x00000000005caba0 in perf_env__arch_strerrno (env=0x0, err=110) at util/env.c:478
  478		if (env->arch_strerrno == NULL)
  (gdb) bt
  #0  0x00000000005caba0 in perf_env__arch_strerrno (env=0x0, err=110) at util/env.c:478
  #1  0x00000000004b75d2 in thread__dump_stats (ttrace=0x14f58f0, trace=0x7fffffffa5b0, fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>) at builtin-trace.c:4673
  #2  0x00000000004b78bf in trace__fprintf_thread (fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>, thread=0x10fa0b0, trace=0x7fffffffa5b0) at builtin-trace.c:4708
  #3  0x00000000004b7ad9 in trace__fprintf_thread_summary (trace=0x7fffffffa5b0, fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>) at builtin-trace.c:4747
  #4  0x00000000004b656e in trace__run (trace=0x7fffffffa5b0, argc=2, argv=0x7fffffffde60) at builtin-trace.c:4456
  #5  0x00000000004ba43e in cmd_trace (argc=2, argv=0x7fffffffde60) at builtin-trace.c:5487
  #6  0x00000000004c0414 in run_builtin (p=0xec3068 <commands+648>, argc=5, argv=0x7fffffffde60) at perf.c:351
  #7  0x00000000004c06bb in handle_internal_command (argc=5, argv=0x7fffffffde60) at perf.c:404
  #8  0x00000000004c0814 in run_argv (argcp=0x7fffffffdc4c, argv=0x7fffffffdc40) at perf.c:448
  #9  0x00000000004c0b5d in main (argc=5, argv=0x7fffffffde60) at perf.c:560
  (gdb)

After:

  root@number:~# perf trace -a --errno-summary sleep 1
  <SNIP>
     pw-data-loop (2685), 1410 events, 16.0%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     epoll_wait           188      0   983.428     0.000     5.231    15.595      8.68%
     ioctl                 94      0     0.811     0.004     0.009     0.016      2.82%
     read                 188      0     0.322     0.001     0.002     0.006      5.15%
     write                141      0     0.280     0.001     0.002     0.018      8.39%
     timerfd_settime       94      0     0.138     0.001     0.001     0.007      6.47%

   gnome-control-c (179406), 1848 events, 20.9%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     poll                 222      0   959.577     0.000     4.322    21.414     11.40%
     recvmsg              150      0     0.539     0.001     0.004     0.013      5.12%
     write                300      0     0.442     0.001     0.001     0.007      3.29%
     read                 150      0     0.183     0.001     0.001     0.009      5.53%
     getpid               102      0     0.101     0.000     0.001     0.008      7.82%

  root@number:~#

Fixes: 54373b5 ("perf env: Introduce perf_env__arch_strerrno()")
Reported-by: Veronika Molnarova <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Acked-by: Veronika Molnarova <[email protected]>
Acked-by: Michael Petlan <[email protected]>
Tested-by: Michael Petlan <[email protected]>
Link: https://lore.kernel.org/r/Z0XffUgNSv_9OjOi@x1
Signed-off-by: Namhyung Kim <[email protected]>
PlaidCat pushed a commit that referenced this pull request Dec 19, 2024
…s_lock

For storing a value to a queue attribute, the queue_attr_store function
first freezes the queue (->q_usage_counter(io)) and then acquire
->sysfs_lock. This seems not correct as the usual ordering should be to
acquire ->sysfs_lock before freezing the queue. This incorrect ordering
causes the following lockdep splat which we are able to reproduce always
simply by accessing /sys/kernel/debug file using ls command:

[   57.597146] WARNING: possible circular locking dependency detected
[   57.597154] 6.12.0-10553-gb86545e02e8c #20 Tainted: G        W
[   57.597162] ------------------------------------------------------
[   57.597168] ls/4605 is trying to acquire lock:
[   57.597176] c00000003eb56710 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x58/0xc0
[   57.597200]
               but task is already holding lock:
[   57.597207] c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4
[   57.597226]
               which lock already depends on the new lock.

[   57.597233]
               the existing dependency chain (in reverse order) is:
[   57.597241]
               -> #5 (&sb->s_type->i_mutex_key#3){++++}-{4:4}:
[   57.597255]        down_write+0x6c/0x18c
[   57.597264]        start_creating+0xb4/0x24c
[   57.597274]        debugfs_create_dir+0x2c/0x1e8
[   57.597283]        blk_register_queue+0xec/0x294
[   57.597292]        add_disk_fwnode+0x2e4/0x548
[   57.597302]        brd_alloc+0x2c8/0x338
[   57.597309]        brd_init+0x100/0x178
[   57.597317]        do_one_initcall+0x88/0x3e4
[   57.597326]        kernel_init_freeable+0x3cc/0x6e0
[   57.597334]        kernel_init+0x34/0x1cc
[   57.597342]        ret_from_kernel_user_thread+0x14/0x1c
[   57.597350]
               -> #4 (&q->debugfs_mutex){+.+.}-{4:4}:
[   57.597362]        __mutex_lock+0xfc/0x12a0
[   57.597370]        blk_register_queue+0xd4/0x294
[   57.597379]        add_disk_fwnode+0x2e4/0x548
[   57.597388]        brd_alloc+0x2c8/0x338
[   57.597395]        brd_init+0x100/0x178
[   57.597402]        do_one_initcall+0x88/0x3e4
[   57.597410]        kernel_init_freeable+0x3cc/0x6e0
[   57.597418]        kernel_init+0x34/0x1cc
[   57.597426]        ret_from_kernel_user_thread+0x14/0x1c
[   57.597434]
               -> #3 (&q->sysfs_lock){+.+.}-{4:4}:
[   57.597446]        __mutex_lock+0xfc/0x12a0
[   57.597454]        queue_attr_store+0x9c/0x110
[   57.597462]        sysfs_kf_write+0x70/0xb0
[   57.597471]        kernfs_fop_write_iter+0x1b0/0x2ac
[   57.597480]        vfs_write+0x3dc/0x6e8
[   57.597488]        ksys_write+0x84/0x140
[   57.597495]        system_call_exception+0x130/0x360
[   57.597504]        system_call_common+0x160/0x2c4
[   57.597516]
               -> #2 (&q->q_usage_counter(io)#21){++++}-{0:0}:
[   57.597530]        __submit_bio+0x5ec/0x828
[   57.597538]        submit_bio_noacct_nocheck+0x1e4/0x4f0
[   57.597547]        iomap_readahead+0x2a0/0x448
[   57.597556]        xfs_vm_readahead+0x28/0x3c
[   57.597564]        read_pages+0x88/0x41c
[   57.597571]        page_cache_ra_unbounded+0x1ac/0x2d8
[   57.597580]        filemap_get_pages+0x188/0x984
[   57.597588]        filemap_read+0x13c/0x4bc
[   57.597596]        xfs_file_buffered_read+0x88/0x17c
[   57.597605]        xfs_file_read_iter+0xac/0x158
[   57.597614]        vfs_read+0x2d4/0x3b4
[   57.597622]        ksys_read+0x84/0x144
[   57.597629]        system_call_exception+0x130/0x360
[   57.597637]        system_call_common+0x160/0x2c4
[   57.597647]
               -> #1 (mapping.invalidate_lock#2){++++}-{4:4}:
[   57.597661]        down_read+0x6c/0x220
[   57.597669]        filemap_fault+0x870/0x100c
[   57.597677]        xfs_filemap_fault+0xc4/0x18c
[   57.597684]        __do_fault+0x64/0x164
[   57.597693]        __handle_mm_fault+0x1274/0x1dac
[   57.597702]        handle_mm_fault+0x248/0x484
[   57.597711]        ___do_page_fault+0x428/0xc0c
[   57.597719]        hash__do_page_fault+0x30/0x68
[   57.597727]        do_hash_fault+0x90/0x35c
[   57.597736]        data_access_common_virt+0x210/0x220
[   57.597745]        _copy_from_user+0xf8/0x19c
[   57.597754]        sel_write_load+0x178/0xd54
[   57.597762]        vfs_write+0x108/0x6e8
[   57.597769]        ksys_write+0x84/0x140
[   57.597777]        system_call_exception+0x130/0x360
[   57.597785]        system_call_common+0x160/0x2c4
[   57.597794]
               -> #0 (&mm->mmap_lock){++++}-{4:4}:
[   57.597806]        __lock_acquire+0x17cc/0x2330
[   57.597814]        lock_acquire+0x138/0x400
[   57.597822]        __might_fault+0x7c/0xc0
[   57.597830]        filldir64+0xe8/0x390
[   57.597839]        dcache_readdir+0x80/0x2d4
[   57.597846]        iterate_dir+0xd8/0x1d4
[   57.597855]        sys_getdents64+0x88/0x2d4
[   57.597864]        system_call_exception+0x130/0x360
[   57.597872]        system_call_common+0x160/0x2c4
[   57.597881]
               other info that might help us debug this:

[   57.597888] Chain exists of:
                 &mm->mmap_lock --> &q->debugfs_mutex --> &sb->s_type->i_mutex_key#3

[   57.597905]  Possible unsafe locking scenario:

[   57.597911]        CPU0                    CPU1
[   57.597917]        ----                    ----
[   57.597922]   rlock(&sb->s_type->i_mutex_key#3);
[   57.597932]                                lock(&q->debugfs_mutex);
[   57.597940]                                lock(&sb->s_type->i_mutex_key#3);
[   57.597950]   rlock(&mm->mmap_lock);
[   57.597958]
                *** DEADLOCK ***

[   57.597965] 2 locks held by ls/4605:
[   57.597971]  #0: c0000000137c12f8 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0xcc/0x154
[   57.597989]  #1: c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4

Prevent the above lockdep warning by acquiring ->sysfs_lock before
freezing the queue while storing a queue attribute in queue_attr_store
function. Later, we also found[1] another function __blk_mq_update_nr_
hw_queues where we first freeze queue and then acquire the ->sysfs_lock.
So we've also updated lock ordering in __blk_mq_update_nr_hw_queues
function and ensured that in all code paths we follow the correct lock
ordering i.e. acquire ->sysfs_lock before freezing the queue.

[1] https://lore.kernel.org/all/CAFj5m9Ke8+EHKQBs_Nk6hqd=LGXtk4mUxZUN5==ZcCjnZSBwHw@mail.gmail.com/

Reported-by: [email protected]
Fixes: af28141 ("block: freeze the queue in queue_attr_store")
Tested-by: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Nilay Shroff <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
gvrose8192 added a commit that referenced this pull request Jan 13, 2025
jira VULN-6730
cve CVE-2023-4921
commit-author valis <[email protected]>
commit 8fc134f

When the plug qdisc is used as a class of the qfq qdisc it could trigger a
UAF. This issue can be reproduced with following commands:

  tc qdisc add dev lo root handle 1: qfq
  tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512
  tc qdisc add dev lo parent 1:1 handle 2: plug
  tc filter add dev lo parent 1: basic classid 1:1
  ping -c1 127.0.0.1

and boom:

[  285.353793] BUG: KASAN: slab-use-after-free in qfq_dequeue+0xa7/0x7f0
[  285.354910] Read of size 4 at addr ffff8880bad312a8 by task ping/144
[  285.355903]
[  285.356165] CPU: 1 PID: 144 Comm: ping Not tainted 6.5.0-rc3+ #4
[  285.357112] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[  285.358376] Call Trace:
[  285.358773]  <IRQ>
[  285.359109]  dump_stack_lvl+0x44/0x60
[  285.359708]  print_address_description.constprop.0+0x2c/0x3c0
[  285.360611]  kasan_report+0x10c/0x120
[  285.361195]  ? qfq_dequeue+0xa7/0x7f0
[  285.361780]  qfq_dequeue+0xa7/0x7f0
[  285.362342]  __qdisc_run+0xf1/0x970
[  285.362903]  net_tx_action+0x28e/0x460
[  285.363502]  __do_softirq+0x11b/0x3de
[  285.364097]  do_softirq.part.0+0x72/0x90
[  285.364721]  </IRQ>
[  285.365072]  <TASK>
[  285.365422]  __local_bh_enable_ip+0x77/0x90
[  285.366079]  __dev_queue_xmit+0x95f/0x1550
[  285.366732]  ? __pfx_csum_and_copy_from_iter+0x10/0x10
[  285.367526]  ? __pfx___dev_queue_xmit+0x10/0x10
[  285.368259]  ? __build_skb_around+0x129/0x190
[  285.368960]  ? ip_generic_getfrag+0x12c/0x170
[  285.369653]  ? __pfx_ip_generic_getfrag+0x10/0x10
[  285.370390]  ? csum_partial+0x8/0x20
[  285.370961]  ? raw_getfrag+0xe5/0x140
[  285.371559]  ip_finish_output2+0x539/0xa40
[  285.372222]  ? __pfx_ip_finish_output2+0x10/0x10
[  285.372954]  ip_output+0x113/0x1e0
[  285.373512]  ? __pfx_ip_output+0x10/0x10
[  285.374130]  ? icmp_out_count+0x49/0x60
[  285.374739]  ? __pfx_ip_finish_output+0x10/0x10
[  285.375457]  ip_push_pending_frames+0xf3/0x100
[  285.376173]  raw_sendmsg+0xef5/0x12d0
[  285.376760]  ? do_syscall_64+0x40/0x90
[  285.377359]  ? __static_call_text_end+0x136578/0x136578
[  285.378173]  ? do_syscall_64+0x40/0x90
[  285.378772]  ? kasan_enable_current+0x11/0x20
[  285.379469]  ? __pfx_raw_sendmsg+0x10/0x10
[  285.380137]  ? __sock_create+0x13e/0x270
[  285.380673]  ? __sys_socket+0xf3/0x180
[  285.381174]  ? __x64_sys_socket+0x3d/0x50
[  285.381725]  ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.382425]  ? __rcu_read_unlock+0x48/0x70
[  285.382975]  ? ip4_datagram_release_cb+0xd8/0x380
[  285.383608]  ? __pfx_ip4_datagram_release_cb+0x10/0x10
[  285.384295]  ? preempt_count_sub+0x14/0xc0
[  285.384844]  ? __list_del_entry_valid+0x76/0x140
[  285.385467]  ? _raw_spin_lock_bh+0x87/0xe0
[  285.386014]  ? __pfx__raw_spin_lock_bh+0x10/0x10
[  285.386645]  ? release_sock+0xa0/0xd0
[  285.387148]  ? preempt_count_sub+0x14/0xc0
[  285.387712]  ? freeze_secondary_cpus+0x348/0x3c0
[  285.388341]  ? aa_sk_perm+0x177/0x390
[  285.388856]  ? __pfx_aa_sk_perm+0x10/0x10
[  285.389441]  ? check_stack_object+0x22/0x70
[  285.390032]  ? inet_send_prepare+0x2f/0x120
[  285.390603]  ? __pfx_inet_sendmsg+0x10/0x10
[  285.391172]  sock_sendmsg+0xcc/0xe0
[  285.391667]  __sys_sendto+0x190/0x230
[  285.392168]  ? __pfx___sys_sendto+0x10/0x10
[  285.392727]  ? kvm_clock_get_cycles+0x14/0x30
[  285.393328]  ? set_normalized_timespec64+0x57/0x70
[  285.393980]  ? _raw_spin_unlock_irq+0x1b/0x40
[  285.394578]  ? __x64_sys_clock_gettime+0x11c/0x160
[  285.395225]  ? __pfx___x64_sys_clock_gettime+0x10/0x10
[  285.395908]  ? _copy_to_user+0x3e/0x60
[  285.396432]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.397086]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.397734]  ? do_syscall_64+0x71/0x90
[  285.398258]  __x64_sys_sendto+0x74/0x90
[  285.398786]  do_syscall_64+0x64/0x90
[  285.399273]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.399949]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.400605]  ? do_syscall_64+0x71/0x90
[  285.401124]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.401807] RIP: 0033:0x495726
[  285.402233] Code: ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 09
[  285.404683] RSP: 002b:00007ffcc25fb618 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[  285.405677] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 0000000000495726
[  285.406628] RDX: 0000000000000040 RSI: 0000000002518750 RDI: 0000000000000000
[  285.407565] RBP: 00000000005205ef R08: 00000000005f8838 R09: 000000000000001c
[  285.408523] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000002517634
[  285.409460] R13: 00007ffcc25fb6f0 R14: 0000000000000003 R15: 0000000000000000
[  285.410403]  </TASK>
[  285.410704]
[  285.410929] Allocated by task 144:
[  285.411402]  kasan_save_stack+0x1e/0x40
[  285.411926]  kasan_set_track+0x21/0x30
[  285.412442]  __kasan_slab_alloc+0x55/0x70
[  285.412973]  kmem_cache_alloc_node+0x187/0x3d0
[  285.413567]  __alloc_skb+0x1b4/0x230
[  285.414060]  __ip_append_data+0x17f7/0x1b60
[  285.414633]  ip_append_data+0x97/0xf0
[  285.415144]  raw_sendmsg+0x5a8/0x12d0
[  285.415640]  sock_sendmsg+0xcc/0xe0
[  285.416117]  __sys_sendto+0x190/0x230
[  285.416626]  __x64_sys_sendto+0x74/0x90
[  285.417145]  do_syscall_64+0x64/0x90
[  285.417624]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.418306]
[  285.418531] Freed by task 144:
[  285.418960]  kasan_save_stack+0x1e/0x40
[  285.419469]  kasan_set_track+0x21/0x30
[  285.419988]  kasan_save_free_info+0x27/0x40
[  285.420556]  ____kasan_slab_free+0x109/0x1a0
[  285.421146]  kmem_cache_free+0x1c2/0x450
[  285.421680]  __netif_receive_skb_core+0x2ce/0x1870
[  285.422333]  __netif_receive_skb_one_core+0x97/0x140
[  285.423003]  process_backlog+0x100/0x2f0
[  285.423537]  __napi_poll+0x5c/0x2d0
[  285.424023]  net_rx_action+0x2be/0x560
[  285.424510]  __do_softirq+0x11b/0x3de
[  285.425034]
[  285.425254] The buggy address belongs to the object at ffff8880bad31280
[  285.425254]  which belongs to the cache skbuff_head_cache of size 224
[  285.426993] The buggy address is located 40 bytes inside of
[  285.426993]  freed 224-byte region [ffff8880bad31280, ffff8880bad31360)
[  285.428572]
[  285.428798] The buggy address belongs to the physical page:
[  285.429540] page:00000000f4b77674 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xbad31
[  285.430758] flags: 0x100000000000200(slab|node=0|zone=1)
[  285.431447] page_type: 0xffffffff()
[  285.431934] raw: 0100000000000200 ffff88810094a8c0 dead000000000122 0000000000000000
[  285.432757] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
[  285.433562] page dumped because: kasan: bad access detected
[  285.434144]
[  285.434320] Memory state around the buggy address:
[  285.434828]  ffff8880bad31180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.435580]  ffff8880bad31200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.436264] >ffff8880bad31280: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  285.436777]                                   ^
[  285.437106]  ffff8880bad31300: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[  285.437616]  ffff8880bad31380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.438126] ==================================================================
[  285.438662] Disabling lock debugging due to kernel taint

Fix this by:
1. Changing sch_plug's .peek handler to qdisc_peek_dequeued(), a
function compatible with non-work-conserving qdiscs
2. Checking the return value of qdisc_dequeue_peeked() in sch_qfq.

Fixes: 462dbc9 ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
	Reported-by: valis <[email protected]>
	Signed-off-by: valis <[email protected]>
	Signed-off-by: Jamal Hadi Salim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
	Signed-off-by: Paolo Abeni <[email protected]>
(cherry picked from commit 8fc134f)
	Signed-off-by: Greg Rose <[email protected]>
gvrose8192 added a commit that referenced this pull request Jan 13, 2025
jira VULN-6730
cve CVE-2023-4921
commit-author valis <[email protected]>
commit 8fc134f

When the plug qdisc is used as a class of the qfq qdisc it could trigger a
UAF. This issue can be reproduced with following commands:

  tc qdisc add dev lo root handle 1: qfq
  tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512
  tc qdisc add dev lo parent 1:1 handle 2: plug
  tc filter add dev lo parent 1: basic classid 1:1
  ping -c1 127.0.0.1

and boom:

[  285.353793] BUG: KASAN: slab-use-after-free in qfq_dequeue+0xa7/0x7f0
[  285.354910] Read of size 4 at addr ffff8880bad312a8 by task ping/144
[  285.355903]
[  285.356165] CPU: 1 PID: 144 Comm: ping Not tainted 6.5.0-rc3+ #4
[  285.357112] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[  285.358376] Call Trace:
[  285.358773]  <IRQ>
[  285.359109]  dump_stack_lvl+0x44/0x60
[  285.359708]  print_address_description.constprop.0+0x2c/0x3c0
[  285.360611]  kasan_report+0x10c/0x120
[  285.361195]  ? qfq_dequeue+0xa7/0x7f0
[  285.361780]  qfq_dequeue+0xa7/0x7f0
[  285.362342]  __qdisc_run+0xf1/0x970
[  285.362903]  net_tx_action+0x28e/0x460
[  285.363502]  __do_softirq+0x11b/0x3de
[  285.364097]  do_softirq.part.0+0x72/0x90
[  285.364721]  </IRQ>
[  285.365072]  <TASK>
[  285.365422]  __local_bh_enable_ip+0x77/0x90
[  285.366079]  __dev_queue_xmit+0x95f/0x1550
[  285.366732]  ? __pfx_csum_and_copy_from_iter+0x10/0x10
[  285.367526]  ? __pfx___dev_queue_xmit+0x10/0x10
[  285.368259]  ? __build_skb_around+0x129/0x190
[  285.368960]  ? ip_generic_getfrag+0x12c/0x170
[  285.369653]  ? __pfx_ip_generic_getfrag+0x10/0x10
[  285.370390]  ? csum_partial+0x8/0x20
[  285.370961]  ? raw_getfrag+0xe5/0x140
[  285.371559]  ip_finish_output2+0x539/0xa40
[  285.372222]  ? __pfx_ip_finish_output2+0x10/0x10
[  285.372954]  ip_output+0x113/0x1e0
[  285.373512]  ? __pfx_ip_output+0x10/0x10
[  285.374130]  ? icmp_out_count+0x49/0x60
[  285.374739]  ? __pfx_ip_finish_output+0x10/0x10
[  285.375457]  ip_push_pending_frames+0xf3/0x100
[  285.376173]  raw_sendmsg+0xef5/0x12d0
[  285.376760]  ? do_syscall_64+0x40/0x90
[  285.377359]  ? __static_call_text_end+0x136578/0x136578
[  285.378173]  ? do_syscall_64+0x40/0x90
[  285.378772]  ? kasan_enable_current+0x11/0x20
[  285.379469]  ? __pfx_raw_sendmsg+0x10/0x10
[  285.380137]  ? __sock_create+0x13e/0x270
[  285.380673]  ? __sys_socket+0xf3/0x180
[  285.381174]  ? __x64_sys_socket+0x3d/0x50
[  285.381725]  ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.382425]  ? __rcu_read_unlock+0x48/0x70
[  285.382975]  ? ip4_datagram_release_cb+0xd8/0x380
[  285.383608]  ? __pfx_ip4_datagram_release_cb+0x10/0x10
[  285.384295]  ? preempt_count_sub+0x14/0xc0
[  285.384844]  ? __list_del_entry_valid+0x76/0x140
[  285.385467]  ? _raw_spin_lock_bh+0x87/0xe0
[  285.386014]  ? __pfx__raw_spin_lock_bh+0x10/0x10
[  285.386645]  ? release_sock+0xa0/0xd0
[  285.387148]  ? preempt_count_sub+0x14/0xc0
[  285.387712]  ? freeze_secondary_cpus+0x348/0x3c0
[  285.388341]  ? aa_sk_perm+0x177/0x390
[  285.388856]  ? __pfx_aa_sk_perm+0x10/0x10
[  285.389441]  ? check_stack_object+0x22/0x70
[  285.390032]  ? inet_send_prepare+0x2f/0x120
[  285.390603]  ? __pfx_inet_sendmsg+0x10/0x10
[  285.391172]  sock_sendmsg+0xcc/0xe0
[  285.391667]  __sys_sendto+0x190/0x230
[  285.392168]  ? __pfx___sys_sendto+0x10/0x10
[  285.392727]  ? kvm_clock_get_cycles+0x14/0x30
[  285.393328]  ? set_normalized_timespec64+0x57/0x70
[  285.393980]  ? _raw_spin_unlock_irq+0x1b/0x40
[  285.394578]  ? __x64_sys_clock_gettime+0x11c/0x160
[  285.395225]  ? __pfx___x64_sys_clock_gettime+0x10/0x10
[  285.395908]  ? _copy_to_user+0x3e/0x60
[  285.396432]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.397086]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.397734]  ? do_syscall_64+0x71/0x90
[  285.398258]  __x64_sys_sendto+0x74/0x90
[  285.398786]  do_syscall_64+0x64/0x90
[  285.399273]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.399949]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.400605]  ? do_syscall_64+0x71/0x90
[  285.401124]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.401807] RIP: 0033:0x495726
[  285.402233] Code: ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 09
[  285.404683] RSP: 002b:00007ffcc25fb618 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[  285.405677] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 0000000000495726
[  285.406628] RDX: 0000000000000040 RSI: 0000000002518750 RDI: 0000000000000000
[  285.407565] RBP: 00000000005205ef R08: 00000000005f8838 R09: 000000000000001c
[  285.408523] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000002517634
[  285.409460] R13: 00007ffcc25fb6f0 R14: 0000000000000003 R15: 0000000000000000
[  285.410403]  </TASK>
[  285.410704]
[  285.410929] Allocated by task 144:
[  285.411402]  kasan_save_stack+0x1e/0x40
[  285.411926]  kasan_set_track+0x21/0x30
[  285.412442]  __kasan_slab_alloc+0x55/0x70
[  285.412973]  kmem_cache_alloc_node+0x187/0x3d0
[  285.413567]  __alloc_skb+0x1b4/0x230
[  285.414060]  __ip_append_data+0x17f7/0x1b60
[  285.414633]  ip_append_data+0x97/0xf0
[  285.415144]  raw_sendmsg+0x5a8/0x12d0
[  285.415640]  sock_sendmsg+0xcc/0xe0
[  285.416117]  __sys_sendto+0x190/0x230
[  285.416626]  __x64_sys_sendto+0x74/0x90
[  285.417145]  do_syscall_64+0x64/0x90
[  285.417624]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.418306]
[  285.418531] Freed by task 144:
[  285.418960]  kasan_save_stack+0x1e/0x40
[  285.419469]  kasan_set_track+0x21/0x30
[  285.419988]  kasan_save_free_info+0x27/0x40
[  285.420556]  ____kasan_slab_free+0x109/0x1a0
[  285.421146]  kmem_cache_free+0x1c2/0x450
[  285.421680]  __netif_receive_skb_core+0x2ce/0x1870
[  285.422333]  __netif_receive_skb_one_core+0x97/0x140
[  285.423003]  process_backlog+0x100/0x2f0
[  285.423537]  __napi_poll+0x5c/0x2d0
[  285.424023]  net_rx_action+0x2be/0x560
[  285.424510]  __do_softirq+0x11b/0x3de
[  285.425034]
[  285.425254] The buggy address belongs to the object at ffff8880bad31280
[  285.425254]  which belongs to the cache skbuff_head_cache of size 224
[  285.426993] The buggy address is located 40 bytes inside of
[  285.426993]  freed 224-byte region [ffff8880bad31280, ffff8880bad31360)
[  285.428572]
[  285.428798] The buggy address belongs to the physical page:
[  285.429540] page:00000000f4b77674 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xbad31
[  285.430758] flags: 0x100000000000200(slab|node=0|zone=1)
[  285.431447] page_type: 0xffffffff()
[  285.431934] raw: 0100000000000200 ffff88810094a8c0 dead000000000122 0000000000000000
[  285.432757] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
[  285.433562] page dumped because: kasan: bad access detected
[  285.434144]
[  285.434320] Memory state around the buggy address:
[  285.434828]  ffff8880bad31180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.435580]  ffff8880bad31200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.436264] >ffff8880bad31280: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  285.436777]                                   ^
[  285.437106]  ffff8880bad31300: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[  285.437616]  ffff8880bad31380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.438126] ==================================================================
[  285.438662] Disabling lock debugging due to kernel taint

Fix this by:
1. Changing sch_plug's .peek handler to qdisc_peek_dequeued(), a
function compatible with non-work-conserving qdiscs
2. Checking the return value of qdisc_dequeue_peeked() in sch_qfq.

Fixes: 462dbc9 ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
	Reported-by: valis <[email protected]>
	Signed-off-by: valis <[email protected]>
	Signed-off-by: Jamal Hadi Salim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
	Signed-off-by: Paolo Abeni <[email protected]>
(cherry picked from commit 8fc134f)
	Signed-off-by: Greg Rose <[email protected]>
gvrose8192 added a commit that referenced this pull request Jan 14, 2025
jira VULN-6730
cve CVE-2023-4921
commit-author valis <[email protected]>
commit 8fc134f

When the plug qdisc is used as a class of the qfq qdisc it could trigger a
UAF. This issue can be reproduced with following commands:

  tc qdisc add dev lo root handle 1: qfq
  tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512
  tc qdisc add dev lo parent 1:1 handle 2: plug
  tc filter add dev lo parent 1: basic classid 1:1
  ping -c1 127.0.0.1

and boom:

[  285.353793] BUG: KASAN: slab-use-after-free in qfq_dequeue+0xa7/0x7f0
[  285.354910] Read of size 4 at addr ffff8880bad312a8 by task ping/144
[  285.355903]
[  285.356165] CPU: 1 PID: 144 Comm: ping Not tainted 6.5.0-rc3+ #4
[  285.357112] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[  285.358376] Call Trace:
[  285.358773]  <IRQ>
[  285.359109]  dump_stack_lvl+0x44/0x60
[  285.359708]  print_address_description.constprop.0+0x2c/0x3c0
[  285.360611]  kasan_report+0x10c/0x120
[  285.361195]  ? qfq_dequeue+0xa7/0x7f0
[  285.361780]  qfq_dequeue+0xa7/0x7f0
[  285.362342]  __qdisc_run+0xf1/0x970
[  285.362903]  net_tx_action+0x28e/0x460
[  285.363502]  __do_softirq+0x11b/0x3de
[  285.364097]  do_softirq.part.0+0x72/0x90
[  285.364721]  </IRQ>
[  285.365072]  <TASK>
[  285.365422]  __local_bh_enable_ip+0x77/0x90
[  285.366079]  __dev_queue_xmit+0x95f/0x1550
[  285.366732]  ? __pfx_csum_and_copy_from_iter+0x10/0x10
[  285.367526]  ? __pfx___dev_queue_xmit+0x10/0x10
[  285.368259]  ? __build_skb_around+0x129/0x190
[  285.368960]  ? ip_generic_getfrag+0x12c/0x170
[  285.369653]  ? __pfx_ip_generic_getfrag+0x10/0x10
[  285.370390]  ? csum_partial+0x8/0x20
[  285.370961]  ? raw_getfrag+0xe5/0x140
[  285.371559]  ip_finish_output2+0x539/0xa40
[  285.372222]  ? __pfx_ip_finish_output2+0x10/0x10
[  285.372954]  ip_output+0x113/0x1e0
[  285.373512]  ? __pfx_ip_output+0x10/0x10
[  285.374130]  ? icmp_out_count+0x49/0x60
[  285.374739]  ? __pfx_ip_finish_output+0x10/0x10
[  285.375457]  ip_push_pending_frames+0xf3/0x100
[  285.376173]  raw_sendmsg+0xef5/0x12d0
[  285.376760]  ? do_syscall_64+0x40/0x90
[  285.377359]  ? __static_call_text_end+0x136578/0x136578
[  285.378173]  ? do_syscall_64+0x40/0x90
[  285.378772]  ? kasan_enable_current+0x11/0x20
[  285.379469]  ? __pfx_raw_sendmsg+0x10/0x10
[  285.380137]  ? __sock_create+0x13e/0x270
[  285.380673]  ? __sys_socket+0xf3/0x180
[  285.381174]  ? __x64_sys_socket+0x3d/0x50
[  285.381725]  ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.382425]  ? __rcu_read_unlock+0x48/0x70
[  285.382975]  ? ip4_datagram_release_cb+0xd8/0x380
[  285.383608]  ? __pfx_ip4_datagram_release_cb+0x10/0x10
[  285.384295]  ? preempt_count_sub+0x14/0xc0
[  285.384844]  ? __list_del_entry_valid+0x76/0x140
[  285.385467]  ? _raw_spin_lock_bh+0x87/0xe0
[  285.386014]  ? __pfx__raw_spin_lock_bh+0x10/0x10
[  285.386645]  ? release_sock+0xa0/0xd0
[  285.387148]  ? preempt_count_sub+0x14/0xc0
[  285.387712]  ? freeze_secondary_cpus+0x348/0x3c0
[  285.388341]  ? aa_sk_perm+0x177/0x390
[  285.388856]  ? __pfx_aa_sk_perm+0x10/0x10
[  285.389441]  ? check_stack_object+0x22/0x70
[  285.390032]  ? inet_send_prepare+0x2f/0x120
[  285.390603]  ? __pfx_inet_sendmsg+0x10/0x10
[  285.391172]  sock_sendmsg+0xcc/0xe0
[  285.391667]  __sys_sendto+0x190/0x230
[  285.392168]  ? __pfx___sys_sendto+0x10/0x10
[  285.392727]  ? kvm_clock_get_cycles+0x14/0x30
[  285.393328]  ? set_normalized_timespec64+0x57/0x70
[  285.393980]  ? _raw_spin_unlock_irq+0x1b/0x40
[  285.394578]  ? __x64_sys_clock_gettime+0x11c/0x160
[  285.395225]  ? __pfx___x64_sys_clock_gettime+0x10/0x10
[  285.395908]  ? _copy_to_user+0x3e/0x60
[  285.396432]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.397086]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.397734]  ? do_syscall_64+0x71/0x90
[  285.398258]  __x64_sys_sendto+0x74/0x90
[  285.398786]  do_syscall_64+0x64/0x90
[  285.399273]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.399949]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.400605]  ? do_syscall_64+0x71/0x90
[  285.401124]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.401807] RIP: 0033:0x495726
[  285.402233] Code: ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 09
[  285.404683] RSP: 002b:00007ffcc25fb618 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[  285.405677] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 0000000000495726
[  285.406628] RDX: 0000000000000040 RSI: 0000000002518750 RDI: 0000000000000000
[  285.407565] RBP: 00000000005205ef R08: 00000000005f8838 R09: 000000000000001c
[  285.408523] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000002517634
[  285.409460] R13: 00007ffcc25fb6f0 R14: 0000000000000003 R15: 0000000000000000
[  285.410403]  </TASK>
[  285.410704]
[  285.410929] Allocated by task 144:
[  285.411402]  kasan_save_stack+0x1e/0x40
[  285.411926]  kasan_set_track+0x21/0x30
[  285.412442]  __kasan_slab_alloc+0x55/0x70
[  285.412973]  kmem_cache_alloc_node+0x187/0x3d0
[  285.413567]  __alloc_skb+0x1b4/0x230
[  285.414060]  __ip_append_data+0x17f7/0x1b60
[  285.414633]  ip_append_data+0x97/0xf0
[  285.415144]  raw_sendmsg+0x5a8/0x12d0
[  285.415640]  sock_sendmsg+0xcc/0xe0
[  285.416117]  __sys_sendto+0x190/0x230
[  285.416626]  __x64_sys_sendto+0x74/0x90
[  285.417145]  do_syscall_64+0x64/0x90
[  285.417624]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.418306]
[  285.418531] Freed by task 144:
[  285.418960]  kasan_save_stack+0x1e/0x40
[  285.419469]  kasan_set_track+0x21/0x30
[  285.419988]  kasan_save_free_info+0x27/0x40
[  285.420556]  ____kasan_slab_free+0x109/0x1a0
[  285.421146]  kmem_cache_free+0x1c2/0x450
[  285.421680]  __netif_receive_skb_core+0x2ce/0x1870
[  285.422333]  __netif_receive_skb_one_core+0x97/0x140
[  285.423003]  process_backlog+0x100/0x2f0
[  285.423537]  __napi_poll+0x5c/0x2d0
[  285.424023]  net_rx_action+0x2be/0x560
[  285.424510]  __do_softirq+0x11b/0x3de
[  285.425034]
[  285.425254] The buggy address belongs to the object at ffff8880bad31280
[  285.425254]  which belongs to the cache skbuff_head_cache of size 224
[  285.426993] The buggy address is located 40 bytes inside of
[  285.426993]  freed 224-byte region [ffff8880bad31280, ffff8880bad31360)
[  285.428572]
[  285.428798] The buggy address belongs to the physical page:
[  285.429540] page:00000000f4b77674 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xbad31
[  285.430758] flags: 0x100000000000200(slab|node=0|zone=1)
[  285.431447] page_type: 0xffffffff()
[  285.431934] raw: 0100000000000200 ffff88810094a8c0 dead000000000122 0000000000000000
[  285.432757] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
[  285.433562] page dumped because: kasan: bad access detected
[  285.434144]
[  285.434320] Memory state around the buggy address:
[  285.434828]  ffff8880bad31180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.435580]  ffff8880bad31200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.436264] >ffff8880bad31280: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  285.436777]                                   ^
[  285.437106]  ffff8880bad31300: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[  285.437616]  ffff8880bad31380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.438126] ==================================================================
[  285.438662] Disabling lock debugging due to kernel taint

Fix this by:
1. Changing sch_plug's .peek handler to qdisc_peek_dequeued(), a
function compatible with non-work-conserving qdiscs
2. Checking the return value of qdisc_dequeue_peeked() in sch_qfq.

Fixes: 462dbc9 ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
	Reported-by: valis <[email protected]>
	Signed-off-by: valis <[email protected]>
	Signed-off-by: Jamal Hadi Salim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
	Signed-off-by: Paolo Abeni <[email protected]>
(cherry picked from commit 8fc134f)
	Signed-off-by: Greg Rose <[email protected]>
pvts-mat pushed a commit to pvts-mat/kernel-src-tree that referenced this pull request Jan 14, 2025
jira LE-1907
Rebuild_History Non-Buildable kernel-rt-5.14.0-284.30.1.rt14.315.el9_2
commit-author minoura makoto <[email protected]>
commit b18cba0

Commit 9130b8d ("SUNRPC: allow for upcalls for the same uid
but different gss service") introduced `auth` argument to
__gss_find_upcall(), but in gss_pipe_downcall() it was left as NULL
since it (and auth->service) was not (yet) determined.

When multiple upcalls with the same uid and different service are
ongoing, it could happen that __gss_find_upcall(), which returns the
first match found in the pipe->in_downcall list, could not find the
correct gss_msg corresponding to the downcall we are looking for.
Moreover, it might return a msg which is not sent to rpc.gssd yet.

We could see mount.nfs process hung in D state with multiple mount.nfs
are executed in parallel.  The call trace below is of CentOS 7.9
kernel-3.10.0-1160.24.1.el7.x86_64 but we observed the same hang w/
elrepo kernel-ml-6.0.7-1.el7.

PID: 71258  TASK: ffff91ebd4be0000  CPU: 36  COMMAND: "mount.nfs"
 #0 [ffff9203ca3234f8] __schedule at ffffffffa3b8899f
 ctrliq#1 [ffff9203ca323580] schedule at ffffffffa3b88eb9
 ctrliq#2 [ffff9203ca323590] gss_cred_init at ffffffffc0355818 [auth_rpcgss]
 ctrliq#3 [ffff9203ca323658] rpcauth_lookup_credcache at ffffffffc0421ebc
[sunrpc]
 ctrliq#4 [ffff9203ca3236d8] gss_lookup_cred at ffffffffc0353633 [auth_rpcgss]
 ctrliq#5 [ffff9203ca3236e8] rpcauth_lookupcred at ffffffffc0421581 [sunrpc]
 ctrliq#6 [ffff9203ca323740] rpcauth_refreshcred at ffffffffc04223d3 [sunrpc]
 ctrliq#7 [ffff9203ca3237a0] call_refresh at ffffffffc04103dc [sunrpc]
 ctrliq#8 [ffff9203ca3237b8] __rpc_execute at ffffffffc041e1c9 [sunrpc]
 ctrliq#9 [ffff9203ca323820] rpc_execute at ffffffffc0420a48 [sunrpc]

The scenario is like this. Let's say there are two upcalls for
services A and B, A -> B in pipe->in_downcall, B -> A in pipe->pipe.

When rpc.gssd reads pipe to get the upcall msg corresponding to
service B from pipe->pipe and then writes the response, in
gss_pipe_downcall the msg corresponding to service A will be picked
because only uid is used to find the msg and it is before the one for
B in pipe->in_downcall.  And the process waiting for the msg
corresponding to service A will be woken up.

Actual scheduing of that process might be after rpc.gssd processes the
next msg.  In rpc_pipe_generic_upcall it clears msg->errno (for A).
The process is scheduled to see gss_msg->ctx == NULL and
gss_msg->msg.errno == 0, therefore it cannot break the loop in
gss_create_upcall and is never woken up after that.

This patch adds a simple check to ensure that a msg which is not
sent to rpc.gssd yet is not chosen as the matching upcall upon
receiving a downcall.

	Signed-off-by: minoura makoto <[email protected]>
	Signed-off-by: Hiroshi Shimamoto <[email protected]>
	Tested-by: Hiroshi Shimamoto <[email protected]>
	Cc: Trond Myklebust <[email protected]>
Fixes: 9130b8d ("SUNRPC: allow for upcalls for same uid but different gss service")
	Signed-off-by: Trond Myklebust <[email protected]>
(cherry picked from commit b18cba0)
	Signed-off-by: Jonathan Maple <[email protected]>
pvts-mat pushed a commit to pvts-mat/kernel-src-tree that referenced this pull request Jan 14, 2025
jira LE-1907
Rebuild_History Non-Buildable kernel-rt-5.14.0-284.30.1.rt14.315.el9_2
commit-author Stefan Assmann <[email protected]>
commit 4e264be

When a system with E810 with existing VFs gets rebooted the following
hang may be observed.

 Pid 1 is hung in iavf_remove(), part of a network driver:
 PID: 1        TASK: ffff965400e5a340  CPU: 24   COMMAND: "systemd-shutdow"
  #0 [ffffaad04005fa50] __schedule at ffffffff8b3239cb
  ctrliq#1 [ffffaad04005fae8] schedule at ffffffff8b323e2d
  ctrliq#2 [ffffaad04005fb00] schedule_hrtimeout_range_clock at ffffffff8b32cebc
  ctrliq#3 [ffffaad04005fb80] usleep_range_state at ffffffff8b32c930
  ctrliq#4 [ffffaad04005fbb0] iavf_remove at ffffffffc12b9b4c [iavf]
  ctrliq#5 [ffffaad04005fbf0] pci_device_remove at ffffffff8add7513
  ctrliq#6 [ffffaad04005fc10] device_release_driver_internal at ffffffff8af08baa
  ctrliq#7 [ffffaad04005fc40] pci_stop_bus_device at ffffffff8adcc5fc
  ctrliq#8 [ffffaad04005fc60] pci_stop_and_remove_bus_device at ffffffff8adcc81e
  ctrliq#9 [ffffaad04005fc70] pci_iov_remove_virtfn at ffffffff8adf9429
 ctrliq#10 [ffffaad04005fca8] sriov_disable at ffffffff8adf98e4
 ctrliq#11 [ffffaad04005fcc8] ice_free_vfs at ffffffffc04bb2c8 [ice]
 ctrliq#12 [ffffaad04005fd10] ice_remove at ffffffffc04778fe [ice]
 ctrliq#13 [ffffaad04005fd38] ice_shutdown at ffffffffc0477946 [ice]
 ctrliq#14 [ffffaad04005fd50] pci_device_shutdown at ffffffff8add58f1
 ctrliq#15 [ffffaad04005fd70] device_shutdown at ffffffff8af05386
 ctrliq#16 [ffffaad04005fd98] kernel_restart at ffffffff8a92a870
 ctrliq#17 [ffffaad04005fda8] __do_sys_reboot at ffffffff8a92abd6
 ctrliq#18 [ffffaad04005fee0] do_syscall_64 at ffffffff8b317159
 ctrliq#19 [ffffaad04005ff08] __context_tracking_enter at ffffffff8b31b6fc
 ctrliq#20 [ffffaad04005ff18] syscall_exit_to_user_mode at ffffffff8b31b50d
 ctrliq#21 [ffffaad04005ff28] do_syscall_64 at ffffffff8b317169
 ctrliq#22 [ffffaad04005ff50] entry_SYSCALL_64_after_hwframe at ffffffff8b40009b
     RIP: 00007f1baa5c13d7  RSP: 00007fffbcc55a98  RFLAGS: 00000202
     RAX: ffffffffffffffda  RBX: 0000000000000000  RCX: 00007f1baa5c13d7
     RDX: 0000000001234567  RSI: 0000000028121969  RDI: 00000000fee1dead
     RBP: 00007fffbcc55ca0   R8: 0000000000000000   R9: 00007fffbcc54e90
     R10: 00007fffbcc55050  R11: 0000000000000202  R12: 0000000000000005
     R13: 0000000000000000  R14: 00007fffbcc55af0  R15: 0000000000000000
     ORIG_RAX: 00000000000000a9  CS: 0033  SS: 002b

During reboot all drivers PM shutdown callbacks are invoked.
In iavf_shutdown() the adapter state is changed to __IAVF_REMOVE.
In ice_shutdown() the call chain above is executed, which at some point
calls iavf_remove(). However iavf_remove() expects the VF to be in one
of the states __IAVF_RUNNING, __IAVF_DOWN or __IAVF_INIT_FAILED. If
that's not the case it sleeps forever.
So if iavf_shutdown() gets invoked before iavf_remove() the system will
hang indefinitely because the adapter is already in state __IAVF_REMOVE.

Fix this by returning from iavf_remove() if the state is __IAVF_REMOVE,
as we already went through iavf_shutdown().

Fixes: 9745780 ("iavf: Add waiting so the port is initialized in remove")
Fixes: a841733 ("iavf: Fix race condition between iavf_shutdown and iavf_remove")
	Reported-by: Marius Cornea <[email protected]>
	Signed-off-by: Stefan Assmann <[email protected]>
	Reviewed-by: Michal Kubiak <[email protected]>
	Tested-by: Rafal Romanowski <[email protected]>
	Signed-off-by: Tony Nguyen <[email protected]>
(cherry picked from commit 4e264be)
	Signed-off-by: Jonathan Maple <[email protected]>
pvts-mat pushed a commit to pvts-mat/kernel-src-tree that referenced this pull request Jan 14, 2025
jira LE-1907
Rebuild_History Non-Buildable kernel-rt-5.14.0-284.30.1.rt14.315.el9_2
commit-author Michael Ellerman <[email protected]>
commit 6d65028

As reported by Alan, the CFI (Call Frame Information) in the VDSO time
routines is incorrect since commit ce7d805 ("powerpc/vdso: Prepare
for switching VDSO to generic C implementation.").

DWARF has a concept called the CFA (Canonical Frame Address), which on
powerpc is calculated as an offset from the stack pointer (r1). That
means when the stack pointer is changed there must be a corresponding
CFI directive to update the calculation of the CFA.

The current code is missing those directives for the changes to r1,
which prevents gdb from being able to generate a backtrace from inside
VDSO functions, eg:

  Breakpoint 1, 0x00007ffff7f804dc in __kernel_clock_gettime ()
  (gdb) bt
  #0  0x00007ffff7f804dc in __kernel_clock_gettime ()
  ctrliq#1  0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6
  ctrliq#2  0x00007fffffffd960 in ?? ()
  ctrliq#3  0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6
  Backtrace stopped: frame did not save the PC

Alan helpfully describes some rules for correctly maintaining the CFI information:

  1) Every adjustment to the current frame address reg (ie. r1) must be
     described, and exactly at the instruction where r1 changes. Why?
     Because stack unwinding might want to access previous frames.

  2) If a function changes LR or any non-volatile register, the save
     location for those regs must be given. The CFI can be at any
     instruction after the saves up to the point that the reg is
     changed.
     (Exception: LR save should be described before a bl. not after)

  3) If asychronous unwind info is needed then restores of LR and
     non-volatile regs must also be described. The CFI can be at any
     instruction after the reg is restored up to the point where the
     save location is (potentially) trashed.

Fix the inability to backtrace by adding CFI directives describing the
changes to r1, ie. satisfying rule 1.

Also change the information for LR to point to the copy saved on the
stack, not the value in r0 that will be overwritten by the function
call.

Finally, add CFI directives describing the save/restore of r2.

With the fix gdb can correctly back trace and navigate up and down the stack:

  Breakpoint 1, 0x00007ffff7f804dc in __kernel_clock_gettime ()
  (gdb) bt
  #0  0x00007ffff7f804dc in __kernel_clock_gettime ()
  ctrliq#1  0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6
  ctrliq#2  0x0000000100015b60 in gettime ()
  ctrliq#3  0x000000010000c8bc in print_long_format ()
  ctrliq#4  0x000000010000d180 in print_current_files ()
  ctrliq#5  0x00000001000054ac in main ()
  (gdb) up
  ctrliq#1  0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6
  (gdb)
  ctrliq#2  0x0000000100015b60 in gettime ()
  (gdb)
  ctrliq#3  0x000000010000c8bc in print_long_format ()
  (gdb)
  ctrliq#4  0x000000010000d180 in print_current_files ()
  (gdb)
  ctrliq#5  0x00000001000054ac in main ()
  (gdb)
  Initial frame selected; you cannot go up.
  (gdb) down
  ctrliq#4  0x000000010000d180 in print_current_files ()
  (gdb)
  ctrliq#3  0x000000010000c8bc in print_long_format ()
  (gdb)
  ctrliq#2  0x0000000100015b60 in gettime ()
  (gdb)
  ctrliq#1  0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6
  (gdb)
  #0  0x00007ffff7f804dc in __kernel_clock_gettime ()
  (gdb)

Fixes: ce7d805 ("powerpc/vdso: Prepare for switching VDSO to generic C implementation.")
	Cc: [email protected] # v5.11+
	Reported-by: Alan Modra <[email protected]>
	Signed-off-by: Michael Ellerman <[email protected]>
	Reviewed-by: Segher Boessenkool <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
(cherry picked from commit 6d65028)
	Signed-off-by: Jonathan Maple <[email protected]>
pvts-mat pushed a commit to pvts-mat/kernel-src-tree that referenced this pull request Jan 14, 2025
jira LE-1907
Rebuild_History Non-Buildable kernel-rt-5.14.0-284.30.1.rt14.315.el9_2
commit-author Eelco Chaudron <[email protected]>
commit de9df6c

Currently, the per cpu upcall counters are allocated after the vport is
created and inserted into the system. This could lead to the datapath
accessing the counters before they are allocated resulting in a kernel
Oops.

Here is an example:

  PID: 59693    TASK: ffff0005f4f51500  CPU: 0    COMMAND: "ovs-vswitchd"
   #0 [ffff80000a39b5b0] __switch_to at ffffb70f0629f2f4
   ctrliq#1 [ffff80000a39b5d0] __schedule at ffffb70f0629f5cc
   ctrliq#2 [ffff80000a39b650] preempt_schedule_common at ffffb70f0629fa60
   ctrliq#3 [ffff80000a39b670] dynamic_might_resched at ffffb70f0629fb58
   ctrliq#4 [ffff80000a39b680] mutex_lock_killable at ffffb70f062a1388
   ctrliq#5 [ffff80000a39b6a0] pcpu_alloc at ffffb70f0594460c
   ctrliq#6 [ffff80000a39b750] __alloc_percpu_gfp at ffffb70f05944e68
   ctrliq#7 [ffff80000a39b760] ovs_vport_cmd_new at ffffb70ee6961b90 [openvswitch]
   ...

  PID: 58682    TASK: ffff0005b2f0bf00  CPU: 0    COMMAND: "kworker/0:3"
   #0 [ffff80000a5d2f40] machine_kexec at ffffb70f056a0758
   ctrliq#1 [ffff80000a5d2f70] __crash_kexec at ffffb70f057e2994
   ctrliq#2 [ffff80000a5d3100] crash_kexec at ffffb70f057e2ad8
   ctrliq#3 [ffff80000a5d3120] die at ffffb70f0628234c
   ctrliq#4 [ffff80000a5d31e0] die_kernel_fault at ffffb70f062828a8
   ctrliq#5 [ffff80000a5d3210] __do_kernel_fault at ffffb70f056a31f4
   ctrliq#6 [ffff80000a5d3240] do_bad_area at ffffb70f056a32a4
   ctrliq#7 [ffff80000a5d3260] do_translation_fault at ffffb70f062a9710
   ctrliq#8 [ffff80000a5d3270] do_mem_abort at ffffb70f056a2f74
   ctrliq#9 [ffff80000a5d32a0] el1_abort at ffffb70f06297dac
  ctrliq#10 [ffff80000a5d32d0] el1h_64_sync_handler at ffffb70f06299b24
  ctrliq#11 [ffff80000a5d3410] el1h_64_sync at ffffb70f056812dc
  ctrliq#12 [ffff80000a5d3430] ovs_dp_upcall at ffffb70ee6963c84 [openvswitch]
  ctrliq#13 [ffff80000a5d3470] ovs_dp_process_packet at ffffb70ee6963fdc [openvswitch]
  ctrliq#14 [ffff80000a5d34f0] ovs_vport_receive at ffffb70ee6972c78 [openvswitch]
  ctrliq#15 [ffff80000a5d36f0] netdev_port_receive at ffffb70ee6973948 [openvswitch]
  ctrliq#16 [ffff80000a5d3720] netdev_frame_hook at ffffb70ee6973a28 [openvswitch]
  ctrliq#17 [ffff80000a5d3730] __netif_receive_skb_core.constprop.0 at ffffb70f06079f90

We moved the per cpu upcall counter allocation to the existing vport
alloc and free functions to solve this.

Fixes: 95637d9 ("net: openvswitch: release vport resources on failure")
Fixes: 1933ea3 ("net: openvswitch: Add support to count upcall packets")
	Signed-off-by: Eelco Chaudron <[email protected]>
	Reviewed-by: Simon Horman <[email protected]>
	Acked-by: Aaron Conole <[email protected]>
	Signed-off-by: David S. Miller <[email protected]>
(cherry picked from commit de9df6c)
	Signed-off-by: Jonathan Maple <[email protected]>
gvrose8192 added a commit that referenced this pull request Jan 16, 2025
jira VULN-6730
cve CVE-2023-4921
commit-author valis <[email protected]>
commit 8fc134f

When the plug qdisc is used as a class of the qfq qdisc it could trigger a
UAF. This issue can be reproduced with following commands:

  tc qdisc add dev lo root handle 1: qfq
  tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512
  tc qdisc add dev lo parent 1:1 handle 2: plug
  tc filter add dev lo parent 1: basic classid 1:1
  ping -c1 127.0.0.1

and boom:

[  285.353793] BUG: KASAN: slab-use-after-free in qfq_dequeue+0xa7/0x7f0
[  285.354910] Read of size 4 at addr ffff8880bad312a8 by task ping/144
[  285.355903]
[  285.356165] CPU: 1 PID: 144 Comm: ping Not tainted 6.5.0-rc3+ #4
[  285.357112] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[  285.358376] Call Trace:
[  285.358773]  <IRQ>
[  285.359109]  dump_stack_lvl+0x44/0x60
[  285.359708]  print_address_description.constprop.0+0x2c/0x3c0
[  285.360611]  kasan_report+0x10c/0x120
[  285.361195]  ? qfq_dequeue+0xa7/0x7f0
[  285.361780]  qfq_dequeue+0xa7/0x7f0
[  285.362342]  __qdisc_run+0xf1/0x970
[  285.362903]  net_tx_action+0x28e/0x460
[  285.363502]  __do_softirq+0x11b/0x3de
[  285.364097]  do_softirq.part.0+0x72/0x90
[  285.364721]  </IRQ>
[  285.365072]  <TASK>
[  285.365422]  __local_bh_enable_ip+0x77/0x90
[  285.366079]  __dev_queue_xmit+0x95f/0x1550
[  285.366732]  ? __pfx_csum_and_copy_from_iter+0x10/0x10
[  285.367526]  ? __pfx___dev_queue_xmit+0x10/0x10
[  285.368259]  ? __build_skb_around+0x129/0x190
[  285.368960]  ? ip_generic_getfrag+0x12c/0x170
[  285.369653]  ? __pfx_ip_generic_getfrag+0x10/0x10
[  285.370390]  ? csum_partial+0x8/0x20
[  285.370961]  ? raw_getfrag+0xe5/0x140
[  285.371559]  ip_finish_output2+0x539/0xa40
[  285.372222]  ? __pfx_ip_finish_output2+0x10/0x10
[  285.372954]  ip_output+0x113/0x1e0
[  285.373512]  ? __pfx_ip_output+0x10/0x10
[  285.374130]  ? icmp_out_count+0x49/0x60
[  285.374739]  ? __pfx_ip_finish_output+0x10/0x10
[  285.375457]  ip_push_pending_frames+0xf3/0x100
[  285.376173]  raw_sendmsg+0xef5/0x12d0
[  285.376760]  ? do_syscall_64+0x40/0x90
[  285.377359]  ? __static_call_text_end+0x136578/0x136578
[  285.378173]  ? do_syscall_64+0x40/0x90
[  285.378772]  ? kasan_enable_current+0x11/0x20
[  285.379469]  ? __pfx_raw_sendmsg+0x10/0x10
[  285.380137]  ? __sock_create+0x13e/0x270
[  285.380673]  ? __sys_socket+0xf3/0x180
[  285.381174]  ? __x64_sys_socket+0x3d/0x50
[  285.381725]  ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.382425]  ? __rcu_read_unlock+0x48/0x70
[  285.382975]  ? ip4_datagram_release_cb+0xd8/0x380
[  285.383608]  ? __pfx_ip4_datagram_release_cb+0x10/0x10
[  285.384295]  ? preempt_count_sub+0x14/0xc0
[  285.384844]  ? __list_del_entry_valid+0x76/0x140
[  285.385467]  ? _raw_spin_lock_bh+0x87/0xe0
[  285.386014]  ? __pfx__raw_spin_lock_bh+0x10/0x10
[  285.386645]  ? release_sock+0xa0/0xd0
[  285.387148]  ? preempt_count_sub+0x14/0xc0
[  285.387712]  ? freeze_secondary_cpus+0x348/0x3c0
[  285.388341]  ? aa_sk_perm+0x177/0x390
[  285.388856]  ? __pfx_aa_sk_perm+0x10/0x10
[  285.389441]  ? check_stack_object+0x22/0x70
[  285.390032]  ? inet_send_prepare+0x2f/0x120
[  285.390603]  ? __pfx_inet_sendmsg+0x10/0x10
[  285.391172]  sock_sendmsg+0xcc/0xe0
[  285.391667]  __sys_sendto+0x190/0x230
[  285.392168]  ? __pfx___sys_sendto+0x10/0x10
[  285.392727]  ? kvm_clock_get_cycles+0x14/0x30
[  285.393328]  ? set_normalized_timespec64+0x57/0x70
[  285.393980]  ? _raw_spin_unlock_irq+0x1b/0x40
[  285.394578]  ? __x64_sys_clock_gettime+0x11c/0x160
[  285.395225]  ? __pfx___x64_sys_clock_gettime+0x10/0x10
[  285.395908]  ? _copy_to_user+0x3e/0x60
[  285.396432]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.397086]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.397734]  ? do_syscall_64+0x71/0x90
[  285.398258]  __x64_sys_sendto+0x74/0x90
[  285.398786]  do_syscall_64+0x64/0x90
[  285.399273]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.399949]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.400605]  ? do_syscall_64+0x71/0x90
[  285.401124]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.401807] RIP: 0033:0x495726
[  285.402233] Code: ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 09
[  285.404683] RSP: 002b:00007ffcc25fb618 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[  285.405677] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 0000000000495726
[  285.406628] RDX: 0000000000000040 RSI: 0000000002518750 RDI: 0000000000000000
[  285.407565] RBP: 00000000005205ef R08: 00000000005f8838 R09: 000000000000001c
[  285.408523] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000002517634
[  285.409460] R13: 00007ffcc25fb6f0 R14: 0000000000000003 R15: 0000000000000000
[  285.410403]  </TASK>
[  285.410704]
[  285.410929] Allocated by task 144:
[  285.411402]  kasan_save_stack+0x1e/0x40
[  285.411926]  kasan_set_track+0x21/0x30
[  285.412442]  __kasan_slab_alloc+0x55/0x70
[  285.412973]  kmem_cache_alloc_node+0x187/0x3d0
[  285.413567]  __alloc_skb+0x1b4/0x230
[  285.414060]  __ip_append_data+0x17f7/0x1b60
[  285.414633]  ip_append_data+0x97/0xf0
[  285.415144]  raw_sendmsg+0x5a8/0x12d0
[  285.415640]  sock_sendmsg+0xcc/0xe0
[  285.416117]  __sys_sendto+0x190/0x230
[  285.416626]  __x64_sys_sendto+0x74/0x90
[  285.417145]  do_syscall_64+0x64/0x90
[  285.417624]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.418306]
[  285.418531] Freed by task 144:
[  285.418960]  kasan_save_stack+0x1e/0x40
[  285.419469]  kasan_set_track+0x21/0x30
[  285.419988]  kasan_save_free_info+0x27/0x40
[  285.420556]  ____kasan_slab_free+0x109/0x1a0
[  285.421146]  kmem_cache_free+0x1c2/0x450
[  285.421680]  __netif_receive_skb_core+0x2ce/0x1870
[  285.422333]  __netif_receive_skb_one_core+0x97/0x140
[  285.423003]  process_backlog+0x100/0x2f0
[  285.423537]  __napi_poll+0x5c/0x2d0
[  285.424023]  net_rx_action+0x2be/0x560
[  285.424510]  __do_softirq+0x11b/0x3de
[  285.425034]
[  285.425254] The buggy address belongs to the object at ffff8880bad31280
[  285.425254]  which belongs to the cache skbuff_head_cache of size 224
[  285.426993] The buggy address is located 40 bytes inside of
[  285.426993]  freed 224-byte region [ffff8880bad31280, ffff8880bad31360)
[  285.428572]
[  285.428798] The buggy address belongs to the physical page:
[  285.429540] page:00000000f4b77674 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xbad31
[  285.430758] flags: 0x100000000000200(slab|node=0|zone=1)
[  285.431447] page_type: 0xffffffff()
[  285.431934] raw: 0100000000000200 ffff88810094a8c0 dead000000000122 0000000000000000
[  285.432757] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
[  285.433562] page dumped because: kasan: bad access detected
[  285.434144]
[  285.434320] Memory state around the buggy address:
[  285.434828]  ffff8880bad31180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.435580]  ffff8880bad31200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.436264] >ffff8880bad31280: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  285.436777]                                   ^
[  285.437106]  ffff8880bad31300: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[  285.437616]  ffff8880bad31380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.438126] ==================================================================
[  285.438662] Disabling lock debugging due to kernel taint

Fix this by:
1. Changing sch_plug's .peek handler to qdisc_peek_dequeued(), a
function compatible with non-work-conserving qdiscs
2. Checking the return value of qdisc_dequeue_peeked() in sch_qfq.

Fixes: 462dbc9 ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
	Reported-by: valis <[email protected]>
	Signed-off-by: valis <[email protected]>
	Signed-off-by: Jamal Hadi Salim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
	Signed-off-by: Paolo Abeni <[email protected]>
(cherry picked from commit 8fc134f)
	Signed-off-by: Greg Rose <[email protected]>
PlaidCat added a commit that referenced this pull request Aug 25, 2025
jira LE-3622
cve CVE-2024-57980
Rebuild_History Non-Buildable kernel-6.12.0-55.22.1.el10_0
commit-author Laurent Pinchart <[email protected]>
commit c6ef3a7

If the uvc_status_init() function fails to allocate the int_urb, it will
free the dev->status pointer but doesn't reset the pointer to NULL. This
results in the kfree() call in uvc_status_cleanup() trying to
double-free the memory. Fix it by resetting the dev->status pointer to
NULL after freeing it.

Fixes: a31a405 ("V4L/DVB:usbvideo:don't use part of buffer for USB transfer #4")
	Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
	Signed-off-by: Laurent Pinchart <[email protected]>
Reviewed by: Ricardo Ribalda <[email protected]>
	Signed-off-by: Mauro Carvalho Chehab <[email protected]>
(cherry picked from commit c6ef3a7)
	Signed-off-by: Jonathan Maple <[email protected]>
PlaidCat added a commit that referenced this pull request Aug 25, 2025
jira LE-3622
Rebuild_History Non-Buildable kernel-6.12.0-55.22.1.el10_0
commit-author David Hildenbrand <[email protected]>
commit 2ccd42b

If we finds a vq without a name in our input array in
virtio_ccw_find_vqs(), we treat it as "non-existing" and set the vq pointer
to NULL; we will not call virtio_ccw_setup_vq() to allocate/setup a vq.

Consequently, we create only a queue if it actually exists (name != NULL)
and assign an incremental queue index to each such existing queue.

However, in virtio_ccw_register_adapter_ind()->get_airq_indicator() we
will not ignore these "non-existing queues", but instead assign an airq
indicator to them.

Besides never releasing them in virtio_ccw_drop_indicators() (because
there is no virtqueue), the bigger issue seems to be that there will be a
disagreement between the device and the Linux guest about the airq
indicator to be used for notifying a queue, because the indicator bit
for adapter I/O interrupt is derived from the queue index.

The virtio spec states under "Setting Up Two-Stage Queue Indicators":

	... indicator contains the guest address of an area wherein the
	indicators for the devices are contained, starting at bit_nr, one
	bit per virtqueue of the device.

And further in "Notification via Adapter I/O Interrupts":

	For notifying the driver of virtqueue buffers, the device sets the
	bit in the guest-provided indicator area at the corresponding
	offset.

For example, QEMU uses in virtio_ccw_notify() the queue index (passed as
"vector") to select the relevant indicator bit. If a queue does not exist,
it does not have a corresponding indicator bit assigned, because it
effectively doesn't have a queue index.

Using a virtio-balloon-ccw device under QEMU with free-page-hinting
disabled ("free-page-hint=off") but free-page-reporting enabled
("free-page-reporting=on") will result in free page reporting
not working as expected: in the virtio_balloon driver, we'll be stuck
forever in virtballoon_free_page_report()->wait_event(), because the
waitqueue will not be woken up as the notification from the device is
lost: it would use the wrong indicator bit.

Free page reporting stops working and we get splats (when configured to
detect hung wqs) like:

 INFO: task kworker/1:3:463 blocked for more than 61 seconds.
       Not tainted 6.14.0 #4
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/1:3 [...]
 Workqueue: events page_reporting_process
 Call Trace:
  [<000002f404e6dfb2>] __schedule+0x402/0x1640
  [<000002f404e6f22e>] schedule+0x3e/0xe0
  [<000002f3846a88fa>] virtballoon_free_page_report+0xaa/0x110 [virtio_balloon]
  [<000002f40435c8a4>] page_reporting_process+0x2e4/0x740
  [<000002f403fd3ee2>] process_one_work+0x1c2/0x400
  [<000002f403fd4b96>] worker_thread+0x296/0x420
  [<000002f403fe10b4>] kthread+0x124/0x290
  [<000002f403f4e0dc>] __ret_from_fork+0x3c/0x60
  [<000002f404e77272>] ret_from_fork+0xa/0x38

There was recently a discussion [1] whether the "holes" should be
treated differently again, effectively assigning also non-existing
queues a queue index: that should also fix the issue, but requires other
workarounds to not break existing setups.

Let's fix it without affecting existing setups for now by properly ignoring
the non-existing queues, so the indicator bits will match the queue
indexes.

[1] https://lore.kernel.org/all/[email protected]/

Fixes: a229989 ("virtio: don't allocate vqs when names[i] = NULL")
	Reported-by: Chandra Merla <[email protected]>
	Cc: [email protected]
	Signed-off-by: David Hildenbrand <[email protected]>
	Tested-by: Thomas Huth <[email protected]>
	Reviewed-by: Thomas Huth <[email protected]>
	Reviewed-by: Cornelia Huck <[email protected]>
	Acked-by: Michael S. Tsirkin <[email protected]>
	Acked-by: Christian Borntraeger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
	Signed-off-by: Heiko Carstens <[email protected]>
(cherry picked from commit 2ccd42b)
	Signed-off-by: Jonathan Maple <[email protected]>
PlaidCat added a commit that referenced this pull request Aug 25, 2025
jira LE-3622
cve CVE-2025-37958
Rebuild_History Non-Buildable kernel-6.12.0-55.22.1.el10_0
commit-author Gavin Guo <[email protected]>
commit be6e843

When migrating a THP, concurrent access to the PMD migration entry during
a deferred split scan can lead to an invalid address access, as
illustrated below.  To prevent this invalid access, it is necessary to
check the PMD migration entry and return early.  In this context, there is
no need to use pmd_to_swp_entry and pfn_swap_entry_to_page to verify the
equality of the target folio.  Since the PMD migration entry is locked, it
cannot be served as the target.

Mailing list discussion and explanation from Hugh Dickins: "An anon_vma
lookup points to a location which may contain the folio of interest, but
might instead contain another folio: and weeding out those other folios is
precisely what the "folio != pmd_folio((*pmd)" check (and the "risk of
replacing the wrong folio" comment a few lines above it) is for."

BUG: unable to handle page fault for address: ffffea60001db008
CPU: 0 UID: 0 PID: 2199114 Comm: tee Not tainted 6.14.0+ #4 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:split_huge_pmd_locked+0x3b5/0x2b60
Call Trace:
<TASK>
try_to_migrate_one+0x28c/0x3730
rmap_walk_anon+0x4f6/0x770
unmap_folio+0x196/0x1f0
split_huge_page_to_list_to_order+0x9f6/0x1560
deferred_split_scan+0xac5/0x12a0
shrinker_debugfs_scan_write+0x376/0x470
full_proxy_write+0x15c/0x220
vfs_write+0x2fc/0xcb0
ksys_write+0x146/0x250
do_syscall_64+0x6a/0x120
entry_SYSCALL_64_after_hwframe+0x76/0x7e

The bug is found by syzkaller on an internal kernel, then confirmed on
upstream.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/[email protected]/
Link: https://lore.kernel.org/all/[email protected]/
Fixes: 84c3fc4 ("mm: thp: check pmd migration entry in common path")
	Signed-off-by: Gavin Guo <[email protected]>
	Acked-by: David Hildenbrand <[email protected]>
	Acked-by: Hugh Dickins <[email protected]>
	Acked-by: Zi Yan <[email protected]>
	Reviewed-by: Gavin Shan <[email protected]>
	Cc: Florent Revest <[email protected]>
	Cc: Matthew Wilcox (Oracle) <[email protected]>
	Cc: Miaohe Lin <[email protected]>
	Cc: <[email protected]>
	Signed-off-by: Andrew Morton <[email protected]>
(cherry picked from commit be6e843)
	Signed-off-by: Jonathan Maple <[email protected]>
github-actions bot pushed a commit that referenced this pull request Aug 29, 2025
[ Upstream commit d9cef55 ]

BPF CI testing report a UAF issue:

  [   16.446633] BUG: kernel NULL pointer dereference, address: 000000000000003  0
  [   16.447134] #PF: supervisor read access in kernel mod  e
  [   16.447516] #PF: error_code(0x0000) - not-present pag  e
  [   16.447878] PGD 0 P4D   0
  [   16.448063] Oops: Oops: 0000 [#1] PREEMPT SMP NOPT  I
  [   16.448409] CPU: 0 UID: 0 PID: 9 Comm: kworker/0:1 Tainted: G           OE      6.13.0-rc3-g89e8a75fda73-dirty #4  2
  [   16.449124] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODUL  E
  [   16.449502] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/201  4
  [   16.450201] Workqueue: smc_hs_wq smc_listen_wor  k
  [   16.450531] RIP: 0010:smc_listen_work+0xc02/0x159  0
  [   16.452158] RSP: 0018:ffffb5ab40053d98 EFLAGS: 0001024  6
  [   16.452526] RAX: 0000000000000001 RBX: 0000000000000002 RCX: 000000000000030  0
  [   16.452994] RDX: 0000000000000280 RSI: 00003513840053f0 RDI: 000000000000000  0
  [   16.453492] RBP: ffffa097808e3800 R08: ffffa09782dba1e0 R09: 000000000000000  5
  [   16.453987] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa0978274640  0
  [   16.454497] R13: 0000000000000000 R14: 0000000000000000 R15: ffffa09782d4092  0
  [   16.454996] FS:  0000000000000000(0000) GS:ffffa097bbc00000(0000) knlGS:000000000000000  0
  [   16.455557] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003  3
  [   16.455961] CR2: 0000000000000030 CR3: 0000000102788004 CR4: 0000000000770ef  0
  [   16.456459] PKRU: 5555555  4
  [   16.456654] Call Trace  :
  [   16.456832]  <TASK  >
  [   16.456989]  ? __die+0x23/0x7  0
  [   16.457215]  ? page_fault_oops+0x180/0x4c  0
  [   16.457508]  ? __lock_acquire+0x3e6/0x249  0
  [   16.457801]  ? exc_page_fault+0x68/0x20  0
  [   16.458080]  ? asm_exc_page_fault+0x26/0x3  0
  [   16.458389]  ? smc_listen_work+0xc02/0x159  0
  [   16.458689]  ? smc_listen_work+0xc02/0x159  0
  [   16.458987]  ? lock_is_held_type+0x8f/0x10  0
  [   16.459284]  process_one_work+0x1ea/0x6d  0
  [   16.459570]  worker_thread+0x1c3/0x38  0
  [   16.459839]  ? __pfx_worker_thread+0x10/0x1  0
  [   16.460144]  kthread+0xe0/0x11  0
  [   16.460372]  ? __pfx_kthread+0x10/0x1  0
  [   16.460640]  ret_from_fork+0x31/0x5  0
  [   16.460896]  ? __pfx_kthread+0x10/0x1  0
  [   16.461166]  ret_from_fork_asm+0x1a/0x3  0
  [   16.461453]  </TASK  >
  [   16.461616] Modules linked in: bpf_testmod(OE) [last unloaded: bpf_testmod(OE)  ]
  [   16.462134] CR2: 000000000000003  0
  [   16.462380] ---[ end trace 0000000000000000 ]---
  [   16.462710] RIP: 0010:smc_listen_work+0xc02/0x1590

The direct cause of this issue is that after smc_listen_out_connected(),
newclcsock->sk may be NULL since it will releases the smcsk. Therefore,
if the application closes the socket immediately after accept,
newclcsock->sk can be NULL. A possible execution order could be as
follows:

smc_listen_work                                 | userspace
-----------------------------------------------------------------
lock_sock(sk)                                   |
smc_listen_out_connected()                      |
| \- smc_listen_out                             |
|    | \- release_sock                          |
     | |- sk->sk_data_ready()                   |
                                                | fd = accept();
                                                | close(fd);
                                                |  \- socket->sk = NULL;
/* newclcsock->sk is NULL now */
SMC_STAT_SERV_SUCC_INC(sock_net(newclcsock->sk))

Since smc_listen_out_connected() will not fail, simply swapping the order
of the code can easily fix this issue.

Fixes: 3b2dec2 ("net/smc: restructure client and server code in af_smc")
Signed-off-by: D. Wythe <[email protected]>
Reviewed-by: Guangguan Wang <[email protected]>
Reviewed-by: Alexandra Winter <[email protected]>
Reviewed-by: Dust Li <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
github-actions bot pushed a commit that referenced this pull request Aug 30, 2025
These iterations require the read lock, otherwise RCU
lockdep will splat:

=============================
WARNING: suspicious RCU usage
6.17.0-rc3-00014-g31419c045d64 #6 Tainted: G           O
-----------------------------
drivers/base/power/main.c:1333 RCU-list traversed in non-reader section!!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
5 locks held by rtcwake/547:
 #0: 00000000643ab418 (sb_writers#6){.+.+}-{0:0}, at: file_start_write+0x2b/0x3a
 #1: 0000000067a0ca88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x181/0x24b
 #2: 00000000631eac40 (kn->active#3){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x191/0x24b
 #3: 00000000609a1308 (system_transition_mutex){+.+.}-{4:4}, at: pm_suspend+0xaf/0x30b
 #4: 0000000060c0fdb0 (device_links_srcu){.+.+}-{0:0}, at: device_links_read_lock+0x75/0x98

stack backtrace:
CPU: 0 UID: 0 PID: 547 Comm: rtcwake Tainted: G           O        6.17.0-rc3-00014-g31419c045d64 #6 VOLUNTARY
Tainted: [O]=OOT_MODULE
Stack:
 223721b3a80 6089eac6 00000001 00000001
 ffffff00 6089eac6 00000535 6086e528
 721b3ac0 6003c294 00000000 60031fc0
Call Trace:
 [<600407ed>] show_stack+0x10e/0x127
 [<6003c294>] dump_stack_lvl+0x77/0xc6
 [<6003c2fd>] dump_stack+0x1a/0x20
 [<600bc2f8>] lockdep_rcu_suspicious+0x116/0x13e
 [<603d8ea1>] dpm_async_suspend_superior+0x117/0x17e
 [<603d980f>] device_suspend+0x528/0x541
 [<603da24b>] dpm_suspend+0x1a2/0x267
 [<603da837>] dpm_suspend_start+0x5d/0x72
 [<600ca0c9>] suspend_devices_and_enter+0xab/0x736
 [...]

Add the fourth argument to the iteration to annotate
this and avoid the splat.

Fixes: 0679963 ("PM: sleep: Make async suspend handle suppliers like parents")
Fixes: ed18738 ("PM: sleep: Make async resume handle consumers like children")
Signed-off-by: Johannes Berg <[email protected]>
Link: https://patch.msgid.link/20250826134348.aba79f6e6299.I9ecf55da46ccf33778f2c018a82e1819d815b348@changeid
Signed-off-by: Rafael J. Wysocki <[email protected]>
PlaidCat added a commit that referenced this pull request Sep 4, 2025
jira LE-4034
Rebuild_History Non-Buildable kernel-4.18.0-553.72.1.el8_10
commit-author Li Lingfeng <[email protected]>
commit b313a8c
Empty-Commit: Cherry-Pick Conflicts during history rebuild.
Will be included in final tarball splat. Ref for failed cherry-pick at:
ciq/ciq_backports/kernel-4.18.0-553.72.1.el8_10/b313a8c8.failed

Lockdep reported a warning in Linux version 6.6:

[  414.344659] ================================
[  414.345155] WARNING: inconsistent lock state
[  414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted
[  414.346221] --------------------------------
[  414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[  414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes:
[  414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[  414.351204] {IN-SOFTIRQ-W} state was registered at:
[  414.351751]   lock_acquire+0x18d/0x460
[  414.352218]   _raw_spin_lock_irqsave+0x39/0x60
[  414.352769]   __wake_up_common_lock+0x22/0x60
[  414.353289]   sbitmap_queue_wake_up+0x375/0x4f0
[  414.353829]   sbitmap_queue_clear+0xdd/0x270
[  414.354338]   blk_mq_put_tag+0xdf/0x170
[  414.354807]   __blk_mq_free_request+0x381/0x4d0
[  414.355335]   blk_mq_free_request+0x28b/0x3e0
[  414.355847]   __blk_mq_end_request+0x242/0xc30
[  414.356367]   scsi_end_request+0x2c1/0x830
[  414.345155] WARNING: inconsistent lock state
[  414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted
[  414.346221] --------------------------------
[  414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[  414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes:
[  414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[  414.351204] {IN-SOFTIRQ-W} state was registered at:
[  414.351751]   lock_acquire+0x18d/0x460
[  414.352218]   _raw_spin_lock_irqsave+0x39/0x60
[  414.352769]   __wake_up_common_lock+0x22/0x60
[  414.353289]   sbitmap_queue_wake_up+0x375/0x4f0
[  414.353829]   sbitmap_queue_clear+0xdd/0x270
[  414.354338]   blk_mq_put_tag+0xdf/0x170
[  414.354807]   __blk_mq_free_request+0x381/0x4d0
[  414.355335]   blk_mq_free_request+0x28b/0x3e0
[  414.355847]   __blk_mq_end_request+0x242/0xc30
[  414.356367]   scsi_end_request+0x2c1/0x830
[  414.356863]   scsi_io_completion+0x177/0x1610
[  414.357379]   scsi_complete+0x12f/0x260
[  414.357856]   blk_complete_reqs+0xba/0xf0
[  414.358338]   __do_softirq+0x1b0/0x7a2
[  414.358796]   irq_exit_rcu+0x14b/0x1a0
[  414.359262]   sysvec_call_function_single+0xaf/0xc0
[  414.359828]   asm_sysvec_call_function_single+0x1a/0x20
[  414.360426]   default_idle+0x1e/0x30
[  414.360873]   default_idle_call+0x9b/0x1f0
[  414.361390]   do_idle+0x2d2/0x3e0
[  414.361819]   cpu_startup_entry+0x55/0x60
[  414.362314]   start_secondary+0x235/0x2b0
[  414.362809]   secondary_startup_64_no_verify+0x18f/0x19b
[  414.363413] irq event stamp: 428794
[  414.363825] hardirqs last  enabled at (428793): [<ffffffff816bfd1c>] ktime_get+0x1dc/0x200
[  414.364694] hardirqs last disabled at (428794): [<ffffffff85470177>] _raw_spin_lock_irq+0x47/0x50
[  414.365629] softirqs last  enabled at (428444): [<ffffffff85474780>] __do_softirq+0x540/0x7a2
[  414.366522] softirqs last disabled at (428419): [<ffffffff813f65ab>] irq_exit_rcu+0x14b/0x1a0
[  414.367425]
               other info that might help us debug this:
[  414.368194]  Possible unsafe locking scenario:
[  414.368900]        CPU0
[  414.369225]        ----
[  414.369548]   lock(&sbq->ws[i].wait);
[  414.370000]   <Interrupt>
[  414.370342]     lock(&sbq->ws[i].wait);
[  414.370802]
                *** DEADLOCK ***
[  414.371569] 5 locks held by kworker/u10:3/1152:
[  414.372088]  #0: ffff88810130e938 ((wq_completion)writeback){+.+.}-{0:0}, at: process_scheduled_works+0x357/0x13f0
[  414.373180]  #1: ffff88810201fdb8 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_scheduled_works+0x3a3/0x13f0
[  414.374384]  #2: ffffffff86ffbdc0 (rcu_read_lock){....}-{1:2}, at: blk_mq_run_hw_queue+0x637/0xa00
[  414.375342]  #3: ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[  414.376377]  #4: ffff888106205a08 (&hctx->dispatch_wait_lock){+.-.}-{2:2}, at: blk_mq_dispatch_rq_list+0x1337/0x1ee0
[  414.378607]
               stack backtrace:
[  414.379177] CPU: 0 PID: 1152 Comm: kworker/u10:3 Not tainted 6.6.0-07439-gba2303cacfda #6
[  414.380032] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[  414.381177] Workqueue: writeback wb_workfn (flush-253:0)
[  414.381805] Call Trace:
[  414.382136]  <TASK>
[  414.382429]  dump_stack_lvl+0x91/0xf0
[  414.382884]  mark_lock_irq+0xb3b/0x1260
[  414.383367]  ? __pfx_mark_lock_irq+0x10/0x10
[  414.383889]  ? stack_trace_save+0x8e/0xc0
[  414.384373]  ? __pfx_stack_trace_save+0x10/0x10
[  414.384903]  ? graph_lock+0xcf/0x410
[  414.385350]  ? save_trace+0x3d/0xc70
[  414.385808]  mark_lock.part.20+0x56d/0xa90
[  414.386317]  mark_held_locks+0xb0/0x110
[  414.386791]  ? __pfx_do_raw_spin_lock+0x10/0x10
[  414.387320]  lockdep_hardirqs_on_prepare+0x297/0x3f0
[  414.387901]  ? _raw_spin_unlock_irq+0x28/0x50
[  414.388422]  trace_hardirqs_on+0x58/0x100
[  414.388917]  _raw_spin_unlock_irq+0x28/0x50
[  414.389422]  __blk_mq_tag_busy+0x1d6/0x2a0
[  414.389920]  __blk_mq_get_driver_tag+0x761/0x9f0
[  414.390899]  blk_mq_dispatch_rq_list+0x1780/0x1ee0
[  414.391473]  ? __pfx_blk_mq_dispatch_rq_list+0x10/0x10
[  414.392070]  ? sbitmap_get+0x2b8/0x450
[  414.392533]  ? __blk_mq_get_driver_tag+0x210/0x9f0
[  414.393095]  __blk_mq_sched_dispatch_requests+0xd99/0x1690
[  414.393730]  ? elv_attempt_insert_merge+0x1b1/0x420
[  414.394302]  ? __pfx___blk_mq_sched_dispatch_requests+0x10/0x10
[  414.394970]  ? lock_acquire+0x18d/0x460
[  414.395456]  ? blk_mq_run_hw_queue+0x637/0xa00
[  414.395986]  ? __pfx_lock_acquire+0x10/0x10
[  414.396499]  blk_mq_sched_dispatch_requests+0x109/0x190
[  414.397100]  blk_mq_run_hw_queue+0x66e/0xa00
[  414.397616]  blk_mq_flush_plug_list.part.17+0x614/0x2030
[  414.398244]  ? __pfx_blk_mq_flush_plug_list.part.17+0x10/0x10
[  414.398897]  ? writeback_sb_inodes+0x241/0xcc0
[  414.399429]  blk_mq_flush_plug_list+0x65/0x80
[  414.399957]  __blk_flush_plug+0x2f1/0x530
[  414.400458]  ? __pfx___blk_flush_plug+0x10/0x10
[  414.400999]  blk_finish_plug+0x59/0xa0
[  414.401467]  wb_writeback+0x7cc/0x920
[  414.401935]  ? __pfx_wb_writeback+0x10/0x10
[  414.402442]  ? mark_held_locks+0xb0/0x110
[  414.402931]  ? __pfx_do_raw_spin_lock+0x10/0x10
[  414.403462]  ? lockdep_hardirqs_on_prepare+0x297/0x3f0
[  414.404062]  wb_workfn+0x2b3/0xcf0
[  414.404500]  ? __pfx_wb_workfn+0x10/0x10
[  414.404989]  process_scheduled_works+0x432/0x13f0
[  414.405546]  ? __pfx_process_scheduled_works+0x10/0x10
[  414.406139]  ? do_raw_spin_lock+0x101/0x2a0
[  414.406641]  ? assign_work+0x19b/0x240
[  414.407106]  ? lock_is_held_type+0x9d/0x110
[  414.407604]  worker_thread+0x6f2/0x1160
[  414.408075]  ? __kthread_parkme+0x62/0x210
[  414.408572]  ? lockdep_hardirqs_on_prepare+0x297/0x3f0
[  414.409168]  ? __kthread_parkme+0x13c/0x210
[  414.409678]  ? __pfx_worker_thread+0x10/0x10
[  414.410191]  kthread+0x33c/0x440
[  414.410602]  ? __pfx_kthread+0x10/0x10
[  414.411068]  ret_from_fork+0x4d/0x80
[  414.411526]  ? __pfx_kthread+0x10/0x10
[  414.411993]  ret_from_fork_asm+0x1b/0x30
[  414.412489]  </TASK>

When interrupt is turned on while a lock holding by spin_lock_irq it
throws a warning because of potential deadlock.

blk_mq_prep_dispatch_rq
 blk_mq_get_driver_tag
  __blk_mq_get_driver_tag
   __blk_mq_alloc_driver_tag
    blk_mq_tag_busy -> tag is already busy
    // failed to get driver tag
 blk_mq_mark_tag_wait
  spin_lock_irq(&wq->lock) -> lock A (&sbq->ws[i].wait)
  __add_wait_queue(wq, wait) -> wait queue active
  blk_mq_get_driver_tag
  __blk_mq_tag_busy
-> 1) tag must be idle, which means there can't be inflight IO
   spin_lock_irq(&tags->lock) -> lock B (hctx->tags)
   spin_unlock_irq(&tags->lock) -> unlock B, turn on interrupt accidentally
-> 2) context must be preempt by IO interrupt to trigger deadlock.

As shown above, the deadlock is not possible in theory, but the warning
still need to be fixed.

Fix it by using spin_lock_irqsave to get lockB instead of spin_lock_irq.

Fixes: 4f1731d ("blk-mq: fix potential io hang by wrong 'wake_batch'")
	Signed-off-by: Li Lingfeng <[email protected]>
	Reviewed-by: Ming Lei <[email protected]>
	Reviewed-by: Yu Kuai <[email protected]>
	Reviewed-by: Bart Van Assche <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
	Signed-off-by: Jens Axboe <[email protected]>
(cherry picked from commit b313a8c)
	Signed-off-by: Jonathan Maple <[email protected]>

# Conflicts:
#	block/blk-mq-tag.c
PlaidCat added a commit that referenced this pull request Sep 4, 2025
jira LE-4066
Rebuild_History Non-Buildable kernel-4.18.0-553.72.1.el8_10
commit-author Li Lingfeng <[email protected]>
commit b313a8c
Empty-Commit: Cherry-Pick Conflicts during history rebuild.
Will be included in final tarball splat. Ref for failed cherry-pick at:
ciq/ciq_backports/kernel-4.18.0-553.72.1.el8_10/b313a8c8.failed

Lockdep reported a warning in Linux version 6.6:

[  414.344659] ================================
[  414.345155] WARNING: inconsistent lock state
[  414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted
[  414.346221] --------------------------------
[  414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[  414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes:
[  414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[  414.351204] {IN-SOFTIRQ-W} state was registered at:
[  414.351751]   lock_acquire+0x18d/0x460
[  414.352218]   _raw_spin_lock_irqsave+0x39/0x60
[  414.352769]   __wake_up_common_lock+0x22/0x60
[  414.353289]   sbitmap_queue_wake_up+0x375/0x4f0
[  414.353829]   sbitmap_queue_clear+0xdd/0x270
[  414.354338]   blk_mq_put_tag+0xdf/0x170
[  414.354807]   __blk_mq_free_request+0x381/0x4d0
[  414.355335]   blk_mq_free_request+0x28b/0x3e0
[  414.355847]   __blk_mq_end_request+0x242/0xc30
[  414.356367]   scsi_end_request+0x2c1/0x830
[  414.345155] WARNING: inconsistent lock state
[  414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted
[  414.346221] --------------------------------
[  414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[  414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes:
[  414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[  414.351204] {IN-SOFTIRQ-W} state was registered at:
[  414.351751]   lock_acquire+0x18d/0x460
[  414.352218]   _raw_spin_lock_irqsave+0x39/0x60
[  414.352769]   __wake_up_common_lock+0x22/0x60
[  414.353289]   sbitmap_queue_wake_up+0x375/0x4f0
[  414.353829]   sbitmap_queue_clear+0xdd/0x270
[  414.354338]   blk_mq_put_tag+0xdf/0x170
[  414.354807]   __blk_mq_free_request+0x381/0x4d0
[  414.355335]   blk_mq_free_request+0x28b/0x3e0
[  414.355847]   __blk_mq_end_request+0x242/0xc30
[  414.356367]   scsi_end_request+0x2c1/0x830
[  414.356863]   scsi_io_completion+0x177/0x1610
[  414.357379]   scsi_complete+0x12f/0x260
[  414.357856]   blk_complete_reqs+0xba/0xf0
[  414.358338]   __do_softirq+0x1b0/0x7a2
[  414.358796]   irq_exit_rcu+0x14b/0x1a0
[  414.359262]   sysvec_call_function_single+0xaf/0xc0
[  414.359828]   asm_sysvec_call_function_single+0x1a/0x20
[  414.360426]   default_idle+0x1e/0x30
[  414.360873]   default_idle_call+0x9b/0x1f0
[  414.361390]   do_idle+0x2d2/0x3e0
[  414.361819]   cpu_startup_entry+0x55/0x60
[  414.362314]   start_secondary+0x235/0x2b0
[  414.362809]   secondary_startup_64_no_verify+0x18f/0x19b
[  414.363413] irq event stamp: 428794
[  414.363825] hardirqs last  enabled at (428793): [<ffffffff816bfd1c>] ktime_get+0x1dc/0x200
[  414.364694] hardirqs last disabled at (428794): [<ffffffff85470177>] _raw_spin_lock_irq+0x47/0x50
[  414.365629] softirqs last  enabled at (428444): [<ffffffff85474780>] __do_softirq+0x540/0x7a2
[  414.366522] softirqs last disabled at (428419): [<ffffffff813f65ab>] irq_exit_rcu+0x14b/0x1a0
[  414.367425]
               other info that might help us debug this:
[  414.368194]  Possible unsafe locking scenario:
[  414.368900]        CPU0
[  414.369225]        ----
[  414.369548]   lock(&sbq->ws[i].wait);
[  414.370000]   <Interrupt>
[  414.370342]     lock(&sbq->ws[i].wait);
[  414.370802]
                *** DEADLOCK ***
[  414.371569] 5 locks held by kworker/u10:3/1152:
[  414.372088]  #0: ffff88810130e938 ((wq_completion)writeback){+.+.}-{0:0}, at: process_scheduled_works+0x357/0x13f0
[  414.373180]  #1: ffff88810201fdb8 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_scheduled_works+0x3a3/0x13f0
[  414.374384]  #2: ffffffff86ffbdc0 (rcu_read_lock){....}-{1:2}, at: blk_mq_run_hw_queue+0x637/0xa00
[  414.375342]  #3: ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[  414.376377]  #4: ffff888106205a08 (&hctx->dispatch_wait_lock){+.-.}-{2:2}, at: blk_mq_dispatch_rq_list+0x1337/0x1ee0
[  414.378607]
               stack backtrace:
[  414.379177] CPU: 0 PID: 1152 Comm: kworker/u10:3 Not tainted 6.6.0-07439-gba2303cacfda #6
[  414.380032] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[  414.381177] Workqueue: writeback wb_workfn (flush-253:0)
[  414.381805] Call Trace:
[  414.382136]  <TASK>
[  414.382429]  dump_stack_lvl+0x91/0xf0
[  414.382884]  mark_lock_irq+0xb3b/0x1260
[  414.383367]  ? __pfx_mark_lock_irq+0x10/0x10
[  414.383889]  ? stack_trace_save+0x8e/0xc0
[  414.384373]  ? __pfx_stack_trace_save+0x10/0x10
[  414.384903]  ? graph_lock+0xcf/0x410
[  414.385350]  ? save_trace+0x3d/0xc70
[  414.385808]  mark_lock.part.20+0x56d/0xa90
[  414.386317]  mark_held_locks+0xb0/0x110
[  414.386791]  ? __pfx_do_raw_spin_lock+0x10/0x10
[  414.387320]  lockdep_hardirqs_on_prepare+0x297/0x3f0
[  414.387901]  ? _raw_spin_unlock_irq+0x28/0x50
[  414.388422]  trace_hardirqs_on+0x58/0x100
[  414.388917]  _raw_spin_unlock_irq+0x28/0x50
[  414.389422]  __blk_mq_tag_busy+0x1d6/0x2a0
[  414.389920]  __blk_mq_get_driver_tag+0x761/0x9f0
[  414.390899]  blk_mq_dispatch_rq_list+0x1780/0x1ee0
[  414.391473]  ? __pfx_blk_mq_dispatch_rq_list+0x10/0x10
[  414.392070]  ? sbitmap_get+0x2b8/0x450
[  414.392533]  ? __blk_mq_get_driver_tag+0x210/0x9f0
[  414.393095]  __blk_mq_sched_dispatch_requests+0xd99/0x1690
[  414.393730]  ? elv_attempt_insert_merge+0x1b1/0x420
[  414.394302]  ? __pfx___blk_mq_sched_dispatch_requests+0x10/0x10
[  414.394970]  ? lock_acquire+0x18d/0x460
[  414.395456]  ? blk_mq_run_hw_queue+0x637/0xa00
[  414.395986]  ? __pfx_lock_acquire+0x10/0x10
[  414.396499]  blk_mq_sched_dispatch_requests+0x109/0x190
[  414.397100]  blk_mq_run_hw_queue+0x66e/0xa00
[  414.397616]  blk_mq_flush_plug_list.part.17+0x614/0x2030
[  414.398244]  ? __pfx_blk_mq_flush_plug_list.part.17+0x10/0x10
[  414.398897]  ? writeback_sb_inodes+0x241/0xcc0
[  414.399429]  blk_mq_flush_plug_list+0x65/0x80
[  414.399957]  __blk_flush_plug+0x2f1/0x530
[  414.400458]  ? __pfx___blk_flush_plug+0x10/0x10
[  414.400999]  blk_finish_plug+0x59/0xa0
[  414.401467]  wb_writeback+0x7cc/0x920
[  414.401935]  ? __pfx_wb_writeback+0x10/0x10
[  414.402442]  ? mark_held_locks+0xb0/0x110
[  414.402931]  ? __pfx_do_raw_spin_lock+0x10/0x10
[  414.403462]  ? lockdep_hardirqs_on_prepare+0x297/0x3f0
[  414.404062]  wb_workfn+0x2b3/0xcf0
[  414.404500]  ? __pfx_wb_workfn+0x10/0x10
[  414.404989]  process_scheduled_works+0x432/0x13f0
[  414.405546]  ? __pfx_process_scheduled_works+0x10/0x10
[  414.406139]  ? do_raw_spin_lock+0x101/0x2a0
[  414.406641]  ? assign_work+0x19b/0x240
[  414.407106]  ? lock_is_held_type+0x9d/0x110
[  414.407604]  worker_thread+0x6f2/0x1160
[  414.408075]  ? __kthread_parkme+0x62/0x210
[  414.408572]  ? lockdep_hardirqs_on_prepare+0x297/0x3f0
[  414.409168]  ? __kthread_parkme+0x13c/0x210
[  414.409678]  ? __pfx_worker_thread+0x10/0x10
[  414.410191]  kthread+0x33c/0x440
[  414.410602]  ? __pfx_kthread+0x10/0x10
[  414.411068]  ret_from_fork+0x4d/0x80
[  414.411526]  ? __pfx_kthread+0x10/0x10
[  414.411993]  ret_from_fork_asm+0x1b/0x30
[  414.412489]  </TASK>

When interrupt is turned on while a lock holding by spin_lock_irq it
throws a warning because of potential deadlock.

blk_mq_prep_dispatch_rq
 blk_mq_get_driver_tag
  __blk_mq_get_driver_tag
   __blk_mq_alloc_driver_tag
    blk_mq_tag_busy -> tag is already busy
    // failed to get driver tag
 blk_mq_mark_tag_wait
  spin_lock_irq(&wq->lock) -> lock A (&sbq->ws[i].wait)
  __add_wait_queue(wq, wait) -> wait queue active
  blk_mq_get_driver_tag
  __blk_mq_tag_busy
-> 1) tag must be idle, which means there can't be inflight IO
   spin_lock_irq(&tags->lock) -> lock B (hctx->tags)
   spin_unlock_irq(&tags->lock) -> unlock B, turn on interrupt accidentally
-> 2) context must be preempt by IO interrupt to trigger deadlock.

As shown above, the deadlock is not possible in theory, but the warning
still need to be fixed.

Fix it by using spin_lock_irqsave to get lockB instead of spin_lock_irq.

Fixes: 4f1731d ("blk-mq: fix potential io hang by wrong 'wake_batch'")
	Signed-off-by: Li Lingfeng <[email protected]>
	Reviewed-by: Ming Lei <[email protected]>
	Reviewed-by: Yu Kuai <[email protected]>
	Reviewed-by: Bart Van Assche <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
	Signed-off-by: Jens Axboe <[email protected]>
(cherry picked from commit b313a8c)
	Signed-off-by: Jonathan Maple <[email protected]>

# Conflicts:
#	block/blk-mq-tag.c
github-actions bot pushed a commit that referenced this pull request Sep 5, 2025
… context

JIRA: https://issues.redhat.com/browse/RHEL-102692

commit 9ca7a42
Author: Sudeep Holla <[email protected]>
Date: Wed, 28 May 2025 09:49:43 +0100

    The current use of a mutex to protect the notifier hashtable accesses
    can lead to issues in the atomic context. It results in the below
    kernel warnings:

      |  BUG: sleeping function called from invalid context at kernel/locking/mutex.c:258
      |  in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 9, name: kworker/0:0
      |  preempt_count: 1, expected: 0
      |  RCU nest depth: 0, expected: 0
      |  CPU: 0 UID: 0 PID: 9 Comm: kworker/0:0 Not tainted 6.14.0 #4
      |  Workqueue: ffa_pcpu_irq_notification notif_pcpu_irq_work_fn
      |  Call trace:
      |   show_stack+0x18/0x24 (C)
      |   dump_stack_lvl+0x78/0x90
      |   dump_stack+0x18/0x24
      |   __might_resched+0x114/0x170
      |   __might_sleep+0x48/0x98
      |   mutex_lock+0x24/0x80
      |   handle_notif_callbacks+0x54/0xe0
      |   notif_get_and_handle+0x40/0x88
      |   generic_exec_single+0x80/0xc0
      |   smp_call_function_single+0xfc/0x1a0
      |   notif_pcpu_irq_work_fn+0x2c/0x38
      |   process_one_work+0x14c/0x2b4
      |   worker_thread+0x2e4/0x3e0
      |   kthread+0x13c/0x210
      |   ret_from_fork+0x10/0x20

    To address this, replace the mutex with an rwlock to protect the notifier
    hashtable accesses. This ensures that read-side locking does not sleep and
    multiple readers can acquire the lock concurrently, avoiding unnecessary
    contention and potential deadlocks. Writer access remains exclusive,
    preserving correctness.

    This change resolves warnings from lockdep about potential sleep in
    atomic context.

    Cc: Jens Wiklander <[email protected]>
    Reported-by: Jérôme Forissier <[email protected]>
    Closes: OP-TEE/optee_os#7394
    Fixes: e057344 ("firmware: arm_ffa: Add interfaces to request notification callbacks")
    Message-Id: <[email protected]>
    Reviewed-by: Jens Wiklander <[email protected]>
    Tested-by: Jens Wiklander <[email protected]>
    Signed-off-by: Sudeep Holla <[email protected]>

Signed-off-by: Marcin Juszkiewicz <[email protected]>
Message-ID: <d2632a513f05ec79f4fffa92d618792eb3cea12f.1752653152.git.mjuszkiewicz@redhat.com>
Message-ID: <b4e3e2e1248ca9909bdea280d9009bdb659d0c81.1752659027.git.mjuszkiewicz@redhat.com>
github-actions bot pushed a commit that referenced this pull request Sep 7, 2025
… context

JIRA: https://issues.redhat.com/browse/RHEL-102691

commit 9ca7a42
Author: Sudeep Holla <[email protected]>
Date: Wed, 28 May 2025 09:49:43 +0100

    The current use of a mutex to protect the notifier hashtable accesses
    can lead to issues in the atomic context. It results in the below
    kernel warnings:

      |  BUG: sleeping function called from invalid context at kernel/locking/mutex.c:258
      |  in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 9, name: kworker/0:0
      |  preempt_count: 1, expected: 0
      |  RCU nest depth: 0, expected: 0
      |  CPU: 0 UID: 0 PID: 9 Comm: kworker/0:0 Not tainted 6.14.0 #4
      |  Workqueue: ffa_pcpu_irq_notification notif_pcpu_irq_work_fn
      |  Call trace:
      |   show_stack+0x18/0x24 (C)
      |   dump_stack_lvl+0x78/0x90
      |   dump_stack+0x18/0x24
      |   __might_resched+0x114/0x170
      |   __might_sleep+0x48/0x98
      |   mutex_lock+0x24/0x80
      |   handle_notif_callbacks+0x54/0xe0
      |   notif_get_and_handle+0x40/0x88
      |   generic_exec_single+0x80/0xc0
      |   smp_call_function_single+0xfc/0x1a0
      |   notif_pcpu_irq_work_fn+0x2c/0x38
      |   process_one_work+0x14c/0x2b4
      |   worker_thread+0x2e4/0x3e0
      |   kthread+0x13c/0x210
      |   ret_from_fork+0x10/0x20

    To address this, replace the mutex with an rwlock to protect the notifier
    hashtable accesses. This ensures that read-side locking does not sleep and
    multiple readers can acquire the lock concurrently, avoiding unnecessary
    contention and potential deadlocks. Writer access remains exclusive,
    preserving correctness.

    This change resolves warnings from lockdep about potential sleep in
    atomic context.

    Cc: Jens Wiklander <[email protected]>
    Reported-by: Jérôme Forissier <[email protected]>
    Closes: OP-TEE/optee_os#7394
    Fixes: e057344 ("firmware: arm_ffa: Add interfaces to request notification callbacks")
    Message-Id: <[email protected]>
    Reviewed-by: Jens Wiklander <[email protected]>
    Tested-by: Jens Wiklander <[email protected]>
    Signed-off-by: Sudeep Holla <[email protected]>

Signed-off-by: Marcin Juszkiewicz <[email protected]>
PlaidCat added a commit that referenced this pull request Sep 10, 2025
jira VULN-33329
jira VULN-33328
cve CVE-2022-48919
commit-author Ronnie Sahlberg <[email protected]>
commit 3d6cc98

When cifs_get_root() fails during cifs_smb3_do_mount() we call
deactivate_locked_super() which eventually will call delayed_free() which
will free the context.
In this situation we should not proceed to enter the out: section in
cifs_smb3_do_mount() and free the same resources a second time.

[Thu Feb 10 12:59:06 2022] BUG: KASAN: use-after-free in rcu_cblist_dequeue+0x32/0x60
[Thu Feb 10 12:59:06 2022] Read of size 8 at addr ffff888364f4d110 by task swapper/1/0

[Thu Feb 10 12:59:06 2022] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G           OE     5.17.0-rc3+ #4
[Thu Feb 10 12:59:06 2022] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019
[Thu Feb 10 12:59:06 2022] Call Trace:
[Thu Feb 10 12:59:06 2022]  <IRQ>
[Thu Feb 10 12:59:06 2022]  dump_stack_lvl+0x5d/0x78
[Thu Feb 10 12:59:06 2022]  print_address_description.constprop.0+0x24/0x150
[Thu Feb 10 12:59:06 2022]  ? rcu_cblist_dequeue+0x32/0x60
[Thu Feb 10 12:59:06 2022]  kasan_report.cold+0x7d/0x117
[Thu Feb 10 12:59:06 2022]  ? rcu_cblist_dequeue+0x32/0x60
[Thu Feb 10 12:59:06 2022]  __asan_load8+0x86/0xa0
[Thu Feb 10 12:59:06 2022]  rcu_cblist_dequeue+0x32/0x60
[Thu Feb 10 12:59:06 2022]  rcu_core+0x547/0xca0
[Thu Feb 10 12:59:06 2022]  ? call_rcu+0x3c0/0x3c0
[Thu Feb 10 12:59:06 2022]  ? __this_cpu_preempt_check+0x13/0x20
[Thu Feb 10 12:59:06 2022]  ? lock_is_held_type+0xea/0x140
[Thu Feb 10 12:59:06 2022]  rcu_core_si+0xe/0x10
[Thu Feb 10 12:59:06 2022]  __do_softirq+0x1d4/0x67b
[Thu Feb 10 12:59:06 2022]  __irq_exit_rcu+0x100/0x150
[Thu Feb 10 12:59:06 2022]  irq_exit_rcu+0xe/0x30
[Thu Feb 10 12:59:06 2022]  sysvec_hyperv_stimer0+0x9d/0xc0
...
[Thu Feb 10 12:59:07 2022] Freed by task 58179:
[Thu Feb 10 12:59:07 2022]  kasan_save_stack+0x26/0x50
[Thu Feb 10 12:59:07 2022]  kasan_set_track+0x25/0x30
[Thu Feb 10 12:59:07 2022]  kasan_set_free_info+0x24/0x40
[Thu Feb 10 12:59:07 2022]  ____kasan_slab_free+0x137/0x170
[Thu Feb 10 12:59:07 2022]  __kasan_slab_free+0x12/0x20
[Thu Feb 10 12:59:07 2022]  slab_free_freelist_hook+0xb3/0x1d0
[Thu Feb 10 12:59:07 2022]  kfree+0xcd/0x520
[Thu Feb 10 12:59:07 2022]  cifs_smb3_do_mount+0x149/0xbe0 [cifs]
[Thu Feb 10 12:59:07 2022]  smb3_get_tree+0x1a0/0x2e0 [cifs]
[Thu Feb 10 12:59:07 2022]  vfs_get_tree+0x52/0x140
[Thu Feb 10 12:59:07 2022]  path_mount+0x635/0x10c0
[Thu Feb 10 12:59:07 2022]  __x64_sys_mount+0x1bf/0x210
[Thu Feb 10 12:59:07 2022]  do_syscall_64+0x5c/0xc0
[Thu Feb 10 12:59:07 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xae

[Thu Feb 10 12:59:07 2022] Last potentially related work creation:
[Thu Feb 10 12:59:07 2022]  kasan_save_stack+0x26/0x50
[Thu Feb 10 12:59:07 2022]  __kasan_record_aux_stack+0xb6/0xc0
[Thu Feb 10 12:59:07 2022]  kasan_record_aux_stack_noalloc+0xb/0x10
[Thu Feb 10 12:59:07 2022]  call_rcu+0x76/0x3c0
[Thu Feb 10 12:59:07 2022]  cifs_umount+0xce/0xe0 [cifs]
[Thu Feb 10 12:59:07 2022]  cifs_kill_sb+0xc8/0xe0 [cifs]
[Thu Feb 10 12:59:07 2022]  deactivate_locked_super+0x5d/0xd0
[Thu Feb 10 12:59:07 2022]  cifs_smb3_do_mount+0xab9/0xbe0 [cifs]
[Thu Feb 10 12:59:07 2022]  smb3_get_tree+0x1a0/0x2e0 [cifs]
[Thu Feb 10 12:59:07 2022]  vfs_get_tree+0x52/0x140
[Thu Feb 10 12:59:07 2022]  path_mount+0x635/0x10c0
[Thu Feb 10 12:59:07 2022]  __x64_sys_mount+0x1bf/0x210
[Thu Feb 10 12:59:07 2022]  do_syscall_64+0x5c/0xc0
[Thu Feb 10 12:59:07 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xae

	Reported-by: Shyam Prasad N <[email protected]>
	Reviewed-by: Shyam Prasad N <[email protected]>
	Signed-off-by: Ronnie Sahlberg <[email protected]>
	Signed-off-by: Steve French <[email protected]>
(cherry picked from commit 3d6cc98)
	Signed-off-by: Jonathan Maple <[email protected]>
github-actions bot pushed a commit that referenced this pull request Sep 16, 2025
JIRA: https://issues.redhat.com/browse/RHEL-106845

commit a9c83a0
Author: Jens Axboe <[email protected]>
Date:   Mon Dec 30 14:15:17 2024 -0700

    io_uring/timeout: flush timeouts outside of the timeout lock

    syzbot reports that a recent fix causes nesting issues between the (now)
    raw timeoutlock and the eventfd locking:

    =============================
    [ BUG: Invalid wait context ]
    6.13.0-rc4-00080-g9828a4c0901f #29 Not tainted
    -----------------------------
    kworker/u32:0/68094 is trying to lock:
    ffff000014d7a520 (&ctx->wqh#2){..-.}-{3:3}, at: eventfd_signal_mask+0x64/0x180
    other info that might help us debug this:
    context-{5:5}
    6 locks held by kworker/u32:0/68094:
     #0: ffff0000c1d98148 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work+0x4e8/0xfc0
     #1: ffff80008d927c78 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work+0x53c/0xfc0
     #2: ffff0000c59bc3d8 (&ctx->completion_lock){+.+.}-{3:3}, at: io_kill_timeouts+0x40/0x180
     #3: ffff0000c59bc358 (&ctx->timeout_lock){-.-.}-{2:2}, at: io_kill_timeouts+0x48/0x180
     #4: ffff800085127aa0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x8/0x38
     #5: ffff800085127aa0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x8/0x38
    stack backtrace:
    CPU: 7 UID: 0 PID: 68094 Comm: kworker/u32:0 Not tainted 6.13.0-rc4-00080-g9828a4c0901f #29
    Hardware name: linux,dummy-virt (DT)
    Workqueue: iou_exit io_ring_exit_work
    Call trace:
     show_stack+0x1c/0x30 (C)
     __dump_stack+0x24/0x30
     dump_stack_lvl+0x60/0x80
     dump_stack+0x14/0x20
     __lock_acquire+0x19f8/0x60c8
     lock_acquire+0x1a4/0x540
     _raw_spin_lock_irqsave+0x90/0xd0
     eventfd_signal_mask+0x64/0x180
     io_eventfd_signal+0x64/0x108
     io_req_local_work_add+0x294/0x430
     __io_req_task_work_add+0x1c0/0x270
     io_kill_timeout+0x1f0/0x288
     io_kill_timeouts+0xd4/0x180
     io_uring_try_cancel_requests+0x2e8/0x388
     io_ring_exit_work+0x150/0x550
     process_one_work+0x5e8/0xfc0
     worker_thread+0x7ec/0xc80
     kthread+0x24c/0x300
     ret_from_fork+0x10/0x20

    because after the preempt-rt fix for the timeout lock nesting inside
    the io-wq lock, we now have the eventfd spinlock nesting inside the
    raw timeout spinlock.

    Rather than play whack-a-mole with other nesting on the timeout lock,
    split the deletion and killing of timeouts so queueing the task_work
    for the timeout cancelations can get done outside of the timeout lock.

    Reported-by: [email protected]
    Fixes: 020b40f ("io_uring: make ctx->timeout_lock a raw spinlock")
    Signed-off-by: Jens Axboe <[email protected]>

Signed-off-by: Ming Lei <[email protected]>
github-actions bot pushed a commit that referenced this pull request Sep 16, 2025
…xit()

JIRA: https://issues.redhat.com/browse/RHEL-106845

commit 78c2713
Author: Ming Lei <[email protected]>
Date:   Mon May 5 22:18:03 2025 +0800

    block: move wbt_enable_default() out of queue freezing from sched ->exit()

    scheduler's ->exit() is called with queue frozen and elevator lock is held, and
    wbt_enable_default() can't be called with queue frozen, otherwise the
    following lockdep warning is triggered:

            #6 (&q->rq_qos_mutex){+.+.}-{4:4}:
            #5 (&eq->sysfs_lock){+.+.}-{4:4}:
            #4 (&q->elevator_lock){+.+.}-{4:4}:
            #3 (&q->q_usage_counter(io)#3){++++}-{0:0}:
            #2 (fs_reclaim){+.+.}-{0:0}:
            #1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}:
            #0 (&q->debugfs_mutex){+.+.}-{4:4}:

    Fix the issue by moving wbt_enable_default() out of bfq's exit(), and
    call it from elevator_change_done().

    Meantime add disk->rqos_state_mutex for covering wbt state change, which
    matches the purpose more than ->elevator_lock.

    Reviewed-by: Hannes Reinecke <[email protected]>
    Reviewed-by: Nilay Shroff <[email protected]>
    Signed-off-by: Ming Lei <[email protected]>
    Reviewed-by: Christoph Hellwig <[email protected]>
    Link: https://lore.kernel.org/r/[email protected]
    Signed-off-by: Jens Axboe <[email protected]>

Signed-off-by: Ming Lei <[email protected]>
PlaidCat added a commit that referenced this pull request Sep 17, 2025
jira LE-4159
Rebuild_History Non-Buildable kernel-5.14.0-570.41.1.el9_6
commit-author Dave Marquardt <[email protected]>
commit 053f3ff

v2:
- Created a single error handling unlock and exit in veth_pool_store
- Greatly expanded commit message with previous explanatory-only text

Summary: Use rtnl_mutex to synchronize veth_pool_store with itself,
ibmveth_close and ibmveth_open, preventing multiple calls in a row to
napi_disable.

Background: Two (or more) threads could call veth_pool_store through
writing to /sys/devices/vio/30000002/pool*/*. You can do this easily
with a little shell script. This causes a hang.

I configured LOCKDEP, compiled ibmveth.c with DEBUG, and built a new
kernel. I ran this test again and saw:

    Setting pool0/active to 0
    Setting pool1/active to 1
    [   73.911067][ T4365] ibmveth 30000002 eth0: close starting
    Setting pool1/active to 1
    Setting pool1/active to 0
    [   73.911367][ T4366] ibmveth 30000002 eth0: close starting
    [   73.916056][ T4365] ibmveth 30000002 eth0: close complete
    [   73.916064][ T4365] ibmveth 30000002 eth0: open starting
    [  110.808564][  T712] systemd-journald[712]: Sent WATCHDOG=1 notification.
    [  230.808495][  T712] systemd-journald[712]: Sent WATCHDOG=1 notification.
    [  243.683786][  T123] INFO: task stress.sh:4365 blocked for more than 122 seconds.
    [  243.683827][  T123]       Not tainted 6.14.0-01103-g2df0c02dab82-dirty #8
    [  243.683833][  T123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [  243.683838][  T123] task:stress.sh       state:D stack:28096 pid:4365  tgid:4365  ppid:4364   task_flags:0x400040 flags:0x00042000
    [  243.683852][  T123] Call Trace:
    [  243.683857][  T123] [c00000000c38f690] [0000000000000001] 0x1 (unreliable)
    [  243.683868][  T123] [c00000000c38f840] [c00000000001f908] __switch_to+0x318/0x4e0
    [  243.683878][  T123] [c00000000c38f8a0] [c000000001549a70] __schedule+0x500/0x12a0
    [  243.683888][  T123] [c00000000c38f9a0] [c00000000154a878] schedule+0x68/0x210
    [  243.683896][  T123] [c00000000c38f9d0] [c00000000154ac80] schedule_preempt_disabled+0x30/0x50
    [  243.683904][  T123] [c00000000c38fa00] [c00000000154dbb0] __mutex_lock+0x730/0x10f0
    [  243.683913][  T123] [c00000000c38fb10] [c000000001154d40] napi_enable+0x30/0x60
    [  243.683921][  T123] [c00000000c38fb40] [c000000000f4ae94] ibmveth_open+0x68/0x5dc
    [  243.683928][  T123] [c00000000c38fbe0] [c000000000f4aa20] veth_pool_store+0x220/0x270
    [  243.683936][  T123] [c00000000c38fc70] [c000000000826278] sysfs_kf_write+0x68/0xb0
    [  243.683944][  T123] [c00000000c38fcb0] [c0000000008240b8] kernfs_fop_write_iter+0x198/0x2d0
    [  243.683951][  T123] [c00000000c38fd00] [c00000000071b9ac] vfs_write+0x34c/0x650
    [  243.683958][  T123] [c00000000c38fdc0] [c00000000071bea8] ksys_write+0x88/0x150
    [  243.683966][  T123] [c00000000c38fe10] [c0000000000317f4] system_call_exception+0x124/0x340
    [  243.683973][  T123] [c00000000c38fe50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
    ...
    [  243.684087][  T123] Showing all locks held in the system:
    [  243.684095][  T123] 1 lock held by khungtaskd/123:
    [  243.684099][  T123]  #0: c00000000278e370 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x50/0x248
    [  243.684114][  T123] 4 locks held by stress.sh/4365:
    [  243.684119][  T123]  #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150
    [  243.684132][  T123]  #1: c000000041aea888 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0
    [  243.684143][  T123]  #2: c0000000366fb9a8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0
    [  243.684155][  T123]  #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_enable+0x30/0x60
    [  243.684166][  T123] 5 locks held by stress.sh/4366:
    [  243.684170][  T123]  #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150
    [  243.684183][  T123]  #1: c00000000aee2288 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0
    [  243.684194][  T123]  #2: c0000000366f4ba8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0
    [  243.684205][  T123]  #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_disable+0x30/0x60
    [  243.684216][  T123]  #4: c0000003ff9bbf18 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x138/0x12a0

From the ibmveth debug, two threads are calling veth_pool_store, which
calls ibmveth_close and ibmveth_open. Here's the sequence:

  T4365             T4366
  ----------------- ----------------- ---------
  veth_pool_store   veth_pool_store
                    ibmveth_close
  ibmveth_close
  napi_disable
                    napi_disable
  ibmveth_open
  napi_enable                         <- HANG

ibmveth_close calls napi_disable at the top and ibmveth_open calls
napi_enable at the top.

https://docs.kernel.org/networking/napi.html]] says

  The control APIs are not idempotent. Control API calls are safe
  against concurrent use of datapath APIs but an incorrect sequence of
  control API calls may result in crashes, deadlocks, or race
  conditions. For example, calling napi_disable() multiple times in a
  row will deadlock.

In the normal open and close paths, rtnl_mutex is acquired to prevent
other callers. This is missing from veth_pool_store. Use rtnl_mutex in
veth_pool_store fixes these hangs.

	Signed-off-by: Dave Marquardt <[email protected]>
Fixes: 860f242 ("[PATCH] ibmveth change buffer pools dynamically")
	Reviewed-by: Nick Child <[email protected]>
	Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
	Signed-off-by: Jakub Kicinski <[email protected]>
(cherry picked from commit 053f3ff)
	Signed-off-by: Jonathan Maple <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 1, 2025
…mplaint

JIRA: https://issues.redhat.com/browse/RHEL-36251

commit bbaa6ff
Author: Kai-Heng Feng <[email protected]>
Date:   Wed Sep 13 11:32:33 2023 +0800

AMD PMF driver can cause the following warning:
[  196.159546] ------------[ cut here ]------------
[  196.159556] Voluntary context switch within RCU read-side critical section!
[  196.159571] WARNING: CPU: 0 PID: 9 at kernel/rcu/tree_plugin.h:320 rcu_note_context_switch+0x43d/0x560
[  196.159604] Modules linked in: nvme_fabrics ccm rfcomm snd_hda_scodec_cs35l41_spi cmac algif_hash algif_skcipher af_alg bnep joydev btusb btrtl uvcvideo btintel btbcm videobuf2_vmalloc intel_rapl_msr btmtk videobuf2_memops uvc videobuf2_v4l2 intel_rapl_common binfmt_misc hid_sensor_als snd_sof_amd_vangogh hid_sensor_trigger bluetooth industrialio_triggered_buffer videodev snd_sof_amd_rembrandt hid_sensor_iio_common amdgpu ecdh_generic kfifo_buf videobuf2_common hp_wmi kvm_amd sparse_keymap snd_sof_amd_renoir wmi_bmof industrialio ecc mc nls_iso8859_1 kvm snd_sof_amd_acp irqbypass snd_sof_xtensa_dsp crct10dif_pclmul crc32_pclmul mt7921e snd_sof_pci snd_ctl_led polyval_clmulni mt7921_common polyval_generic snd_sof ghash_clmulni_intel mt792x_lib mt76_connac_lib sha512_ssse3 snd_sof_utils aesni_intel snd_hda_codec_realtek crypto_simd mt76 snd_hda_codec_generic cryptd snd_soc_core snd_hda_codec_hdmi rapl ledtrig_audio input_leds snd_compress i2c_algo_bit drm_ttm_helper mac80211 snd_pci_ps hid_multitouch ttm drm_exec
[  196.159970]  drm_suballoc_helper snd_rpl_pci_acp6x amdxcp drm_buddy snd_hda_intel snd_acp_pci snd_hda_scodec_cs35l41_i2c serio_raw gpu_sched snd_hda_scodec_cs35l41 snd_acp_legacy_common snd_intel_dspcfg snd_hda_cs_dsp_ctls snd_hda_codec libarc4 drm_display_helper snd_pci_acp6x cs_dsp snd_hwdep snd_soc_cs35l41_lib video k10temp snd_pci_acp5x thunderbolt snd_hda_core drm_kms_helper cfg80211 snd_seq snd_rn_pci_acp3x snd_pcm snd_acp_config cec snd_soc_acpi snd_seq_device rc_core ccp snd_pci_acp3x snd_timer snd soundcore wmi amd_pmf platform_profile amd_pmc mac_hid serial_multi_instantiate wireless_hotkey hid_sensor_hub sch_fq_codel msr parport_pc ppdev lp parport efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor raid6_pq raid1 raid0 multipath linear dm_mirror dm_region_hash dm_log cdc_ether usbnet r8152 mii hid_generic nvme i2c_hid_acpi i2c_hid nvme_core i2c_piix4 xhci_pci amd_sfh drm xhci_pci_renesas nvme_common hid
[  196.160382] CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1 #4
[  196.160397] Hardware name: HP HP EliteBook 845 14 inch G10 Notebook PC/8B6E, BIOS V82 Ver. 01.02.00 08/24/2023
[  196.160405] Workqueue: events power_supply_changed_work
[  196.160426] RIP: 0010:rcu_note_context_switch+0x43d/0x560
[  196.160440] Code: 00 48 89 be 40 08 00 00 48 89 86 48 08 00 00 48 89 10 e9 63 fe ff ff 48 c7 c7 10 e7 b0 9e c6 05 e8 d8 20 02 01 e8 13 0f f3 ff <0f> 0b e9 27 fc ff ff a9 ff ff ff 7f 0f 84 cf fc ff ff 65 48 8b 3c
[  196.160450] RSP: 0018:ffffc900001878f0 EFLAGS: 00010046
[  196.160462] RAX: 0000000000000000 RBX: ffff88885e834040 RCX: 0000000000000000
[  196.160470] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[  196.160476] RBP: ffffc90000187910 R08: 0000000000000000 R09: 0000000000000000
[  196.160482] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  196.160488] R13: 0000000000000000 R14: ffff888100990000 R15: ffff888100990000
[  196.160495] FS:  0000000000000000(0000) GS:ffff88885e800000(0000) knlGS:0000000000000000
[  196.160504] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  196.160512] CR2: 000055cb053c8246 CR3: 000000013443a000 CR4: 0000000000750ef0
[  196.160520] PKRU: 55555554
[  196.160526] Call Trace:
[  196.160532]  <TASK>
[  196.160548]  ? show_regs+0x72/0x90
[  196.160570]  ? rcu_note_context_switch+0x43d/0x560
[  196.160580]  ? __warn+0x8d/0x160
[  196.160600]  ? rcu_note_context_switch+0x43d/0x560
[  196.160613]  ? report_bug+0x1bb/0x1d0
[  196.160637]  ? handle_bug+0x46/0x90
[  196.160658]  ? exc_invalid_op+0x19/0x80
[  196.160675]  ? asm_exc_invalid_op+0x1b/0x20
[  196.160709]  ? rcu_note_context_switch+0x43d/0x560
[  196.160727]  __schedule+0xb9/0x15f0
[  196.160746]  ? srso_alias_return_thunk+0x5/0x7f
[  196.160765]  ? srso_alias_return_thunk+0x5/0x7f
[  196.160778]  ? acpi_ns_search_one_scope+0xbe/0x270
[  196.160806]  schedule+0x68/0x110
[  196.160820]  schedule_timeout+0x151/0x160
[  196.160829]  ? srso_alias_return_thunk+0x5/0x7f
[  196.160842]  ? srso_alias_return_thunk+0x5/0x7f
[  196.160855]  ? acpi_ns_lookup+0x3c5/0xa90
[  196.160878]  __down_common+0xff/0x220
[  196.160905]  __down_timeout+0x16/0x30
[  196.160920]  down_timeout+0x64/0x70
[  196.160938]  acpi_os_wait_semaphore+0x85/0x200
[  196.160959]  acpi_ut_acquire_mutex+0x9e/0x280
[  196.160979]  acpi_ex_enter_interpreter+0x2d/0xb0
[  196.160992]  acpi_ns_evaluate+0x2f0/0x5f0
[  196.161005]  acpi_evaluate_object+0x172/0x490
[  196.161018]  ? acpi_os_signal_semaphore+0x8a/0xd0
[  196.161038]  acpi_evaluate_integer+0x52/0xe0
[  196.161055]  ? kfree+0x79/0x120
[  196.161071]  ? srso_alias_return_thunk+0x5/0x7f
[  196.161089]  acpi_ac_get_state.part.0+0x27/0x80
[  196.161110]  get_ac_property+0x5c/0x70
[  196.161127]  ? __pfx___power_supply_is_system_supplied+0x10/0x10
[  196.161146]  __power_supply_is_system_supplied+0x44/0xb0
[  196.161166]  class_for_each_device+0x124/0x160
[  196.161184]  ? acpi_ac_get_state.part.0+0x27/0x80
[  196.161203]  ? srso_alias_return_thunk+0x5/0x7f
[  196.161223]  power_supply_is_system_supplied+0x3c/0x70
[  196.161243]  amd_pmf_get_power_source+0xe/0x20 [amd_pmf]
[  196.161276]  amd_pmf_power_slider_update_event+0x49/0x90 [amd_pmf]
[  196.161310]  amd_pmf_pwr_src_notify_call+0xe7/0x100 [amd_pmf]
[  196.161340]  notifier_call_chain+0x5f/0xe0
[  196.161362]  atomic_notifier_call_chain+0x33/0x60
[  196.161378]  power_supply_changed_work+0x84/0x110
[  196.161394]  process_one_work+0x178/0x360
[  196.161412]  ? __pfx_worker_thread+0x10/0x10
[  196.161424]  worker_thread+0x307/0x430
[  196.161440]  ? __pfx_worker_thread+0x10/0x10
[  196.161451]  kthread+0xf4/0x130
[  196.161467]  ? __pfx_kthread+0x10/0x10
[  196.161486]  ret_from_fork+0x43/0x70
[  196.161502]  ? __pfx_kthread+0x10/0x10
[  196.161518]  ret_from_fork_asm+0x1b/0x30
[  196.161558]  </TASK>
[  196.161562] ---[ end trace 0000000000000000 ]---

Since there's no guarantee that all the callbacks can work in atomic
context, switch to use blocking_notifier_call_chain to relax the
constraint.

Signed-off-by: Kai-Heng Feng <[email protected]>
Reported-by: Allen Zhong <[email protected]>
Fixes: 4c71ae4 ("platform/x86/amd/pmf: Add support SPS PMF feature")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217571
Reviewed-by: Mario Limonciello <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sebastian Reichel <[email protected]>
Signed-off-by: Mark Langsdorf <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 7, 2025
Before disabling SR-IOV via config space accesses to the parent PF,
sriov_disable() first removes the PCI devices representing the VFs.

Since commit 9d16947 ("PCI: Add global pci_lock_rescan_remove()")
such removal operations are serialized against concurrent remove and
rescan using the pci_rescan_remove_lock. No such locking was ever added
in sriov_disable() however. In particular when commit 18f9e9d
("PCI/IOV: Factor out sriov_add_vfs()") factored out the PCI device
removal into sriov_del_vfs() there was still no locking around the
pci_iov_remove_virtfn() calls.

On s390 the lack of serialization in sriov_disable() may cause double
remove and list corruption with the below (amended) trace being observed:

  PSW:  0704c00180000000 0000000c914e4b38 (klist_put+56)
  GPRS: 000003800313fb48 0000000000000000 0000000100000001 0000000000000001
	00000000f9b520a8 0000000000000000 0000000000002fbd 00000000f4cc9480
	0000000000000001 0000000000000000 0000000000000000 0000000180692828
	00000000818e8000 000003800313fe2c 000003800313fb20 000003800313fad8
  #0 [3800313fb20] device_del at c9158ad5c
  #1 [3800313fb88] pci_remove_bus_device at c915105ba
  #2 [3800313fbd0] pci_iov_remove_virtfn at c9152f198
  #3 [3800313fc28] zpci_iov_remove_virtfn at c90fb67c0
  #4 [3800313fc60] zpci_bus_remove_device at c90fb6104
  #5 [3800313fca0] __zpci_event_availability at c90fb3dca
  #6 [3800313fd08] chsc_process_sei_nt0 at c918fe4a2
  #7 [3800313fd60] crw_collect_info at c91905822
  #8 [3800313fe10] kthread at c90feb390
  #9 [3800313fe68] __ret_from_fork at c90f6aa64
  #10 [3800313fe98] ret_from_fork at c9194f3f2.

This is because in addition to sriov_disable() removing the VFs, the
platform also generates hot-unplug events for the VFs. This being the
reverse operation to the hotplug events generated by sriov_enable() and
handled via pdev->no_vf_scan. And while the event processing takes
pci_rescan_remove_lock and checks whether the struct pci_dev still exists,
the lack of synchronization makes this checking racy.

Other races may also be possible of course though given that this lack of
locking persisted so long observable races seem very rare. Even on s390 the
list corruption was only observed with certain devices since the platform
events are only triggered by config accesses after the removal, so as long
as the removal finished synchronously they would not race. Either way the
locking is missing so fix this by adding it to the sriov_del_vfs() helper.

Just like PCI rescan-remove, locking is also missing in sriov_add_vfs()
including for the error case where pci_stop_and_remove_bus_device() is
called without the PCI rescan-remove lock being held. Even in the non-error
case, adding new PCI devices and buses should be serialized via the PCI
rescan-remove lock. Add the necessary locking.

Fixes: 18f9e9d ("PCI/IOV: Factor out sriov_add_vfs()")
Signed-off-by: Niklas Schnelle <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Benjamin Block <[email protected]>
Reviewed-by: Farhan Ali <[email protected]>
Reviewed-by: Julian Ruess <[email protected]>
Cc: [email protected]
Link: https://patch.msgid.link/[email protected]
github-actions bot pushed a commit that referenced this pull request Oct 9, 2025
…uctions

JIRA: https://issues.redhat.com/browse/RHEL-78202

commit ff3afe5
Author: Peilin Ye <[email protected]>
Date:   Tue Mar 4 01:06:46 2025 +0000

    selftests/bpf: Add selftests for load-acquire and store-release instructions
    
    Add several ./test_progs tests:
    
      - arena_atomics/load_acquire
      - arena_atomics/store_release
      - verifier_load_acquire/*
      - verifier_store_release/*
      - verifier_precision/bpf_load_acquire
      - verifier_precision/bpf_store_release
    
    The last two tests are added to check if backtrack_insn() handles the
    new instructions correctly.
    
    Additionally, the last test also makes sure that the verifier
    "remembers" the value (in src_reg) we store-release into e.g. a stack
    slot.  For example, if we take a look at the test program:
    
        #0:  r1 = 8;
          /* store_release((u64 *)(r10 - 8), r1); */
        #1:  .8byte %[store_release];
        #2:  r1 = *(u64 *)(r10 - 8);
        #3:  r2 = r10;
        #4:  r2 += r1;
        #5:  r0 = 0;
        #6:  exit;
    
    At #1, if the verifier doesn't remember that we wrote 8 to the stack,
    then later at #4 we would be adding an unbounded scalar value to the
    stack pointer, which would cause the program to be rejected:
    
      VERIFIER LOG:
      =============
    ...
      math between fp pointer and register with unbounded min value is not allowed
    
    For easier CI integration, instead of using built-ins like
    __atomic_{load,store}_n() which depend on the new
    __BPF_FEATURE_LOAD_ACQ_STORE_REL pre-defined macro, manually craft
    load-acquire/store-release instructions using __imm_insn(), as suggested
    by Eduard.
    
    All new tests depend on:
    
      (1) Clang major version >= 18, and
      (2) ENABLE_ATOMICS_TESTS is defined (currently implies -mcpu=v3 or
          v4), and
      (3) JIT supports load-acquire/store-release (currently arm64 and
          x86-64)
    
    In .../progs/arena_atomics.c:
    
      /* 8-byte-aligned */
      __u8 __arena_global load_acquire8_value = 0x12;
      /* 1-byte hole */
      __u16 __arena_global load_acquire16_value = 0x1234;
    
    That 1-byte hole in the .addr_space.1 ELF section caused clang-17 to
    crash:
    
      fatal error: error in backend: unable to write nop sequence of 1 bytes
    
    To work around such llvm-17 CI job failures, conditionally define
    __arena_global variables as 64-bit if __clang_major__ < 18, to make sure
    .addr_space.1 has no holes.  Ideally we should avoid compiling this file
    using clang-17 at all (arena tests depend on
    __BPF_FEATURE_ADDR_SPACE_CAST, and are skipped for llvm-17 anyway), but
    that is a separate topic.
    
    Acked-by: Eduard Zingerman <[email protected]>
    Signed-off-by: Peilin Ye <[email protected]>
    Link: https://lore.kernel.org/r/1b46c6feaf0f1b6984d9ec80e500cc7383e9da1a.1741049567.git.yepeilin@google.com
    Signed-off-by: Alexei Starovoitov <[email protected]>

Signed-off-by: Gregory Bell <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 9, 2025
The test starts a workload and then opens events. If the events fail
to open, for example because of perf_event_paranoid, the gopipe of the
workload is leaked and the file descriptor leak check fails when the
test exits. To avoid this cancel the workload when opening the events
fails.

Before:
```
$ perf test -vv 7
  7: PERF_RECORD_* events & perf_sample fields:
 --- start ---
test child forked, pid 1189568
Using CPUID GenuineIntel-6-B7-1
 ------------------------------------------------------------
perf_event_attr:
  type                    	   0 (PERF_TYPE_HARDWARE)
  config                  	   0xa00000000 (cpu_atom/PERF_COUNT_HW_CPU_CYCLES/)
  disabled                	   1
 ------------------------------------------------------------
sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8
sys_perf_event_open failed, error -13
 ------------------------------------------------------------
perf_event_attr:
  type                             0 (PERF_TYPE_HARDWARE)
  config                           0xa00000000 (cpu_atom/PERF_COUNT_HW_CPU_CYCLES/)
  disabled                         1
  exclude_kernel                   1
 ------------------------------------------------------------
sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8 = 3
 ------------------------------------------------------------
perf_event_attr:
  type                             0 (PERF_TYPE_HARDWARE)
  config                           0x400000000 (cpu_core/PERF_COUNT_HW_CPU_CYCLES/)
  disabled                         1
 ------------------------------------------------------------
sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8
sys_perf_event_open failed, error -13
 ------------------------------------------------------------
perf_event_attr:
  type                             0 (PERF_TYPE_HARDWARE)
  config                           0x400000000 (cpu_core/PERF_COUNT_HW_CPU_CYCLES/)
  disabled                         1
  exclude_kernel                   1
 ------------------------------------------------------------
sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8 = 3
Attempt to add: software/cpu-clock/
..after resolving event: software/config=0/
cpu-clock -> software/cpu-clock/
 ------------------------------------------------------------
perf_event_attr:
  type                             1 (PERF_TYPE_SOFTWARE)
  size                             136
  config                           0x9 (PERF_COUNT_SW_DUMMY)
  sample_type                      IP|TID|TIME|CPU
  read_format                      ID|LOST
  disabled                         1
  inherit                          1
  mmap                             1
  comm                             1
  enable_on_exec                   1
  task                             1
  sample_id_all                    1
  mmap2                            1
  comm_exec                        1
  ksymbol                          1
  bpf_event                        1
  { wakeup_events, wakeup_watermark } 1
 ------------------------------------------------------------
sys_perf_event_open: pid 1189569  cpu 0  group_fd -1  flags 0x8
sys_perf_event_open failed, error -13
perf_evlist__open: Permission denied
 ---- end(-2) ----
Leak of file descriptor 6 that opened: 'pipe:[14200347]'
 ---- unexpected signal (6) ----
iFailed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
    #0 0x565358f6666e in child_test_sig_handler builtin-test.c:311
    #1 0x7f29ce849df0 in __restore_rt libc_sigaction.c:0
    #2 0x7f29ce89e95c in __pthread_kill_implementation pthread_kill.c:44
    #3 0x7f29ce849cc2 in raise raise.c:27
    #4 0x7f29ce8324ac in abort abort.c:81
    #5 0x565358f662d4 in check_leaks builtin-test.c:226
    #6 0x565358f6682e in run_test_child builtin-test.c:344
    #7 0x565358ef7121 in start_command run-command.c:128
    #8 0x565358f67273 in start_test builtin-test.c:545
    #9 0x565358f6771d in __cmd_test builtin-test.c:647
    #10 0x565358f682bd in cmd_test builtin-test.c:849
    #11 0x565358ee5ded in run_builtin perf.c:349
    #12 0x565358ee6085 in handle_internal_command perf.c:401
    #13 0x565358ee61de in run_argv perf.c:448
    #14 0x565358ee6527 in main perf.c:555
    #15 0x7f29ce833ca8 in __libc_start_call_main libc_start_call_main.h:74
    #16 0x7f29ce833d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128
    #17 0x565358e391c1 in _start perf[851c1]
  7: PERF_RECORD_* events & perf_sample fields                       : FAILED!
```

After:
```
$ perf test 7
  7: PERF_RECORD_* events & perf_sample fields                       : Skip (permissions)
```

Fixes: 16d00fe ("perf tests: Move test__PERF_RECORD into separate object")
Signed-off-by: Ian Rogers <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Chun-Tse Shao <[email protected]>
Cc: Howard Chu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 17, 2025
Since blamed commit, unregister_netdevice_many_notify() takes the netdev
mutex if the device needs it.

If the device list is too long, this will lock more device mutexes than
lockdep can handle:

unshare -n \
 bash -c 'for i in $(seq 1 100);do ip link add foo$i type dummy;done'

BUG: MAX_LOCK_DEPTH too low!
turning off the locking correctness validator.
depth: 48  max: 48!
48 locks held by kworker/u16:1/69:
 #0: ..148 ((wq_completion)netns){+.+.}-{0:0}, at: process_one_work
 #1: ..d40 (net_cleanup_work){+.+.}-{0:0}, at: process_one_work
 #2: ..bd0 (pernet_ops_rwsem){++++}-{4:4}, at: cleanup_net
 #3: ..aa8 (rtnl_mutex){+.+.}-{4:4}, at: default_device_exit_batch
 #4: ..cb0 (&dev_instance_lock_key#3){+.+.}-{4:4}, at: unregister_netdevice_many_notify
[..]

Add a helper to close and then unlock a list of net_devices.
Devices that are not up have to be skipped - netif_close_many always
removes them from the list without any other actions taken, so they'd
remain in locked state.

Close devices whenever we've used up half of the tracking slots or we
processed entire list without hitting the limit.

Fixes: 7e4d784 ("net: hold netdev instance lock during rtnetlink operations")
Signed-off-by: Florian Westphal <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 20, 2025
[ Upstream commit 48918ca ]

The test starts a workload and then opens events. If the events fail
to open, for example because of perf_event_paranoid, the gopipe of the
workload is leaked and the file descriptor leak check fails when the
test exits. To avoid this cancel the workload when opening the events
fails.

Before:
```
$ perf test -vv 7
  7: PERF_RECORD_* events & perf_sample fields:
 --- start ---
test child forked, pid 1189568
Using CPUID GenuineIntel-6-B7-1
 ------------------------------------------------------------
perf_event_attr:
  type                    	   0 (PERF_TYPE_HARDWARE)
  config                  	   0xa00000000 (cpu_atom/PERF_COUNT_HW_CPU_CYCLES/)
  disabled                	   1
 ------------------------------------------------------------
sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8
sys_perf_event_open failed, error -13
 ------------------------------------------------------------
perf_event_attr:
  type                             0 (PERF_TYPE_HARDWARE)
  config                           0xa00000000 (cpu_atom/PERF_COUNT_HW_CPU_CYCLES/)
  disabled                         1
  exclude_kernel                   1
 ------------------------------------------------------------
sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8 = 3
 ------------------------------------------------------------
perf_event_attr:
  type                             0 (PERF_TYPE_HARDWARE)
  config                           0x400000000 (cpu_core/PERF_COUNT_HW_CPU_CYCLES/)
  disabled                         1
 ------------------------------------------------------------
sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8
sys_perf_event_open failed, error -13
 ------------------------------------------------------------
perf_event_attr:
  type                             0 (PERF_TYPE_HARDWARE)
  config                           0x400000000 (cpu_core/PERF_COUNT_HW_CPU_CYCLES/)
  disabled                         1
  exclude_kernel                   1
 ------------------------------------------------------------
sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8 = 3
Attempt to add: software/cpu-clock/
..after resolving event: software/config=0/
cpu-clock -> software/cpu-clock/
 ------------------------------------------------------------
perf_event_attr:
  type                             1 (PERF_TYPE_SOFTWARE)
  size                             136
  config                           0x9 (PERF_COUNT_SW_DUMMY)
  sample_type                      IP|TID|TIME|CPU
  read_format                      ID|LOST
  disabled                         1
  inherit                          1
  mmap                             1
  comm                             1
  enable_on_exec                   1
  task                             1
  sample_id_all                    1
  mmap2                            1
  comm_exec                        1
  ksymbol                          1
  bpf_event                        1
  { wakeup_events, wakeup_watermark } 1
 ------------------------------------------------------------
sys_perf_event_open: pid 1189569  cpu 0  group_fd -1  flags 0x8
sys_perf_event_open failed, error -13
perf_evlist__open: Permission denied
 ---- end(-2) ----
Leak of file descriptor 6 that opened: 'pipe:[14200347]'
 ---- unexpected signal (6) ----
iFailed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
    #0 0x565358f6666e in child_test_sig_handler builtin-test.c:311
    #1 0x7f29ce849df0 in __restore_rt libc_sigaction.c:0
    #2 0x7f29ce89e95c in __pthread_kill_implementation pthread_kill.c:44
    #3 0x7f29ce849cc2 in raise raise.c:27
    #4 0x7f29ce8324ac in abort abort.c:81
    #5 0x565358f662d4 in check_leaks builtin-test.c:226
    #6 0x565358f6682e in run_test_child builtin-test.c:344
    #7 0x565358ef7121 in start_command run-command.c:128
    #8 0x565358f67273 in start_test builtin-test.c:545
    #9 0x565358f6771d in __cmd_test builtin-test.c:647
    #10 0x565358f682bd in cmd_test builtin-test.c:849
    #11 0x565358ee5ded in run_builtin perf.c:349
    #12 0x565358ee6085 in handle_internal_command perf.c:401
    #13 0x565358ee61de in run_argv perf.c:448
    #14 0x565358ee6527 in main perf.c:555
    #15 0x7f29ce833ca8 in __libc_start_call_main libc_start_call_main.h:74
    #16 0x7f29ce833d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128
    #17 0x565358e391c1 in _start perf[851c1]
  7: PERF_RECORD_* events & perf_sample fields                       : FAILED!
```

After:
```
$ perf test 7
  7: PERF_RECORD_* events & perf_sample fields                       : Skip (permissions)
```

Fixes: 16d00fe ("perf tests: Move test__PERF_RECORD into separate object")
Signed-off-by: Ian Rogers <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Chun-Tse Shao <[email protected]>
Cc: Howard Chu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 20, 2025
commit 0570327 upstream.

Before disabling SR-IOV via config space accesses to the parent PF,
sriov_disable() first removes the PCI devices representing the VFs.

Since commit 9d16947 ("PCI: Add global pci_lock_rescan_remove()")
such removal operations are serialized against concurrent remove and
rescan using the pci_rescan_remove_lock. No such locking was ever added
in sriov_disable() however. In particular when commit 18f9e9d
("PCI/IOV: Factor out sriov_add_vfs()") factored out the PCI device
removal into sriov_del_vfs() there was still no locking around the
pci_iov_remove_virtfn() calls.

On s390 the lack of serialization in sriov_disable() may cause double
remove and list corruption with the below (amended) trace being observed:

  PSW:  0704c00180000000 0000000c914e4b38 (klist_put+56)
  GPRS: 000003800313fb48 0000000000000000 0000000100000001 0000000000000001
	00000000f9b520a8 0000000000000000 0000000000002fbd 00000000f4cc9480
	0000000000000001 0000000000000000 0000000000000000 0000000180692828
	00000000818e8000 000003800313fe2c 000003800313fb20 000003800313fad8
  #0 [3800313fb20] device_del at c9158ad5c
  #1 [3800313fb88] pci_remove_bus_device at c915105ba
  #2 [3800313fbd0] pci_iov_remove_virtfn at c9152f198
  #3 [3800313fc28] zpci_iov_remove_virtfn at c90fb67c0
  #4 [3800313fc60] zpci_bus_remove_device at c90fb6104
  #5 [3800313fca0] __zpci_event_availability at c90fb3dca
  #6 [3800313fd08] chsc_process_sei_nt0 at c918fe4a2
  #7 [3800313fd60] crw_collect_info at c91905822
  #8 [3800313fe10] kthread at c90feb390
  #9 [3800313fe68] __ret_from_fork at c90f6aa64
  #10 [3800313fe98] ret_from_fork at c9194f3f2.

This is because in addition to sriov_disable() removing the VFs, the
platform also generates hot-unplug events for the VFs. This being the
reverse operation to the hotplug events generated by sriov_enable() and
handled via pdev->no_vf_scan. And while the event processing takes
pci_rescan_remove_lock and checks whether the struct pci_dev still exists,
the lack of synchronization makes this checking racy.

Other races may also be possible of course though given that this lack of
locking persisted so long observable races seem very rare. Even on s390 the
list corruption was only observed with certain devices since the platform
events are only triggered by config accesses after the removal, so as long
as the removal finished synchronously they would not race. Either way the
locking is missing so fix this by adding it to the sriov_del_vfs() helper.

Just like PCI rescan-remove, locking is also missing in sriov_add_vfs()
including for the error case where pci_stop_and_remove_bus_device() is
called without the PCI rescan-remove lock being held. Even in the non-error
case, adding new PCI devices and buses should be serialized via the PCI
rescan-remove lock. Add the necessary locking.

Fixes: 18f9e9d ("PCI/IOV: Factor out sriov_add_vfs()")
Signed-off-by: Niklas Schnelle <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Benjamin Block <[email protected]>
Reviewed-by: Farhan Ali <[email protected]>
Reviewed-by: Julian Ruess <[email protected]>
Cc: [email protected]
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 23, 2025
JIRA: https://issues.redhat.com/browse/RHEL-78203

commit ee684de
Author: Viktor Malik <[email protected]>
Date:   Tue Apr 15 17:50:14 2025 +0200

    libbpf: Fix buffer overflow in bpf_object__init_prog
    
    As shown in [1], it is possible to corrupt a BPF ELF file such that
    arbitrary BPF instructions are loaded by libbpf. This can be done by
    setting a symbol (BPF program) section offset to a large (unsigned)
    number such that <section start + symbol offset> overflows and points
    before the section data in the memory.
    
    Consider the situation below where:
    - prog_start = sec_start + symbol_offset    <-- size_t overflow here
    - prog_end   = prog_start + prog_size
    
        prog_start        sec_start        prog_end        sec_end
            |                |                 |              |
            v                v                 v              v
        .....................|################################|............
    
    The report in [1] also provides a corrupted BPF ELF which can be used as
    a reproducer:
    
        $ readelf -S crash
        Section Headers:
          [Nr] Name              Type             Address           Offset
               Size              EntSize          Flags  Link  Info  Align
        ...
          [ 2] uretprobe.mu[...] PROGBITS         0000000000000000  00000040
               0000000000000068  0000000000000000  AX       0     0     8
    
        $ readelf -s crash
        Symbol table '.symtab' contains 8 entries:
           Num:    Value          Size Type    Bind   Vis      Ndx Name
        ...
             6: ffffffffffffffb8   104 FUNC    GLOBAL DEFAULT    2 handle_tp
    
    Here, the handle_tp prog has section offset ffffffffffffffb8, i.e. will
    point before the actual memory where section 2 is allocated.
    
    This is also reported by AddressSanitizer:
    
        =================================================================
        ==1232==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7c7302fe0000 at pc 0x7fc3046e4b77 bp 0x7ffe64677cd0 sp 0x7ffe64677490
        READ of size 104 at 0x7c7302fe0000 thread T0
            #0 0x7fc3046e4b76 in memcpy (/lib64/libasan.so.8+0xe4b76)
            #1 0x00000040df3e in bpf_object__init_prog /src/libbpf/src/libbpf.c:856
            #2 0x00000040df3e in bpf_object__add_programs /src/libbpf/src/libbpf.c:928
            #3 0x00000040df3e in bpf_object__elf_collect /src/libbpf/src/libbpf.c:3930
            #4 0x00000040df3e in bpf_object_open /src/libbpf/src/libbpf.c:8067
            #5 0x00000040f176 in bpf_object__open_file /src/libbpf/src/libbpf.c:8090
            #6 0x000000400c16 in main /poc/poc.c:8
            #7 0x7fc3043d25b4 in __libc_start_call_main (/lib64/libc.so.6+0x35b4)
            #8 0x7fc3043d2667 in __libc_start_main@@GLIBC_2.34 (/lib64/libc.so.6+0x3667)
            #9 0x000000400b34 in _start (/poc/poc+0x400b34)
    
        0x7c7302fe0000 is located 64 bytes before 104-byte region [0x7c7302fe0040,0x7c7302fe00a8)
        allocated by thread T0 here:
            #0 0x7fc3046e716b in malloc (/lib64/libasan.so.8+0xe716b)
            #1 0x7fc3045ee600 in __libelf_set_rawdata_wrlock (/lib64/libelf.so.1+0xb600)
            #2 0x7fc3045ef018 in __elf_getdata_rdlock (/lib64/libelf.so.1+0xc018)
            #3 0x00000040642f in elf_sec_data /src/libbpf/src/libbpf.c:3740
    
    The problem here is that currently, libbpf only checks that the program
    end is within the section bounds. There used to be a check
    `while (sec_off < sec_sz)` in bpf_object__add_programs, however, it was
    removed by commit 6245947 ("libbpf: Allow gaps in BPF program
    sections to support overriden weak functions").
    
    Add a check for detecting the overflow of `sec_off + prog_sz` to
    bpf_object__init_prog to fix this issue.
    
    [1] https://github.com/lmarch2/poc/blob/main/libbpf/libbpf.md
    
    Fixes: 6245947 ("libbpf: Allow gaps in BPF program sections to support overriden weak functions")
    Reported-by: lmarch2 <[email protected]>
    Signed-off-by: Viktor Malik <[email protected]>
    Signed-off-by: Andrii Nakryiko <[email protected]>
    Reviewed-by: Shung-Hsi Yu <[email protected]>
    Link: https://github.com/lmarch2/poc/blob/main/libbpf/libbpf.md
    Link: https://lore.kernel.org/bpf/[email protected]

Signed-off-by: Viktor Malik <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 24, 2025
JIRA: https://issues.redhat.com/browse/RHEL-112997

commit ffa1e7a
Author: Thomas Hellström <[email protected]>
Date:   Tue Mar 18 10:55:48 2025 +0100

    block: Make request_queue lockdep splats show up earlier

    In recent kernels, there are lockdep splats around the
    struct request_queue::io_lockdep_map, similar to [1], but they
    typically don't show up until reclaim with writeback happens.

    Having multiple kernel versions released with a known risc of kernel
    deadlock during reclaim writeback should IMHO be addressed and
    backported to -stable with the highest priority.

    In order to have these lockdep splats show up earlier,
    preferrably during system initialization, prime the
    struct request_queue::io_lockdep_map as GFP_KERNEL reclaim-
    tainted. This will instead lead to lockdep splats looking similar
    to [2], but without the need for reclaim + writeback
    happening.

    [1]:
    [  189.762244] ======================================================
    [  189.762432] WARNING: possible circular locking dependency detected
    [  189.762441] 6.14.0-rc6-xe+ #6 Tainted: G     U
    [  189.762450] ------------------------------------------------------
    [  189.762459] kswapd0/119 is trying to acquire lock:
    [  189.762467] ffff888110ceb710 (&q->q_usage_counter(io)#26){++++}-{0:0}, at: __submit_bio+0x76/0x230
    [  189.762485]
                   but task is already holding lock:
    [  189.762494] ffffffff834c97c0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xbe/0xb00
    [  189.762507]
                   which lock already depends on the new lock.

    [  189.762519]
                   the existing dependency chain (in reverse order) is:
    [  189.762529]
                   -> #2 (fs_reclaim){+.+.}-{0:0}:
    [  189.762540]        fs_reclaim_acquire+0xc5/0x100
    [  189.762548]        kmem_cache_alloc_lru_noprof+0x4a/0x480
    [  189.762558]        alloc_inode+0xaa/0xe0
    [  189.762566]        iget_locked+0x157/0x330
    [  189.762573]        kernfs_get_inode+0x1b/0x110
    [  189.762582]        kernfs_get_tree+0x1b0/0x2e0
    [  189.762590]        sysfs_get_tree+0x1f/0x60
    [  189.762597]        vfs_get_tree+0x2a/0xf0
    [  189.762605]        path_mount+0x4cd/0xc00
    [  189.762613]        __x64_sys_mount+0x119/0x150
    [  189.762621]        x64_sys_call+0x14f2/0x2310
    [  189.762630]        do_syscall_64+0x91/0x180
    [  189.762637]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
    [  189.762647]
                   -> #1 (&root->kernfs_rwsem){++++}-{3:3}:
    [  189.762659]        down_write+0x3e/0xf0
    [  189.762667]        kernfs_remove+0x32/0x60
    [  189.762676]        sysfs_remove_dir+0x4f/0x60
    [  189.762685]        __kobject_del+0x33/0xa0
    [  189.762709]        kobject_del+0x13/0x30
    [  189.762716]        elv_unregister_queue+0x52/0x80
    [  189.762725]        elevator_switch+0x68/0x360
    [  189.762733]        elv_iosched_store+0x14b/0x1b0
    [  189.762756]        queue_attr_store+0x181/0x1e0
    [  189.762765]        sysfs_kf_write+0x49/0x80
    [  189.762773]        kernfs_fop_write_iter+0x17d/0x250
    [  189.762781]        vfs_write+0x281/0x540
    [  189.762790]        ksys_write+0x72/0xf0
    [  189.762798]        __x64_sys_write+0x19/0x30
    [  189.762807]        x64_sys_call+0x2a3/0x2310
    [  189.762815]        do_syscall_64+0x91/0x180
    [  189.762823]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
    [  189.762833]
                   -> #0 (&q->q_usage_counter(io)#26){++++}-{0:0}:
    [  189.762845]        __lock_acquire+0x1525/0x2760
    [  189.762854]        lock_acquire+0xca/0x310
    [  189.762861]        blk_mq_submit_bio+0x8a2/0xba0
    [  189.762870]        __submit_bio+0x76/0x230
    [  189.762878]        submit_bio_noacct_nocheck+0x323/0x430
    [  189.762888]        submit_bio_noacct+0x2cc/0x620
    [  189.762896]        submit_bio+0x38/0x110
    [  189.762904]        __swap_writepage+0xf5/0x380
    [  189.762912]        swap_writepage+0x3c7/0x600
    [  189.762920]        shmem_writepage+0x3da/0x4f0
    [  189.762929]        pageout+0x13f/0x310
    [  189.762937]        shrink_folio_list+0x61c/0xf60
    [  189.763261]        evict_folios+0x378/0xcd0
    [  189.763584]        try_to_shrink_lruvec+0x1b0/0x360
    [  189.763946]        shrink_one+0x10e/0x200
    [  189.764266]        shrink_node+0xc02/0x1490
    [  189.764586]        balance_pgdat+0x563/0xb00
    [  189.764934]        kswapd+0x1e8/0x430
    [  189.765249]        kthread+0x10b/0x260
    [  189.765559]        ret_from_fork+0x44/0x70
    [  189.765889]        ret_from_fork_asm+0x1a/0x30
    [  189.766198]
                   other info that might help us debug this:

    [  189.767089] Chain exists of:
                     &q->q_usage_counter(io)#26 --> &root->kernfs_rwsem --> fs_reclaim

    [  189.767971]  Possible unsafe locking scenario:

    [  189.768555]        CPU0                    CPU1
    [  189.768849]        ----                    ----
    [  189.769136]   lock(fs_reclaim);
    [  189.769421]                                lock(&root->kernfs_rwsem);
    [  189.769714]                                lock(fs_reclaim);
    [  189.770016]   rlock(&q->q_usage_counter(io)#26);
    [  189.770305]
                    *** DEADLOCK ***

    [  189.771167] 1 lock held by kswapd0/119:
    [  189.771453]  #0: ffffffff834c97c0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xbe/0xb00
    [  189.771770]
                   stack backtrace:
    [  189.772351] CPU: 4 UID: 0 PID: 119 Comm: kswapd0 Tainted: G     U             6.14.0-rc6-xe+ #6
    [  189.772353] Tainted: [U]=USER
    [  189.772354] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023
    [  189.772354] Call Trace:
    [  189.772355]  <TASK>
    [  189.772356]  dump_stack_lvl+0x6e/0xa0
    [  189.772359]  dump_stack+0x10/0x18
    [  189.772360]  print_circular_bug.cold+0x17a/0x1b7
    [  189.772363]  check_noncircular+0x13a/0x150
    [  189.772365]  ? __pfx_stack_trace_consume_entry+0x10/0x10
    [  189.772368]  __lock_acquire+0x1525/0x2760
    [  189.772368]  ? ret_from_fork_asm+0x1a/0x30
    [  189.772371]  lock_acquire+0xca/0x310
    [  189.772372]  ? __submit_bio+0x76/0x230
    [  189.772375]  ? lock_release+0xd5/0x2c0
    [  189.772376]  blk_mq_submit_bio+0x8a2/0xba0
    [  189.772378]  ? __submit_bio+0x76/0x230
    [  189.772380]  __submit_bio+0x76/0x230
    [  189.772382]  ? trace_hardirqs_on+0x1e/0xe0
    [  189.772384]  submit_bio_noacct_nocheck+0x323/0x430
    [  189.772386]  ? submit_bio_noacct_nocheck+0x323/0x430
    [  189.772387]  ? __might_sleep+0x58/0xa0
    [  189.772390]  submit_bio_noacct+0x2cc/0x620
    [  189.772391]  ? count_memcg_events+0x68/0x90
    [  189.772393]  submit_bio+0x38/0x110
    [  189.772395]  __swap_writepage+0xf5/0x380
    [  189.772396]  swap_writepage+0x3c7/0x600
    [  189.772397]  shmem_writepage+0x3da/0x4f0
    [  189.772401]  pageout+0x13f/0x310
    [  189.772406]  shrink_folio_list+0x61c/0xf60
    [  189.772409]  ? isolate_folios+0xe80/0x16b0
    [  189.772410]  ? mark_held_locks+0x46/0x90
    [  189.772412]  evict_folios+0x378/0xcd0
    [  189.772414]  ? evict_folios+0x34a/0xcd0
    [  189.772415]  ? lock_is_held_type+0xa3/0x130
    [  189.772417]  try_to_shrink_lruvec+0x1b0/0x360
    [  189.772420]  shrink_one+0x10e/0x200
    [  189.772421]  shrink_node+0xc02/0x1490
    [  189.772423]  ? shrink_node+0xa08/0x1490
    [  189.772424]  ? shrink_node+0xbd8/0x1490
    [  189.772425]  ? mem_cgroup_iter+0x366/0x480
    [  189.772427]  balance_pgdat+0x563/0xb00
    [  189.772428]  ? balance_pgdat+0x563/0xb00
    [  189.772430]  ? trace_hardirqs_on+0x1e/0xe0
    [  189.772431]  ? finish_task_switch.isra.0+0xcb/0x330
    [  189.772433]  ? __switch_to_asm+0x33/0x70
    [  189.772437]  kswapd+0x1e8/0x430
    [  189.772438]  ? __pfx_autoremove_wake_function+0x10/0x10
    [  189.772440]  ? __pfx_kswapd+0x10/0x10
    [  189.772441]  kthread+0x10b/0x260
    [  189.772443]  ? __pfx_kthread+0x10/0x10
    [  189.772444]  ret_from_fork+0x44/0x70
    [  189.772446]  ? __pfx_kthread+0x10/0x10
    [  189.772447]  ret_from_fork_asm+0x1a/0x30
    [  189.772450]  </TASK>

    [2]:
    [    8.760253] ======================================================
    [    8.760254] WARNING: possible circular locking dependency detected
    [    8.760255] 6.14.0-rc6-xe+ #7 Tainted: G     U
    [    8.760256] ------------------------------------------------------
    [    8.760257] (udev-worker)/674 is trying to acquire lock:
    [    8.760259] ffff888100e39148 (&root->kernfs_rwsem){++++}-{3:3}, at: kernfs_remove+0x32/0x60
    [    8.760265]
                   but task is already holding lock:
    [    8.760266] ffff888110dc7680 (&q->q_usage_counter(io)#27){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x12/0x30
    [    8.760272]
                   which lock already depends on the new lock.

    [    8.760272]
                   the existing dependency chain (in reverse order) is:
    [    8.760273]
                   -> #2 (&q->q_usage_counter(io)#27){++++}-{0:0}:
    [    8.760276]        blk_alloc_queue+0x30a/0x350
    [    8.760279]        blk_mq_alloc_queue+0x6b/0xe0
    [    8.760281]        scsi_alloc_sdev+0x276/0x3c0
    [    8.760284]        scsi_probe_and_add_lun+0x22a/0x440
    [    8.760286]        __scsi_scan_target+0x109/0x230
    [    8.760288]        scsi_scan_channel+0x65/0xc0
    [    8.760290]        scsi_scan_host_selected+0xff/0x140
    [    8.760292]        do_scsi_scan_host+0xa7/0xc0
    [    8.760293]        do_scan_async+0x1c/0x160
    [    8.760295]        async_run_entry_fn+0x32/0x150
    [    8.760299]        process_one_work+0x224/0x5f0
    [    8.760302]        worker_thread+0x1d4/0x3e0
    [    8.760304]        kthread+0x10b/0x260
    [    8.760306]        ret_from_fork+0x44/0x70
    [    8.760309]        ret_from_fork_asm+0x1a/0x30
    [    8.760312]
                   -> #1 (fs_reclaim){+.+.}-{0:0}:
    [    8.760315]        fs_reclaim_acquire+0xc5/0x100
    [    8.760317]        kmem_cache_alloc_lru_noprof+0x4a/0x480
    [    8.760319]        alloc_inode+0xaa/0xe0
    [    8.760322]        iget_locked+0x157/0x330
    [    8.760323]        kernfs_get_inode+0x1b/0x110
    [    8.760325]        kernfs_get_tree+0x1b0/0x2e0
    [    8.760327]        sysfs_get_tree+0x1f/0x60
    [    8.760329]        vfs_get_tree+0x2a/0xf0
    [    8.760332]        path_mount+0x4cd/0xc00
    [    8.760334]        __x64_sys_mount+0x119/0x150
    [    8.760336]        x64_sys_call+0x14f2/0x2310
    [    8.760338]        do_syscall_64+0x91/0x180
    [    8.760340]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
    [    8.760342]
                   -> #0 (&root->kernfs_rwsem){++++}-{3:3}:
    [    8.760345]        __lock_acquire+0x1525/0x2760
    [    8.760347]        lock_acquire+0xca/0x310
    [    8.760348]        down_write+0x3e/0xf0
    [    8.760350]        kernfs_remove+0x32/0x60
    [    8.760351]        sysfs_remove_dir+0x4f/0x60
    [    8.760353]        __kobject_del+0x33/0xa0
    [    8.760355]        kobject_del+0x13/0x30
    [    8.760356]        elv_unregister_queue+0x52/0x80
    [    8.760358]        elevator_switch+0x68/0x360
    [    8.760360]        elv_iosched_store+0x14b/0x1b0
    [    8.760362]        queue_attr_store+0x181/0x1e0
    [    8.760364]        sysfs_kf_write+0x49/0x80
    [    8.760366]        kernfs_fop_write_iter+0x17d/0x250
    [    8.760367]        vfs_write+0x281/0x540
    [    8.760370]        ksys_write+0x72/0xf0
    [    8.760372]        __x64_sys_write+0x19/0x30
    [    8.760374]        x64_sys_call+0x2a3/0x2310
    [    8.760376]        do_syscall_64+0x91/0x180
    [    8.760377]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
    [    8.760380]
                   other info that might help us debug this:

    [    8.760380] Chain exists of:
                     &root->kernfs_rwsem --> fs_reclaim --> &q->q_usage_counter(io)#27

    [    8.760384]  Possible unsafe locking scenario:

    [    8.760384]        CPU0                    CPU1
    [    8.760385]        ----                    ----
    [    8.760385]   lock(&q->q_usage_counter(io)#27);
    [    8.760387]                                lock(fs_reclaim);
    [    8.760388]                                lock(&q->q_usage_counter(io)#27);
    [    8.760390]   lock(&root->kernfs_rwsem);
    [    8.760391]
                    *** DEADLOCK ***

    [    8.760391] 6 locks held by (udev-worker)/674:
    [    8.760392]  #0: ffff8881209ac420 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x72/0xf0
    [    8.760398]  #1: ffff88810c80f488 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x136/0x250
    [    8.760402]  #2: ffff888125d1d330 (kn->active#101){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x13f/0x250
    [    8.760406]  #3: ffff888110dc7bb0 (&q->sysfs_lock){+.+.}-{3:3}, at: queue_attr_store+0x148/0x1e0
    [    8.760411]  #4: ffff888110dc7680 (&q->q_usage_counter(io)#27){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x12/0x30
    [    8.760416]  #5: ffff888110dc76b8 (&q->q_usage_counter(queue)#27){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x12/0x30
    [    8.760421]
                   stack backtrace:
    [    8.760422] CPU: 7 UID: 0 PID: 674 Comm: (udev-worker) Tainted: G     U             6.14.0-rc6-xe+ #7
    [    8.760424] Tainted: [U]=USER
    [    8.760425] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023
    [    8.760426] Call Trace:
    [    8.760427]  <TASK>
    [    8.760428]  dump_stack_lvl+0x6e/0xa0
    [    8.760431]  dump_stack+0x10/0x18
    [    8.760433]  print_circular_bug.cold+0x17a/0x1b7
    [    8.760437]  check_noncircular+0x13a/0x150
    [    8.760441]  ? save_trace+0x54/0x360
    [    8.760445]  __lock_acquire+0x1525/0x2760
    [    8.760446]  ? irqentry_exit+0x3a/0xb0
    [    8.760448]  ? sysvec_apic_timer_interrupt+0x57/0xc0
    [    8.760452]  lock_acquire+0xca/0x310
    [    8.760453]  ? kernfs_remove+0x32/0x60
    [    8.760457]  down_write+0x3e/0xf0
    [    8.760459]  ? kernfs_remove+0x32/0x60
    [    8.760460]  kernfs_remove+0x32/0x60
    [    8.760462]  sysfs_remove_dir+0x4f/0x60
    [    8.760464]  __kobject_del+0x33/0xa0
    [    8.760466]  kobject_del+0x13/0x30
    [    8.760467]  elv_unregister_queue+0x52/0x80
    [    8.760470]  elevator_switch+0x68/0x360
    [    8.760472]  elv_iosched_store+0x14b/0x1b0
    [    8.760475]  queue_attr_store+0x181/0x1e0
    [    8.760479]  ? lock_acquire+0xca/0x310
    [    8.760480]  ? kernfs_fop_write_iter+0x13f/0x250
    [    8.760482]  ? lock_is_held_type+0xa3/0x130
    [    8.760485]  sysfs_kf_write+0x49/0x80
    [    8.760487]  kernfs_fop_write_iter+0x17d/0x250
    [    8.760489]  vfs_write+0x281/0x540
    [    8.760494]  ksys_write+0x72/0xf0
    [    8.760497]  __x64_sys_write+0x19/0x30
    [    8.760499]  x64_sys_call+0x2a3/0x2310
    [    8.760502]  do_syscall_64+0x91/0x180
    [    8.760504]  ? trace_hardirqs_off+0x5d/0xe0
    [    8.760506]  ? handle_softirqs+0x479/0x4d0
    [    8.760508]  ? hrtimer_interrupt+0x13f/0x280
    [    8.760511]  ? irqentry_exit_to_user_mode+0x8b/0x260
    [    8.760513]  ? clear_bhb_loop+0x15/0x70
    [    8.760515]  ? clear_bhb_loop+0x15/0x70
    [    8.760516]  ? clear_bhb_loop+0x15/0x70
    [    8.760518]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
    [    8.760520] RIP: 0033:0x7aa3bf2f5504
    [    8.760522] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 8b 10 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
    [    8.760523] RSP: 002b:00007ffc1e3697d8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
    [    8.760526] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007aa3bf2f5504
    [    8.760527] RDX: 0000000000000003 RSI: 00007ffc1e369ae0 RDI: 000000000000001c
    [    8.760528] RBP: 00007ffc1e369800 R08: 00007aa3bf3f51c8 R09: 00007ffc1e3698b0
    [    8.760528] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003
    [    8.760529] R13: 00007ffc1e369ae0 R14: 0000613ccf21f2f0 R15: 00007aa3bf3f4e80
    [    8.760533]  </TASK>

    v2:
    - Update a code comment to increase readability (Ming Lei).

    Cc: Jens Axboe <[email protected]>
    Cc: [email protected]
    Cc: [email protected]
    Cc: Ming Lei <[email protected]>
    Signed-off-by: Thomas Hellström <[email protected]>
    Reviewed-by: Ming Lei <[email protected]>
    Link: https://lore.kernel.org/r/[email protected]
    Signed-off-by: Jens Axboe <[email protected]>

Signed-off-by: Ming Lei <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 24, 2025
JIRA: https://issues.redhat.com/browse/RHEL-112997

commit 9730763
Author: Nilay Shroff <[email protected]>
Date:   Wed Mar 19 16:23:46 2025 +0530

    block: correct locking order for protecting blk-wbt parameters

    The commit '245618f8e45f ("block: protect wbt_lat_usec using q->
    elevator_lock")' introduced q->elevator_lock to protect updates
    to blk-wbt parameters when writing to the sysfs attribute wbt_
    lat_usec and the cgroup attribute io.cost.qos.  However, both
    these attributes also acquire q->rq_qos_mutex, leading to the
    following lockdep warning:

    ======================================================
    WARNING: possible circular locking dependency detected
    6.14.0-rc5+ #138 Not tainted
    ------------------------------------------------------
    bash/5902 is trying to acquire lock:
    c000000085d495a0 (&q->rq_qos_mutex){+.+.}-{4:4}, at: wbt_init+0x164/0x238

    but task is already holding lock:
    c000000085d498c8 (&q->elevator_lock){+.+.}-{4:4}, at: queue_wb_lat_store+0xb0/0x20c

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&q->elevator_lock){+.+.}-{4:4}:
            __mutex_lock+0xf0/0xa58
            ioc_qos_write+0x16c/0x85c
            cgroup_file_write+0xc4/0x32c
            kernfs_fop_write_iter+0x1b8/0x29c
            vfs_write+0x410/0x584
            ksys_write+0x84/0x140
            system_call_exception+0x134/0x360
            system_call_vectored_common+0x15c/0x2ec

    -> #0 (&q->rq_qos_mutex){+.+.}-{4:4}:
            __lock_acquire+0x1b6c/0x2ae0
            lock_acquire+0x140/0x430
            __mutex_lock+0xf0/0xa58
            wbt_init+0x164/0x238
            queue_wb_lat_store+0x1dc/0x20c
            queue_attr_store+0x12c/0x164
            sysfs_kf_write+0x6c/0xb0
            kernfs_fop_write_iter+0x1b8/0x29c
            vfs_write+0x410/0x584
            ksys_write+0x84/0x140
            system_call_exception+0x134/0x360
            system_call_vectored_common+0x15c/0x2ec

    other info that might help us debug this:

        Possible unsafe locking scenario:

            CPU0                    CPU1
            ----                    ----
        lock(&q->elevator_lock);
                                    lock(&q->rq_qos_mutex);
                                    lock(&q->elevator_lock);
        lock(&q->rq_qos_mutex);

        *** DEADLOCK ***

    6 locks held by bash/5902:
        #0: c000000051122400 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x84/0x140
        #1: c00000007383f088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x174/0x29c
        #2: c000000008550428 (kn->active#182){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x180/0x29c
        #3: c000000085d493a8 (&q->q_usage_counter(io)#5){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
        #4: c000000085d493e0 (&q->q_usage_counter(queue)#5){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
        #5: c000000085d498c8 (&q->elevator_lock){+.+.}-{4:4}, at: queue_wb_lat_store+0xb0/0x20c

    stack backtrace:
    CPU: 17 UID: 0 PID: 5902 Comm: bash Kdump: loaded Not tainted 6.14.0-rc5+ #138
    Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries
    Call Trace:
    [c0000000721ef590] [c00000000118f8a8] dump_stack_lvl+0x108/0x18c (unreliable)
    [c0000000721ef5c0] [c00000000022563c] print_circular_bug+0x448/0x604
    [c0000000721ef670] [c000000000225a44] check_noncircular+0x24c/0x26c
    [c0000000721ef740] [c00000000022bf28] __lock_acquire+0x1b6c/0x2ae0
    [c0000000721ef870] [c000000000229240] lock_acquire+0x140/0x430
    [c0000000721ef970] [c0000000011cfbec] __mutex_lock+0xf0/0xa58
    [c0000000721efaa0] [c00000000096c46c] wbt_init+0x164/0x238
    [c0000000721efaf0] [c0000000008f8cd8] queue_wb_lat_store+0x1dc/0x20c
    [c0000000721efb50] [c0000000008f8fa0] queue_attr_store+0x12c/0x164
    [c0000000721efc60] [c0000000007c11cc] sysfs_kf_write+0x6c/0xb0
    [c0000000721efca0] [c0000000007bfa4c] kernfs_fop_write_iter+0x1b8/0x29c
    [c0000000721efcf0] [c0000000006a281c] vfs_write+0x410/0x584
    [c0000000721efdc0] [c0000000006a2cc8] ksys_write+0x84/0x140
    [c0000000721efe10] [c000000000031b64] system_call_exception+0x134/0x360
    [c0000000721efe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec

    >From the above log it's apparent that method which writes to sysfs attr
    wbt_lat_usec acquires q->elevator_lock first, and then acquires q->rq_
    qos_mutex. However the another method which writes to io.cost.qos,
    acquires q->rq_qos_mutex first, and then acquires q->rq_qos_mutex. So
    this could potentially cause the deadlock.

    A closer look at ioc_qos_write shows that correcting the lock order is
    non-trivial because q->rq_qos_mutex is acquired in blkg_conf_open_bdev
    and released in blkg_conf_exit. The function blkg_conf_open_bdev is
    responsible for parsing user input and finding the corresponding block
    device (bdev) from the user provided major:minor number.

    Since we do not know the bdev until blkg_conf_open_bdev completes, we
    cannot simply move q->elevator_lock acquisition before blkg_conf_open_
    bdev. So to address this, we intoduce new helpers blkg_conf_open_bdev_
    frozen and blkg_conf_exit_frozen which are just wrappers around blkg_
    conf_open_bdev and blkg_conf_exit respectively. The helper blkg_conf_
    open_bdev_frozen is similar to blkg_conf_open_bdev, but additionally
    freezes the queue, acquires q->elevator_lock and ensures the correct
    locking order is followed between q->elevator_lock and q->rq_qos_mutex.
    Similarly another helper blkg_conf_exit_frozen in addition to unfreezing
    the queue ensures that we release the locks in correct order.

    By using these helpers, now we maintain the same locking order in all
    code paths where we update blk-wbt parameters.

    Fixes: 245618f ("block: protect wbt_lat_usec using q->elevator_lock")
    Reported-by: kernel test robot <[email protected]>
    Closes: https://lore.kernel.org/oe-lkp/[email protected]
    Signed-off-by: Nilay Shroff <[email protected]>
    Link: https://lore.kernel.org/r/[email protected]
    Signed-off-by: Jens Axboe <[email protected]>

Signed-off-by: Ming Lei <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 24, 2025
…xit()

JIRA: https://issues.redhat.com/browse/RHEL-112997

commit 78c2713
Author: Ming Lei <[email protected]>
Date:   Mon May 5 22:18:03 2025 +0800

    block: move wbt_enable_default() out of queue freezing from sched ->exit()

    scheduler's ->exit() is called with queue frozen and elevator lock is held, and
    wbt_enable_default() can't be called with queue frozen, otherwise the
    following lockdep warning is triggered:

            #6 (&q->rq_qos_mutex){+.+.}-{4:4}:
            #5 (&eq->sysfs_lock){+.+.}-{4:4}:
            #4 (&q->elevator_lock){+.+.}-{4:4}:
            #3 (&q->q_usage_counter(io)#3){++++}-{0:0}:
            #2 (fs_reclaim){+.+.}-{0:0}:
            #1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}:
            #0 (&q->debugfs_mutex){+.+.}-{4:4}:

    Fix the issue by moving wbt_enable_default() out of bfq's exit(), and
    call it from elevator_change_done().

    Meantime add disk->rqos_state_mutex for covering wbt state change, which
    matches the purpose more than ->elevator_lock.

    Reviewed-by: Hannes Reinecke <[email protected]>
    Reviewed-by: Nilay Shroff <[email protected]>
    Signed-off-by: Ming Lei <[email protected]>
    Reviewed-by: Christoph Hellwig <[email protected]>
    Link: https://lore.kernel.org/r/[email protected]
    Signed-off-by: Jens Axboe <[email protected]>

Signed-off-by: Ming Lei <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 24, 2025
…ked_roots()

If fs_info->super_copy or fs_info->super_for_commit allocated failed in
btrfs_get_tree_subvol(), then no need to call btrfs_free_fs_info().
Otherwise btrfs_check_leaked_roots() would access NULL pointer because
fs_info->allocated_roots had not been initialised.

syzkaller reported the following information:
  ------------[ cut here ]------------
  BUG: unable to handle page fault for address: fffffffffffffbb0
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 64c9067 P4D 64c9067 PUD 64cb067 PMD 0
  Oops: Oops: 0000 [#1] SMP KASAN PTI
  CPU: 0 UID: 0 PID: 1402 Comm: syz.1.35 Not tainted 6.15.8 #4 PREEMPT(lazy)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), (...)
  RIP: 0010:arch_atomic_read arch/x86/include/asm/atomic.h:23 [inline]
  RIP: 0010:raw_atomic_read include/linux/atomic/atomic-arch-fallback.h:457 [inline]
  RIP: 0010:atomic_read include/linux/atomic/atomic-instrumented.h:33 [inline]
  RIP: 0010:refcount_read include/linux/refcount.h:170 [inline]
  RIP: 0010:btrfs_check_leaked_roots+0x18f/0x2c0 fs/btrfs/disk-io.c:1230
  [...]
  Call Trace:
   <TASK>
   btrfs_free_fs_info+0x310/0x410 fs/btrfs/disk-io.c:1280
   btrfs_get_tree_subvol+0x592/0x6b0 fs/btrfs/super.c:2029
   btrfs_get_tree+0x63/0x80 fs/btrfs/super.c:2097
   vfs_get_tree+0x98/0x320 fs/super.c:1759
   do_new_mount+0x357/0x660 fs/namespace.c:3899
   path_mount+0x716/0x19c0 fs/namespace.c:4226
   do_mount fs/namespace.c:4239 [inline]
   __do_sys_mount fs/namespace.c:4450 [inline]
   __se_sys_mount fs/namespace.c:4427 [inline]
   __x64_sys_mount+0x28c/0x310 fs/namespace.c:4427
   do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
   do_syscall_64+0x92/0x180 arch/x86/entry/syscall_64.c:94
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f032eaffa8d
  [...]

Fixes: 3bb17a2 ("btrfs: add get_tree callback for new mount API")
CC: [email protected] # 6.12+
Reviewed-by: Daniel Vacek <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Dewei Meng <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 25, 2025
JIRA: https://issues.redhat.com/browse/RHEL-110301
Conflicts: Minor fuzz in core.c in hunk #4.

commit 676e8cf
Author: Peter Zijlstra <[email protected]>
Date:   Fri May 9 13:36:59 2025 +0200

    sched,livepatch: Untangle cond_resched() and live-patching

    With the goal of deprecating / removing VOLUNTARY preempt, live-patch
    needs to stop relying on cond_resched() to make forward progress.

    Instead, rely on schedule() with TASK_FREEZABLE set. Just like
    live-patching, the freezer needs to be able to stop tasks in a safe /
    known state.

    [bigeasy: use likely() in __klp_sched_try_switch() and update comments]

    Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
    Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
    Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
    Reviewed-by: Thomas Gleixner <[email protected]>
    Reviewed-by: Petr Mladek <[email protected]>
    Tested-by: Petr Mladek <[email protected]>
    Tested-by: Miroslav Benes <[email protected]>
    Acked-by: Miroslav Benes <[email protected]>
    Acked-by: Josh Poimboeuf <[email protected]>
    Link: https://lore.kernel.org/r/[email protected]

Signed-off-by: Phil Auld <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 26, 2025
The original code causes a circular locking dependency found by lockdep.

======================================================
WARNING: possible circular locking dependency detected
6.16.0-rc6-lgci-xe-xe-pw-151626v3+ #1 Tainted: G S   U
------------------------------------------------------
xe_fault_inject/5091 is trying to acquire lock:
ffff888156815688 ((work_completion)(&(&devcd->del_wk)->work)){+.+.}-{0:0}, at: __flush_work+0x25d/0x660

but task is already holding lock:

ffff888156815620 (&devcd->mutex){+.+.}-{3:3}, at: dev_coredump_put+0x3f/0xa0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (&devcd->mutex){+.+.}-{3:3}:
       mutex_lock_nested+0x4e/0xc0
       devcd_data_write+0x27/0x90
       sysfs_kf_bin_write+0x80/0xf0
       kernfs_fop_write_iter+0x169/0x220
       vfs_write+0x293/0x560
       ksys_write+0x72/0xf0
       __x64_sys_write+0x19/0x30
       x64_sys_call+0x2bf/0x2660
       do_syscall_64+0x93/0xb60
       entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #1 (kn->active#236){++++}-{0:0}:
       kernfs_drain+0x1e2/0x200
       __kernfs_remove+0xae/0x400
       kernfs_remove_by_name_ns+0x5d/0xc0
       remove_files+0x54/0x70
       sysfs_remove_group+0x3d/0xa0
       sysfs_remove_groups+0x2e/0x60
       device_remove_attrs+0xc7/0x100
       device_del+0x15d/0x3b0
       devcd_del+0x19/0x30
       process_one_work+0x22b/0x6f0
       worker_thread+0x1e8/0x3d0
       kthread+0x11c/0x250
       ret_from_fork+0x26c/0x2e0
       ret_from_fork_asm+0x1a/0x30
-> #0 ((work_completion)(&(&devcd->del_wk)->work)){+.+.}-{0:0}:
       __lock_acquire+0x1661/0x2860
       lock_acquire+0xc4/0x2f0
       __flush_work+0x27a/0x660
       flush_delayed_work+0x5d/0xa0
       dev_coredump_put+0x63/0xa0
       xe_driver_devcoredump_fini+0x12/0x20 [xe]
       devm_action_release+0x12/0x30
       release_nodes+0x3a/0x120
       devres_release_all+0x8a/0xd0
       device_unbind_cleanup+0x12/0x80
       device_release_driver_internal+0x23a/0x280
       device_driver_detach+0x14/0x20
       unbind_store+0xaf/0xc0
       drv_attr_store+0x21/0x50
       sysfs_kf_write+0x4a/0x80
       kernfs_fop_write_iter+0x169/0x220
       vfs_write+0x293/0x560
       ksys_write+0x72/0xf0
       __x64_sys_write+0x19/0x30
       x64_sys_call+0x2bf/0x2660
       do_syscall_64+0x93/0xb60
       entry_SYSCALL_64_after_hwframe+0x76/0x7e
other info that might help us debug this:
Chain exists of: (work_completion)(&(&devcd->del_wk)->work) --> kn->active#236 --> &devcd->mutex
 Possible unsafe locking scenario:
       CPU0                    CPU1
       ----                    ----
  lock(&devcd->mutex);
                               lock(kn->active#236);
                               lock(&devcd->mutex);
  lock((work_completion)(&(&devcd->del_wk)->work));
 *** DEADLOCK ***
5 locks held by xe_fault_inject/5091:
 #0: ffff8881129f9488 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x72/0xf0
 #1: ffff88810c755078 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x123/0x220
 #2: ffff8881054811a0 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x55/0x280
 #3: ffff888156815620 (&devcd->mutex){+.+.}-{3:3}, at: dev_coredump_put+0x3f/0xa0
 #4: ffffffff8359e020 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x72/0x660
stack backtrace:
CPU: 14 UID: 0 PID: 5091 Comm: xe_fault_inject Tainted: G S   U              6.16.0-rc6-lgci-xe-xe-pw-151626v3+ #1 PREEMPT_{RT,(lazy)}
Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER
Hardware name: Micro-Star International Co., Ltd. MS-7D25/PRO Z690-A DDR4(MS-7D25), BIOS 1.10 12/13/2021
Call Trace:
 <TASK>
 dump_stack_lvl+0x91/0xf0
 dump_stack+0x10/0x20
 print_circular_bug+0x285/0x360
 check_noncircular+0x135/0x150
 ? register_lock_class+0x48/0x4a0
 __lock_acquire+0x1661/0x2860
 lock_acquire+0xc4/0x2f0
 ? __flush_work+0x25d/0x660
 ? mark_held_locks+0x46/0x90
 ? __flush_work+0x25d/0x660
 __flush_work+0x27a/0x660
 ? __flush_work+0x25d/0x660
 ? trace_hardirqs_on+0x1e/0xd0
 ? __pfx_wq_barrier_func+0x10/0x10
 flush_delayed_work+0x5d/0xa0
 dev_coredump_put+0x63/0xa0
 xe_driver_devcoredump_fini+0x12/0x20 [xe]
 devm_action_release+0x12/0x30
 release_nodes+0x3a/0x120
 devres_release_all+0x8a/0xd0
 device_unbind_cleanup+0x12/0x80
 device_release_driver_internal+0x23a/0x280
 ? bus_find_device+0xa8/0xe0
 device_driver_detach+0x14/0x20
 unbind_store+0xaf/0xc0
 drv_attr_store+0x21/0x50
 sysfs_kf_write+0x4a/0x80
 kernfs_fop_write_iter+0x169/0x220
 vfs_write+0x293/0x560
 ksys_write+0x72/0xf0
 __x64_sys_write+0x19/0x30
 x64_sys_call+0x2bf/0x2660
 do_syscall_64+0x93/0xb60
 ? __f_unlock_pos+0x15/0x20
 ? __x64_sys_getdents64+0x9b/0x130
 ? __pfx_filldir64+0x10/0x10
 ? do_syscall_64+0x1a2/0xb60
 ? clear_bhb_loop+0x30/0x80
 ? clear_bhb_loop+0x30/0x80
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x76e292edd574
Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d d5 ea 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
RSP: 002b:00007fffe247a828 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000076e292edd574
RDX: 000000000000000c RSI: 00006267f6306063 RDI: 000000000000000b
RBP: 000000000000000c R08: 000076e292fc4b20 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 00006267f6306063
R13: 000000000000000b R14: 00006267e6859c00 R15: 000076e29322a000
 </TASK>
xe 0000:03:00.0: [drm] Xe device coredump has been deleted.

Fixes: 01daccf ("devcoredump : Serialize devcd_del work")
Cc: Mukesh Ojha <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Danilo Krummrich <[email protected]>
Cc: [email protected]
Cc: [email protected] # v6.1+
Signed-off-by: Maarten Lankhorst <[email protected]>
Cc: Matthew Brost <[email protected]>
Acked-by: Mukesh Ojha <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 30, 2025
JIRA: https://issues.redhat.com/browse/RHEL-115358

commit c98cc97
Author: Steven Rostedt <[email protected]>
Date:   Tue May 27 10:58:20 2025 -0400

    ring-buffer: Move cpus_read_lock() outside of buffer->mutex

    Running a modified trace-cmd record --nosplice where it does a mmap of the
    ring buffer when '--nosplice' is set, caused the following lockdep splat:

     ======================================================
     WARNING: possible circular locking dependency detected
     6.15.0-rc7-test-00002-gfb7d03d8a82f #551 Not tainted
     ------------------------------------------------------
     trace-cmd/1113 is trying to acquire lock:
     ffff888100062888 (&buffer->mutex){+.+.}-{4:4}, at: ring_buffer_map+0x11c/0xe70

     but task is already holding lock:
     ffff888100a5f9f8 (&cpu_buffer->mapping_lock){+.+.}-{4:4}, at: ring_buffer_map+0xcf/0xe70

     which lock already depends on the new lock.

     the existing dependency chain (in reverse order) is:

     -> #5 (&cpu_buffer->mapping_lock){+.+.}-{4:4}:
            __mutex_lock+0x192/0x18c0
            ring_buffer_map+0xcf/0xe70
            tracing_buffers_mmap+0x1c4/0x3b0
            __mmap_region+0xd8d/0x1f70
            do_mmap+0x9d7/0x1010
            vm_mmap_pgoff+0x20b/0x390
            ksys_mmap_pgoff+0x2e9/0x440
            do_syscall_64+0x79/0x1c0
            entry_SYSCALL_64_after_hwframe+0x76/0x7e

     -> #4 (&mm->mmap_lock){++++}-{4:4}:
            __might_fault+0xa5/0x110
            _copy_to_user+0x22/0x80
            _perf_ioctl+0x61b/0x1b70
            perf_ioctl+0x62/0x90
            __x64_sys_ioctl+0x134/0x190
            do_syscall_64+0x79/0x1c0
            entry_SYSCALL_64_after_hwframe+0x76/0x7e

     -> #3 (&cpuctx_mutex){+.+.}-{4:4}:
            __mutex_lock+0x192/0x18c0
            perf_event_init_cpu+0x325/0x7c0
            perf_event_init+0x52a/0x5b0
            start_kernel+0x263/0x3e0
            x86_64_start_reservations+0x24/0x30
            x86_64_start_kernel+0x95/0xa0
            common_startup_64+0x13e/0x141

     -> #2 (pmus_lock){+.+.}-{4:4}:
            __mutex_lock+0x192/0x18c0
            perf_event_init_cpu+0xb7/0x7c0
            cpuhp_invoke_callback+0x2c0/0x1030
            __cpuhp_invoke_callback_range+0xbf/0x1f0
            _cpu_up+0x2e7/0x690
            cpu_up+0x117/0x170
            cpuhp_bringup_mask+0xd5/0x120
            bringup_nonboot_cpus+0x13d/0x170
            smp_init+0x2b/0xf0
            kernel_init_freeable+0x441/0x6d0
            kernel_init+0x1e/0x160
            ret_from_fork+0x34/0x70
            ret_from_fork_asm+0x1a/0x30

     -> #1 (cpu_hotplug_lock){++++}-{0:0}:
            cpus_read_lock+0x2a/0xd0
            ring_buffer_resize+0x610/0x14e0
            __tracing_resize_ring_buffer.part.0+0x42/0x120
            tracing_set_tracer+0x7bd/0xa80
            tracing_set_trace_write+0x132/0x1e0
            vfs_write+0x21c/0xe80
            ksys_write+0xf9/0x1c0
            do_syscall_64+0x79/0x1c0
            entry_SYSCALL_64_after_hwframe+0x76/0x7e

     -> #0 (&buffer->mutex){+.+.}-{4:4}:
            __lock_acquire+0x1405/0x2210
            lock_acquire+0x174/0x310
            __mutex_lock+0x192/0x18c0
            ring_buffer_map+0x11c/0xe70
            tracing_buffers_mmap+0x1c4/0x3b0
            __mmap_region+0xd8d/0x1f70
            do_mmap+0x9d7/0x1010
            vm_mmap_pgoff+0x20b/0x390
            ksys_mmap_pgoff+0x2e9/0x440
            do_syscall_64+0x79/0x1c0
            entry_SYSCALL_64_after_hwframe+0x76/0x7e

     other info that might help us debug this:

     Chain exists of:
       &buffer->mutex --> &mm->mmap_lock --> &cpu_buffer->mapping_lock

      Possible unsafe locking scenario:

            CPU0                    CPU1
            ----                    ----
       lock(&cpu_buffer->mapping_lock);
                                    lock(&mm->mmap_lock);
                                    lock(&cpu_buffer->mapping_lock);
       lock(&buffer->mutex);

      *** DEADLOCK ***

     2 locks held by trace-cmd/1113:
      #0: ffff888106b847e0 (&mm->mmap_lock){++++}-{4:4}, at: vm_mmap_pgoff+0x192/0x390
      #1: ffff888100a5f9f8 (&cpu_buffer->mapping_lock){+.+.}-{4:4}, at: ring_buffer_map+0xcf/0xe70

     stack backtrace:
     CPU: 5 UID: 0 PID: 1113 Comm: trace-cmd Not tainted 6.15.0-rc7-test-00002-gfb7d03d8a82f #551 PREEMPT
     Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
     Call Trace:
      <TASK>
      dump_stack_lvl+0x6e/0xa0
      print_circular_bug.cold+0x178/0x1be
      check_noncircular+0x146/0x160
      __lock_acquire+0x1405/0x2210
      lock_acquire+0x174/0x310
      ? ring_buffer_map+0x11c/0xe70
      ? ring_buffer_map+0x11c/0xe70
      ? __mutex_lock+0x169/0x18c0
      __mutex_lock+0x192/0x18c0
      ? ring_buffer_map+0x11c/0xe70
      ? ring_buffer_map+0x11c/0xe70
      ? function_trace_call+0x296/0x370
      ? __pfx___mutex_lock+0x10/0x10
      ? __pfx_function_trace_call+0x10/0x10
      ? __pfx___mutex_lock+0x10/0x10
      ? _raw_spin_unlock+0x2d/0x50
      ? ring_buffer_map+0x11c/0xe70
      ? ring_buffer_map+0x11c/0xe70
      ? __mutex_lock+0x5/0x18c0
      ring_buffer_map+0x11c/0xe70
      ? do_raw_spin_lock+0x12d/0x270
      ? find_held_lock+0x2b/0x80
      ? _raw_spin_unlock+0x2d/0x50
      ? rcu_is_watching+0x15/0xb0
      ? _raw_spin_unlock+0x2d/0x50
      ? trace_preempt_on+0xd0/0x110
      tracing_buffers_mmap+0x1c4/0x3b0
      __mmap_region+0xd8d/0x1f70
      ? ring_buffer_lock_reserve+0x99/0xff0
      ? __pfx___mmap_region+0x10/0x10
      ? ring_buffer_lock_reserve+0x99/0xff0
      ? __pfx_ring_buffer_lock_reserve+0x10/0x10
      ? __pfx_ring_buffer_lock_reserve+0x10/0x10
      ? bpf_lsm_mmap_addr+0x4/0x10
      ? security_mmap_addr+0x46/0xd0
      ? lock_is_held_type+0xd9/0x130
      do_mmap+0x9d7/0x1010
      ? 0xffffffffc0370095
      ? __pfx_do_mmap+0x10/0x10
      vm_mmap_pgoff+0x20b/0x390
      ? __pfx_vm_mmap_pgoff+0x10/0x10
      ? 0xffffffffc0370095
      ksys_mmap_pgoff+0x2e9/0x440
      do_syscall_64+0x79/0x1c0
      entry_SYSCALL_64_after_hwframe+0x76/0x7e
     RIP: 0033:0x7fb0963a7de2
     Code: 00 00 00 0f 1f 44 00 00 41 f7 c1 ff 0f 00 00 75 27 55 89 cd 53 48 89 fb 48 85 ff 74 3b 41 89 ea 48 89 df b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 5b 5d c3 0f 1f 00 48 8b 05 e1 9f 0d 00 64
     RSP: 002b:00007ffdcc8fb878 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
     RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb0963a7de2
     RDX: 0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000
     RBP: 0000000000000001 R08: 0000000000000006 R09: 0000000000000000
     R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
     R13: 00007ffdcc8fbe68 R14: 00007fb096628000 R15: 00005633e01a5c90
      </TASK>

    The issue is that cpus_read_lock() is taken within buffer->mutex. The
    memory mapped pages are taken with the mmap_lock held. The buffer->mutex
    is taken within the cpu_buffer->mapping_lock. There's quite a chain with
    all these locks, where the deadlock can be fixed by moving the
    cpus_read_lock() outside the taking of the buffer->mutex.

    Cc: [email protected]
    Cc: Masami Hiramatsu <[email protected]>
    Cc: Mathieu Desnoyers <[email protected]>
    Cc: Vincent Donnefort <[email protected]>
    Link: https://lore.kernel.org/[email protected]
    Fixes: 117c392 ("ring-buffer: Introducing ring-buffer mapping functions")
    Signed-off-by: Steven Rostedt (Google) <[email protected]>

Signed-off-by: Jerome Marchand <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 30, 2025
…ked_roots()

commit 17679ac upstream.

If fs_info->super_copy or fs_info->super_for_commit allocated failed in
btrfs_get_tree_subvol(), then no need to call btrfs_free_fs_info().
Otherwise btrfs_check_leaked_roots() would access NULL pointer because
fs_info->allocated_roots had not been initialised.

syzkaller reported the following information:
  ------------[ cut here ]------------
  BUG: unable to handle page fault for address: fffffffffffffbb0
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 64c9067 P4D 64c9067 PUD 64cb067 PMD 0
  Oops: Oops: 0000 [#1] SMP KASAN PTI
  CPU: 0 UID: 0 PID: 1402 Comm: syz.1.35 Not tainted 6.15.8 #4 PREEMPT(lazy)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), (...)
  RIP: 0010:arch_atomic_read arch/x86/include/asm/atomic.h:23 [inline]
  RIP: 0010:raw_atomic_read include/linux/atomic/atomic-arch-fallback.h:457 [inline]
  RIP: 0010:atomic_read include/linux/atomic/atomic-instrumented.h:33 [inline]
  RIP: 0010:refcount_read include/linux/refcount.h:170 [inline]
  RIP: 0010:btrfs_check_leaked_roots+0x18f/0x2c0 fs/btrfs/disk-io.c:1230
  [...]
  Call Trace:
   <TASK>
   btrfs_free_fs_info+0x310/0x410 fs/btrfs/disk-io.c:1280
   btrfs_get_tree_subvol+0x592/0x6b0 fs/btrfs/super.c:2029
   btrfs_get_tree+0x63/0x80 fs/btrfs/super.c:2097
   vfs_get_tree+0x98/0x320 fs/super.c:1759
   do_new_mount+0x357/0x660 fs/namespace.c:3899
   path_mount+0x716/0x19c0 fs/namespace.c:4226
   do_mount fs/namespace.c:4239 [inline]
   __do_sys_mount fs/namespace.c:4450 [inline]
   __se_sys_mount fs/namespace.c:4427 [inline]
   __x64_sys_mount+0x28c/0x310 fs/namespace.c:4427
   do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
   do_syscall_64+0x92/0x180 arch/x86/entry/syscall_64.c:94
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f032eaffa8d
  [...]

Fixes: 3bb17a2 ("btrfs: add get_tree callback for new mount API")
CC: [email protected] # 6.12+
Reviewed-by: Daniel Vacek <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Dewei Meng <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 30, 2025
[ Upstream commit a91c809 ]

The original code causes a circular locking dependency found by lockdep.

======================================================
WARNING: possible circular locking dependency detected
6.16.0-rc6-lgci-xe-xe-pw-151626v3+ #1 Tainted: G S   U
------------------------------------------------------
xe_fault_inject/5091 is trying to acquire lock:
ffff888156815688 ((work_completion)(&(&devcd->del_wk)->work)){+.+.}-{0:0}, at: __flush_work+0x25d/0x660

but task is already holding lock:

ffff888156815620 (&devcd->mutex){+.+.}-{3:3}, at: dev_coredump_put+0x3f/0xa0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (&devcd->mutex){+.+.}-{3:3}:
       mutex_lock_nested+0x4e/0xc0
       devcd_data_write+0x27/0x90
       sysfs_kf_bin_write+0x80/0xf0
       kernfs_fop_write_iter+0x169/0x220
       vfs_write+0x293/0x560
       ksys_write+0x72/0xf0
       __x64_sys_write+0x19/0x30
       x64_sys_call+0x2bf/0x2660
       do_syscall_64+0x93/0xb60
       entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #1 (kn->active#236){++++}-{0:0}:
       kernfs_drain+0x1e2/0x200
       __kernfs_remove+0xae/0x400
       kernfs_remove_by_name_ns+0x5d/0xc0
       remove_files+0x54/0x70
       sysfs_remove_group+0x3d/0xa0
       sysfs_remove_groups+0x2e/0x60
       device_remove_attrs+0xc7/0x100
       device_del+0x15d/0x3b0
       devcd_del+0x19/0x30
       process_one_work+0x22b/0x6f0
       worker_thread+0x1e8/0x3d0
       kthread+0x11c/0x250
       ret_from_fork+0x26c/0x2e0
       ret_from_fork_asm+0x1a/0x30
-> #0 ((work_completion)(&(&devcd->del_wk)->work)){+.+.}-{0:0}:
       __lock_acquire+0x1661/0x2860
       lock_acquire+0xc4/0x2f0
       __flush_work+0x27a/0x660
       flush_delayed_work+0x5d/0xa0
       dev_coredump_put+0x63/0xa0
       xe_driver_devcoredump_fini+0x12/0x20 [xe]
       devm_action_release+0x12/0x30
       release_nodes+0x3a/0x120
       devres_release_all+0x8a/0xd0
       device_unbind_cleanup+0x12/0x80
       device_release_driver_internal+0x23a/0x280
       device_driver_detach+0x14/0x20
       unbind_store+0xaf/0xc0
       drv_attr_store+0x21/0x50
       sysfs_kf_write+0x4a/0x80
       kernfs_fop_write_iter+0x169/0x220
       vfs_write+0x293/0x560
       ksys_write+0x72/0xf0
       __x64_sys_write+0x19/0x30
       x64_sys_call+0x2bf/0x2660
       do_syscall_64+0x93/0xb60
       entry_SYSCALL_64_after_hwframe+0x76/0x7e
other info that might help us debug this:
Chain exists of: (work_completion)(&(&devcd->del_wk)->work) --> kn->active#236 --> &devcd->mutex
 Possible unsafe locking scenario:
       CPU0                    CPU1
       ----                    ----
  lock(&devcd->mutex);
                               lock(kn->active#236);
                               lock(&devcd->mutex);
  lock((work_completion)(&(&devcd->del_wk)->work));
 *** DEADLOCK ***
5 locks held by xe_fault_inject/5091:
 #0: ffff8881129f9488 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x72/0xf0
 #1: ffff88810c755078 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x123/0x220
 #2: ffff8881054811a0 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x55/0x280
 #3: ffff888156815620 (&devcd->mutex){+.+.}-{3:3}, at: dev_coredump_put+0x3f/0xa0
 #4: ffffffff8359e020 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x72/0x660
stack backtrace:
CPU: 14 UID: 0 PID: 5091 Comm: xe_fault_inject Tainted: G S   U              6.16.0-rc6-lgci-xe-xe-pw-151626v3+ #1 PREEMPT_{RT,(lazy)}
Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER
Hardware name: Micro-Star International Co., Ltd. MS-7D25/PRO Z690-A DDR4(MS-7D25), BIOS 1.10 12/13/2021
Call Trace:
 <TASK>
 dump_stack_lvl+0x91/0xf0
 dump_stack+0x10/0x20
 print_circular_bug+0x285/0x360
 check_noncircular+0x135/0x150
 ? register_lock_class+0x48/0x4a0
 __lock_acquire+0x1661/0x2860
 lock_acquire+0xc4/0x2f0
 ? __flush_work+0x25d/0x660
 ? mark_held_locks+0x46/0x90
 ? __flush_work+0x25d/0x660
 __flush_work+0x27a/0x660
 ? __flush_work+0x25d/0x660
 ? trace_hardirqs_on+0x1e/0xd0
 ? __pfx_wq_barrier_func+0x10/0x10
 flush_delayed_work+0x5d/0xa0
 dev_coredump_put+0x63/0xa0
 xe_driver_devcoredump_fini+0x12/0x20 [xe]
 devm_action_release+0x12/0x30
 release_nodes+0x3a/0x120
 devres_release_all+0x8a/0xd0
 device_unbind_cleanup+0x12/0x80
 device_release_driver_internal+0x23a/0x280
 ? bus_find_device+0xa8/0xe0
 device_driver_detach+0x14/0x20
 unbind_store+0xaf/0xc0
 drv_attr_store+0x21/0x50
 sysfs_kf_write+0x4a/0x80
 kernfs_fop_write_iter+0x169/0x220
 vfs_write+0x293/0x560
 ksys_write+0x72/0xf0
 __x64_sys_write+0x19/0x30
 x64_sys_call+0x2bf/0x2660
 do_syscall_64+0x93/0xb60
 ? __f_unlock_pos+0x15/0x20
 ? __x64_sys_getdents64+0x9b/0x130
 ? __pfx_filldir64+0x10/0x10
 ? do_syscall_64+0x1a2/0xb60
 ? clear_bhb_loop+0x30/0x80
 ? clear_bhb_loop+0x30/0x80
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x76e292edd574
Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d d5 ea 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
RSP: 002b:00007fffe247a828 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000076e292edd574
RDX: 000000000000000c RSI: 00006267f6306063 RDI: 000000000000000b
RBP: 000000000000000c R08: 000076e292fc4b20 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 00006267f6306063
R13: 000000000000000b R14: 00006267e6859c00 R15: 000076e29322a000
 </TASK>
xe 0000:03:00.0: [drm] Xe device coredump has been deleted.

Fixes: 01daccf ("devcoredump : Serialize devcd_del work")
Cc: Mukesh Ojha <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Danilo Krummrich <[email protected]>
Cc: [email protected]
Cc: [email protected] # v6.1+
Signed-off-by: Maarten Lankhorst <[email protected]>
Cc: Matthew Brost <[email protected]>
Acked-by: Mukesh Ojha <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
[ removed const qualifier from bin_attribute callback parameters ]
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants