Skip to content

Conversation

kernel-patches-bot
Copy link

branch: master_test
base:bpf-next
version: edc21dc

Dave Chinner and others added 30 commits February 1, 2022 14:14
Since we've started treating fallocate more like a file write, we
should flush the log to disk if the user has asked for synchronous
writes either by setting it via fcntl flags, or inode flags, or with
the sync mount option.  We've already got a helper for this, so use
it.

[The original patch by Darrick was massaged by Dave to fit this patchset]

Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Unlike .queue_rq, in .submit_async_event drivers may not check the ctrl
readiness for AER submission. This may lead to a use-after-free
condition that was observed with nvme-tcp.

The race condition may happen in the following scenario:
1. driver executes its reset_ctrl_work
2. -> nvme_stop_ctrl - flushes ctrl async_event_work
3. ctrl sends AEN which is received by the host, which in turn
   schedules AEN handling
4. teardown admin queue (which releases the queue socket)
5. AEN processed, submits another AER, calling the driver to submit
6. driver attempts to send the cmd
==> use-after-free

In order to fix that, add ctrl state check to validate the ctrl
is actually able to accept the AER submission.

This addresses the above race in controller resets because the driver
during teardown should:
1. change ctrl state to RESETTING
2. flush async_event_work (as well as other async work elements)

So after 1,2, any other AER command will find the
ctrl state to be RESETTING and bail out without submitting the AER.

Signed-off-by: Sagi Grimberg <[email protected]>
While nvme_tcp_submit_async_event_work is checking the ctrl and queue
state before preparing the AER command and scheduling io_work, in order
to fully prevent a race where this check is not reliable the error
recovery work must flush async_event_work before continuing to destroy
the admin queue after setting the ctrl state to RESETTING such that
there is no race .submit_async_event and the error recovery handler
itself changing the ctrl state.

Tested-by: Chris Leech <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
While nvme_rdma_submit_async_event_work is checking the ctrl and queue
state before preparing the AER command and scheduling io_work, in order
to fully prevent a race where this check is not reliable the error
recovery work must flush async_event_work before continuing to destroy
the admin queue after setting the ctrl state to RESETTING such that
there is no race .submit_async_event and the error recovery handler
itself changing the ctrl state.

Signed-off-by: Sagi Grimberg <[email protected]>
Refuse SIDA memops on guests which are not protected.
For normal guests, the secure instruction data address designation,
which determines the location we access, is not under control of KVM.

Fixes: 19e1227 (KVM: S390: protvirt: Introduce instruction data area bounce buffer)
Signed-off-by: Janis Schoetterl-Glausch <[email protected]>
Cc: [email protected]
Signed-off-by: Christian Borntraeger <[email protected]>
Kyle reported that rr[0] has started to malfunction on Comet Lake and
later CPUs due to EFI starting to make use of CPL3 [1] and the PMU
event filtering not distinguishing between regular CPL3 and SMM CPL3.

Since this is a privilege violation, default disable SMM visibility
where possible.

Administrators wanting to observe SMM cycles can easily change this
using the sysfs attribute while regular users don't have access to
this file.

[0] https://rr-project.org/

[1] See the Intel white paper "Trustworthy SMM on the Intel vPro Platform"
at https://bugzilla.kernel.org/attachment.cgi?id=300300, particularly the
end of page 5.

Reported-by: Kyle Huey <[email protected]>
Suggested-by: Andrew Cooper <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
The intent has always been that perf_event_attr::sig_data should also be
modifiable along with PERF_EVENT_IOC_MODIFY_ATTRIBUTES, because it is
observable by user space if SIGTRAP on events is requested.

Currently only PERF_TYPE_BREAKPOINT is modifiable, and explicitly copies
relevant breakpoint-related attributes in hw_breakpoint_copy_attr().
This misses copying perf_event_attr::sig_data.

Since sig_data is not specific to PERF_TYPE_BREAKPOINT, introduce a
helper to copy generic event-type-independent attributes on
modification.

Fixes: 97ba62b ("perf: Add support for SIGTRAP on perf events")
Reported-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Marco Elver <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dmitry Vyukov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Test that PERF_EVENT_IOC_MODIFY_ATTRIBUTES correctly modifies
perf_event_attr::sig_data as well.

Signed-off-by: Marco Elver <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dmitry Vyukov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
…rchitectures

Due to the alignment requirements of siginfo_t, as described in
3ddb3fd ("signal, perf: Fix siginfo_t by avoiding u64 on 32-bit
architectures"), siginfo_t::si_perf_data is limited to an unsigned long.

However, perf_event_attr::sig_data is an u64, to avoid having to deal
with compat conversions. Due to being an u64, it may not immediately be
clear to users that sig_data is truncated on 32 bit architectures.

Add a comment to explicitly point this out, and hopefully help some
users save time by not having to deduce themselves what's happening.

Reported-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Marco Elver <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dmitry Vyukov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Add a check for !buf->single before calling pt_buffer_region_size in a
place where a missing check can cause a kernel crash.

Fixes a bug introduced by commit 6706384 ("perf/x86/intel/pt:
Opportunistically use single range output mode"), which added a
support for PT single-range output mode. Since that commit if a PT
stop filter range is hit while tracing, the kernel will crash because
of a null pointer dereference in pt_handle_status due to calling
pt_buffer_region_size without a ToPA configured.

The commit which introduced single-range mode guarded almost all uses of
the ToPA buffer variables with checks of the buf->single variable, but
missed the case where tracing was stopped by the PT hardware, which
happens when execution hits a configured stop filter.

Tested that hitting a stop filter while PT recording successfully
records a trace with this patch but crashes without this patch.

Fixes: 6706384 ("perf/x86/intel/pt: Opportunistically use single range output mode")
Signed-off-by: Tristan Hume <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Adrian Hunter <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
In kvm_arch_vcpu_ioctl_run() we enter an RCU extended quiescent state
(EQS) by calling guest_enter_irqoff(), and unmask IRQs prior to exiting
the EQS by calling guest_exit(). As the IRQ entry code will not wake RCU
in this case, we may run the core IRQ code and IRQ handler without RCU
watching, leading to various potential problems.

Additionally, we do not inform lockdep or tracing that interrupts will
be enabled during guest execution, which caan lead to misleading traces
and warnings that interrupts have been enabled for overly-long periods.

This patch fixes these issues by using the new timing and context
entry/exit helpers to ensure that interrupts are handled during guest
vtime but with RCU watching, with a sequence:

	guest_timing_enter_irqoff();

	guest_state_enter_irqoff();
	< run the vcpu >
	guest_state_exit_irqoff();

	< take any pending IRQs >

	guest_timing_exit_irqoff();

Since instrumentation may make use of RCU, we must also ensure that no
instrumented code is run during the EQS. I've split out the critical
section into a new kvm_riscv_enter_exit_vcpu() helper which is marked
noinstr.

Fixes: 99cdc6c ("RISC-V: Add initial skeletal KVM support")
Signed-off-by: Mark Rutland <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Anup Patel <[email protected]>
Cc: Atish Patra <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Paul Walmsley <[email protected]>
Tested-by: Anup Patel <[email protected]>
Signed-off-by: Anup Patel <[email protected]>
Those applications that run in VU mode and access the time CSR cause
a virtual instruction trap as Guest kernel currently does not
initialize the scounteren CSR.

To fix this, we should make CY, TM, and IR counters accessibile
by default in VU mode (similar to OpenSBI).

Fixes: a33c72f ("RISC-V: KVM: Implement VCPU create, init and
destroy functions")
Cc: [email protected]
Signed-off-by: Mayuresh Chitale <[email protected]>
Signed-off-by: Anup Patel <[email protected]>
The SBI implementation version returned by KVM RISC-V should be the
Host Linux version code.

Fixes: c62a768 ("RISC-V: KVM: Add SBI v0.2 base extension")
Signed-off-by: Anup Patel <[email protected]>
Reviewed-by: Atish Patra <[email protected]>
Signed-off-by: Anup Patel <[email protected]>
…from TODO list)"

This reverts commit b3ec8cd.

Revert the second (of 2) commits which disabled scrolling acceleration
in fbcon/fbdev.  It introduced a regression for fbdev-supported graphic
cards because of the performance penalty by doing screen scrolling by
software instead of using the existing graphic card 2D hardware
acceleration.

Console scrolling acceleration was disabled by dropping code which
checked at runtime the driver hardware capabilities for the
BINFO_HWACCEL_COPYAREA or FBINFO_HWACCEL_FILLRECT flags and if set, it
enabled scrollmode SCROLL_MOVE which uses hardware acceleration to move
screen contents.  After dropping those checks scrollmode was hard-wired
to SCROLL_REDRAW instead, which forces all graphic cards to redraw every
character at the new screen position when scrolling.

This change effectively disabled all hardware-based scrolling acceleration for
ALL drivers, because now all kind of 2D hardware acceleration (bitblt,
fillrect) in the drivers isn't used any longer.

The original commit message mentions that only 3 DRM drivers (nouveau, omapdrm
and gma500) used hardware acceleration in the past and thus code for checking
and using scrolling acceleration is obsolete.

This statement is NOT TRUE, because beside the DRM drivers there are around 35
other fbdev drivers which depend on fbdev/fbcon and still provide hardware
acceleration for fbdev/fbcon.

The original commit message also states that syzbot found lots of bugs in fbcon
and thus it's "often the solution to just delete code and remove features".
This is true, and the bugs - which actually affected all users of fbcon,
including DRM - were fixed, or code was dropped like e.g. the support for
software scrollback in vgacon (commit 973c096).

So to further analyze which bugs were found by syzbot, I've looked through all
patches in drivers/video which were tagged with syzbot or syzkaller back to
year 2005. The vast majority fixed the reported issues on a higher level, e.g.
when screen is to be resized, or when font size is to be changed. The few ones
which touched driver code fixed a real driver bug, e.g. by adding a check.

But NONE of those patches touched code of either the SCROLL_MOVE or the
SCROLL_REDRAW case.

That means, there was no real reason why SCROLL_MOVE had to be ripped-out and
just SCROLL_REDRAW had to be used instead. The only reason I can imagine so far
was that SCROLL_MOVE wasn't used by DRM and as such it was assumed that it
could go away. That argument completely missed the fact that SCROLL_MOVE is
still heavily used by fbdev (non-DRM) drivers.

Some people mention that using memcpy() instead of the hardware acceleration is
pretty much the same speed. But that's not true, at least not for older graphic
cards and machines where we see speed decreases by factor 10 and more and thus
this change leads to console responsiveness way worse than before.

That's why the original commit is to be reverted. By reverting we
reintroduce hardware-based scrolling acceleration and fix the
performance regression for fbdev drivers.

There isn't any impact on DRM when reverting those patches.

Signed-off-by: Helge Deller <[email protected]>
Acked-by: Geert Uytterhoeven <[email protected]>
Acked-by: Sven Schnelle <[email protected]>
Cc: [email protected] # v5.16+
Signed-off-by: Helge Deller <[email protected]>
Signed-off-by: Daniel Vetter <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
This reverts commit 39aead8.

Revert the first (of 2) commits which disabled scrolling acceleration in
fbcon/fbdev.  It introduced a regression for fbdev-supported graphic cards
because of the performance penalty by doing screen scrolling by software
instead of using the existing graphic card 2D hardware acceleration.

Console scrolling acceleration was disabled by dropping code which
checked at runtime the driver hardware capabilities for the
BINFO_HWACCEL_COPYAREA or FBINFO_HWACCEL_FILLRECT flags and if set, it
enabled scrollmode SCROLL_MOVE which uses hardware acceleration to move
screen contents.  After dropping those checks scrollmode was hard-wired
to SCROLL_REDRAW instead, which forces all graphic cards to redraw every
character at the new screen position when scrolling.

This change effectively disabled all hardware-based scrolling acceleration for
ALL drivers, because now all kind of 2D hardware acceleration (bitblt,
fillrect) in the drivers isn't used any longer.

The original commit message mentions that only 3 DRM drivers (nouveau, omapdrm
and gma500) used hardware acceleration in the past and thus code for checking
and using scrolling acceleration is obsolete.

This statement is NOT TRUE, because beside the DRM drivers there are around 35
other fbdev drivers which depend on fbdev/fbcon and still provide hardware
acceleration for fbdev/fbcon.

The original commit message also states that syzbot found lots of bugs in fbcon
and thus it's "often the solution to just delete code and remove features".
This is true, and the bugs - which actually affected all users of fbcon,
including DRM - were fixed, or code was dropped like e.g. the support for
software scrollback in vgacon (commit 973c096).

So to further analyze which bugs were found by syzbot, I've looked through all
patches in drivers/video which were tagged with syzbot or syzkaller back to
year 2005. The vast majority fixed the reported issues on a higher level, e.g.
when screen is to be resized, or when font size is to be changed. The few ones
which touched driver code fixed a real driver bug, e.g. by adding a check.

But NONE of those patches touched code of either the SCROLL_MOVE or the
SCROLL_REDRAW case.

That means, there was no real reason why SCROLL_MOVE had to be ripped-out and
just SCROLL_REDRAW had to be used instead. The only reason I can imagine so far
was that SCROLL_MOVE wasn't used by DRM and as such it was assumed that it
could go away. That argument completely missed the fact that SCROLL_MOVE is
still heavily used by fbdev (non-DRM) drivers.

Some people mention that using memcpy() instead of the hardware acceleration is
pretty much the same speed. But that's not true, at least not for older graphic
cards and machines where we see speed decreases by factor 10 and more and thus
this change leads to console responsiveness way worse than before.

That's why the original commit is to be reverted. By reverting we
reintroduce hardware-based scrolling acceleration and fix the
performance regression for fbdev drivers.

There isn't any impact on DRM when reverting those patches.

Signed-off-by: Helge Deller <[email protected]>
Acked-by: Geert Uytterhoeven <[email protected]>
Acked-by: Sven Schnelle <[email protected]>
Cc: [email protected] # v5.10+
Signed-off-by: Helge Deller <[email protected]>
Signed-off-by: Daniel Vetter <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Add a config option CONFIG_FRAMEBUFFER_CONSOLE_LEGACY_ACCELERATION to
enable bitblt and fillrect hardware acceleration in the framebuffer
console. If disabled, such acceleration will not be used, even if it is
supported by the graphics hardware driver.

If you plan to use DRM as your main graphics output system, you should
disable this option since it will prevent compiling in code which isn't
used later on when DRM takes over.

For all other configurations, e.g. if none of your graphic cards support
DRM (yet), DRM isn't available for your architecture, or you can't be
sure that the graphic card in the target system will support DRM, you
most likely want to enable this option.

In the non-accelerated case (e.g. when DRM is used), the inlined
fb_scrollmode() function is hardcoded to return SCROLL_REDRAW and as such the
compiler is able to optimize much unneccesary code away.

In this v3 patch version I additionally changed the GETVYRES() and GETVXRES()
macros to take a pointer to the fbcon_display struct. This fixes the build when
console rotation is enabled and helps the compiler again to optimize out code.

Signed-off-by: Helge Deller <[email protected]>
Cc: [email protected] # v5.10+
Signed-off-by: Helge Deller <[email protected]>
Signed-off-by: Daniel Vetter <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Commit ceaa762 ("block: move direct_IO into our own read_iter
handler") introduced several regressions for bdev DIO:

1. read spanning EOF always returns 0 instead of the number of bytes
   read.  This is because "count" is assigned early and isn't updated
   when the iterator is truncated:

     $ lsblk -o name,size /dev/vdb
     NAME SIZE
     vdb    1G
     $ xfs_io -d -c 'pread -b 4M 1021M 4M' /dev/vdb
     read 0/4194304 bytes at offset 1070596096
     0.000000 bytes, 0 ops; 0.0007 sec (0.000000 bytes/sec and 0.0000 ops/sec)

     instead of

     $ xfs_io -d -c 'pread -b 4M 1021M 4M' /dev/vdb
     read 3145728/4194304 bytes at offset 1070596096
     3 MiB, 1 ops; 0.0007 sec (3.865 GiB/sec and 1319.2612 ops/sec)

2. truncated iterator isn't reexpanded
3. iterator isn't reverted on blkdev_direct_IO() error
4. zero size read no longer skips atime update

Fixes: ceaa762 ("block: move direct_IO into our own read_iter handler")
Signed-off-by: Ilya Dryomov <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
 into HEAD

KVM/riscv fixes for 5.17, take #1

- Rework guest entry logic

- Make CY, TM, and IR counters accessible in VU mode

- Fix SBI implementation version
If we're doing an uncached read of the directory, then we ideally want
to read only the exact set of entries that will fit in the buffer
supplied by the getdents() system call. So unlike the case where we're
reading into the page cache, let's send only one READDIR call, before
trying to fill up the buffer.

Fixes: 35df59d ("NFS: Reduce number of RPC calls when doing uncached readdir")
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
Ensure that we initialise desc->cache_entry_index correctly in
uncached_readdir().

Fixes: d1bacf9 ("NFS: add readdir cache array")
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
If we've reached the end of the directory, then cache that information
in the context so that we don't need to do an uncached readdir in order
to rediscover that fact.

Fixes: 794092c ("NFS: Do uncached readdir when we're seeking a cookie in an empty page cache")
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
audit_log_start() returns audit_buffer pointer on success or NULL on
error, so it is better to check the return value of it.

Fixes: 3323eec ("integrity: IMA as an integrity service provider")
Signed-off-by: Xiaoke Wang <[email protected]>
Cc: <[email protected]>
Reviewed-by: Paul Moore <[email protected]>
Signed-off-by: Mimi Zohar <[email protected]>
The removal of ima_dir currently fails since ima_policy still exists, so
remove the ima_policy file before removing the directory.

Fixes: 4af4662 ("integrity: IMA policy")
Signed-off-by: Stefan Berger <[email protected]>
Cc: <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Signed-off-by: Mimi Zohar <[email protected]>
Commit c2426d2 ("ima: added support for new kernel cmdline parameter
ima_template_fmt") introduced an additional check on the ima_template
variable to avoid multiple template selection.

Unfortunately, ima_template could be also set by the setup function of the
ima_hash= parameter, when it calls ima_template_desc_current(). This causes
attempts to choose a new template with ima_template= or with
ima_template_fmt=, after ima_hash=, to be ignored.

Achieve the goal of the commit mentioned with the new static variable
template_setup_done, so that template selection requests after ima_hash=
are not ignored.

Finally, call ima_init_template_list(), if not already done, to initialize
the list of templates before lookup_template_desc() is called.

Reported-by: Guo Zihua <[email protected]>
Signed-off-by: Roberto Sassu <[email protected]>
Cc: [email protected]
Fixes: c2426d2 ("ima: added support for new kernel cmdline parameter ima_template_fmt")
Signed-off-by: Mimi Zohar <[email protected]>
Before printing a policy rule scan for inactive LSM labels in the policy
rule. Inactive LSM labels are identified by args_p != NULL and
rule == NULL.

Fixes: 483ec26 ("ima: ima/lsm policy rule loading logic bug fixes")
Signed-off-by: Stefan Berger <[email protected]>
Cc: <[email protected]> # v5.6+
Acked-by: Christian Brauner <[email protected]>
[[email protected]: Updated "Fixes" tag]
Signed-off-by: Mimi Zohar <[email protected]>
The recv path of secure mode is intertwined with that of crc mode.
While it's slightly more efficient that way (the ciphertext is read
into the destination buffer and decrypted in place, thus avoiding
two potentially heavy memory allocations for the bounce buffer and
the corresponding sg array), it isn't really amenable to changes.
Sacrifice that edge and align with the send path which always uses
a full-sized bounce buffer (currently there is no other way -- if
the kernel crypto API ever grows support for streaming (piecewise)
en/decryption for GCM [1], we would be able to easily take advantage
of that on both sides).

[1] https://lore.kernel.org/all/[email protected]/

Signed-off-by: Ilya Dryomov <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Both msgr1 and msgr2 in crc mode are zero copy in the sense that
message data is read from the socket directly into the destination
buffer.  We assume that the destination buffer is stable (i.e. remains
unchanged while it is being read to) though.  Otherwise, CRC errors
ensue:

  libceph: read_partial_message 0000000048edf8ad data crc 1063286393 != exp. 228122706
  libceph: osd1 (1)192.168.122.1:6843 bad crc/signature

  libceph: bad data crc, calculated 57958023, expected 1805382778
  libceph: osd2 (2)192.168.122.1:6876 integrity error, bad crc

Introduce rxbounce option to enable use of a bounce buffer when
receiving message data.  In particular this is needed if a mapped
image is a Windows VM disk, passed to QEMU.  Windows has a system-wide
"dummy" page that may be mapped into the destination buffer (potentially
more than once into the same buffer) by the Windows Memory Manager in
an effort to generate a single large I/O [1][2].  QEMU makes a point of
preserving overlap relationships when cloning I/O vectors, so krbd gets
exposed to this behaviour.

[1] "What Is Really in That MDL?"
    https://docs.microsoft.com/en-us/previous-versions/windows/hardware/design/dn614012(v=vs.85)
[2] https://blogs.msmvps.com/kernelmustard/2005/05/04/dummy-pages/

URL: https://bugzilla.redhat.com/show_bug.cgi?id=1973317
Signed-off-by: Ilya Dryomov <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
We're missing the `f` prefix to have python do string interpolation, so
we'd never end up printing what the actual "unexpected" error is.

Fixes: ee92ed3 ("kunit: add run_checks.py script to validate kunit changes")
Signed-off-by: Daniel Latypov <[email protected]>
Reviewed-by: David Gow <[email protected]>
Reviewed-by: Brendan Higgins <[email protected]>
Signed-off-by: Shuah Khan <[email protected]>
Leon reported NULL pointer deref with nowait support:

[   15.123761] device-mapper: raid: Loading target version 1.15.1
[   15.124185] device-mapper: raid: Ignoring chunk size parameter for RAID 1
[   15.124192] device-mapper: raid: Choosing default region size of 4MiB
[   15.129524] BUG: kernel NULL pointer dereference, address: 0000000000000060
[   15.129530] #PF: supervisor write access in kernel mode
[   15.129533] #PF: error_code(0x0002) - not-present page
[   15.129535] PGD 0 P4D 0
[   15.129538] Oops: 0002 [#1] PREEMPT SMP NOPTI
[   15.129541] CPU: 5 PID: 494 Comm: ldmtool Not tainted 5.17.0-rc2-1-mainline #1 9fe89d43dfcb215d2731e6f8851740520778615e
[   15.129546] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F36e 10/14/2021
[   15.129549] RIP: 0010:blk_queue_flag_set+0x7/0x20
[   15.129555] Code: 00 00 00 0f 1f 44 00 00 48 8b 35 e4 e0 04 02 48 8d 57 28 bf 40 01 \
       00 00 e9 16 c1 be ff 66 0f 1f 44 00 00 0f 1f 44 00 00 89 ff <f0> 48 0f ab 7e 60 \
       31 f6 89 f7 c3 66 66 2e 0f 1f 84 00 00 00 00 00
[   15.129559] RSP: 0018:ffff966b81987a88 EFLAGS: 00010202
[   15.129562] RAX: ffff8b11c363a0d0 RBX: ffff8b11e294b070 RCX: 0000000000000000
[   15.129564] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000001d
[   15.129566] RBP: ffff8b11e294b058 R08: 0000000000000000 R09: 0000000000000000
[   15.129568] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b11e294b070
[   15.129570] R13: 0000000000000000 R14: ffff8b11e294b000 R15: 0000000000000001
[   15.129572] FS:  00007fa96e826780(0000) GS:ffff8b18deb40000(0000) knlGS:0000000000000000
[   15.129575] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.129577] CR2: 0000000000000060 CR3: 000000010b8ce000 CR4: 00000000003506e0
[   15.129580] Call Trace:
[   15.129582]  <TASK>
[   15.129584]  md_run+0x67c/0xc70 [md_mod 1e470c1b6bcf1114198109f42682f5a2740e9531]
[   15.129597]  raid_ctr+0x134a/0x28ea [dm_raid 6a645dd7519e72834bd7e98c23497eeade14cd63]
[   15.129604]  ? dm_split_args+0x63/0x150 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
[   15.129615]  dm_table_add_target+0x188/0x380 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
[   15.129625]  table_load+0x13b/0x370 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
[   15.129635]  ? dev_suspend+0x2d0/0x2d0 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
[   15.129644]  ctl_ioctl+0x1bd/0x460 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
[   15.129655]  dm_ctl_ioctl+0xa/0x20 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e]
[   15.129663]  __x64_sys_ioctl+0x8e/0xd0
[   15.129667]  do_syscall_64+0x5c/0x90
[   15.129672]  ? syscall_exit_to_user_mode+0x23/0x50
[   15.129675]  ? do_syscall_64+0x69/0x90
[   15.129677]  ? do_syscall_64+0x69/0x90
[   15.129679]  ? syscall_exit_to_user_mode+0x23/0x50
[   15.129682]  ? do_syscall_64+0x69/0x90
[   15.129684]  ? do_syscall_64+0x69/0x90
[   15.129686]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   15.129689] RIP: 0033:0x7fa96ecd559b
[   15.129692] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c \
    c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff \
    ff 73 01 c3 48 8b 0d a5 a8 0c 00 f7 d8 64 89 01 48
[   15.129696] RSP: 002b:00007ffcaf85c258 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[   15.129699] RAX: ffffffffffffffda RBX: 00007fa96f1b48f0 RCX: 00007fa96ecd559b
[   15.129701] RDX: 00007fa97017e610 RSI: 00000000c138fd09 RDI: 0000000000000003
[   15.129702] RBP: 00007fa96ebab583 R08: 00007fa97017c9e0 R09: 00007ffcaf85bf27
[   15.129704] R10: 0000000000000001 R11: 0000000000000206 R12: 00007fa97017e610
[   15.129706] R13: 00007fa97017e640 R14: 00007fa97017e6c0 R15: 00007fa97017e530
[   15.129709]  </TASK>

This is caused by missing mddev->queue check for setting QUEUE_FLAG_NOWAIT
Fix this by moving the QUEUE_FLAG_NOWAIT logic to under mddev->queue check.

Fixes: f51d46d ("md: add support for REQ_NOWAIT")
Reported-by: Leon Möller <[email protected]>
Tested-by: Leon Möller <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Song Liu <[email protected]>
This will be used to help make decisions on what to do in
misconfigured systems.

v2: squash in semicolon fix from Stephen Rothwell

Signed-off-by: Mario Limonciello <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 11, 2025
A crash was observed with the following output:

BUG: kernel NULL pointer dereference, address: 0000000000000010
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 2 UID: 0 PID: 92 Comm: osnoise_cpus Not tainted 6.17.0-rc4-00201-gd69eb204c255 #138 PREEMPT(voluntary)
RIP: 0010:bitmap_parselist+0x53/0x3e0
Call Trace:
 <TASK>
 osnoise_cpus_write+0x7a/0x190
 vfs_write+0xf8/0x410
 ? do_sys_openat2+0x88/0xd0
 ksys_write+0x60/0xd0
 do_syscall_64+0xa4/0x260
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
 </TASK>

This issue can be reproduced by below code:

fd=open("/sys/kernel/debug/tracing/osnoise/cpus", O_WRONLY);
write(fd, "0-2", 0);

When user pass 'count=0' to osnoise_cpus_write(), kmalloc() will return
ZERO_SIZE_PTR (16) and cpulist_parse() treat it as a normal value, which
trigger the null pointer dereference. Add check for the parameter 'count'.

Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Link: https://lore.kernel.org/[email protected]
Fixes: 17f8910 ("tracing/osnoise: Allow arbitrarily long CPU string")
Signed-off-by: Wang Liang <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 11, 2025
…on memory

When I did memory failure tests, below panic occurs:

page dumped because: VM_BUG_ON_PAGE(PagePoisoned(page))
kernel BUG at include/linux/page-flags.h:616!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 3 PID: 720 Comm: bash Not tainted 6.10.0-rc1-00195-g148743902568 #40
RIP: 0010:unpoison_memory+0x2f3/0x590
RSP: 0018:ffffa57fc8787d60 EFLAGS: 00000246
RAX: 0000000000000037 RBX: 0000000000000009 RCX: ffff9be25fcdc9c8
RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff9be25fcdc9c0
RBP: 0000000000300000 R08: ffffffffb4956f88 R09: 0000000000009ffb
R10: 0000000000000284 R11: ffffffffb4926fa0 R12: ffffe6b00c000000
R13: ffff9bdb453dfd00 R14: 0000000000000000 R15: fffffffffffffffe
FS:  00007f08f04e4740(0000) GS:ffff9be25fcc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000564787a30410 CR3: 000000010d4e2000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 unpoison_memory+0x2f3/0x590
 simple_attr_write_xsigned.constprop.0.isra.0+0xb3/0x110
 debugfs_attr_write+0x42/0x60
 full_proxy_write+0x5b/0x80
 vfs_write+0xd5/0x540
 ksys_write+0x64/0xe0
 do_syscall_64+0xb9/0x1d0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f08f0314887
RSP: 002b:00007ffece710078 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007f08f0314887
RDX: 0000000000000009 RSI: 0000564787a30410 RDI: 0000000000000001
RBP: 0000564787a30410 R08: 000000000000fefe R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000009
R13: 00007f08f041b780 R14: 00007f08f0417600 R15: 00007f08f0416a00
 </TASK>
Modules linked in: hwpoison_inject
---[ end trace 0000000000000000 ]---
RIP: 0010:unpoison_memory+0x2f3/0x590
RSP: 0018:ffffa57fc8787d60 EFLAGS: 00000246
RAX: 0000000000000037 RBX: 0000000000000009 RCX: ffff9be25fcdc9c8
RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff9be25fcdc9c0
RBP: 0000000000300000 R08: ffffffffb4956f88 R09: 0000000000009ffb
R10: 0000000000000284 R11: ffffffffb4926fa0 R12: ffffe6b00c000000
R13: ffff9bdb453dfd00 R14: 0000000000000000 R15: fffffffffffffffe
FS:  00007f08f04e4740(0000) GS:ffff9be25fcc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000564787a30410 CR3: 000000010d4e2000 CR4: 00000000000006f0
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x31c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Fatal exception ]---

The root cause is that unpoison_memory() tries to check the PG_HWPoison
flags of an uninitialized page.  So VM_BUG_ON_PAGE(PagePoisoned(page)) is
triggered.  This can be reproduced by below steps:

1.Offline memory block:

 echo offline > /sys/devices/system/memory/memory12/state

2.Get offlined memory pfn:

 page-types -b n -rlN

3.Write pfn to unpoison-pfn

 echo <pfn> > /sys/kernel/debug/hwpoison/unpoison-pfn

This scenario can be identified by pfn_to_online_page() returning NULL. 
And ZONE_DEVICE pages are never expected, so we can simply fail if
pfn_to_online_page() == NULL to fix the bug.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: f1dd2cd ("mm, memory_hotplug: do not associate hotadded memory to zones until online")
Signed-off-by: Miaohe Lin <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 11, 2025
The ns_bpf_qdisc selftest triggers a kernel panic:

    Unable to handle kernel paging request at virtual address ffffffffa38dbf58
    Current test_progs pgtable: 4K pagesize, 57-bit VAs, pgdp=0x00000001109cc000
    [ffffffffa38dbf58] pgd=000000011fffd801, p4d=000000011fffd401, pud=000000011fffd001, pmd=0000000000000000
    Oops [#1]
    Modules linked in: bpf_testmod(OE) xt_conntrack nls_iso8859_1 dm_mod drm drm_panel_orientation_quirks configfs backlight btrfs blake2b_generic xor lzo_compress zlib_deflate raid6_pq efivarfs [last unloaded: bpf_testmod(OE)]
    CPU: 1 UID: 0 PID: 23584 Comm: test_progs Tainted: G        W  OE       6.17.0-rc1-g2465bb83e0b4 #1 NONE
    Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
    Hardware name: Unknown Unknown Product/Unknown Product, BIOS 2024.01+dfsg-1ubuntu5.1 01/01/2024
    epc : __qdisc_run+0x82/0x6f0
     ra : __qdisc_run+0x6e/0x6f0
    epc : ffffffff80bd5c7a ra : ffffffff80bd5c66 sp : ff2000000eecb550
     gp : ffffffff82472098 tp : ff60000096895940 t0 : ffffffff8001f180
     t1 : ffffffff801e1664 t2 : 0000000000000000 s0 : ff2000000eecb5d0
     s1 : ff60000093a6a600 a0 : ffffffffa38dbee8 a1 : 0000000000000001
     a2 : ff2000000eecb510 a3 : 0000000000000001 a4 : 0000000000000000
     a5 : 0000000000000010 a6 : 0000000000000000 a7 : 0000000000735049
     s2 : ffffffffa38dbee8 s3 : 0000000000000040 s4 : ff6000008bcda000
     s5 : 0000000000000008 s6 : ff60000093a6a680 s7 : ff60000093a6a6f0
     s8 : ff60000093a6a6ac s9 : ff60000093140000 s10: 0000000000000000
     s11: ff2000000eecb9d0 t3 : 0000000000000000 t4 : 0000000000ff0000
     t5 : 0000000000000000 t6 : ff60000093a6a8b6
    status: 0000000200000120 badaddr: ffffffffa38dbf58 cause: 000000000000000d
    [<ffffffff80bd5c7a>] __qdisc_run+0x82/0x6f0
    [<ffffffff80b6fe58>] __dev_queue_xmit+0x4c0/0x1128
    [<ffffffff80b80ae0>] neigh_resolve_output+0xd0/0x170
    [<ffffffff80d2daf6>] ip6_finish_output2+0x226/0x6c8
    [<ffffffff80d31254>] ip6_finish_output+0x10c/0x2a0
    [<ffffffff80d31446>] ip6_output+0x5e/0x178
    [<ffffffff80d2e232>] ip6_xmit+0x29a/0x608
    [<ffffffff80d6f4c6>] inet6_csk_xmit+0xe6/0x140
    [<ffffffff80c985e4>] __tcp_transmit_skb+0x45c/0xaa8
    [<ffffffff80c995fe>] tcp_connect+0x9ce/0xd10
    [<ffffffff80d66524>] tcp_v6_connect+0x4ac/0x5e8
    [<ffffffff80cc19b8>] __inet_stream_connect+0xd8/0x318
    [<ffffffff80cc1c36>] inet_stream_connect+0x3e/0x68
    [<ffffffff80b42b20>] __sys_connect_file+0x50/0x88
    [<ffffffff80b42bee>] __sys_connect+0x96/0xc8
    [<ffffffff80b42c40>] __riscv_sys_connect+0x20/0x30
    [<ffffffff80e5bcae>] do_trap_ecall_u+0x256/0x378
    [<ffffffff80e69af2>] handle_exception+0x14a/0x156
    Code: 892a 0363 1205 489c 8bc1 c7e5 2d03 084a 2703 080a (2783) 0709
    ---[ end trace 0000000000000000 ]---

The bpf_fifo_dequeue prog returns a skb which is a pointer.
The pointer is treated as a 32bit value and sign extend to
64bit in epilogue. This behavior is right for most bpf prog
types but wrong for struct ops which requires RISC-V ABI.

So let's sign extend struct ops return values according to
the function model and RISC-V ABI([0]).

  [0]: https://riscv.org/wp-content/uploads/2024/12/riscv-calling.pdf

Fixes: 25ad106 ("riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework")
Signed-off-by: Hengqi Chen <[email protected]>
Reviewed-by: Pu Lehui <[email protected]>
Tested-by: Pu Lehui <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 11, 2025
The ns_bpf_qdisc selftest triggers a kernel panic:

    Unable to handle kernel paging request at virtual address ffffffffa38dbf58
    Current test_progs pgtable: 4K pagesize, 57-bit VAs, pgdp=0x00000001109cc000
    [ffffffffa38dbf58] pgd=000000011fffd801, p4d=000000011fffd401, pud=000000011fffd001, pmd=0000000000000000
    Oops [#1]
    Modules linked in: bpf_testmod(OE) xt_conntrack nls_iso8859_1 dm_mod drm drm_panel_orientation_quirks configfs backlight btrfs blake2b_generic xor lzo_compress zlib_deflate raid6_pq efivarfs [last unloaded: bpf_testmod(OE)]
    CPU: 1 UID: 0 PID: 23584 Comm: test_progs Tainted: G        W  OE       6.17.0-rc1-g2465bb83e0b4 #1 NONE
    Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
    Hardware name: Unknown Unknown Product/Unknown Product, BIOS 2024.01+dfsg-1ubuntu5.1 01/01/2024
    epc : __qdisc_run+0x82/0x6f0
     ra : __qdisc_run+0x6e/0x6f0
    epc : ffffffff80bd5c7a ra : ffffffff80bd5c66 sp : ff2000000eecb550
     gp : ffffffff82472098 tp : ff60000096895940 t0 : ffffffff8001f180
     t1 : ffffffff801e1664 t2 : 0000000000000000 s0 : ff2000000eecb5d0
     s1 : ff60000093a6a600 a0 : ffffffffa38dbee8 a1 : 0000000000000001
     a2 : ff2000000eecb510 a3 : 0000000000000001 a4 : 0000000000000000
     a5 : 0000000000000010 a6 : 0000000000000000 a7 : 0000000000735049
     s2 : ffffffffa38dbee8 s3 : 0000000000000040 s4 : ff6000008bcda000
     s5 : 0000000000000008 s6 : ff60000093a6a680 s7 : ff60000093a6a6f0
     s8 : ff60000093a6a6ac s9 : ff60000093140000 s10: 0000000000000000
     s11: ff2000000eecb9d0 t3 : 0000000000000000 t4 : 0000000000ff0000
     t5 : 0000000000000000 t6 : ff60000093a6a8b6
    status: 0000000200000120 badaddr: ffffffffa38dbf58 cause: 000000000000000d
    [<ffffffff80bd5c7a>] __qdisc_run+0x82/0x6f0
    [<ffffffff80b6fe58>] __dev_queue_xmit+0x4c0/0x1128
    [<ffffffff80b80ae0>] neigh_resolve_output+0xd0/0x170
    [<ffffffff80d2daf6>] ip6_finish_output2+0x226/0x6c8
    [<ffffffff80d31254>] ip6_finish_output+0x10c/0x2a0
    [<ffffffff80d31446>] ip6_output+0x5e/0x178
    [<ffffffff80d2e232>] ip6_xmit+0x29a/0x608
    [<ffffffff80d6f4c6>] inet6_csk_xmit+0xe6/0x140
    [<ffffffff80c985e4>] __tcp_transmit_skb+0x45c/0xaa8
    [<ffffffff80c995fe>] tcp_connect+0x9ce/0xd10
    [<ffffffff80d66524>] tcp_v6_connect+0x4ac/0x5e8
    [<ffffffff80cc19b8>] __inet_stream_connect+0xd8/0x318
    [<ffffffff80cc1c36>] inet_stream_connect+0x3e/0x68
    [<ffffffff80b42b20>] __sys_connect_file+0x50/0x88
    [<ffffffff80b42bee>] __sys_connect+0x96/0xc8
    [<ffffffff80b42c40>] __riscv_sys_connect+0x20/0x30
    [<ffffffff80e5bcae>] do_trap_ecall_u+0x256/0x378
    [<ffffffff80e69af2>] handle_exception+0x14a/0x156
    Code: 892a 0363 1205 489c 8bc1 c7e5 2d03 084a 2703 080a (2783) 0709
    ---[ end trace 0000000000000000 ]---

The bpf_fifo_dequeue prog returns a skb which is a pointer.
The pointer is treated as a 32bit value and sign extend to
64bit in epilogue. This behavior is right for most bpf prog
types but wrong for struct ops which requires RISC-V ABI.

So let's sign extend struct ops return values according to
the function model and RISC-V ABI([0]).

  [0]: https://riscv.org/wp-content/uploads/2024/12/riscv-calling.pdf

Fixes: 25ad106 ("riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework")
Signed-off-by: Hengqi Chen <[email protected]>
Reviewed-by: Pu Lehui <[email protected]>
Tested-by: Pu Lehui <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 12, 2025
Avoid below overlapping mappings by using a contiguous
non-cacheable buffer.

[    4.077708] DMA-API: stm32_fmc2_nfc 48810000.nand-controller: cacheline tracking EEXIST,
overlapping mappings aren't supported
[    4.089103] WARNING: CPU: 1 PID: 44 at kernel/dma/debug.c:568 add_dma_entry+0x23c/0x300
[    4.097071] Modules linked in:
[    4.100101] CPU: 1 PID: 44 Comm: kworker/u4:2 Not tainted 6.1.82 #1
[    4.106346] Hardware name: STMicroelectronics STM32MP257F VALID1 SNOR / MB1704 (LPDDR4 Power discrete) + MB1703 + MB1708 (SNOR MB1730) (DT)
[    4.118824] Workqueue: events_unbound deferred_probe_work_func
[    4.124674] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    4.131624] pc : add_dma_entry+0x23c/0x300
[    4.135658] lr : add_dma_entry+0x23c/0x300
[    4.139792] sp : ffff800009dbb490
[    4.143016] x29: ffff800009dbb4a0 x28: 0000000004008022 x27: ffff8000098a6000
[    4.150174] x26: 0000000000000000 x25: ffff8000099e7000 x24: ffff8000099e7de8
[    4.157231] x23: 00000000ffffffff x22: 0000000000000000 x21: ffff8000098a6a20
[    4.164388] x20: ffff000080964180 x19: ffff800009819ba0 x18: 0000000000000006
[    4.171545] x17: 6361727420656e69 x16: 6c6568636163203a x15: 72656c6c6f72746e
[    4.178602] x14: 6f632d646e616e2e x13: ffff800009832f58 x12: 00000000000004ec
[    4.185759] x11: 00000000000001a4 x10: ffff80000988af58 x9 : ffff800009832f58
[    4.192916] x8 : 00000000ffffefff x7 : ffff80000988af58 x6 : 80000000fffff000
[    4.199972] x5 : 000000000000bff4 x4 : 0000000000000000 x3 : 0000000000000000
[    4.207128] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000812d2c40
[    4.214185] Call trace:
[    4.216605]  add_dma_entry+0x23c/0x300
[    4.220338]  debug_dma_map_sg+0x198/0x350
[    4.224373]  __dma_map_sg_attrs+0xa0/0x110
[    4.228411]  dma_map_sg_attrs+0x10/0x2c
[    4.232247]  stm32_fmc2_nfc_xfer.isra.0+0x1c8/0x3fc
[    4.237088]  stm32_fmc2_nfc_seq_read_page+0xc8/0x174
[    4.242127]  nand_read_oob+0x1d4/0x8e0
[    4.245861]  mtd_read_oob_std+0x58/0x84
[    4.249596]  mtd_read_oob+0x90/0x150
[    4.253231]  mtd_read+0x68/0xac

Signed-off-by: Christophe Kerello <[email protected]>
Cc: [email protected]
Fixes: 2cd457f ("mtd: rawnand: stm32_fmc2: add STM32 FMC2 NAND flash controller driver")
Signed-off-by: Miquel Raynal <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 12, 2025
Problem description
===================

Lockdep reports a possible circular locking dependency (AB/BA) between
&pl->state_mutex and &phy->lock, as follows.

phylink_resolve() // acquires &pl->state_mutex
-> phylink_major_config()
   -> phy_config_inband() // acquires &pl->phydev->lock

whereas all the other call sites where &pl->state_mutex and
&pl->phydev->lock have the locking scheme reversed. Everywhere else,
&pl->phydev->lock is acquired at the top level, and &pl->state_mutex at
the lower level. A clear example is phylink_bringup_phy().

The outlier is the newly introduced phy_config_inband() and the existing
lock order is the correct one. To understand why it cannot be the other
way around, it is sufficient to consider phylink_phy_change(), phylink's
callback from the PHY device's phy->phy_link_change() virtual method,
invoked by the PHY state machine.

phy_link_up() and phy_link_down(), the (indirect) callers of
phylink_phy_change(), are called with &phydev->lock acquired.
Then phylink_phy_change() acquires its own &pl->state_mutex, to
serialize changes made to its pl->phy_state and pl->link_config.
So all other instances of &pl->state_mutex and &phydev->lock must be
consistent with this order.

Problem impact
==============

I think the kernel runs a serious deadlock risk if an existing
phylink_resolve() thread, which results in a phy_config_inband() call,
is concurrent with a phy_link_up() or phy_link_down() call, which will
deadlock on &pl->state_mutex in phylink_phy_change(). Practically
speaking, the impact may be limited by the slow speed of the medium
auto-negotiation protocol, which makes it unlikely for the current state
to still be unresolved when a new one is detected, but I think the
problem is there. Nonetheless, the problem was discovered using lockdep.

Proposed solution
=================

Practically speaking, the phy_config_inband() requirement of having
phydev->lock acquired must transfer to the caller (phylink is the only
caller). There, it must bubble up until immediately before
&pl->state_mutex is acquired, for the cases where that takes place.

Solution details, considerations, notes
=======================================

This is the phy_config_inband() call graph:

                          sfp_upstream_ops :: connect_phy()
                          |
                          v
                          phylink_sfp_connect_phy()
                          |
                          v
                          phylink_sfp_config_phy()
                          |
                          |   sfp_upstream_ops :: module_insert()
                          |   |
                          |   v
                          |   phylink_sfp_module_insert()
                          |   |
                          |   |   sfp_upstream_ops :: module_start()
                          |   |   |
                          |   |   v
                          |   |   phylink_sfp_module_start()
                          |   |   |
                          |   v   v
                          |   phylink_sfp_config_optical()
 phylink_start()          |   |
   |   phylink_resume()   v   v
   |   |  phylink_sfp_set_config()
   |   |  |
   v   v  v
 phylink_mac_initial_config()
   |   phylink_resolve()
   |   |  phylink_ethtool_ksettings_set()
   v   v  v
   phylink_major_config()
            |
            v
    phy_config_inband()

phylink_major_config() caller #1, phylink_mac_initial_config(), does not
acquire &pl->state_mutex nor do its callers. It must acquire
&pl->phydev->lock prior to calling phylink_major_config().

phylink_major_config() caller #2, phylink_resolve() acquires
&pl->state_mutex, thus also needs to acquire &pl->phydev->lock.

phylink_major_config() caller #3, phylink_ethtool_ksettings_set(), is
completely uninteresting, because it only calls phylink_major_config()
if pl->phydev is NULL (otherwise it calls phy_ethtool_ksettings_set()).
We need to change nothing there.

Other solutions
===============

The lock inversion between &pl->state_mutex and &pl->phydev->lock has
occurred at least once before, as seen in commit c718af2 ("net:
phylink: fix ethtool -A with attached PHYs"). The solution there was to
simply not call phy_set_asym_pause() under the &pl->state_mutex. That
cannot be extended to our case though, where the phy_config_inband()
call is much deeper inside the &pl->state_mutex section.

Fixes: 5fd0f1a ("net: phylink: add negotiation of in-band capabilities")
Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Russell King (Oracle) <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 12, 2025
5da3d94 ("PCI: mvebu: Use for_each_of_range() iterator for parsing
"ranges"") simplified code by using the for_each_of_range() iterator, but
it broke PCI enumeration on Turris Omnia (and probably other mvebu
targets).

Issue #1:

To determine range.flags, of_pci_range_parser_one() uses bus->get_flags(),
which resolves to of_bus_pci_get_flags(), which already returns an
IORESOURCE bit field, and NOT the original flags from the "ranges"
resource.

Then mvebu_get_tgt_attr() attempts the very same conversion again.  Remove
the misinterpretation of range.flags in mvebu_get_tgt_attr(), to restore
the intended behavior.

Issue #2:

The driver needs target and attributes, which are encoded in the raw
address values of the "/soc/pcie/ranges" resource. According to
of_pci_range_parser_one(), the raw values are stored in range.bus_addr and
range.parent_bus_addr, respectively. range.cpu_addr is a translated version
of range.parent_bus_addr, and not relevant here.

Use the correct range structure member, to extract target and attributes.
This restores the intended behavior.

Fixes: 5da3d94 ("PCI: mvebu: Use for_each_of_range() iterator for parsing "ranges"")
Reported-by: Jan Palus <[email protected]>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220479
Signed-off-by: Klaus Kudielka <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Tested-by: Tony Dinh <[email protected]>
Tested-by: Jan Palus <[email protected]>
Link: https://patch.msgid.link/[email protected]
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 12, 2025
If request_irq() in i40e_vsi_request_irq_msix() fails in an iteration
later than the first, the error path wants to free the IRQs requested
so far. However, it uses the wrong dev_id argument for free_irq(), so
it does not free the IRQs correctly and instead triggers the warning:

 Trying to free already-free IRQ 173
 WARNING: CPU: 25 PID: 1091 at kernel/irq/manage.c:1829 __free_irq+0x192/0x2c0
 Modules linked in: i40e(+) [...]
 CPU: 25 UID: 0 PID: 1091 Comm: NetworkManager Not tainted 6.17.0-rc1+ #1 PREEMPT(lazy)
 Hardware name: [...]
 RIP: 0010:__free_irq+0x192/0x2c0
 [...]
 Call Trace:
  <TASK>
  free_irq+0x32/0x70
  i40e_vsi_request_irq_msix.cold+0x63/0x8b [i40e]
  i40e_vsi_request_irq+0x79/0x80 [i40e]
  i40e_vsi_open+0x21f/0x2f0 [i40e]
  i40e_open+0x63/0x130 [i40e]
  __dev_open+0xfc/0x210
  __dev_change_flags+0x1fc/0x240
  netif_change_flags+0x27/0x70
  do_setlink.isra.0+0x341/0xc70
  rtnl_newlink+0x468/0x860
  rtnetlink_rcv_msg+0x375/0x450
  netlink_rcv_skb+0x5c/0x110
  netlink_unicast+0x288/0x3c0
  netlink_sendmsg+0x20d/0x430
  ____sys_sendmsg+0x3a2/0x3d0
  ___sys_sendmsg+0x99/0xe0
  __sys_sendmsg+0x8a/0xf0
  do_syscall_64+0x82/0x2c0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [...]
  </TASK>
 ---[ end trace 0000000000000000 ]---

Use the same dev_id for free_irq() as for request_irq().

I tested this with inserting code to fail intentionally.

Fixes: 493fb30 ("i40e: Move q_vectors from pointer to array to array of pointers")
Signed-off-by: Michal Schmidt <[email protected]>
Reviewed-by: Aleksandr Loktionov <[email protected]>
Reviewed-by: Subbaraya Sundeep <[email protected]>
Tested-by: Rinitha S <[email protected]> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 12, 2025
Hangbin Liu says:

====================
hsr: fix lock warnings

hsr_for_each_port is called in many places without holding the RCU read
lock, this may trigger warnings on debug kernels like:

  [   40.457015] [  T201] WARNING: suspicious RCU usage
  [   40.457020] [  T201] 6.17.0-rc2-virtme #1 Not tainted
  [   40.457025] [  T201] -----------------------------
  [   40.457029] [  T201] net/hsr/hsr_main.c:137 RCU-list traversed in non-reader section!!
  [   40.457036] [  T201]
                          other info that might help us debug this:

  [   40.457040] [  T201]
                          rcu_scheduler_active = 2, debug_locks = 1
  [   40.457045] [  T201] 2 locks held by ip/201:
  [   40.457050] [  T201]  #0: ffffffff93040a40 (&ops->srcu){.+.+}-{0:0}, at: rtnl_link_ops_get+0xf2/0x280
  [   40.457080] [  T201]  #1: ffffffff92e7f968 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_newlink+0x5e1/0xb20
  [   40.457102] [  T201]
                          stack backtrace:
  [   40.457108] [  T201] CPU: 2 UID: 0 PID: 201 Comm: ip Not tainted 6.17.0-rc2-virtme #1 PREEMPT(full)
  [   40.457114] [  T201] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  [   40.457117] [  T201] Call Trace:
  [   40.457120] [  T201]  <TASK>
  [   40.457126] [  T201]  dump_stack_lvl+0x6f/0xb0
  [   40.457136] [  T201]  lockdep_rcu_suspicious.cold+0x4f/0xb1
  [   40.457148] [  T201]  hsr_port_get_hsr+0xfe/0x140
  [   40.457158] [  T201]  hsr_add_port+0x192/0x940
  [   40.457167] [  T201]  ? __pfx_hsr_add_port+0x10/0x10
  [   40.457176] [  T201]  ? lockdep_init_map_type+0x5c/0x270
  [   40.457189] [  T201]  hsr_dev_finalize+0x4bc/0xbf0
  [   40.457204] [  T201]  hsr_newlink+0x3c3/0x8f0
  [   40.457212] [  T201]  ? __pfx_hsr_newlink+0x10/0x10
  [   40.457222] [  T201]  ? rtnl_create_link+0x173/0xe40
  [   40.457233] [  T201]  rtnl_newlink_create+0x2cf/0x750
  [   40.457243] [  T201]  ? __pfx_rtnl_newlink_create+0x10/0x10
  [   40.457247] [  T201]  ? __dev_get_by_name+0x12/0x50
  [   40.457252] [  T201]  ? rtnl_dev_get+0xac/0x140
  [   40.457259] [  T201]  ? __pfx_rtnl_dev_get+0x10/0x10
  [   40.457285] [  T201]  __rtnl_newlink+0x22c/0xa50
  [   40.457305] [  T201]  rtnl_newlink+0x637/0xb20

Adding rcu_read_lock() for all hsr_for_each_port() looks confusing.

Introduce a new helper, hsr_for_each_port_rtnl(), that assumes the
RTNL lock is held. This allows callers in suitable contexts to iterate
ports safely without explicit RCU locking.

Other code paths that rely on RCU protection continue to use
hsr_for_each_port() with rcu_read_lock().
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 15, 2025
The ns_bpf_qdisc selftest triggers a kernel panic:

    Unable to handle kernel paging request at virtual address ffffffffa38dbf58
    Current test_progs pgtable: 4K pagesize, 57-bit VAs, pgdp=0x00000001109cc000
    [ffffffffa38dbf58] pgd=000000011fffd801, p4d=000000011fffd401, pud=000000011fffd001, pmd=0000000000000000
    Oops [#1]
    Modules linked in: bpf_testmod(OE) xt_conntrack nls_iso8859_1 dm_mod drm drm_panel_orientation_quirks configfs backlight btrfs blake2b_generic xor lzo_compress zlib_deflate raid6_pq efivarfs [last unloaded: bpf_testmod(OE)]
    CPU: 1 UID: 0 PID: 23584 Comm: test_progs Tainted: G        W  OE       6.17.0-rc1-g2465bb83e0b4 #1 NONE
    Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
    Hardware name: Unknown Unknown Product/Unknown Product, BIOS 2024.01+dfsg-1ubuntu5.1 01/01/2024
    epc : __qdisc_run+0x82/0x6f0
     ra : __qdisc_run+0x6e/0x6f0
    epc : ffffffff80bd5c7a ra : ffffffff80bd5c66 sp : ff2000000eecb550
     gp : ffffffff82472098 tp : ff60000096895940 t0 : ffffffff8001f180
     t1 : ffffffff801e1664 t2 : 0000000000000000 s0 : ff2000000eecb5d0
     s1 : ff60000093a6a600 a0 : ffffffffa38dbee8 a1 : 0000000000000001
     a2 : ff2000000eecb510 a3 : 0000000000000001 a4 : 0000000000000000
     a5 : 0000000000000010 a6 : 0000000000000000 a7 : 0000000000735049
     s2 : ffffffffa38dbee8 s3 : 0000000000000040 s4 : ff6000008bcda000
     s5 : 0000000000000008 s6 : ff60000093a6a680 s7 : ff60000093a6a6f0
     s8 : ff60000093a6a6ac s9 : ff60000093140000 s10: 0000000000000000
     s11: ff2000000eecb9d0 t3 : 0000000000000000 t4 : 0000000000ff0000
     t5 : 0000000000000000 t6 : ff60000093a6a8b6
    status: 0000000200000120 badaddr: ffffffffa38dbf58 cause: 000000000000000d
    [<ffffffff80bd5c7a>] __qdisc_run+0x82/0x6f0
    [<ffffffff80b6fe58>] __dev_queue_xmit+0x4c0/0x1128
    [<ffffffff80b80ae0>] neigh_resolve_output+0xd0/0x170
    [<ffffffff80d2daf6>] ip6_finish_output2+0x226/0x6c8
    [<ffffffff80d31254>] ip6_finish_output+0x10c/0x2a0
    [<ffffffff80d31446>] ip6_output+0x5e/0x178
    [<ffffffff80d2e232>] ip6_xmit+0x29a/0x608
    [<ffffffff80d6f4c6>] inet6_csk_xmit+0xe6/0x140
    [<ffffffff80c985e4>] __tcp_transmit_skb+0x45c/0xaa8
    [<ffffffff80c995fe>] tcp_connect+0x9ce/0xd10
    [<ffffffff80d66524>] tcp_v6_connect+0x4ac/0x5e8
    [<ffffffff80cc19b8>] __inet_stream_connect+0xd8/0x318
    [<ffffffff80cc1c36>] inet_stream_connect+0x3e/0x68
    [<ffffffff80b42b20>] __sys_connect_file+0x50/0x88
    [<ffffffff80b42bee>] __sys_connect+0x96/0xc8
    [<ffffffff80b42c40>] __riscv_sys_connect+0x20/0x30
    [<ffffffff80e5bcae>] do_trap_ecall_u+0x256/0x378
    [<ffffffff80e69af2>] handle_exception+0x14a/0x156
    Code: 892a 0363 1205 489c 8bc1 c7e5 2d03 084a 2703 080a (2783) 0709
    ---[ end trace 0000000000000000 ]---

The bpf_fifo_dequeue prog returns a skb which is a pointer.
The pointer is treated as a 32bit value and sign extend to
64bit in epilogue. This behavior is right for most bpf prog
types but wrong for struct ops which requires RISC-V ABI.

So let's sign extend struct ops return values according to
the function model and RISC-V ABI([0]).

  [0]: https://riscv.org/wp-content/uploads/2024/12/riscv-calling.pdf

Fixes: 25ad106 ("riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework")
Signed-off-by: Hengqi Chen <[email protected]>
Reviewed-by: Pu Lehui <[email protected]>
Tested-by: Pu Lehui <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 15, 2025
The ns_bpf_qdisc selftest triggers a kernel panic:

    Unable to handle kernel paging request at virtual address ffffffffa38dbf58
    Current test_progs pgtable: 4K pagesize, 57-bit VAs, pgdp=0x00000001109cc000
    [ffffffffa38dbf58] pgd=000000011fffd801, p4d=000000011fffd401, pud=000000011fffd001, pmd=0000000000000000
    Oops [#1]
    Modules linked in: bpf_testmod(OE) xt_conntrack nls_iso8859_1 [...] [last unloaded: bpf_testmod(OE)]
    CPU: 1 UID: 0 PID: 23584 Comm: test_progs Tainted: G        W  OE       6.17.0-rc1-g2465bb83e0b4 #1 NONE
    Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
    Hardware name: Unknown Unknown Product/Unknown Product, BIOS 2024.01+dfsg-1ubuntu5.1 01/01/2024
    epc : __qdisc_run+0x82/0x6f0
     ra : __qdisc_run+0x6e/0x6f0
    epc : ffffffff80bd5c7a ra : ffffffff80bd5c66 sp : ff2000000eecb550
     gp : ffffffff82472098 tp : ff60000096895940 t0 : ffffffff8001f180
     t1 : ffffffff801e1664 t2 : 0000000000000000 s0 : ff2000000eecb5d0
     s1 : ff60000093a6a600 a0 : ffffffffa38dbee8 a1 : 0000000000000001
     a2 : ff2000000eecb510 a3 : 0000000000000001 a4 : 0000000000000000
     a5 : 0000000000000010 a6 : 0000000000000000 a7 : 0000000000735049
     s2 : ffffffffa38dbee8 s3 : 0000000000000040 s4 : ff6000008bcda000
     s5 : 0000000000000008 s6 : ff60000093a6a680 s7 : ff60000093a6a6f0
     s8 : ff60000093a6a6ac s9 : ff60000093140000 s10: 0000000000000000
     s11: ff2000000eecb9d0 t3 : 0000000000000000 t4 : 0000000000ff0000
     t5 : 0000000000000000 t6 : ff60000093a6a8b6
    status: 0000000200000120 badaddr: ffffffffa38dbf58 cause: 000000000000000d
    [<ffffffff80bd5c7a>] __qdisc_run+0x82/0x6f0
    [<ffffffff80b6fe58>] __dev_queue_xmit+0x4c0/0x1128
    [<ffffffff80b80ae0>] neigh_resolve_output+0xd0/0x170
    [<ffffffff80d2daf6>] ip6_finish_output2+0x226/0x6c8
    [<ffffffff80d31254>] ip6_finish_output+0x10c/0x2a0
    [<ffffffff80d31446>] ip6_output+0x5e/0x178
    [<ffffffff80d2e232>] ip6_xmit+0x29a/0x608
    [<ffffffff80d6f4c6>] inet6_csk_xmit+0xe6/0x140
    [<ffffffff80c985e4>] __tcp_transmit_skb+0x45c/0xaa8
    [<ffffffff80c995fe>] tcp_connect+0x9ce/0xd10
    [<ffffffff80d66524>] tcp_v6_connect+0x4ac/0x5e8
    [<ffffffff80cc19b8>] __inet_stream_connect+0xd8/0x318
    [<ffffffff80cc1c36>] inet_stream_connect+0x3e/0x68
    [<ffffffff80b42b20>] __sys_connect_file+0x50/0x88
    [<ffffffff80b42bee>] __sys_connect+0x96/0xc8
    [<ffffffff80b42c40>] __riscv_sys_connect+0x20/0x30
    [<ffffffff80e5bcae>] do_trap_ecall_u+0x256/0x378
    [<ffffffff80e69af2>] handle_exception+0x14a/0x156
    Code: 892a 0363 1205 489c 8bc1 c7e5 2d03 084a 2703 080a (2783) 0709
    ---[ end trace 0000000000000000 ]---

The bpf_fifo_dequeue prog returns a skb which is a pointer. The pointer
is treated as a 32bit value and sign extend to 64bit in epilogue. This
behavior is right for most bpf prog types but wrong for struct ops which
requires RISC-V ABI.

So let's sign extend struct ops return values according to the function
model and RISC-V ABI([0]).

  [0]: https://riscv.org/wp-content/uploads/2024/12/riscv-calling.pdf

Fixes: 25ad106 ("riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework")
Signed-off-by: Hengqi Chen <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Tested-by: Pu Lehui <[email protected]>
Reviewed-by: Pu Lehui <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 19, 2025
…tack-analysis'

Eduard Zingerman says:

====================
bpf: replace path-sensitive with path-insensitive live stack analysis

Consider the following program, assuming checkpoint is created for a
state at instruction (3):

  1: call bpf_get_prandom_u32()
  2: *(u64 *)(r10 - 8) = 42
  -- checkpoint #1 --
  3: if r0 != 0 goto +1
  4: exit;
  5: r0 = *(u64 *)(r10 - 8)
  6: exit

The verifier processes this program by exploring two paths:
 - 1 -> 2 -> 3 -> 4
 - 1 -> 2 -> 3 -> 5 -> 6

When instruction (5) is processed, the current liveness tracking
mechanism moves up the register parent links and records a "read" mark
for stack slot -8 at checkpoint #1, stopping because of the "write"
mark recorded at instruction (2).

This patch set replaces the existing liveness tracking mechanism with
a path-insensitive data flow analysis. The program above is processed
as follows:
 - a data structure representing live stack slots for
   instructions 1-6 in frame #0 is allocated;
 - when instruction (2) is processed, record that slot -8 is written at
   instruction (2) in frame #0;
 - when instruction (5) is processed, record that slot -8 is read at
   instruction (5) in frame #0;
 - when instruction (6) is processed, propagate read mark for slot -8
   up the control flow graph to instructions 3 and 2.

The key difference is that the new mechanism operates on a control
flow graph and associates read and write marks with pairs of (call
chain, instruction index). In contrast, the old mechanism operates on
verifier states and register parent links, associating read and write
marks with verifier states.

Motivation
==========

As it stands, this patch set makes liveness tracking slightly less
precise, as it no longer distinguishes individual program paths taken
by the verifier during symbolic execution.
See the "Impact on verification performance" section for details.

However, this change is intended as a stepping stone toward the
following goals:
 - Short term, integrate precision tracking into liveness analysis and
   remove the following code:
   - verifier backedge states accumulation in is_state_visited();
   - most of the logic for precision tracking;
   - jump history tracking.
 - Long term, help with more efficient loop verification handling.

 Why integrating precision tracking?
 -----------------------------------

In a sense, precision tracking is very similar to liveness tracking.
The data flow equations for liveness tracking look as follows:

  live_after =
    U [state[s].live_before for s in insn_successors(i)]

  state[i].live_before =
    (live_after / state[i].must_write) U state[i].may_read

While data flow equations for precision tracking look as follows:

  precise_after =
    U [state[s].precise_before for s in insn_successors(i)]

  // if some of the instruction outputs are precise,
  // assume its inputs to be precise
  induced_precise =
    ⎧ state[i].may_read   if (state[i].may_write ∩ precise_after) ≠ ∅
    ⎨
    ⎩ ∅                   otherwise

  state[i].precise_before =
    (precise_after / state[i].must_write) ∩ induced_precise

Where:
 - `may_read` set represents a union of all possibly read slots
   (any slot in `may_read` set might be by the instruction);
 - `must_write` set represents an intersection of all possibly written slots
   (any slot in `must_write` set is guaranteed to be written by the instruction).
 - `may_write` set represents a union of all possibly written slots
   (any slot in `may_write` set might be written by the instruction).

This means that precision tracking can be implemented as a logical
extension of liveness tracking:
 - track registers as well as stack slots;
 - add bit masks to represent `precise_before` and `may_write`;
 - add above equations for `precise_before` computation;
 - (linked registers require some additional consideration).

Such extension would allow removal of:
 - precision propagation logic in verifier.c:
   - backtrack_insn()
   - mark_chain_precision()
   - propagate_{precision,backedges}()
 - push_jmp_history() and related data structures, which are only used
   by precision tracking;
 - add_scc_backedge() and related backedge state accumulation in
   is_state_visited(), superseded by per-callchain function state
   accumulated by liveness analysis.

The hope here is that unifying liveness and precision tracking will
reduce overall amount of code and make it easier to reason about.

 How this helps with loops?
 --------------------------

As it stands, this patch set shares the same deficiency as the current
liveness tracking mechanism. Liveness marks on stack slots cannot be
used to prune states when processing iterator-based loops:
 - such states still have branches to be explored;
 - meaning that not all stack slot reads have been discovered.

For example:

  1: while(iter_next()) {
  2:   if (...)
  3:     r0 = *(u64 *)(r10 - 8)
  4:   if (...)
  5:     r0 = *(u64 *)(r10 - 16)
  6:   ...
  7: }

For any checkpoint state created at instruction (1), it is only
possible to rely on read marks for slots fp[-8] and fp[-16] once all
child states of (1) have been explored. Thus, when the verifier
transitions from (7) to (1), it cannot rely on read marks.

However, sacrificing path-sensitivity makes it possible to run
analysis defined in this patch set before main verification pass,
if estimates for value ranges are available.
E.g. for the following program:

  1: while(iter_next()) {
  2:   r0 = r10
  3:   r0 += r2
  4:   r0 = *(u64 *)(r2 + 0)
  5:   ...
  6: }

If an estimate for `r2` range is available before the main
verification pass, it can be used to populate read marks at
instruction (4) and run the liveness analysis. Thus making
conservative liveness information available during loops verification.

Such estimates can be provided by some form of value range analysis.
Value range analysis is also necessary to address loop verification
from another angle: computing boundaries for loop induction variables
and iteration counts.

The hope here is that the new liveness tracking mechanism will support
the broader goal of making loop verification more efficient.

Validation
==========

The change was tested on three program sets:
 - bpf selftests
 - sched_ext
 - Meta's internal set of programs

Commit [#8] enables a special mode where both the current and new
liveness analyses are enabled simultaneously. This mode signals an
error if the new algorithm considers a stack slot dead while the
current algorithm assumes it is alive. This mode was very useful for
debugging. At the time of posting, no such errors have been reported
for the above program sets.

[#8] "bpf: signal error if old liveness is more conservative than new"

Impact on memory consumption
============================

Debug patch [1] extends the kernel and veristat to count the amount of
memory allocated for storing analysis data. This patch is not included
in the submission. The maximal observed impact for the above program
sets is 2.6Mb.

Data below is shown in bytes.

For bpf selftests top 5 consumers look as follows:

  File                     Program           liveness mem
  -----------------------  ----------------  ------------
  pyperf180.bpf.o          on_event               2629740
  pyperf600.bpf.o          on_event               2287662
  pyperf100.bpf.o          on_event               1427022
  test_verif_scale3.bpf.o  balancer_ingress       1121283
  pyperf_subprogs.bpf.o    on_event                756900

For sched_ext top 5 consumers loog as follows:

  File       Program                          liveness mem
  ---------  -------------------------------  ------------
  bpf.bpf.o  lavd_enqueue                           164686
  bpf.bpf.o  lavd_select_cpu                        157393
  bpf.bpf.o  layered_enqueue                        154817
  bpf.bpf.o  lavd_init                              127865
  bpf.bpf.o  layered_dispatch                       110129

For Meta's internal set of programs top consumer is 1Mb.

[1] kernel-patches/bpf@085588e

Impact on verification performance
==================================

Veristat results below are reported using
`-f insns_pct>1 -f !insns<500` filter and -t option
(BPF_F_TEST_STATE_FREQ flag).

 master vs patch-set, selftests (out of ~4K programs)
 ----------------------------------------------------

  File                              Program                                 Insns (A)  Insns (B)  Insns    (DIFF)
  --------------------------------  --------------------------------------  ---------  ---------  ---------------
  cpumask_success.bpf.o             test_global_mask_nested_deep_array_rcu       1622       1655     +33 (+2.03%)
  strobemeta_bpf_loop.bpf.o         on_event                                     2163       2684   +521 (+24.09%)
  test_cls_redirect.bpf.o           cls_redirect                                36001      42515  +6514 (+18.09%)
  test_cls_redirect_dynptr.bpf.o    cls_redirect                                 2299       2339     +40 (+1.74%)
  test_cls_redirect_subprogs.bpf.o  cls_redirect                                69545      78497  +8952 (+12.87%)
  test_l4lb_noinline.bpf.o          balancer_ingress                             2993       3084     +91 (+3.04%)
  test_xdp_noinline.bpf.o           balancer_ingress_v4                          3539       3616     +77 (+2.18%)
  test_xdp_noinline.bpf.o           balancer_ingress_v6                          3608       3685     +77 (+2.13%)

 master vs patch-set, sched_ext (out of 148 programs)
 ----------------------------------------------------

  File       Program           Insns (A)  Insns (B)  Insns    (DIFF)
  ---------  ----------------  ---------  ---------  ---------------
  bpf.bpf.o  chaos_dispatch         2257       2287     +30 (+1.33%)
  bpf.bpf.o  lavd_enqueue          20735      22101   +1366 (+6.59%)
  bpf.bpf.o  lavd_select_cpu       22100      24409  +2309 (+10.45%)
  bpf.bpf.o  layered_dispatch      25051      25606    +555 (+2.22%)
  bpf.bpf.o  p2dq_dispatch           961        990     +29 (+3.02%)
  bpf.bpf.o  rusty_quiescent         526        534      +8 (+1.52%)
  bpf.bpf.o  rusty_runnable          541        547      +6 (+1.11%)

Perf report
===========

In relative terms, the analysis does not consume much CPU time.
For example, here is a perf report collected for pyperf180 selftest:

 # Children      Self  Command   Shared Object         Symbol
 # ........  ........  ........  ....................  ........................................
        ...
      1.22%     1.22%  veristat  [kernel.kallsyms]     [k] bpf_update_live_stack
        ...

Changelog
=========

v1: https://lore.kernel.org/bpf/[email protected]/T/
v1 -> v2:
 - compute_postorder() fixed to handle jumps with offset -1 (syzbot).
 - is_state_visited() in patch #9 fixed access to uninitialized `err`
   (kernel test robot, Dan Carpenter).
 - Selftests added.
 - Fixed bug with write marks propagation from callee to caller,
   see verifier_live_stack.c:caller_stack_write() test case.
 - Added a patch for __not_msg() annotation for test_loader based
   tests.

v2: https://lore.kernel.org/bpf/20250918-callchain-sensitive-liveness-v2-0-214ed2653eee@gmail.com/
v2 -> v3:
 - Added __diag_ignore_all("-Woverride-init", ...) in liveness.c for
   bpf_insn_successors() (suggested by Alexei).

Signed-off-by: Eduard Zingerman <[email protected]>
====================

Link: https://patch.msgid.link/20250918-callchain-sensitive-liveness-v3-0-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 23, 2025
Leon Hwang says:

====================
bpf: Allow union argument in trampoline based programs

While tracing 'release_pages' with bpfsnoop[0], the verifier reports:

The function release_pages arg0 type UNION is unsupported.

However, it should be acceptable to trace functions that have 'union'
arguments.

This patch set enables such support in the verifier by allowing 'union'
as a valid argument type.

Changes:
v3 -> v4:
* Address comments from Alexei:
  * Trim bpftrace output in patch #1 log.
  * Drop the referenced commit info and the test output in patch #2 log.

v2 -> v3:
* Address comments from Alexei:
  * Reuse the existing flag BTF_FMODEL_STRUCT_ARG.
  * Update the comment of the flag BTF_FMODEL_STRUCT_ARG.

v1 -> v2:
* Add 16B 'union' argument support in x86_64 trampoline.
* Update selftests using bpf_testmod.
* Add test case about 16-bytes 'union' argument.
* Address comments from Alexei:
  * Study the patch set about 'struct' argument support.
  * Update selftests to cover more cases.
v1: https://lore.kernel.org/bpf/[email protected]/

Links:
[0] https://github.com/bpfsnoop/bpfsnoop
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 24, 2025
…ostcopy

When you run a KVM guest with vhost-net and migrate that guest to
another host, and you immediately enable postcopy after starting the
migration, there is a big chance that the network connection of the
guest won't work anymore on the destination side after the migration.

With a debug kernel v6.16.0, there is also a call trace that looks
like this:

 FAULT_FLAG_ALLOW_RETRY missing 881
 CPU: 6 UID: 0 PID: 549 Comm: kworker/6:2 Kdump: loaded Not tainted 6.16.0 #56 NONE
 Hardware name: IBM 3931 LA1 400 (LPAR)
 Workqueue: events irqfd_inject [kvm]
 Call Trace:
  [<00003173cbecc634>] dump_stack_lvl+0x104/0x168
  [<00003173cca69588>] handle_userfault+0xde8/0x1310
  [<00003173cc756f0c>] handle_pte_fault+0x4fc/0x760
  [<00003173cc759212>] __handle_mm_fault+0x452/0xa00
  [<00003173cc7599ba>] handle_mm_fault+0x1fa/0x6a0
  [<00003173cc73409a>] __get_user_pages+0x4aa/0xba0
  [<00003173cc7349e8>] get_user_pages_remote+0x258/0x770
  [<000031734be6f052>] get_map_page+0xe2/0x190 [kvm]
  [<000031734be6f910>] adapter_indicators_set+0x50/0x4a0 [kvm]
  [<000031734be7f674>] set_adapter_int+0xc4/0x170 [kvm]
  [<000031734be2f268>] kvm_set_irq+0x228/0x3f0 [kvm]
  [<000031734be27000>] irqfd_inject+0xd0/0x150 [kvm]
  [<00003173cc00c9ec>] process_one_work+0x87c/0x1490
  [<00003173cc00dda6>] worker_thread+0x7a6/0x1010
  [<00003173cc02dc36>] kthread+0x3b6/0x710
  [<00003173cbed2f0c>] __ret_from_fork+0xdc/0x7f0
  [<00003173cdd737ca>] ret_from_fork+0xa/0x30
 3 locks held by kworker/6:2/549:
  #0: 00000000800bc958 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x7ee/0x1490
  #1: 000030f3d527fbd0 ((work_completion)(&irqfd->inject)){+.+.}-{0:0}, at: process_one_work+0x81c/0x1490
  #2: 00000000f99862b0 (&mm->mmap_lock){++++}-{3:3}, at: get_map_page+0xa8/0x190 [kvm]

The "FAULT_FLAG_ALLOW_RETRY missing" indicates that handle_userfaultfd()
saw a page fault request without ALLOW_RETRY flag set, hence userfaultfd
cannot remotely resolve it (because the caller was asking for an immediate
resolution, aka, FAULT_FLAG_NOWAIT, while remote faults can take time).
With that, get_map_page() failed and the irq was lost.

We should not be strictly in an atomic environment here and the worker
should be sleepable (the call is done during an ioctl from userspace),
so we can allow adapter_indicators_set() to just sleep waiting for the
remote fault instead.

Link: https://issues.redhat.com/browse/RHEL-42486
Signed-off-by: Peter Xu <[email protected]>
[thuth: Assembled patch description and fixed some cosmetical issues]
Signed-off-by: Thomas Huth <[email protected]>
Reviewed-by: Claudio Imbrenda <[email protected]>
Acked-by: Janosch Frank <[email protected]>
Fixes: f654706 ("KVM: s390/interrupt: do not pin adapter interrupt pages")
[frankja: Added fixes tag]
Signed-off-by: Janosch Frank <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 24, 2025
The function ceph_process_folio_batch() sets folio_batch entries to
NULL, which is an illegal state.  Before folio_batch_release() crashes
due to this API violation, the function ceph_shift_unused_folios_left()
is supposed to remove those NULLs from the array.

However, since commit ce80b76 ("ceph: introduce
ceph_process_folio_batch() method"), this shifting doesn't happen
anymore because the "for" loop got moved to ceph_process_folio_batch(),
and now the `i` variable that remains in ceph_writepages_start()
doesn't get incremented anymore, making the shifting effectively
unreachable much of the time.

Later, commit 1551ec6 ("ceph: introduce ceph_submit_write()
method") added more preconditions for doing the shift, replacing the
`i` check (with something that is still just as broken):

- if ceph_process_folio_batch() fails, shifting never happens

- if ceph_move_dirty_page_in_page_array() was never called (because
  ceph_process_folio_batch() has returned early for some of various
  reasons), shifting never happens

- if `processed_in_fbatch` is zero (because ceph_process_folio_batch()
  has returned early for some of the reasons mentioned above or
  because ceph_move_dirty_page_in_page_array() has failed), shifting
  never happens

Since those two commits, any problem in ceph_process_folio_batch()
could crash the kernel, e.g. this way:

 BUG: kernel NULL pointer dereference, address: 0000000000000034
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 PGD 0 P4D 0
 Oops: Oops: 0002 [#1] SMP NOPTI
 CPU: 172 UID: 0 PID: 2342707 Comm: kworker/u778:8 Not tainted 6.15.10-cm4all1-es #714 NONE
 Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.10 12/08/2023
 Workqueue: writeback wb_workfn (flush-ceph-1)
 RIP: 0010:folios_put_refs+0x85/0x140
 Code: 83 c5 01 39 e8 7e 76 48 63 c5 49 8b 5c c4 08 b8 01 00 00 00 4d 85 ed 74 05 41 8b 44 ad 00 48 8b 15 b0 >
 RSP: 0018:ffffb880af8db778 EFLAGS: 00010207
 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000003
 RDX: ffffe377cc3b0000 RSI: 0000000000000000 RDI: ffffb880af8db8c0
 RBP: 0000000000000000 R08: 000000000000007d R09: 000000000102b86f
 R10: 0000000000000001 R11: 00000000000000ac R12: ffffb880af8db8c0
 R13: 0000000000000000 R14: 0000000000000000 R15: ffff9bd262c97000
 FS:  0000000000000000(0000) GS:ffff9c8efc303000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000034 CR3: 0000000160958004 CR4: 0000000000770ef0
 PKRU: 55555554
 Call Trace:
  <TASK>
  ceph_writepages_start+0xeb9/0x1410

The crash can be reproduced easily by changing the
ceph_check_page_before_write() return value to `-E2BIG`.

(Interestingly, the crash happens only if `huge_zero_folio` has
already been allocated; without `huge_zero_folio`,
is_huge_zero_folio(NULL) returns true and folios_put_refs() skips NULL
entries instead of dereferencing them.  That makes reproducing the bug
somewhat unreliable.  See
https://lore.kernel.org/[email protected]
for a discussion of this detail.)

My suggestion is to move the ceph_shift_unused_folios_left() to right
after ceph_process_folio_batch() to ensure it always gets called to
fix up the illegal folio_batch state.

Cc: [email protected]
Fixes: ce80b76 ("ceph: introduce ceph_process_folio_batch() method")
Link: https://lore.kernel.org/ceph-devel/[email protected]/
Signed-off-by: Max Kellermann <[email protected]>
Reviewed-by: Viacheslav Dubeyko <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 24, 2025
syzkaller has caught us red-handed once more, this time nesting regular
spinlocks behind raw spinlocks:

  =============================
  [ BUG: Invalid wait context ]
  6.16.0-rc3-syzkaller-g7b8346bd9fce #0 Not tainted
  -----------------------------
  syz.0.29/3743 is trying to lock:
  a3ff80008e2e9e18 (&xa->xa_lock#20){....}-{3:3}, at: vgic_put_irq+0xb4/0x190 arch/arm64/kvm/vgic/vgic.c:137
  other info that might help us debug this:
  context-{5:5}
  3 locks held by syz.0.29/3743:
   #0: a3ff80008e2e90a8 (&kvm->slots_lock){+.+.}-{4:4}, at: kvm_vgic_destroy+0x50/0x624 arch/arm64/kvm/vgic/vgic-init.c:499
   #1: a3ff80008e2e9fa0 (&kvm->arch.config_lock){+.+.}-{4:4}, at: kvm_vgic_destroy+0x5c/0x624 arch/arm64/kvm/vgic/vgic-init.c:500
   #2: 58f0000021be1428 (&vgic_cpu->ap_list_lock){....}-{2:2}, at: vgic_flush_pending_lpis+0x3c/0x31c arch/arm64/kvm/vgic/vgic.c:150
  stack backtrace:
  CPU: 0 UID: 0 PID: 3743 Comm: syz.0.29 Not tainted 6.16.0-rc3-syzkaller-g7b8346bd9fce #0 PREEMPT
  Hardware name: linux,dummy-virt (DT)
  Call trace:
   show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:466 (C)
   __dump_stack+0x30/0x40 lib/dump_stack.c:94
   dump_stack_lvl+0xd8/0x12c lib/dump_stack.c:120
   dump_stack+0x1c/0x28 lib/dump_stack.c:129
   print_lock_invalid_wait_context kernel/locking/lockdep.c:4833 [inline]
   check_wait_context kernel/locking/lockdep.c:4905 [inline]
   __lock_acquire+0x978/0x299c kernel/locking/lockdep.c:5190
   lock_acquire+0x14c/0x2e0 kernel/locking/lockdep.c:5871
   __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
   _raw_spin_lock_irqsave+0x5c/0x7c kernel/locking/spinlock.c:162
   vgic_put_irq+0xb4/0x190 arch/arm64/kvm/vgic/vgic.c:137
   vgic_flush_pending_lpis+0x24c/0x31c arch/arm64/kvm/vgic/vgic.c:158
   __kvm_vgic_vcpu_destroy+0x44/0x500 arch/arm64/kvm/vgic/vgic-init.c:455
   kvm_vgic_destroy+0x100/0x624 arch/arm64/kvm/vgic/vgic-init.c:505
   kvm_arch_destroy_vm+0x80/0x138 arch/arm64/kvm/arm.c:244
   kvm_destroy_vm virt/kvm/kvm_main.c:1308 [inline]
   kvm_put_kvm+0x800/0xff8 virt/kvm/kvm_main.c:1344
   kvm_vm_release+0x58/0x78 virt/kvm/kvm_main.c:1367
   __fput+0x4ac/0x980 fs/file_table.c:465
   ____fput+0x20/0x58 fs/file_table.c:493
   task_work_run+0x1bc/0x254 kernel/task_work.c:227
   resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
   do_notify_resume+0x1b4/0x270 arch/arm64/kernel/entry-common.c:151
   exit_to_user_mode_prepare arch/arm64/kernel/entry-common.c:169 [inline]
   exit_to_user_mode arch/arm64/kernel/entry-common.c:178 [inline]
   el0_svc+0xb4/0x160 arch/arm64/kernel/entry-common.c:768
   el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786
   el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600

This is of course no good, but is at odds with how LPI refcounts are
managed. Solve the locking mess by deferring the release of unreferenced
LPIs after the ap_list_lock is released. Mark these to-be-released LPIs
specially to avoid racing with vgic_put_irq() and causing a double-free.

Since references can only be taken on LPIs with a nonzero refcount,
extending the lifetime of freed LPIs is still safe.

Reviewed-by: Marc Zyngier <[email protected]>
Reported-by: [email protected]
Closes: https://lore.kernel.org/kvmarm/[email protected]/
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 24, 2025
Ido Schimmel says:

====================
ipv4: icmp: Fix source IP derivation in presence of VRFs

Align IPv4 with IPv6 and in the presence of VRFs generate ICMP error
messages with a source IP that is derived from the receiving interface
and not from its VRF master. This is especially important when the error
messages are "Time Exceeded" messages as it means that utilities like
traceroute will show an incorrect packet path.

Patches #1-#2 are preparations.

Patch #3 is the actual change.

Patches #4-#7 make small improvements in the existing traceroute test.

Patch #8 extends the traceroute test with VRF test cases for both IPv4
and IPv6.

Changes since v1 [1]:
* Rebase.

[1] https://lore.kernel.org/netdev/[email protected]/
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 24, 2025
Petr Machata says:

====================
bridge: Allow keeping local FDB entries only on VLAN 0

The bridge FDB contains one local entry per port per VLAN, for the MAC of
the port in question, and likewise for the bridge itself. This allows
bridge to locally receive and punt "up" any packets whose destination MAC
address matches that of one of the bridge interfaces or of the bridge
itself.

The number of these local "service" FDB entries grows linearly with number
of bridge-global VLAN memberships, but that in turn will tend to grow
quadratically with number of ports and per-port VLAN memberships. While
that does not cause issues during forwarding lookups, it does make dumps
impractically slow.

As an example, with 100 interfaces, each on 4K VLANs, a full dump of FDB
that just contains these 400K local entries, takes 6.5s. That's _without_
considering iproute2 formatting overhead, this is just how long it takes to
walk the FDB (repeatedly), serialize it into netlink messages, and parse
the messages back in userspace.

This is to illustrate that with growing number of ports and VLANs, the time
required to dump this repetitive information blows up. Arguably 4K VLANs
per interface is not a very realistic configuration, but then modern
switches can instead have several hundred interfaces, and we have fielded
requests for >1K VLAN memberships per port among customers.

FDB entries are currently all kept on a single linked list, and then
dumping uses this linked list to walk all entries and dump them in order.
When the message buffer is full, the iteration is cut short, and later
restarted. Of course, to restart the iteration, it's first necessary to
walk the already-dumped front part of the list before starting dumping
again. So one possibility is to organize the FDB entries in different
structure more amenable to walk restarts.

One option is to walk directly the hash table. The advantage is that no
auxiliary structure needs to be introduced. With a rough sketch of this
approach, the above scenario gets dumped in not quite 3 s, saving over 50 %
of time. However hash table iteration requires maintaining an active cursor
that must be collected when the dump is aborted. It looks like that would
require changes in the NDO protocol to allow to run this cleanup. Moreover,
on hash table resize the iteration is simply restarted. FDB dumps are
currently not guaranteed to correspond to any one particular state: entries
can be missed, or be duplicated. But with hash table iteration we would get
that plus the much less graceful resize behavior, where swaths of FDB are
duplicated.

Another option is to maintain the FDB entries in a red-black tree. We have
a PoC of this approach on hand, and the above scenario is dumped in about
2.5 s. Still not as snappy as we'd like it, but better than the hash table.
However the savings come at the expense of a more expensive insertion, and
require locking during dumps, which blocks insertion.

The upside of these approaches is that they provide benefits whatever the
FDB contents. But it does not seem like either of these is workable.
However we intend to clean up the RB tree PoC and present it for
consideration later on in case the trade-offs are considered acceptable.

Yet another option might be to use in-kernel FDB filtering, and to filter
the local entries when dumping. Unfortunately, this does not help all that
much either, because the linked-list walk still needs to happen. Also, with
the obvious filtering interface built around ndm_flags / ndm_state
filtering, one can't just exclude pure local entries in one query. One
needs to dump all non-local entries first, and then to get permanent
entries in another run filter local & added_by_user. I.e. one needs to pay
the iteration overhead twice, and then integrate the result in userspace.
To get significant savings, one would need a very specific knob like "dump,
but skip/only include local entries". But if we are adding a local-specific
knobs, maybe let's have an option to just not duplicate them in the first
place.

All this FDB duplication is there merely to make things snappy during
forwarding. But high-radix switches with thousands of VLANs typically do
not process much traffic in the SW datapath at all, but rather offload vast
majority of it. So we could exchange some of the runtime performance for a
neater FDB.

To that end, in this patchset, introduce a new bridge option,
BR_BOOLOPT_FDB_LOCAL_VLAN_0, which when enabled, has local FDB entries
installed only on VLAN 0, instead of duplicating them across all VLANs.
Then to maintain the local termination behavior, on FDB miss, the bridge
does a second lookup on VLAN 0.

Enabling this option changes the bridge behavior in expected ways. Since
the entries are only kept on VLAN 0, FDB get, flush and dump will not
perceive them on non-0 VLANs. And deleting the VLAN 0 entry affects
forwarding on all VLANs.

This patchset is loosely based on a privately circulated patch by Nikolay
Aleksandrov.

The patchset progresses as follows:

- Patch #1 introduces a bridge option to enable the above feature. Then
  patches #2 to #5 gradually patch the bridge to do the right thing when
  the option is enabled. Finally patch #6 adds the UAPI knob and the code
  for when the feature is enabled or disabled.
- Patches #7, #8 and #9 contain fixes and improvements to selftest
  libraries
- Patch #10 contains a new selftest
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 24, 2025
…PAIR

A NULL pointer dereference can occur in tcp_ao_finish_connect() during a
connect() system call on a socket with a TCP-AO key added and TCP_REPAIR
enabled.

The function is called with skb being NULL and attempts to dereference it
on tcp_hdr(skb)->seq without a prior skb validation.

Fix this by checking if skb is NULL before dereferencing it.

The commentary is taken from bpf_skops_established(), which is also called
in the same flow. Unlike the function being patched,
bpf_skops_established() validates the skb before dereferencing it.

int main(void){
	struct sockaddr_in sockaddr;
	struct tcp_ao_add tcp_ao;
	int sk;
	int one = 1;

	memset(&sockaddr,'\0',sizeof(sockaddr));
	memset(&tcp_ao,'\0',sizeof(tcp_ao));

	sk = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);

	sockaddr.sin_family = AF_INET;

	memcpy(tcp_ao.alg_name,"cmac(aes128)",12);
	memcpy(tcp_ao.key,"ABCDEFGHABCDEFGH",16);
	tcp_ao.keylen = 16;

	memcpy(&tcp_ao.addr,&sockaddr,sizeof(sockaddr));

	setsockopt(sk, IPPROTO_TCP, TCP_AO_ADD_KEY, &tcp_ao,
	sizeof(tcp_ao));
	setsockopt(sk, IPPROTO_TCP, TCP_REPAIR, &one, sizeof(one));

	sockaddr.sin_family = AF_INET;
	sockaddr.sin_port = htobe16(123);

	inet_aton("127.0.0.1", &sockaddr.sin_addr);

	connect(sk,(struct sockaddr *)&sockaddr,sizeof(sockaddr));

return 0;
}

$ gcc tcp-ao-nullptr.c -o tcp-ao-nullptr -Wall
$ unshare -Urn

BUG: kernel NULL pointer dereference, address: 00000000000000b6
PGD 1f648d067 P4D 1f648d067 PUD 1982e8067 PMD 0
Oops: Oops: 0000 [#1] SMP NOPTI
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop
Reference Platform, BIOS 6.00 11/12/2020
RIP: 0010:tcp_ao_finish_connect (net/ipv4/tcp_ao.c:1182)

Fixes: 7c2ffaf ("net/tcp: Calculate TCP-AO traffic keys")
Signed-off-by: Anderson Nascimento <[email protected]>
Reviewed-by: Dmitry Safonov <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 24, 2025
When igc_led_setup() fails, igc_probe() fails and triggers kernel panic
in free_netdev() since unregister_netdev() is not called. [1]
This behavior can be tested using fault-injection framework, especially
the failslab feature. [2]

Since LED support is not mandatory, treat LED setup failures as
non-fatal and continue probe with a warning message, consequently
avoiding the kernel panic.

[1]
 kernel BUG at net/core/dev.c:12047!
 Oops: invalid opcode: 0000 [#1] SMP NOPTI
 CPU: 0 UID: 0 PID: 937 Comm: repro-igc-led-e Not tainted 6.17.0-rc4-enjuk-tnguy-00865-gc4940196ab02 #64 PREEMPT(voluntary)
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 RIP: 0010:free_netdev+0x278/0x2b0
 [...]
 Call Trace:
  <TASK>
  igc_probe+0x370/0x910
  local_pci_probe+0x3a/0x80
  pci_device_probe+0xd1/0x200
 [...]

[2]
 #!/bin/bash -ex

 FAILSLAB_PATH=/sys/kernel/debug/failslab/
 DEVICE=0000:00:05.0
 START_ADDR=$(grep " igc_led_setup" /proc/kallsyms \
         | awk '{printf("0x%s", $1)}')
 END_ADDR=$(printf "0x%x" $((START_ADDR + 0x100)))

 echo $START_ADDR > $FAILSLAB_PATH/require-start
 echo $END_ADDR > $FAILSLAB_PATH/require-end
 echo 1 > $FAILSLAB_PATH/times
 echo 100 > $FAILSLAB_PATH/probability
 echo N > $FAILSLAB_PATH/ignore-gfp-wait

 echo $DEVICE > /sys/bus/pci/drivers/igc/bind

Fixes: ea57870 ("igc: Add support for LEDs on i225/i226")
Signed-off-by: Kohei Enju <[email protected]>
Reviewed-by: Paul Menzel <[email protected]>
Reviewed-by: Aleksandr Loktionov <[email protected]>
Reviewed-by: Vitaly Lifshits <[email protected]>
Reviewed-by: Kurt Kanzenbach <[email protected]>
Tested-by: Mor Bar-Gabay <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 24, 2025
…rnal()

A crash was observed with the following output:

Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 1 UID: 0 PID: 2899 Comm: syz.2.399 Not tainted 6.17.0-rc5+ #5 PREEMPT(none)
RIP: 0010:trace_kprobe_create_internal+0x3fc/0x1440 kernel/trace/trace_kprobe.c:911
Call Trace:
 <TASK>
 trace_kprobe_create_cb+0xa2/0xf0 kernel/trace/trace_kprobe.c:1089
 trace_probe_create+0xf1/0x110 kernel/trace/trace_probe.c:2246
 dyn_event_create+0x45/0x70 kernel/trace/trace_dynevent.c:128
 create_or_delete_trace_kprobe+0x5e/0xc0 kernel/trace/trace_kprobe.c:1107
 trace_parse_run_command+0x1a5/0x330 kernel/trace/trace.c:10785
 vfs_write+0x2b6/0xd00 fs/read_write.c:684
 ksys_write+0x129/0x240 fs/read_write.c:738
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x5d/0x2d0 arch/x86/entry/syscall_64.c:94
 </TASK>

Function kmemdup() may return NULL in trace_kprobe_create_internal(), add
check for it's return value.

Link: https://lore.kernel.org/all/[email protected]/

Fixes: 33b4e38 ("tracing: kprobe-event: Allocate string buffers from heap")
Signed-off-by: Wang Liang <[email protected]>
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 25, 2025
clang-18.1.3 on Ubuntu 24.04.2 reports warning:

  thread_loop.c:41:23: warning: value size does not match register size specified by the constraint and modifier [-Wasm-operand-widths]
     41 |                 : /* in */ [i] "r" (i), [len] "r" (len)
        |                                     ^
  thread_loop.c:37:8: note: use constraint modifier "w"
     37 |                 "add %[i], %[i], #1\n"
        |                      ^~~~
        |                      %w[i]
  thread_loop.c:41:23: warning: value size does not match register size specified by the constraint and modifier [-Wasm-operand-widths]
     41 |                 : /* in */ [i] "r" (i), [len] "r" (len)
        |                                     ^
  thread_loop.c:37:14: note: use constraint modifier "w"
     37 |                 "add %[i], %[i], #1\n"
        |                            ^~~~
        |                            %w[i]
  thread_loop.c:41:23: warning: value size does not match register size specified by the constraint and modifier [-Wasm-operand-widths]
     41 |                 : /* in */ [i] "r" (i), [len] "r" (len)
        |                                     ^
  thread_loop.c:38:8: note: use constraint modifier "w"
     38 |                 "cmp %[i], %[len]\n"
        |                      ^~~~
        |                      %w[i]
  thread_loop.c:41:38: warning: value size does not match register size specified by the constraint and modifier [-Wasm-operand-widths]
     41 |                 : /* in */ [i] "r" (i), [len] "r" (len)
        |                                                    ^
  thread_loop.c:38:14: note: use constraint modifier "w"
     38 |                 "cmp %[i], %[len]\n"
        |                            ^~~~~~
        |                            %w[len]

Use the modifier "w" for 32-bit register access.

Signed-off-by: Leo Yan <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 25, 2025
Check in snd_intel_dsp_check_soundwire() that the pointer returned by
ACPI_HANDLE() is not NULL, before passing it on to other functions.

The original code assumed a non-NULL return, but if it was unexpectedly
NULL it would end up passed to acpi_walk_namespace() as the start
point, and would result in

[    3.219028] BUG: kernel NULL pointer dereference, address:
0000000000000018
[    3.219029] #PF: supervisor read access in kernel mode
[    3.219030] #PF: error_code(0x0000) - not-present page
[    3.219031] PGD 0 P4D 0
[    3.219032] Oops: Oops: 0000 [#1] SMP NOPTI
[    3.219035] CPU: 2 UID: 0 PID: 476 Comm: (udev-worker) Tainted: G S
AW   E       6.17.0-rc5-test #1 PREEMPT(voluntary)
[    3.219038] Tainted: [S]=CPU_OUT_OF_SPEC, [A]=OVERRIDDEN_ACPI_TABLE,
[W]=WARN, [E]=UNSIGNED_MODULE
[    3.219040] RIP: 0010:acpi_ns_walk_namespace+0xb5/0x480

This problem was triggered by a bugged DSDT that the kernel couldn't parse.
But it shouldn't be possible to SEGFAULT the kernel just because of some
bugs in ACPI.

Fixes: 0650857 ("ALSA: hda: add autodetection for SoundWire")
Signed-off-by: Richard Fitzgerald <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 25, 2025
…own_locked()

When

  9770b42 ("crypto: ccp - Move dev_info/err messages for SEV/SNP init and shutdown")

moved the error messages dumping so that they don't need to be issued by
the callers, it missed the case where __sev_firmware_shutdown() calls
__sev_platform_shutdown_locked() with a NULL argument which leads to
a NULL ptr deref on the shutdown path, during suspend to disk:

  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0000 [#1] SMP NOPTI
  CPU: 0 UID: 0 PID: 983 Comm: hib.sh Not tainted 6.17.0-rc4+ #1 PREEMPT(voluntary)
  Hardware name: Supermicro Super Server/H12SSL-i, BIOS 2.5 09/08/2022
  RIP: 0010:__sev_platform_shutdown_locked.cold+0x0/0x21 [ccp]

That rIP is:

  00000000000006fd <__sev_platform_shutdown_locked.cold>:
   6fd:   8b 13                   mov    (%rbx),%edx
   6ff:   48 8b 7d 00             mov    0x0(%rbp),%rdi
   703:   89 c1                   mov    %eax,%ecx

  Code: 74 05 31 ff 41 89 3f 49 8b 3e 89 ea 48 c7 c6 a0 8e 54 a0 41 bf 92 ff ff ff e8 e5 2e 09 e1 c6 05 2a d4 38 00 01 e9 26 af ff ff <8b> 13 48 8b 7d 00 89 c1 48 c7 c6 18 90 54 a0 89 44 24 04 e8 c1 2e
  RSP: 0018:ffffc90005467d00 EFLAGS: 00010282
  RAX: 00000000ffffff92 RBX: 0000000000000000 RCX: 0000000000000000
  			     ^^^^^^^^^^^^^^^^
and %rbx is nice and clean.

  Call Trace:
   <TASK>
   __sev_firmware_shutdown.isra.0
   sev_dev_destroy
   psp_dev_destroy
   sp_destroy
   pci_device_shutdown
   device_shutdown
   kernel_power_off
   hibernate.cold
   state_store
   kernfs_fop_write_iter
   vfs_write
   ksys_write
   do_syscall_64
   entry_SYSCALL_64_after_hwframe

Pass in a pointer to the function-local error var in the caller.

With that addressed, suspending the ccp shows the error properly at
least:

  ccp 0000:47:00.1: sev command 0x2 timed out, disabling PSP
  ccp 0000:47:00.1: SEV: failed to SHUTDOWN error 0x0, rc -110
  SEV-SNP: Leaking PFN range 0x146800-0x146a00
  SEV-SNP: PFN 0x146800 unassigned, dumping non-zero entries in 2M PFN region: [0x146800 - 0x146a00]
  ...
  ccp 0000:47:00.1: SEV-SNP firmware shutdown failed, rc -16, error 0x0
  ACPI: PM: Preparing to enter system sleep state S5
  kvm: exiting hardware virtualization
  reboot: Power down

Btw, this driver is crying to be cleaned up to pass in a proper I/O
struct which can be used to store information between the different
functions, otherwise stuff like that will happen in the future again.

Fixes: 9770b42 ("crypto: ccp - Move dev_info/err messages for SEV/SNP init and shutdown")
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Cc: <[email protected]>
Reviewed-by: Ashish Kalra <[email protected]>
Acked-by: Tom Lendacky <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 25, 2025
clang-18.1.3 on Ubuntu 24.04.2 reports warning:

  thread_loop.c:41:23: warning: value size does not match register size specified by the constraint and modifier [-Wasm-operand-widths]
     41 |                 : /* in */ [i] "r" (i), [len] "r" (len)
        |                                     ^
  thread_loop.c:37:8: note: use constraint modifier "w"
     37 |                 "add %[i], %[i], #1\n"
        |                      ^~~~
        |                      %w[i]
  thread_loop.c:41:23: warning: value size does not match register size specified by the constraint and modifier [-Wasm-operand-widths]
     41 |                 : /* in */ [i] "r" (i), [len] "r" (len)
        |                                     ^
  thread_loop.c:37:14: note: use constraint modifier "w"
     37 |                 "add %[i], %[i], #1\n"
        |                            ^~~~
        |                            %w[i]
  thread_loop.c:41:23: warning: value size does not match register size specified by the constraint and modifier [-Wasm-operand-widths]
     41 |                 : /* in */ [i] "r" (i), [len] "r" (len)
        |                                     ^
  thread_loop.c:38:8: note: use constraint modifier "w"
     38 |                 "cmp %[i], %[len]\n"
        |                      ^~~~
        |                      %w[i]
  thread_loop.c:41:38: warning: value size does not match register size specified by the constraint and modifier [-Wasm-operand-widths]
     41 |                 : /* in */ [i] "r" (i), [len] "r" (len)
        |                                                    ^
  thread_loop.c:38:14: note: use constraint modifier "w"
     38 |                 "cmp %[i], %[len]\n"
        |                            ^~~~~~
        |                            %w[len]

Use the modifier "w" for 32-bit register access.

Signed-off-by: Leo Yan <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 29, 2025
Add 0x29 as the accelerometer address for the Dell Latitude E6530 to
lis3lv02d_devices[].

The address was verified as below:

    $ cd /sys/bus/pci/drivers/i801_smbus/0000:00:1f.3
    $ ls -d i2c-*
    i2c-20
    $ sudo modprobe i2c-dev
    $ sudo i2cdetect 20
    WARNING! This program can confuse your I2C bus, cause data loss and worse!
    I will probe file /dev/i2c-20.
    I will probe address range 0x08-0x77.
    Continue? [Y/n] Y
         0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
    00:                         08 -- -- -- -- -- -- --
    10: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
    20: -- -- -- -- -- -- -- -- -- UU -- 2b -- -- -- --
    30: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
    40: -- -- -- -- 44 -- -- -- -- -- -- -- -- -- -- --
    50: UU -- 52 -- -- -- -- -- -- -- -- -- -- -- -- --
    60: -- 61 -- -- -- -- -- -- -- -- -- -- -- -- -- --
    70: -- -- -- -- -- -- -- --
    $ cat /proc/cmdline
    BOOT_IMAGE=/vmlinuz-linux-cachyos-bore root=UUID=<redacted> rw loglevel=3 quiet dell_lis3lv02d.probe_i2c_addr=1
    $ sudo dmesg
    [    0.000000] Linux version 6.16.6-2-cachyos-bore (linux-cachyos-bore@cachyos) (gcc (GCC) 15.2.1 20250813, GNU ld (GNU Binutils) 2.45.0) #1 SMP PREEMPT_DYNAMIC Thu, 11 Sep 2025 16:01:12 +0000
    […]
    [    0.000000] DMI: Dell Inc. Latitude E6530/07Y85M, BIOS A22 11/30/2018
    […]
    [    5.166442] i2c i2c-20: Probing for lis3lv02d on address 0x29
    [    5.167854] i2c i2c-20: Detected lis3lv02d on address 0x29, please report this upstream to [email protected] so that a quirk can be added

Signed-off-by: Nickolay Goppen <[email protected]>
Reviewed-by: Hans de Goede <[email protected]>
Link: https://patch.msgid.link/20250917-dell-lis3lv02d-latitude-e6530-v1-1-8a6dec4e51e9@mainlining.org
Reviewed-by: Ilpo Järvinen <[email protected]>
Signed-off-by: Ilpo Järvinen <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 29, 2025
The kernel forbids the creation of non-FDB nexthop groups with FDB
nexthops:

 # ip nexthop add id 1 via 192.0.2.1 fdb
 # ip nexthop add id 2 group 1
 Error: Non FDB nexthop group cannot have fdb nexthops.

And vice versa:

 # ip nexthop add id 3 via 192.0.2.2 dev dummy1
 # ip nexthop add id 4 group 3 fdb
 Error: FDB nexthop group can only have fdb nexthops.

However, as long as no routes are pointing to a non-FDB nexthop group,
the kernel allows changing the type of a nexthop from FDB to non-FDB and
vice versa:

 # ip nexthop add id 5 via 192.0.2.2 dev dummy1
 # ip nexthop add id 6 group 5
 # ip nexthop replace id 5 via 192.0.2.2 fdb
 # echo $?
 0

This configuration is invalid and can result in a NPD [1] since FDB
nexthops are not associated with a nexthop device:

 # ip route add 198.51.100.1/32 nhid 6
 # ping 198.51.100.1

Fix by preventing nexthop FDB status change while the nexthop is in a
group:

 # ip nexthop add id 7 via 192.0.2.2 dev dummy1
 # ip nexthop add id 8 group 7
 # ip nexthop replace id 7 via 192.0.2.2 fdb
 Error: Cannot change nexthop FDB status while in a group.

[1]
BUG: kernel NULL pointer dereference, address: 00000000000003c0
[...]
Oops: Oops: 0000 [#1] SMP
CPU: 6 UID: 0 PID: 367 Comm: ping Not tainted 6.17.0-rc6-virtme-gb65678cacc03 #1 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-4.fc41 04/01/2014
RIP: 0010:fib_lookup_good_nhc+0x1e/0x80
[...]
Call Trace:
 <TASK>
 fib_table_lookup+0x541/0x650
 ip_route_output_key_hash_rcu+0x2ea/0x970
 ip_route_output_key_hash+0x55/0x80
 __ip4_datagram_connect+0x250/0x330
 udp_connect+0x2b/0x60
 __sys_connect+0x9c/0xd0
 __x64_sys_connect+0x18/0x20
 do_syscall_64+0xa4/0x2a0
 entry_SYSCALL_64_after_hwframe+0x4b/0x53

Fixes: 38428d6 ("nexthop: support for fdb ecmp nexthops")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Tested-by: [email protected]
Signed-off-by: Ido Schimmel <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 29, 2025
Ido Schimmel says:

====================
nexthop: Various fixes

Patch #1 fixes a NPD that was recently reported by syzbot.

Patch #2 fixes an issue in the existing FIB nexthop selftest.

Patch #3 extends the selftest with test cases for the bug that was fixed
in the first patch.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 29, 2025
Running sha224_kunit on a KMSAN-enabled kernel results in a crash in
kmsan_internal_set_shadow_origin():

    BUG: unable to handle page fault for address: ffffbc3840291000
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 1810067 P4D 1810067 PUD 192d067 PMD 3c17067 PTE 0
    Oops: 0000 [#1] SMP NOPTI
    CPU: 0 UID: 0 PID: 81 Comm: kunit_try_catch Tainted: G                 N  6.17.0-rc3 #10 PREEMPT(voluntary)
    Tainted: [N]=TEST
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
    RIP: 0010:kmsan_internal_set_shadow_origin+0x91/0x100
    [...]
    Call Trace:
    <TASK>
    __msan_memset+0xee/0x1a0
    sha224_final+0x9e/0x350
    test_hash_buffer_overruns+0x46f/0x5f0
    ? kmsan_get_shadow_origin_ptr+0x46/0xa0
    ? __pfx_test_hash_buffer_overruns+0x10/0x10
    kunit_try_run_case+0x198/0xa00

This occurs when memset() is called on a buffer that is not 4-byte aligned
and extends to the end of a guard page, i.e.  the next page is unmapped.

The bug is that the loop at the end of kmsan_internal_set_shadow_origin()
accesses the wrong shadow memory bytes when the address is not 4-byte
aligned.  Since each 4 bytes are associated with an origin, it rounds the
address and size so that it can access all the origins that contain the
buffer.  However, when it checks the corresponding shadow bytes for a
particular origin, it incorrectly uses the original unrounded shadow
address.  This results in reads from shadow memory beyond the end of the
buffer's shadow memory, which crashes when that memory is not mapped.

To fix this, correctly align the shadow address before accessing the 4
shadow bytes corresponding to each origin.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 2ef3cec ("kmsan: do not wipe out origin when doing partial unpoisoning")
Signed-off-by: Eric Biggers <[email protected]>
Tested-by: Alexander Potapenko <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Sep 29, 2025
When the PAGEMAP_SCAN ioctl is invoked with vec_len = 0 reaches
pagemap_scan_backout_range(), kernel panics with null-ptr-deref:

[   44.936808] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[   44.937797] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[   44.938391] CPU: 1 UID: 0 PID: 2480 Comm: reproducer Not tainted 6.17.0-rc6 #22 PREEMPT(none)
[   44.939062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   44.939935] RIP: 0010:pagemap_scan_thp_entry.isra.0+0x741/0xa80

<snip registers, unreliable trace>

[   44.946828] Call Trace:
[   44.947030]  <TASK>
[   44.949219]  pagemap_scan_pmd_entry+0xec/0xfa0
[   44.952593]  walk_pmd_range.isra.0+0x302/0x910
[   44.954069]  walk_pud_range.isra.0+0x419/0x790
[   44.954427]  walk_p4d_range+0x41e/0x620
[   44.954743]  walk_pgd_range+0x31e/0x630
[   44.955057]  __walk_page_range+0x160/0x670
[   44.956883]  walk_page_range_mm+0x408/0x980
[   44.958677]  walk_page_range+0x66/0x90
[   44.958984]  do_pagemap_scan+0x28d/0x9c0
[   44.961833]  do_pagemap_cmd+0x59/0x80
[   44.962484]  __x64_sys_ioctl+0x18d/0x210
[   44.962804]  do_syscall_64+0x5b/0x290
[   44.963111]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

vec_len = 0 in pagemap_scan_init_bounce_buffer() means no buffers are
allocated and p->vec_buf remains set to NULL.

This breaks an assumption made later in pagemap_scan_backout_range(), that
page_region is always allocated for p->vec_buf_index.

Fix it by explicitly checking p->vec_buf for NULL before dereferencing.

Other sites that might run into same deref-issue are already (directly or
transitively) protected by checking p->vec_buf.

Note:
From PAGEMAP_SCAN man page, it seems vec_len = 0 is valid when no output
is requested and it's only the side effects caller is interested in,
hence it passes check in pagemap_scan_get_args().

This issue was found by syzkaller.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 52526ca ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Jakub Acs <[email protected]>
Reviewed-by: Muhammad Usama Anjum <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Jinjiang Tu <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Penglei Jiang <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: "Michał Mirosław" <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.