[test] master_test #1

kernel-patches-bot · 2022-02-15T17:04:46Z

branch: master_test
base:bpf-next
version: edc21dc

Since we've started treating fallocate more like a file write, we should flush the log to disk if the user has asked for synchronous writes either by setting it via fcntl flags, or inode flags, or with the sync mount option. We've already got a helper for this, so use it. [The original patch by Darrick was massaged by Dave to fit this patchset] Signed-off-by: Darrick J. Wong <[email protected]> Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]>

Unlike .queue_rq, in .submit_async_event drivers may not check the ctrl readiness for AER submission. This may lead to a use-after-free condition that was observed with nvme-tcp. The race condition may happen in the following scenario: 1. driver executes its reset_ctrl_work 2. -> nvme_stop_ctrl - flushes ctrl async_event_work 3. ctrl sends AEN which is received by the host, which in turn schedules AEN handling 4. teardown admin queue (which releases the queue socket) 5. AEN processed, submits another AER, calling the driver to submit 6. driver attempts to send the cmd ==> use-after-free In order to fix that, add ctrl state check to validate the ctrl is actually able to accept the AER submission. This addresses the above race in controller resets because the driver during teardown should: 1. change ctrl state to RESETTING 2. flush async_event_work (as well as other async work elements) So after 1,2, any other AER command will find the ctrl state to be RESETTING and bail out without submitting the AER. Signed-off-by: Sagi Grimberg <[email protected]>

While nvme_tcp_submit_async_event_work is checking the ctrl and queue state before preparing the AER command and scheduling io_work, in order to fully prevent a race where this check is not reliable the error recovery work must flush async_event_work before continuing to destroy the admin queue after setting the ctrl state to RESETTING such that there is no race .submit_async_event and the error recovery handler itself changing the ctrl state. Tested-by: Chris Leech <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>

While nvme_rdma_submit_async_event_work is checking the ctrl and queue state before preparing the AER command and scheduling io_work, in order to fully prevent a race where this check is not reliable the error recovery work must flush async_event_work before continuing to destroy the admin queue after setting the ctrl state to RESETTING such that there is no race .submit_async_event and the error recovery handler itself changing the ctrl state. Signed-off-by: Sagi Grimberg <[email protected]>

Refuse SIDA memops on guests which are not protected. For normal guests, the secure instruction data address designation, which determines the location we access, is not under control of KVM. Fixes: 19e1227 (KVM: S390: protvirt: Introduce instruction data area bounce buffer) Signed-off-by: Janis Schoetterl-Glausch <[email protected]> Cc: [email protected] Signed-off-by: Christian Borntraeger <[email protected]>

Kyle reported that rr[0] has started to malfunction on Comet Lake and later CPUs due to EFI starting to make use of CPL3 [1] and the PMU event filtering not distinguishing between regular CPL3 and SMM CPL3. Since this is a privilege violation, default disable SMM visibility where possible. Administrators wanting to observe SMM cycles can easily change this using the sysfs attribute while regular users don't have access to this file. [0] https://rr-project.org/ [1] See the Intel white paper "Trustworthy SMM on the Intel vPro Platform" at https://bugzilla.kernel.org/attachment.cgi?id=300300, particularly the end of page 5. Reported-by: Kyle Huey <[email protected]> Suggested-by: Andrew Cooper <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]

The intent has always been that perf_event_attr::sig_data should also be modifiable along with PERF_EVENT_IOC_MODIFY_ATTRIBUTES, because it is observable by user space if SIGTRAP on events is requested. Currently only PERF_TYPE_BREAKPOINT is modifiable, and explicitly copies relevant breakpoint-related attributes in hw_breakpoint_copy_attr(). This misses copying perf_event_attr::sig_data. Since sig_data is not specific to PERF_TYPE_BREAKPOINT, introduce a helper to copy generic event-type-independent attributes on modification. Fixes: 97ba62b ("perf: Add support for SIGTRAP on perf events") Reported-by: Dmitry Vyukov <[email protected]> Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dmitry Vyukov <[email protected]> Link: https://lore.kernel.org/r/[email protected]

Test that PERF_EVENT_IOC_MODIFY_ATTRIBUTES correctly modifies perf_event_attr::sig_data as well. Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dmitry Vyukov <[email protected]> Link: https://lore.kernel.org/r/[email protected]

…rchitectures Due to the alignment requirements of siginfo_t, as described in 3ddb3fd ("signal, perf: Fix siginfo_t by avoiding u64 on 32-bit architectures"), siginfo_t::si_perf_data is limited to an unsigned long. However, perf_event_attr::sig_data is an u64, to avoid having to deal with compat conversions. Due to being an u64, it may not immediately be clear to users that sig_data is truncated on 32 bit architectures. Add a comment to explicitly point this out, and hopefully help some users save time by not having to deduce themselves what's happening. Reported-by: Dmitry Vyukov <[email protected]> Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dmitry Vyukov <[email protected]> Link: https://lore.kernel.org/r/[email protected]

Add a check for !buf->single before calling pt_buffer_region_size in a place where a missing check can cause a kernel crash. Fixes a bug introduced by commit 6706384 ("perf/x86/intel/pt: Opportunistically use single range output mode"), which added a support for PT single-range output mode. Since that commit if a PT stop filter range is hit while tracing, the kernel will crash because of a null pointer dereference in pt_handle_status due to calling pt_buffer_region_size without a ToPA configured. The commit which introduced single-range mode guarded almost all uses of the ToPA buffer variables with checks of the buf->single variable, but missed the case where tracing was stopped by the PT hardware, which happens when execution hits a configured stop filter. Tested that hitting a stop filter while PT recording successfully records a trace with this patch but crashes without this patch. Fixes: 6706384 ("perf/x86/intel/pt: Opportunistically use single range output mode") Signed-off-by: Tristan Hume <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Adrian Hunter <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]

In kvm_arch_vcpu_ioctl_run() we enter an RCU extended quiescent state (EQS) by calling guest_enter_irqoff(), and unmask IRQs prior to exiting the EQS by calling guest_exit(). As the IRQ entry code will not wake RCU in this case, we may run the core IRQ code and IRQ handler without RCU watching, leading to various potential problems. Additionally, we do not inform lockdep or tracing that interrupts will be enabled during guest execution, which caan lead to misleading traces and warnings that interrupts have been enabled for overly-long periods. This patch fixes these issues by using the new timing and context entry/exit helpers to ensure that interrupts are handled during guest vtime but with RCU watching, with a sequence: guest_timing_enter_irqoff(); guest_state_enter_irqoff(); < run the vcpu > guest_state_exit_irqoff(); < take any pending IRQs > guest_timing_exit_irqoff(); Since instrumentation may make use of RCU, we must also ensure that no instrumented code is run during the EQS. I've split out the critical section into a new kvm_riscv_enter_exit_vcpu() helper which is marked noinstr. Fixes: 99cdc6c ("RISC-V: Add initial skeletal KVM support") Signed-off-by: Mark Rutland <[email protected]> Cc: Albert Ou <[email protected]> Cc: Anup Patel <[email protected]> Cc: Atish Patra <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Paul Walmsley <[email protected]> Tested-by: Anup Patel <[email protected]> Signed-off-by: Anup Patel <[email protected]>

Those applications that run in VU mode and access the time CSR cause a virtual instruction trap as Guest kernel currently does not initialize the scounteren CSR. To fix this, we should make CY, TM, and IR counters accessibile by default in VU mode (similar to OpenSBI). Fixes: a33c72f ("RISC-V: KVM: Implement VCPU create, init and destroy functions") Cc: [email protected] Signed-off-by: Mayuresh Chitale <[email protected]> Signed-off-by: Anup Patel <[email protected]>

The SBI implementation version returned by KVM RISC-V should be the Host Linux version code. Fixes: c62a768 ("RISC-V: KVM: Add SBI v0.2 base extension") Signed-off-by: Anup Patel <[email protected]> Reviewed-by: Atish Patra <[email protected]> Signed-off-by: Anup Patel <[email protected]>

…from TODO list)" This reverts commit b3ec8cd. Revert the second (of 2) commits which disabled scrolling acceleration in fbcon/fbdev. It introduced a regression for fbdev-supported graphic cards because of the performance penalty by doing screen scrolling by software instead of using the existing graphic card 2D hardware acceleration. Console scrolling acceleration was disabled by dropping code which checked at runtime the driver hardware capabilities for the BINFO_HWACCEL_COPYAREA or FBINFO_HWACCEL_FILLRECT flags and if set, it enabled scrollmode SCROLL_MOVE which uses hardware acceleration to move screen contents. After dropping those checks scrollmode was hard-wired to SCROLL_REDRAW instead, which forces all graphic cards to redraw every character at the new screen position when scrolling. This change effectively disabled all hardware-based scrolling acceleration for ALL drivers, because now all kind of 2D hardware acceleration (bitblt, fillrect) in the drivers isn't used any longer. The original commit message mentions that only 3 DRM drivers (nouveau, omapdrm and gma500) used hardware acceleration in the past and thus code for checking and using scrolling acceleration is obsolete. This statement is NOT TRUE, because beside the DRM drivers there are around 35 other fbdev drivers which depend on fbdev/fbcon and still provide hardware acceleration for fbdev/fbcon. The original commit message also states that syzbot found lots of bugs in fbcon and thus it's "often the solution to just delete code and remove features". This is true, and the bugs - which actually affected all users of fbcon, including DRM - were fixed, or code was dropped like e.g. the support for software scrollback in vgacon (commit 973c096). So to further analyze which bugs were found by syzbot, I've looked through all patches in drivers/video which were tagged with syzbot or syzkaller back to year 2005. The vast majority fixed the reported issues on a higher level, e.g. when screen is to be resized, or when font size is to be changed. The few ones which touched driver code fixed a real driver bug, e.g. by adding a check. But NONE of those patches touched code of either the SCROLL_MOVE or the SCROLL_REDRAW case. That means, there was no real reason why SCROLL_MOVE had to be ripped-out and just SCROLL_REDRAW had to be used instead. The only reason I can imagine so far was that SCROLL_MOVE wasn't used by DRM and as such it was assumed that it could go away. That argument completely missed the fact that SCROLL_MOVE is still heavily used by fbdev (non-DRM) drivers. Some people mention that using memcpy() instead of the hardware acceleration is pretty much the same speed. But that's not true, at least not for older graphic cards and machines where we see speed decreases by factor 10 and more and thus this change leads to console responsiveness way worse than before. That's why the original commit is to be reverted. By reverting we reintroduce hardware-based scrolling acceleration and fix the performance regression for fbdev drivers. There isn't any impact on DRM when reverting those patches. Signed-off-by: Helge Deller <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Acked-by: Sven Schnelle <[email protected]> Cc: [email protected] # v5.16+ Signed-off-by: Helge Deller <[email protected]> Signed-off-by: Daniel Vetter <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

This reverts commit 39aead8. Revert the first (of 2) commits which disabled scrolling acceleration in fbcon/fbdev. It introduced a regression for fbdev-supported graphic cards because of the performance penalty by doing screen scrolling by software instead of using the existing graphic card 2D hardware acceleration. Console scrolling acceleration was disabled by dropping code which checked at runtime the driver hardware capabilities for the BINFO_HWACCEL_COPYAREA or FBINFO_HWACCEL_FILLRECT flags and if set, it enabled scrollmode SCROLL_MOVE which uses hardware acceleration to move screen contents. After dropping those checks scrollmode was hard-wired to SCROLL_REDRAW instead, which forces all graphic cards to redraw every character at the new screen position when scrolling. This change effectively disabled all hardware-based scrolling acceleration for ALL drivers, because now all kind of 2D hardware acceleration (bitblt, fillrect) in the drivers isn't used any longer. The original commit message mentions that only 3 DRM drivers (nouveau, omapdrm and gma500) used hardware acceleration in the past and thus code for checking and using scrolling acceleration is obsolete. This statement is NOT TRUE, because beside the DRM drivers there are around 35 other fbdev drivers which depend on fbdev/fbcon and still provide hardware acceleration for fbdev/fbcon. The original commit message also states that syzbot found lots of bugs in fbcon and thus it's "often the solution to just delete code and remove features". This is true, and the bugs - which actually affected all users of fbcon, including DRM - were fixed, or code was dropped like e.g. the support for software scrollback in vgacon (commit 973c096). So to further analyze which bugs were found by syzbot, I've looked through all patches in drivers/video which were tagged with syzbot or syzkaller back to year 2005. The vast majority fixed the reported issues on a higher level, e.g. when screen is to be resized, or when font size is to be changed. The few ones which touched driver code fixed a real driver bug, e.g. by adding a check. But NONE of those patches touched code of either the SCROLL_MOVE or the SCROLL_REDRAW case. That means, there was no real reason why SCROLL_MOVE had to be ripped-out and just SCROLL_REDRAW had to be used instead. The only reason I can imagine so far was that SCROLL_MOVE wasn't used by DRM and as such it was assumed that it could go away. That argument completely missed the fact that SCROLL_MOVE is still heavily used by fbdev (non-DRM) drivers. Some people mention that using memcpy() instead of the hardware acceleration is pretty much the same speed. But that's not true, at least not for older graphic cards and machines where we see speed decreases by factor 10 and more and thus this change leads to console responsiveness way worse than before. That's why the original commit is to be reverted. By reverting we reintroduce hardware-based scrolling acceleration and fix the performance regression for fbdev drivers. There isn't any impact on DRM when reverting those patches. Signed-off-by: Helge Deller <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Acked-by: Sven Schnelle <[email protected]> Cc: [email protected] # v5.10+ Signed-off-by: Helge Deller <[email protected]> Signed-off-by: Daniel Vetter <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

Add a config option CONFIG_FRAMEBUFFER_CONSOLE_LEGACY_ACCELERATION to enable bitblt and fillrect hardware acceleration in the framebuffer console. If disabled, such acceleration will not be used, even if it is supported by the graphics hardware driver. If you plan to use DRM as your main graphics output system, you should disable this option since it will prevent compiling in code which isn't used later on when DRM takes over. For all other configurations, e.g. if none of your graphic cards support DRM (yet), DRM isn't available for your architecture, or you can't be sure that the graphic card in the target system will support DRM, you most likely want to enable this option. In the non-accelerated case (e.g. when DRM is used), the inlined fb_scrollmode() function is hardcoded to return SCROLL_REDRAW and as such the compiler is able to optimize much unneccesary code away. In this v3 patch version I additionally changed the GETVYRES() and GETVXRES() macros to take a pointer to the fbcon_display struct. This fixes the build when console rotation is enabled and helps the compiler again to optimize out code. Signed-off-by: Helge Deller <[email protected]> Cc: [email protected] # v5.10+ Signed-off-by: Helge Deller <[email protected]> Signed-off-by: Daniel Vetter <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

Commit ceaa762 ("block: move direct_IO into our own read_iter handler") introduced several regressions for bdev DIO: 1. read spanning EOF always returns 0 instead of the number of bytes read. This is because "count" is assigned early and isn't updated when the iterator is truncated: $ lsblk -o name,size /dev/vdb NAME SIZE vdb 1G $ xfs_io -d -c 'pread -b 4M 1021M 4M' /dev/vdb read 0/4194304 bytes at offset 1070596096 0.000000 bytes, 0 ops; 0.0007 sec (0.000000 bytes/sec and 0.0000 ops/sec) instead of $ xfs_io -d -c 'pread -b 4M 1021M 4M' /dev/vdb read 3145728/4194304 bytes at offset 1070596096 3 MiB, 1 ops; 0.0007 sec (3.865 GiB/sec and 1319.2612 ops/sec) 2. truncated iterator isn't reexpanded 3. iterator isn't reverted on blkdev_direct_IO() error 4. zero size read no longer skips atime update Fixes: ceaa762 ("block: move direct_IO into our own read_iter handler") Signed-off-by: Ilya Dryomov <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>

into HEAD KVM/riscv fixes for 5.17, take #1 - Rework guest entry logic - Make CY, TM, and IR counters accessible in VU mode - Fix SBI implementation version

If we're doing an uncached read of the directory, then we ideally want to read only the exact set of entries that will fit in the buffer supplied by the getdents() system call. So unlike the case where we're reading into the page cache, let's send only one READDIR call, before trying to fill up the buffer. Fixes: 35df59d ("NFS: Reduce number of RPC calls when doing uncached readdir") Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>

Ensure that we initialise desc->cache_entry_index correctly in uncached_readdir(). Fixes: d1bacf9 ("NFS: add readdir cache array") Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>

If we've reached the end of the directory, then cache that information in the context so that we don't need to do an uncached readdir in order to rediscover that fact. Fixes: 794092c ("NFS: Do uncached readdir when we're seeking a cookie in an empty page cache") Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>

audit_log_start() returns audit_buffer pointer on success or NULL on error, so it is better to check the return value of it. Fixes: 3323eec ("integrity: IMA as an integrity service provider") Signed-off-by: Xiaoke Wang <[email protected]> Cc: <[email protected]> Reviewed-by: Paul Moore <[email protected]> Signed-off-by: Mimi Zohar <[email protected]>

The removal of ima_dir currently fails since ima_policy still exists, so remove the ima_policy file before removing the directory. Fixes: 4af4662 ("integrity: IMA policy") Signed-off-by: Stefan Berger <[email protected]> Cc: <[email protected]> Acked-by: Christian Brauner <[email protected]> Signed-off-by: Mimi Zohar <[email protected]>

Commit c2426d2 ("ima: added support for new kernel cmdline parameter ima_template_fmt") introduced an additional check on the ima_template variable to avoid multiple template selection. Unfortunately, ima_template could be also set by the setup function of the ima_hash= parameter, when it calls ima_template_desc_current(). This causes attempts to choose a new template with ima_template= or with ima_template_fmt=, after ima_hash=, to be ignored. Achieve the goal of the commit mentioned with the new static variable template_setup_done, so that template selection requests after ima_hash= are not ignored. Finally, call ima_init_template_list(), if not already done, to initialize the list of templates before lookup_template_desc() is called. Reported-by: Guo Zihua <[email protected]> Signed-off-by: Roberto Sassu <[email protected]> Cc: [email protected] Fixes: c2426d2 ("ima: added support for new kernel cmdline parameter ima_template_fmt") Signed-off-by: Mimi Zohar <[email protected]>

Before printing a policy rule scan for inactive LSM labels in the policy rule. Inactive LSM labels are identified by args_p != NULL and rule == NULL. Fixes: 483ec26 ("ima: ima/lsm policy rule loading logic bug fixes") Signed-off-by: Stefan Berger <[email protected]> Cc: <[email protected]> # v5.6+ Acked-by: Christian Brauner <[email protected]> [[email protected]: Updated "Fixes" tag] Signed-off-by: Mimi Zohar <[email protected]>

The recv path of secure mode is intertwined with that of crc mode. While it's slightly more efficient that way (the ciphertext is read into the destination buffer and decrypted in place, thus avoiding two potentially heavy memory allocations for the bounce buffer and the corresponding sg array), it isn't really amenable to changes. Sacrifice that edge and align with the send path which always uses a full-sized bounce buffer (currently there is no other way -- if the kernel crypto API ever grows support for streaming (piecewise) en/decryption for GCM [1], we would be able to easily take advantage of that on both sides). [1] https://lore.kernel.org/all/[email protected]/ Signed-off-by: Ilya Dryomov <[email protected]> Reviewed-by: Jeff Layton <[email protected]>

Both msgr1 and msgr2 in crc mode are zero copy in the sense that message data is read from the socket directly into the destination buffer. We assume that the destination buffer is stable (i.e. remains unchanged while it is being read to) though. Otherwise, CRC errors ensue: libceph: read_partial_message 0000000048edf8ad data crc 1063286393 != exp. 228122706 libceph: osd1 (1)192.168.122.1:6843 bad crc/signature libceph: bad data crc, calculated 57958023, expected 1805382778 libceph: osd2 (2)192.168.122.1:6876 integrity error, bad crc Introduce rxbounce option to enable use of a bounce buffer when receiving message data. In particular this is needed if a mapped image is a Windows VM disk, passed to QEMU. Windows has a system-wide "dummy" page that may be mapped into the destination buffer (potentially more than once into the same buffer) by the Windows Memory Manager in an effort to generate a single large I/O [1][2]. QEMU makes a point of preserving overlap relationships when cloning I/O vectors, so krbd gets exposed to this behaviour. [1] "What Is Really in That MDL?" https://docs.microsoft.com/en-us/previous-versions/windows/hardware/design/dn614012(v=vs.85) [2] https://blogs.msmvps.com/kernelmustard/2005/05/04/dummy-pages/ URL: https://bugzilla.redhat.com/show_bug.cgi?id=1973317 Signed-off-by: Ilya Dryomov <[email protected]> Reviewed-by: Jeff Layton <[email protected]>

We're missing the `f` prefix to have python do string interpolation, so we'd never end up printing what the actual "unexpected" error is. Fixes: ee92ed3 ("kunit: add run_checks.py script to validate kunit changes") Signed-off-by: Daniel Latypov <[email protected]> Reviewed-by: David Gow <[email protected]> Reviewed-by: Brendan Higgins <[email protected]> Signed-off-by: Shuah Khan <[email protected]>

Leon reported NULL pointer deref with nowait support: [ 15.123761] device-mapper: raid: Loading target version 1.15.1 [ 15.124185] device-mapper: raid: Ignoring chunk size parameter for RAID 1 [ 15.124192] device-mapper: raid: Choosing default region size of 4MiB [ 15.129524] BUG: kernel NULL pointer dereference, address: 0000000000000060 [ 15.129530] #PF: supervisor write access in kernel mode [ 15.129533] #PF: error_code(0x0002) - not-present page [ 15.129535] PGD 0 P4D 0 [ 15.129538] Oops: 0002 [#1] PREEMPT SMP NOPTI [ 15.129541] CPU: 5 PID: 494 Comm: ldmtool Not tainted 5.17.0-rc2-1-mainline #1 9fe89d43dfcb215d2731e6f8851740520778615e [ 15.129546] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F36e 10/14/2021 [ 15.129549] RIP: 0010:blk_queue_flag_set+0x7/0x20 [ 15.129555] Code: 00 00 00 0f 1f 44 00 00 48 8b 35 e4 e0 04 02 48 8d 57 28 bf 40 01 \ 00 00 e9 16 c1 be ff 66 0f 1f 44 00 00 0f 1f 44 00 00 89 ff <f0> 48 0f ab 7e 60 \ 31 f6 89 f7 c3 66 66 2e 0f 1f 84 00 00 00 00 00 [ 15.129559] RSP: 0018:ffff966b81987a88 EFLAGS: 00010202 [ 15.129562] RAX: ffff8b11c363a0d0 RBX: ffff8b11e294b070 RCX: 0000000000000000 [ 15.129564] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000001d [ 15.129566] RBP: ffff8b11e294b058 R08: 0000000000000000 R09: 0000000000000000 [ 15.129568] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b11e294b070 [ 15.129570] R13: 0000000000000000 R14: ffff8b11e294b000 R15: 0000000000000001 [ 15.129572] FS: 00007fa96e826780(0000) GS:ffff8b18deb40000(0000) knlGS:0000000000000000 [ 15.129575] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 15.129577] CR2: 0000000000000060 CR3: 000000010b8ce000 CR4: 00000000003506e0 [ 15.129580] Call Trace: [ 15.129582] <TASK> [ 15.129584] md_run+0x67c/0xc70 [md_mod 1e470c1b6bcf1114198109f42682f5a2740e9531] [ 15.129597] raid_ctr+0x134a/0x28ea [dm_raid 6a645dd7519e72834bd7e98c23497eeade14cd63] [ 15.129604] ? dm_split_args+0x63/0x150 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129615] dm_table_add_target+0x188/0x380 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129625] table_load+0x13b/0x370 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129635] ? dev_suspend+0x2d0/0x2d0 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129644] ctl_ioctl+0x1bd/0x460 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129655] dm_ctl_ioctl+0xa/0x20 [dm_mod 0d7b0bc3414340a79c4553bae5ca97294b78336e] [ 15.129663] __x64_sys_ioctl+0x8e/0xd0 [ 15.129667] do_syscall_64+0x5c/0x90 [ 15.129672] ? syscall_exit_to_user_mode+0x23/0x50 [ 15.129675] ? do_syscall_64+0x69/0x90 [ 15.129677] ? do_syscall_64+0x69/0x90 [ 15.129679] ? syscall_exit_to_user_mode+0x23/0x50 [ 15.129682] ? do_syscall_64+0x69/0x90 [ 15.129684] ? do_syscall_64+0x69/0x90 [ 15.129686] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 15.129689] RIP: 0033:0x7fa96ecd559b [ 15.129692] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c \ c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff \ ff 73 01 c3 48 8b 0d a5 a8 0c 00 f7 d8 64 89 01 48 [ 15.129696] RSP: 002b:00007ffcaf85c258 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 [ 15.129699] RAX: ffffffffffffffda RBX: 00007fa96f1b48f0 RCX: 00007fa96ecd559b [ 15.129701] RDX: 00007fa97017e610 RSI: 00000000c138fd09 RDI: 0000000000000003 [ 15.129702] RBP: 00007fa96ebab583 R08: 00007fa97017c9e0 R09: 00007ffcaf85bf27 [ 15.129704] R10: 0000000000000001 R11: 0000000000000206 R12: 00007fa97017e610 [ 15.129706] R13: 00007fa97017e640 R14: 00007fa97017e6c0 R15: 00007fa97017e530 [ 15.129709] </TASK> This is caused by missing mddev->queue check for setting QUEUE_FLAG_NOWAIT Fix this by moving the QUEUE_FLAG_NOWAIT logic to under mddev->queue check. Fixes: f51d46d ("md: add support for REQ_NOWAIT") Reported-by: Leon Möller <[email protected]> Tested-by: Leon Möller <[email protected]> Cc: Vishal Verma <[email protected]> Signed-off-by: Song Liu <[email protected]>

This will be used to help make decisions on what to do in misconfigured systems. v2: squash in semicolon fix from Stephen Rothwell Signed-off-by: Mario Limonciello <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]>

A crash was observed with the following output: BUG: kernel NULL pointer dereference, address: 0000000000000010 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 2 UID: 0 PID: 92 Comm: osnoise_cpus Not tainted 6.17.0-rc4-00201-gd69eb204c255 #138 PREEMPT(voluntary) RIP: 0010:bitmap_parselist+0x53/0x3e0 Call Trace: <TASK> osnoise_cpus_write+0x7a/0x190 vfs_write+0xf8/0x410 ? do_sys_openat2+0x88/0xd0 ksys_write+0x60/0xd0 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> This issue can be reproduced by below code: fd=open("/sys/kernel/debug/tracing/osnoise/cpus", O_WRONLY); write(fd, "0-2", 0); When user pass 'count=0' to osnoise_cpus_write(), kmalloc() will return ZERO_SIZE_PTR (16) and cpulist_parse() treat it as a normal value, which trigger the null pointer dereference. Add check for the parameter 'count'. Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Link: https://lore.kernel.org/[email protected] Fixes: 17f8910 ("tracing/osnoise: Allow arbitrarily long CPU string") Signed-off-by: Wang Liang <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>

…on memory When I did memory failure tests, below panic occurs: page dumped because: VM_BUG_ON_PAGE(PagePoisoned(page)) kernel BUG at include/linux/page-flags.h:616! Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 3 PID: 720 Comm: bash Not tainted 6.10.0-rc1-00195-g148743902568 #40 RIP: 0010:unpoison_memory+0x2f3/0x590 RSP: 0018:ffffa57fc8787d60 EFLAGS: 00000246 RAX: 0000000000000037 RBX: 0000000000000009 RCX: ffff9be25fcdc9c8 RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff9be25fcdc9c0 RBP: 0000000000300000 R08: ffffffffb4956f88 R09: 0000000000009ffb R10: 0000000000000284 R11: ffffffffb4926fa0 R12: ffffe6b00c000000 R13: ffff9bdb453dfd00 R14: 0000000000000000 R15: fffffffffffffffe FS: 00007f08f04e4740(0000) GS:ffff9be25fcc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000564787a30410 CR3: 000000010d4e2000 CR4: 00000000000006f0 Call Trace: <TASK> unpoison_memory+0x2f3/0x590 simple_attr_write_xsigned.constprop.0.isra.0+0xb3/0x110 debugfs_attr_write+0x42/0x60 full_proxy_write+0x5b/0x80 vfs_write+0xd5/0x540 ksys_write+0x64/0xe0 do_syscall_64+0xb9/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f08f0314887 RSP: 002b:00007ffece710078 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007f08f0314887 RDX: 0000000000000009 RSI: 0000564787a30410 RDI: 0000000000000001 RBP: 0000564787a30410 R08: 000000000000fefe R09: 000000007fffffff R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000009 R13: 00007f08f041b780 R14: 00007f08f0417600 R15: 00007f08f0416a00 </TASK> Modules linked in: hwpoison_inject ---[ end trace 0000000000000000 ]--- RIP: 0010:unpoison_memory+0x2f3/0x590 RSP: 0018:ffffa57fc8787d60 EFLAGS: 00000246 RAX: 0000000000000037 RBX: 0000000000000009 RCX: ffff9be25fcdc9c8 RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff9be25fcdc9c0 RBP: 0000000000300000 R08: ffffffffb4956f88 R09: 0000000000009ffb R10: 0000000000000284 R11: ffffffffb4926fa0 R12: ffffe6b00c000000 R13: ffff9bdb453dfd00 R14: 0000000000000000 R15: fffffffffffffffe FS: 00007f08f04e4740(0000) GS:ffff9be25fcc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000564787a30410 CR3: 000000010d4e2000 CR4: 00000000000006f0 Kernel panic - not syncing: Fatal exception Kernel Offset: 0x31c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ---[ end Kernel panic - not syncing: Fatal exception ]--- The root cause is that unpoison_memory() tries to check the PG_HWPoison flags of an uninitialized page. So VM_BUG_ON_PAGE(PagePoisoned(page)) is triggered. This can be reproduced by below steps: 1.Offline memory block: echo offline > /sys/devices/system/memory/memory12/state 2.Get offlined memory pfn: page-types -b n -rlN 3.Write pfn to unpoison-pfn echo <pfn> > /sys/kernel/debug/hwpoison/unpoison-pfn This scenario can be identified by pfn_to_online_page() returning NULL. And ZONE_DEVICE pages are never expected, so we can simply fail if pfn_to_online_page() == NULL to fix the bug. Link: https://lkml.kernel.org/r/[email protected] Fixes: f1dd2cd ("mm, memory_hotplug: do not associate hotadded memory to zones until online") Signed-off-by: Miaohe Lin <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

The ns_bpf_qdisc selftest triggers a kernel panic: Unable to handle kernel paging request at virtual address ffffffffa38dbf58 Current test_progs pgtable: 4K pagesize, 57-bit VAs, pgdp=0x00000001109cc000 [ffffffffa38dbf58] pgd=000000011fffd801, p4d=000000011fffd401, pud=000000011fffd001, pmd=0000000000000000 Oops [#1] Modules linked in: bpf_testmod(OE) xt_conntrack nls_iso8859_1 dm_mod drm drm_panel_orientation_quirks configfs backlight btrfs blake2b_generic xor lzo_compress zlib_deflate raid6_pq efivarfs [last unloaded: bpf_testmod(OE)] CPU: 1 UID: 0 PID: 23584 Comm: test_progs Tainted: G W OE 6.17.0-rc1-g2465bb83e0b4 #1 NONE Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: Unknown Unknown Product/Unknown Product, BIOS 2024.01+dfsg-1ubuntu5.1 01/01/2024 epc : __qdisc_run+0x82/0x6f0 ra : __qdisc_run+0x6e/0x6f0 epc : ffffffff80bd5c7a ra : ffffffff80bd5c66 sp : ff2000000eecb550 gp : ffffffff82472098 tp : ff60000096895940 t0 : ffffffff8001f180 t1 : ffffffff801e1664 t2 : 0000000000000000 s0 : ff2000000eecb5d0 s1 : ff60000093a6a600 a0 : ffffffffa38dbee8 a1 : 0000000000000001 a2 : ff2000000eecb510 a3 : 0000000000000001 a4 : 0000000000000000 a5 : 0000000000000010 a6 : 0000000000000000 a7 : 0000000000735049 s2 : ffffffffa38dbee8 s3 : 0000000000000040 s4 : ff6000008bcda000 s5 : 0000000000000008 s6 : ff60000093a6a680 s7 : ff60000093a6a6f0 s8 : ff60000093a6a6ac s9 : ff60000093140000 s10: 0000000000000000 s11: ff2000000eecb9d0 t3 : 0000000000000000 t4 : 0000000000ff0000 t5 : 0000000000000000 t6 : ff60000093a6a8b6 status: 0000000200000120 badaddr: ffffffffa38dbf58 cause: 000000000000000d [<ffffffff80bd5c7a>] __qdisc_run+0x82/0x6f0 [<ffffffff80b6fe58>] __dev_queue_xmit+0x4c0/0x1128 [<ffffffff80b80ae0>] neigh_resolve_output+0xd0/0x170 [<ffffffff80d2daf6>] ip6_finish_output2+0x226/0x6c8 [<ffffffff80d31254>] ip6_finish_output+0x10c/0x2a0 [<ffffffff80d31446>] ip6_output+0x5e/0x178 [<ffffffff80d2e232>] ip6_xmit+0x29a/0x608 [<ffffffff80d6f4c6>] inet6_csk_xmit+0xe6/0x140 [<ffffffff80c985e4>] __tcp_transmit_skb+0x45c/0xaa8 [<ffffffff80c995fe>] tcp_connect+0x9ce/0xd10 [<ffffffff80d66524>] tcp_v6_connect+0x4ac/0x5e8 [<ffffffff80cc19b8>] __inet_stream_connect+0xd8/0x318 [<ffffffff80cc1c36>] inet_stream_connect+0x3e/0x68 [<ffffffff80b42b20>] __sys_connect_file+0x50/0x88 [<ffffffff80b42bee>] __sys_connect+0x96/0xc8 [<ffffffff80b42c40>] __riscv_sys_connect+0x20/0x30 [<ffffffff80e5bcae>] do_trap_ecall_u+0x256/0x378 [<ffffffff80e69af2>] handle_exception+0x14a/0x156 Code: 892a 0363 1205 489c 8bc1 c7e5 2d03 084a 2703 080a (2783) 0709 ---[ end trace 0000000000000000 ]--- The bpf_fifo_dequeue prog returns a skb which is a pointer. The pointer is treated as a 32bit value and sign extend to 64bit in epilogue. This behavior is right for most bpf prog types but wrong for struct ops which requires RISC-V ABI. So let's sign extend struct ops return values according to the function model and RISC-V ABI([0]). [0]: https://riscv.org/wp-content/uploads/2024/12/riscv-calling.pdf Fixes: 25ad106 ("riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework") Signed-off-by: Hengqi Chen <[email protected]> Reviewed-by: Pu Lehui <[email protected]> Tested-by: Pu Lehui <[email protected]>

Avoid below overlapping mappings by using a contiguous non-cacheable buffer. [ 4.077708] DMA-API: stm32_fmc2_nfc 48810000.nand-controller: cacheline tracking EEXIST, overlapping mappings aren't supported [ 4.089103] WARNING: CPU: 1 PID: 44 at kernel/dma/debug.c:568 add_dma_entry+0x23c/0x300 [ 4.097071] Modules linked in: [ 4.100101] CPU: 1 PID: 44 Comm: kworker/u4:2 Not tainted 6.1.82 #1 [ 4.106346] Hardware name: STMicroelectronics STM32MP257F VALID1 SNOR / MB1704 (LPDDR4 Power discrete) + MB1703 + MB1708 (SNOR MB1730) (DT) [ 4.118824] Workqueue: events_unbound deferred_probe_work_func [ 4.124674] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 4.131624] pc : add_dma_entry+0x23c/0x300 [ 4.135658] lr : add_dma_entry+0x23c/0x300 [ 4.139792] sp : ffff800009dbb490 [ 4.143016] x29: ffff800009dbb4a0 x28: 0000000004008022 x27: ffff8000098a6000 [ 4.150174] x26: 0000000000000000 x25: ffff8000099e7000 x24: ffff8000099e7de8 [ 4.157231] x23: 00000000ffffffff x22: 0000000000000000 x21: ffff8000098a6a20 [ 4.164388] x20: ffff000080964180 x19: ffff800009819ba0 x18: 0000000000000006 [ 4.171545] x17: 6361727420656e69 x16: 6c6568636163203a x15: 72656c6c6f72746e [ 4.178602] x14: 6f632d646e616e2e x13: ffff800009832f58 x12: 00000000000004ec [ 4.185759] x11: 00000000000001a4 x10: ffff80000988af58 x9 : ffff800009832f58 [ 4.192916] x8 : 00000000ffffefff x7 : ffff80000988af58 x6 : 80000000fffff000 [ 4.199972] x5 : 000000000000bff4 x4 : 0000000000000000 x3 : 0000000000000000 [ 4.207128] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000812d2c40 [ 4.214185] Call trace: [ 4.216605] add_dma_entry+0x23c/0x300 [ 4.220338] debug_dma_map_sg+0x198/0x350 [ 4.224373] __dma_map_sg_attrs+0xa0/0x110 [ 4.228411] dma_map_sg_attrs+0x10/0x2c [ 4.232247] stm32_fmc2_nfc_xfer.isra.0+0x1c8/0x3fc [ 4.237088] stm32_fmc2_nfc_seq_read_page+0xc8/0x174 [ 4.242127] nand_read_oob+0x1d4/0x8e0 [ 4.245861] mtd_read_oob_std+0x58/0x84 [ 4.249596] mtd_read_oob+0x90/0x150 [ 4.253231] mtd_read+0x68/0xac Signed-off-by: Christophe Kerello <[email protected]> Cc: [email protected] Fixes: 2cd457f ("mtd: rawnand: stm32_fmc2: add STM32 FMC2 NAND flash controller driver") Signed-off-by: Miquel Raynal <[email protected]>

Problem description =================== Lockdep reports a possible circular locking dependency (AB/BA) between &pl->state_mutex and &phy->lock, as follows. phylink_resolve() // acquires &pl->state_mutex -> phylink_major_config() -> phy_config_inband() // acquires &pl->phydev->lock whereas all the other call sites where &pl->state_mutex and &pl->phydev->lock have the locking scheme reversed. Everywhere else, &pl->phydev->lock is acquired at the top level, and &pl->state_mutex at the lower level. A clear example is phylink_bringup_phy(). The outlier is the newly introduced phy_config_inband() and the existing lock order is the correct one. To understand why it cannot be the other way around, it is sufficient to consider phylink_phy_change(), phylink's callback from the PHY device's phy->phy_link_change() virtual method, invoked by the PHY state machine. phy_link_up() and phy_link_down(), the (indirect) callers of phylink_phy_change(), are called with &phydev->lock acquired. Then phylink_phy_change() acquires its own &pl->state_mutex, to serialize changes made to its pl->phy_state and pl->link_config. So all other instances of &pl->state_mutex and &phydev->lock must be consistent with this order. Problem impact ============== I think the kernel runs a serious deadlock risk if an existing phylink_resolve() thread, which results in a phy_config_inband() call, is concurrent with a phy_link_up() or phy_link_down() call, which will deadlock on &pl->state_mutex in phylink_phy_change(). Practically speaking, the impact may be limited by the slow speed of the medium auto-negotiation protocol, which makes it unlikely for the current state to still be unresolved when a new one is detected, but I think the problem is there. Nonetheless, the problem was discovered using lockdep. Proposed solution ================= Practically speaking, the phy_config_inband() requirement of having phydev->lock acquired must transfer to the caller (phylink is the only caller). There, it must bubble up until immediately before &pl->state_mutex is acquired, for the cases where that takes place. Solution details, considerations, notes ======================================= This is the phy_config_inband() call graph: sfp_upstream_ops :: connect_phy() | v phylink_sfp_connect_phy() | v phylink_sfp_config_phy() | | sfp_upstream_ops :: module_insert() | | | v | phylink_sfp_module_insert() | | | | sfp_upstream_ops :: module_start() | | | | | v | | phylink_sfp_module_start() | | | | v v | phylink_sfp_config_optical() phylink_start() | | | phylink_resume() v v | | phylink_sfp_set_config() | | | v v v phylink_mac_initial_config() | phylink_resolve() | | phylink_ethtool_ksettings_set() v v v phylink_major_config() | v phy_config_inband() phylink_major_config() caller #1, phylink_mac_initial_config(), does not acquire &pl->state_mutex nor do its callers. It must acquire &pl->phydev->lock prior to calling phylink_major_config(). phylink_major_config() caller #2, phylink_resolve() acquires &pl->state_mutex, thus also needs to acquire &pl->phydev->lock. phylink_major_config() caller #3, phylink_ethtool_ksettings_set(), is completely uninteresting, because it only calls phylink_major_config() if pl->phydev is NULL (otherwise it calls phy_ethtool_ksettings_set()). We need to change nothing there. Other solutions =============== The lock inversion between &pl->state_mutex and &pl->phydev->lock has occurred at least once before, as seen in commit c718af2 ("net: phylink: fix ethtool -A with attached PHYs"). The solution there was to simply not call phy_set_asym_pause() under the &pl->state_mutex. That cannot be extended to our case though, where the phy_config_inband() call is much deeper inside the &pl->state_mutex section. Fixes: 5fd0f1a ("net: phylink: add negotiation of in-band capabilities") Signed-off-by: Vladimir Oltean <[email protected]> Reviewed-by: Russell King (Oracle) <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>

5da3d94 ("PCI: mvebu: Use for_each_of_range() iterator for parsing "ranges"") simplified code by using the for_each_of_range() iterator, but it broke PCI enumeration on Turris Omnia (and probably other mvebu targets). Issue #1: To determine range.flags, of_pci_range_parser_one() uses bus->get_flags(), which resolves to of_bus_pci_get_flags(), which already returns an IORESOURCE bit field, and NOT the original flags from the "ranges" resource. Then mvebu_get_tgt_attr() attempts the very same conversion again. Remove the misinterpretation of range.flags in mvebu_get_tgt_attr(), to restore the intended behavior. Issue #2: The driver needs target and attributes, which are encoded in the raw address values of the "/soc/pcie/ranges" resource. According to of_pci_range_parser_one(), the raw values are stored in range.bus_addr and range.parent_bus_addr, respectively. range.cpu_addr is a translated version of range.parent_bus_addr, and not relevant here. Use the correct range structure member, to extract target and attributes. This restores the intended behavior. Fixes: 5da3d94 ("PCI: mvebu: Use for_each_of_range() iterator for parsing "ranges"") Reported-by: Jan Palus <[email protected]> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220479 Signed-off-by: Klaus Kudielka <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]> Tested-by: Tony Dinh <[email protected]> Tested-by: Jan Palus <[email protected]> Link: https://patch.msgid.link/[email protected]

If request_irq() in i40e_vsi_request_irq_msix() fails in an iteration later than the first, the error path wants to free the IRQs requested so far. However, it uses the wrong dev_id argument for free_irq(), so it does not free the IRQs correctly and instead triggers the warning: Trying to free already-free IRQ 173 WARNING: CPU: 25 PID: 1091 at kernel/irq/manage.c:1829 __free_irq+0x192/0x2c0 Modules linked in: i40e(+) [...] CPU: 25 UID: 0 PID: 1091 Comm: NetworkManager Not tainted 6.17.0-rc1+ #1 PREEMPT(lazy) Hardware name: [...] RIP: 0010:__free_irq+0x192/0x2c0 [...] Call Trace: <TASK> free_irq+0x32/0x70 i40e_vsi_request_irq_msix.cold+0x63/0x8b [i40e] i40e_vsi_request_irq+0x79/0x80 [i40e] i40e_vsi_open+0x21f/0x2f0 [i40e] i40e_open+0x63/0x130 [i40e] __dev_open+0xfc/0x210 __dev_change_flags+0x1fc/0x240 netif_change_flags+0x27/0x70 do_setlink.isra.0+0x341/0xc70 rtnl_newlink+0x468/0x860 rtnetlink_rcv_msg+0x375/0x450 netlink_rcv_skb+0x5c/0x110 netlink_unicast+0x288/0x3c0 netlink_sendmsg+0x20d/0x430 ____sys_sendmsg+0x3a2/0x3d0 ___sys_sendmsg+0x99/0xe0 __sys_sendmsg+0x8a/0xf0 do_syscall_64+0x82/0x2c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e [...] </TASK> ---[ end trace 0000000000000000 ]--- Use the same dev_id for free_irq() as for request_irq(). I tested this with inserting code to fail intentionally. Fixes: 493fb30 ("i40e: Move q_vectors from pointer to array to array of pointers") Signed-off-by: Michal Schmidt <[email protected]> Reviewed-by: Aleksandr Loktionov <[email protected]> Reviewed-by: Subbaraya Sundeep <[email protected]> Tested-by: Rinitha S <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>

Hangbin Liu says: ==================== hsr: fix lock warnings hsr_for_each_port is called in many places without holding the RCU read lock, this may trigger warnings on debug kernels like: [ 40.457015] [ T201] WARNING: suspicious RCU usage [ 40.457020] [ T201] 6.17.0-rc2-virtme #1 Not tainted [ 40.457025] [ T201] ----------------------------- [ 40.457029] [ T201] net/hsr/hsr_main.c:137 RCU-list traversed in non-reader section!! [ 40.457036] [ T201] other info that might help us debug this: [ 40.457040] [ T201] rcu_scheduler_active = 2, debug_locks = 1 [ 40.457045] [ T201] 2 locks held by ip/201: [ 40.457050] [ T201] #0: ffffffff93040a40 (&ops->srcu){.+.+}-{0:0}, at: rtnl_link_ops_get+0xf2/0x280 [ 40.457080] [ T201] #1: ffffffff92e7f968 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_newlink+0x5e1/0xb20 [ 40.457102] [ T201] stack backtrace: [ 40.457108] [ T201] CPU: 2 UID: 0 PID: 201 Comm: ip Not tainted 6.17.0-rc2-virtme #1 PREEMPT(full) [ 40.457114] [ T201] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 40.457117] [ T201] Call Trace: [ 40.457120] [ T201] <TASK> [ 40.457126] [ T201] dump_stack_lvl+0x6f/0xb0 [ 40.457136] [ T201] lockdep_rcu_suspicious.cold+0x4f/0xb1 [ 40.457148] [ T201] hsr_port_get_hsr+0xfe/0x140 [ 40.457158] [ T201] hsr_add_port+0x192/0x940 [ 40.457167] [ T201] ? __pfx_hsr_add_port+0x10/0x10 [ 40.457176] [ T201] ? lockdep_init_map_type+0x5c/0x270 [ 40.457189] [ T201] hsr_dev_finalize+0x4bc/0xbf0 [ 40.457204] [ T201] hsr_newlink+0x3c3/0x8f0 [ 40.457212] [ T201] ? __pfx_hsr_newlink+0x10/0x10 [ 40.457222] [ T201] ? rtnl_create_link+0x173/0xe40 [ 40.457233] [ T201] rtnl_newlink_create+0x2cf/0x750 [ 40.457243] [ T201] ? __pfx_rtnl_newlink_create+0x10/0x10 [ 40.457247] [ T201] ? __dev_get_by_name+0x12/0x50 [ 40.457252] [ T201] ? rtnl_dev_get+0xac/0x140 [ 40.457259] [ T201] ? __pfx_rtnl_dev_get+0x10/0x10 [ 40.457285] [ T201] __rtnl_newlink+0x22c/0xa50 [ 40.457305] [ T201] rtnl_newlink+0x637/0xb20 Adding rcu_read_lock() for all hsr_for_each_port() looks confusing. Introduce a new helper, hsr_for_each_port_rtnl(), that assumes the RTNL lock is held. This allows callers in suitable contexts to iterate ports safely without explicit RCU locking. Other code paths that rely on RCU protection continue to use hsr_for_each_port() with rcu_read_lock(). ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]>

The ns_bpf_qdisc selftest triggers a kernel panic: Unable to handle kernel paging request at virtual address ffffffffa38dbf58 Current test_progs pgtable: 4K pagesize, 57-bit VAs, pgdp=0x00000001109cc000 [ffffffffa38dbf58] pgd=000000011fffd801, p4d=000000011fffd401, pud=000000011fffd001, pmd=0000000000000000 Oops [#1] Modules linked in: bpf_testmod(OE) xt_conntrack nls_iso8859_1 dm_mod drm drm_panel_orientation_quirks configfs backlight btrfs blake2b_generic xor lzo_compress zlib_deflate raid6_pq efivarfs [last unloaded: bpf_testmod(OE)] CPU: 1 UID: 0 PID: 23584 Comm: test_progs Tainted: G W OE 6.17.0-rc1-g2465bb83e0b4 #1 NONE Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: Unknown Unknown Product/Unknown Product, BIOS 2024.01+dfsg-1ubuntu5.1 01/01/2024 epc : __qdisc_run+0x82/0x6f0 ra : __qdisc_run+0x6e/0x6f0 epc : ffffffff80bd5c7a ra : ffffffff80bd5c66 sp : ff2000000eecb550 gp : ffffffff82472098 tp : ff60000096895940 t0 : ffffffff8001f180 t1 : ffffffff801e1664 t2 : 0000000000000000 s0 : ff2000000eecb5d0 s1 : ff60000093a6a600 a0 : ffffffffa38dbee8 a1 : 0000000000000001 a2 : ff2000000eecb510 a3 : 0000000000000001 a4 : 0000000000000000 a5 : 0000000000000010 a6 : 0000000000000000 a7 : 0000000000735049 s2 : ffffffffa38dbee8 s3 : 0000000000000040 s4 : ff6000008bcda000 s5 : 0000000000000008 s6 : ff60000093a6a680 s7 : ff60000093a6a6f0 s8 : ff60000093a6a6ac s9 : ff60000093140000 s10: 0000000000000000 s11: ff2000000eecb9d0 t3 : 0000000000000000 t4 : 0000000000ff0000 t5 : 0000000000000000 t6 : ff60000093a6a8b6 status: 0000000200000120 badaddr: ffffffffa38dbf58 cause: 000000000000000d [<ffffffff80bd5c7a>] __qdisc_run+0x82/0x6f0 [<ffffffff80b6fe58>] __dev_queue_xmit+0x4c0/0x1128 [<ffffffff80b80ae0>] neigh_resolve_output+0xd0/0x170 [<ffffffff80d2daf6>] ip6_finish_output2+0x226/0x6c8 [<ffffffff80d31254>] ip6_finish_output+0x10c/0x2a0 [<ffffffff80d31446>] ip6_output+0x5e/0x178 [<ffffffff80d2e232>] ip6_xmit+0x29a/0x608 [<ffffffff80d6f4c6>] inet6_csk_xmit+0xe6/0x140 [<ffffffff80c985e4>] __tcp_transmit_skb+0x45c/0xaa8 [<ffffffff80c995fe>] tcp_connect+0x9ce/0xd10 [<ffffffff80d66524>] tcp_v6_connect+0x4ac/0x5e8 [<ffffffff80cc19b8>] __inet_stream_connect+0xd8/0x318 [<ffffffff80cc1c36>] inet_stream_connect+0x3e/0x68 [<ffffffff80b42b20>] __sys_connect_file+0x50/0x88 [<ffffffff80b42bee>] __sys_connect+0x96/0xc8 [<ffffffff80b42c40>] __riscv_sys_connect+0x20/0x30 [<ffffffff80e5bcae>] do_trap_ecall_u+0x256/0x378 [<ffffffff80e69af2>] handle_exception+0x14a/0x156 Code: 892a 0363 1205 489c 8bc1 c7e5 2d03 084a 2703 080a (2783) 0709 ---[ end trace 0000000000000000 ]--- The bpf_fifo_dequeue prog returns a skb which is a pointer. The pointer is treated as a 32bit value and sign extend to 64bit in epilogue. This behavior is right for most bpf prog types but wrong for struct ops which requires RISC-V ABI. So let's sign extend struct ops return values according to the function model and RISC-V ABI([0]). [0]: https://riscv.org/wp-content/uploads/2024/12/riscv-calling.pdf Fixes: 25ad106 ("riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework") Signed-off-by: Hengqi Chen <[email protected]> Reviewed-by: Pu Lehui <[email protected]> Tested-by: Pu Lehui <[email protected]>

The ns_bpf_qdisc selftest triggers a kernel panic: Unable to handle kernel paging request at virtual address ffffffffa38dbf58 Current test_progs pgtable: 4K pagesize, 57-bit VAs, pgdp=0x00000001109cc000 [ffffffffa38dbf58] pgd=000000011fffd801, p4d=000000011fffd401, pud=000000011fffd001, pmd=0000000000000000 Oops [#1] Modules linked in: bpf_testmod(OE) xt_conntrack nls_iso8859_1 [...] [last unloaded: bpf_testmod(OE)] CPU: 1 UID: 0 PID: 23584 Comm: test_progs Tainted: G W OE 6.17.0-rc1-g2465bb83e0b4 #1 NONE Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: Unknown Unknown Product/Unknown Product, BIOS 2024.01+dfsg-1ubuntu5.1 01/01/2024 epc : __qdisc_run+0x82/0x6f0 ra : __qdisc_run+0x6e/0x6f0 epc : ffffffff80bd5c7a ra : ffffffff80bd5c66 sp : ff2000000eecb550 gp : ffffffff82472098 tp : ff60000096895940 t0 : ffffffff8001f180 t1 : ffffffff801e1664 t2 : 0000000000000000 s0 : ff2000000eecb5d0 s1 : ff60000093a6a600 a0 : ffffffffa38dbee8 a1 : 0000000000000001 a2 : ff2000000eecb510 a3 : 0000000000000001 a4 : 0000000000000000 a5 : 0000000000000010 a6 : 0000000000000000 a7 : 0000000000735049 s2 : ffffffffa38dbee8 s3 : 0000000000000040 s4 : ff6000008bcda000 s5 : 0000000000000008 s6 : ff60000093a6a680 s7 : ff60000093a6a6f0 s8 : ff60000093a6a6ac s9 : ff60000093140000 s10: 0000000000000000 s11: ff2000000eecb9d0 t3 : 0000000000000000 t4 : 0000000000ff0000 t5 : 0000000000000000 t6 : ff60000093a6a8b6 status: 0000000200000120 badaddr: ffffffffa38dbf58 cause: 000000000000000d [<ffffffff80bd5c7a>] __qdisc_run+0x82/0x6f0 [<ffffffff80b6fe58>] __dev_queue_xmit+0x4c0/0x1128 [<ffffffff80b80ae0>] neigh_resolve_output+0xd0/0x170 [<ffffffff80d2daf6>] ip6_finish_output2+0x226/0x6c8 [<ffffffff80d31254>] ip6_finish_output+0x10c/0x2a0 [<ffffffff80d31446>] ip6_output+0x5e/0x178 [<ffffffff80d2e232>] ip6_xmit+0x29a/0x608 [<ffffffff80d6f4c6>] inet6_csk_xmit+0xe6/0x140 [<ffffffff80c985e4>] __tcp_transmit_skb+0x45c/0xaa8 [<ffffffff80c995fe>] tcp_connect+0x9ce/0xd10 [<ffffffff80d66524>] tcp_v6_connect+0x4ac/0x5e8 [<ffffffff80cc19b8>] __inet_stream_connect+0xd8/0x318 [<ffffffff80cc1c36>] inet_stream_connect+0x3e/0x68 [<ffffffff80b42b20>] __sys_connect_file+0x50/0x88 [<ffffffff80b42bee>] __sys_connect+0x96/0xc8 [<ffffffff80b42c40>] __riscv_sys_connect+0x20/0x30 [<ffffffff80e5bcae>] do_trap_ecall_u+0x256/0x378 [<ffffffff80e69af2>] handle_exception+0x14a/0x156 Code: 892a 0363 1205 489c 8bc1 c7e5 2d03 084a 2703 080a (2783) 0709 ---[ end trace 0000000000000000 ]--- The bpf_fifo_dequeue prog returns a skb which is a pointer. The pointer is treated as a 32bit value and sign extend to 64bit in epilogue. This behavior is right for most bpf prog types but wrong for struct ops which requires RISC-V ABI. So let's sign extend struct ops return values according to the function model and RISC-V ABI([0]). [0]: https://riscv.org/wp-content/uploads/2024/12/riscv-calling.pdf Fixes: 25ad106 ("riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework") Signed-off-by: Hengqi Chen <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Pu Lehui <[email protected]> Reviewed-by: Pu Lehui <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]

…tack-analysis' Eduard Zingerman says: ==================== bpf: replace path-sensitive with path-insensitive live stack analysis Consider the following program, assuming checkpoint is created for a state at instruction (3): 1: call bpf_get_prandom_u32() 2: *(u64 *)(r10 - 8) = 42 -- checkpoint #1 -- 3: if r0 != 0 goto +1 4: exit; 5: r0 = *(u64 *)(r10 - 8) 6: exit The verifier processes this program by exploring two paths: - 1 -> 2 -> 3 -> 4 - 1 -> 2 -> 3 -> 5 -> 6 When instruction (5) is processed, the current liveness tracking mechanism moves up the register parent links and records a "read" mark for stack slot -8 at checkpoint #1, stopping because of the "write" mark recorded at instruction (2). This patch set replaces the existing liveness tracking mechanism with a path-insensitive data flow analysis. The program above is processed as follows: - a data structure representing live stack slots for instructions 1-6 in frame #0 is allocated; - when instruction (2) is processed, record that slot -8 is written at instruction (2) in frame #0; - when instruction (5) is processed, record that slot -8 is read at instruction (5) in frame #0; - when instruction (6) is processed, propagate read mark for slot -8 up the control flow graph to instructions 3 and 2. The key difference is that the new mechanism operates on a control flow graph and associates read and write marks with pairs of (call chain, instruction index). In contrast, the old mechanism operates on verifier states and register parent links, associating read and write marks with verifier states. Motivation ========== As it stands, this patch set makes liveness tracking slightly less precise, as it no longer distinguishes individual program paths taken by the verifier during symbolic execution. See the "Impact on verification performance" section for details. However, this change is intended as a stepping stone toward the following goals: - Short term, integrate precision tracking into liveness analysis and remove the following code: - verifier backedge states accumulation in is_state_visited(); - most of the logic for precision tracking; - jump history tracking. - Long term, help with more efficient loop verification handling. Why integrating precision tracking? ----------------------------------- In a sense, precision tracking is very similar to liveness tracking. The data flow equations for liveness tracking look as follows: live_after = U [state[s].live_before for s in insn_successors(i)] state[i].live_before = (live_after / state[i].must_write) U state[i].may_read While data flow equations for precision tracking look as follows: precise_after = U [state[s].precise_before for s in insn_successors(i)] // if some of the instruction outputs are precise, // assume its inputs to be precise induced_precise = ⎧ state[i].may_read if (state[i].may_write ∩ precise_after) ≠ ∅ ⎨ ⎩ ∅ otherwise state[i].precise_before = (precise_after / state[i].must_write) ∩ induced_precise Where: - `may_read` set represents a union of all possibly read slots (any slot in `may_read` set might be by the instruction); - `must_write` set represents an intersection of all possibly written slots (any slot in `must_write` set is guaranteed to be written by the instruction). - `may_write` set represents a union of all possibly written slots (any slot in `may_write` set might be written by the instruction). This means that precision tracking can be implemented as a logical extension of liveness tracking: - track registers as well as stack slots; - add bit masks to represent `precise_before` and `may_write`; - add above equations for `precise_before` computation; - (linked registers require some additional consideration). Such extension would allow removal of: - precision propagation logic in verifier.c: - backtrack_insn() - mark_chain_precision() - propagate_{precision,backedges}() - push_jmp_history() and related data structures, which are only used by precision tracking; - add_scc_backedge() and related backedge state accumulation in is_state_visited(), superseded by per-callchain function state accumulated by liveness analysis. The hope here is that unifying liveness and precision tracking will reduce overall amount of code and make it easier to reason about. How this helps with loops? -------------------------- As it stands, this patch set shares the same deficiency as the current liveness tracking mechanism. Liveness marks on stack slots cannot be used to prune states when processing iterator-based loops: - such states still have branches to be explored; - meaning that not all stack slot reads have been discovered. For example: 1: while(iter_next()) { 2: if (...) 3: r0 = *(u64 *)(r10 - 8) 4: if (...) 5: r0 = *(u64 *)(r10 - 16) 6: ... 7: } For any checkpoint state created at instruction (1), it is only possible to rely on read marks for slots fp[-8] and fp[-16] once all child states of (1) have been explored. Thus, when the verifier transitions from (7) to (1), it cannot rely on read marks. However, sacrificing path-sensitivity makes it possible to run analysis defined in this patch set before main verification pass, if estimates for value ranges are available. E.g. for the following program: 1: while(iter_next()) { 2: r0 = r10 3: r0 += r2 4: r0 = *(u64 *)(r2 + 0) 5: ... 6: } If an estimate for `r2` range is available before the main verification pass, it can be used to populate read marks at instruction (4) and run the liveness analysis. Thus making conservative liveness information available during loops verification. Such estimates can be provided by some form of value range analysis. Value range analysis is also necessary to address loop verification from another angle: computing boundaries for loop induction variables and iteration counts. The hope here is that the new liveness tracking mechanism will support the broader goal of making loop verification more efficient. Validation ========== The change was tested on three program sets: - bpf selftests - sched_ext - Meta's internal set of programs Commit [#8] enables a special mode where both the current and new liveness analyses are enabled simultaneously. This mode signals an error if the new algorithm considers a stack slot dead while the current algorithm assumes it is alive. This mode was very useful for debugging. At the time of posting, no such errors have been reported for the above program sets. [#8] "bpf: signal error if old liveness is more conservative than new" Impact on memory consumption ============================ Debug patch [1] extends the kernel and veristat to count the amount of memory allocated for storing analysis data. This patch is not included in the submission. The maximal observed impact for the above program sets is 2.6Mb. Data below is shown in bytes. For bpf selftests top 5 consumers look as follows: File Program liveness mem ----------------------- ---------------- ------------ pyperf180.bpf.o on_event 2629740 pyperf600.bpf.o on_event 2287662 pyperf100.bpf.o on_event 1427022 test_verif_scale3.bpf.o balancer_ingress 1121283 pyperf_subprogs.bpf.o on_event 756900 For sched_ext top 5 consumers loog as follows: File Program liveness mem --------- ------------------------------- ------------ bpf.bpf.o lavd_enqueue 164686 bpf.bpf.o lavd_select_cpu 157393 bpf.bpf.o layered_enqueue 154817 bpf.bpf.o lavd_init 127865 bpf.bpf.o layered_dispatch 110129 For Meta's internal set of programs top consumer is 1Mb. [1] kernel-patches/bpf@085588e Impact on verification performance ================================== Veristat results below are reported using `-f insns_pct>1 -f !insns<500` filter and -t option (BPF_F_TEST_STATE_FREQ flag). master vs patch-set, selftests (out of ~4K programs) ---------------------------------------------------- File Program Insns (A) Insns (B) Insns (DIFF) -------------------------------- -------------------------------------- --------- --------- --------------- cpumask_success.bpf.o test_global_mask_nested_deep_array_rcu 1622 1655 +33 (+2.03%) strobemeta_bpf_loop.bpf.o on_event 2163 2684 +521 (+24.09%) test_cls_redirect.bpf.o cls_redirect 36001 42515 +6514 (+18.09%) test_cls_redirect_dynptr.bpf.o cls_redirect 2299 2339 +40 (+1.74%) test_cls_redirect_subprogs.bpf.o cls_redirect 69545 78497 +8952 (+12.87%) test_l4lb_noinline.bpf.o balancer_ingress 2993 3084 +91 (+3.04%) test_xdp_noinline.bpf.o balancer_ingress_v4 3539 3616 +77 (+2.18%) test_xdp_noinline.bpf.o balancer_ingress_v6 3608 3685 +77 (+2.13%) master vs patch-set, sched_ext (out of 148 programs) ---------------------------------------------------- File Program Insns (A) Insns (B) Insns (DIFF) --------- ---------------- --------- --------- --------------- bpf.bpf.o chaos_dispatch 2257 2287 +30 (+1.33%) bpf.bpf.o lavd_enqueue 20735 22101 +1366 (+6.59%) bpf.bpf.o lavd_select_cpu 22100 24409 +2309 (+10.45%) bpf.bpf.o layered_dispatch 25051 25606 +555 (+2.22%) bpf.bpf.o p2dq_dispatch 961 990 +29 (+3.02%) bpf.bpf.o rusty_quiescent 526 534 +8 (+1.52%) bpf.bpf.o rusty_runnable 541 547 +6 (+1.11%) Perf report =========== In relative terms, the analysis does not consume much CPU time. For example, here is a perf report collected for pyperf180 selftest: # Children Self Command Shared Object Symbol # ........ ........ ........ .................... ........................................ ... 1.22% 1.22% veristat [kernel.kallsyms] [k] bpf_update_live_stack ... Changelog ========= v1: https://lore.kernel.org/bpf/[email protected]/T/ v1 -> v2: - compute_postorder() fixed to handle jumps with offset -1 (syzbot). - is_state_visited() in patch #9 fixed access to uninitialized `err` (kernel test robot, Dan Carpenter). - Selftests added. - Fixed bug with write marks propagation from callee to caller, see verifier_live_stack.c:caller_stack_write() test case. - Added a patch for __not_msg() annotation for test_loader based tests. v2: https://lore.kernel.org/bpf/20250918-callchain-sensitive-liveness-v2-0-214ed2653eee@gmail.com/ v2 -> v3: - Added __diag_ignore_all("-Woverride-init", ...) in liveness.c for bpf_insn_successors() (suggested by Alexei). Signed-off-by: Eduard Zingerman <[email protected]> ==================== Link: https://patch.msgid.link/20250918-callchain-sensitive-liveness-v3-0-c3cd27bacc60@gmail.com Signed-off-by: Alexei Starovoitov <[email protected]>

Leon Hwang says: ==================== bpf: Allow union argument in trampoline based programs While tracing 'release_pages' with bpfsnoop[0], the verifier reports: The function release_pages arg0 type UNION is unsupported. However, it should be acceptable to trace functions that have 'union' arguments. This patch set enables such support in the verifier by allowing 'union' as a valid argument type. Changes: v3 -> v4: * Address comments from Alexei: * Trim bpftrace output in patch #1 log. * Drop the referenced commit info and the test output in patch #2 log. v2 -> v3: * Address comments from Alexei: * Reuse the existing flag BTF_FMODEL_STRUCT_ARG. * Update the comment of the flag BTF_FMODEL_STRUCT_ARG. v1 -> v2: * Add 16B 'union' argument support in x86_64 trampoline. * Update selftests using bpf_testmod. * Add test case about 16-bytes 'union' argument. * Address comments from Alexei: * Study the patch set about 'struct' argument support. * Update selftests to cover more cases. v1: https://lore.kernel.org/bpf/[email protected]/ Links: [0] https://github.com/bpfsnoop/bpfsnoop ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

…ostcopy When you run a KVM guest with vhost-net and migrate that guest to another host, and you immediately enable postcopy after starting the migration, there is a big chance that the network connection of the guest won't work anymore on the destination side after the migration. With a debug kernel v6.16.0, there is also a call trace that looks like this: FAULT_FLAG_ALLOW_RETRY missing 881 CPU: 6 UID: 0 PID: 549 Comm: kworker/6:2 Kdump: loaded Not tainted 6.16.0 #56 NONE Hardware name: IBM 3931 LA1 400 (LPAR) Workqueue: events irqfd_inject [kvm] Call Trace: [<00003173cbecc634>] dump_stack_lvl+0x104/0x168 [<00003173cca69588>] handle_userfault+0xde8/0x1310 [<00003173cc756f0c>] handle_pte_fault+0x4fc/0x760 [<00003173cc759212>] __handle_mm_fault+0x452/0xa00 [<00003173cc7599ba>] handle_mm_fault+0x1fa/0x6a0 [<00003173cc73409a>] __get_user_pages+0x4aa/0xba0 [<00003173cc7349e8>] get_user_pages_remote+0x258/0x770 [<000031734be6f052>] get_map_page+0xe2/0x190 [kvm] [<000031734be6f910>] adapter_indicators_set+0x50/0x4a0 [kvm] [<000031734be7f674>] set_adapter_int+0xc4/0x170 [kvm] [<000031734be2f268>] kvm_set_irq+0x228/0x3f0 [kvm] [<000031734be27000>] irqfd_inject+0xd0/0x150 [kvm] [<00003173cc00c9ec>] process_one_work+0x87c/0x1490 [<00003173cc00dda6>] worker_thread+0x7a6/0x1010 [<00003173cc02dc36>] kthread+0x3b6/0x710 [<00003173cbed2f0c>] __ret_from_fork+0xdc/0x7f0 [<00003173cdd737ca>] ret_from_fork+0xa/0x30 3 locks held by kworker/6:2/549: #0: 00000000800bc958 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x7ee/0x1490 #1: 000030f3d527fbd0 ((work_completion)(&irqfd->inject)){+.+.}-{0:0}, at: process_one_work+0x81c/0x1490 #2: 00000000f99862b0 (&mm->mmap_lock){++++}-{3:3}, at: get_map_page+0xa8/0x190 [kvm] The "FAULT_FLAG_ALLOW_RETRY missing" indicates that handle_userfaultfd() saw a page fault request without ALLOW_RETRY flag set, hence userfaultfd cannot remotely resolve it (because the caller was asking for an immediate resolution, aka, FAULT_FLAG_NOWAIT, while remote faults can take time). With that, get_map_page() failed and the irq was lost. We should not be strictly in an atomic environment here and the worker should be sleepable (the call is done during an ioctl from userspace), so we can allow adapter_indicators_set() to just sleep waiting for the remote fault instead. Link: https://issues.redhat.com/browse/RHEL-42486 Signed-off-by: Peter Xu <[email protected]> [thuth: Assembled patch description and fixed some cosmetical issues] Signed-off-by: Thomas Huth <[email protected]> Reviewed-by: Claudio Imbrenda <[email protected]> Acked-by: Janosch Frank <[email protected]> Fixes: f654706 ("KVM: s390/interrupt: do not pin adapter interrupt pages") [frankja: Added fixes tag] Signed-off-by: Janosch Frank <[email protected]>

The function ceph_process_folio_batch() sets folio_batch entries to NULL, which is an illegal state. Before folio_batch_release() crashes due to this API violation, the function ceph_shift_unused_folios_left() is supposed to remove those NULLs from the array. However, since commit ce80b76 ("ceph: introduce ceph_process_folio_batch() method"), this shifting doesn't happen anymore because the "for" loop got moved to ceph_process_folio_batch(), and now the `i` variable that remains in ceph_writepages_start() doesn't get incremented anymore, making the shifting effectively unreachable much of the time. Later, commit 1551ec6 ("ceph: introduce ceph_submit_write() method") added more preconditions for doing the shift, replacing the `i` check (with something that is still just as broken): - if ceph_process_folio_batch() fails, shifting never happens - if ceph_move_dirty_page_in_page_array() was never called (because ceph_process_folio_batch() has returned early for some of various reasons), shifting never happens - if `processed_in_fbatch` is zero (because ceph_process_folio_batch() has returned early for some of the reasons mentioned above or because ceph_move_dirty_page_in_page_array() has failed), shifting never happens Since those two commits, any problem in ceph_process_folio_batch() could crash the kernel, e.g. this way: BUG: kernel NULL pointer dereference, address: 0000000000000034 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: Oops: 0002 [#1] SMP NOPTI CPU: 172 UID: 0 PID: 2342707 Comm: kworker/u778:8 Not tainted 6.15.10-cm4all1-es #714 NONE Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.10 12/08/2023 Workqueue: writeback wb_workfn (flush-ceph-1) RIP: 0010:folios_put_refs+0x85/0x140 Code: 83 c5 01 39 e8 7e 76 48 63 c5 49 8b 5c c4 08 b8 01 00 00 00 4d 85 ed 74 05 41 8b 44 ad 00 48 8b 15 b0 > RSP: 0018:ffffb880af8db778 EFLAGS: 00010207 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000003 RDX: ffffe377cc3b0000 RSI: 0000000000000000 RDI: ffffb880af8db8c0 RBP: 0000000000000000 R08: 000000000000007d R09: 000000000102b86f R10: 0000000000000001 R11: 00000000000000ac R12: ffffb880af8db8c0 R13: 0000000000000000 R14: 0000000000000000 R15: ffff9bd262c97000 FS: 0000000000000000(0000) GS:ffff9c8efc303000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000034 CR3: 0000000160958004 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: <TASK> ceph_writepages_start+0xeb9/0x1410 The crash can be reproduced easily by changing the ceph_check_page_before_write() return value to `-E2BIG`. (Interestingly, the crash happens only if `huge_zero_folio` has already been allocated; without `huge_zero_folio`, is_huge_zero_folio(NULL) returns true and folios_put_refs() skips NULL entries instead of dereferencing them. That makes reproducing the bug somewhat unreliable. See https://lore.kernel.org/[email protected] for a discussion of this detail.) My suggestion is to move the ceph_shift_unused_folios_left() to right after ceph_process_folio_batch() to ensure it always gets called to fix up the illegal folio_batch state. Cc: [email protected] Fixes: ce80b76 ("ceph: introduce ceph_process_folio_batch() method") Link: https://lore.kernel.org/ceph-devel/[email protected]/ Signed-off-by: Max Kellermann <[email protected]> Reviewed-by: Viacheslav Dubeyko <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>

syzkaller has caught us red-handed once more, this time nesting regular spinlocks behind raw spinlocks: ============================= [ BUG: Invalid wait context ] 6.16.0-rc3-syzkaller-g7b8346bd9fce #0 Not tainted ----------------------------- syz.0.29/3743 is trying to lock: a3ff80008e2e9e18 (&xa->xa_lock#20){....}-{3:3}, at: vgic_put_irq+0xb4/0x190 arch/arm64/kvm/vgic/vgic.c:137 other info that might help us debug this: context-{5:5} 3 locks held by syz.0.29/3743: #0: a3ff80008e2e90a8 (&kvm->slots_lock){+.+.}-{4:4}, at: kvm_vgic_destroy+0x50/0x624 arch/arm64/kvm/vgic/vgic-init.c:499 #1: a3ff80008e2e9fa0 (&kvm->arch.config_lock){+.+.}-{4:4}, at: kvm_vgic_destroy+0x5c/0x624 arch/arm64/kvm/vgic/vgic-init.c:500 #2: 58f0000021be1428 (&vgic_cpu->ap_list_lock){....}-{2:2}, at: vgic_flush_pending_lpis+0x3c/0x31c arch/arm64/kvm/vgic/vgic.c:150 stack backtrace: CPU: 0 UID: 0 PID: 3743 Comm: syz.0.29 Not tainted 6.16.0-rc3-syzkaller-g7b8346bd9fce #0 PREEMPT Hardware name: linux,dummy-virt (DT) Call trace: show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:466 (C) __dump_stack+0x30/0x40 lib/dump_stack.c:94 dump_stack_lvl+0xd8/0x12c lib/dump_stack.c:120 dump_stack+0x1c/0x28 lib/dump_stack.c:129 print_lock_invalid_wait_context kernel/locking/lockdep.c:4833 [inline] check_wait_context kernel/locking/lockdep.c:4905 [inline] __lock_acquire+0x978/0x299c kernel/locking/lockdep.c:5190 lock_acquire+0x14c/0x2e0 kernel/locking/lockdep.c:5871 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x5c/0x7c kernel/locking/spinlock.c:162 vgic_put_irq+0xb4/0x190 arch/arm64/kvm/vgic/vgic.c:137 vgic_flush_pending_lpis+0x24c/0x31c arch/arm64/kvm/vgic/vgic.c:158 __kvm_vgic_vcpu_destroy+0x44/0x500 arch/arm64/kvm/vgic/vgic-init.c:455 kvm_vgic_destroy+0x100/0x624 arch/arm64/kvm/vgic/vgic-init.c:505 kvm_arch_destroy_vm+0x80/0x138 arch/arm64/kvm/arm.c:244 kvm_destroy_vm virt/kvm/kvm_main.c:1308 [inline] kvm_put_kvm+0x800/0xff8 virt/kvm/kvm_main.c:1344 kvm_vm_release+0x58/0x78 virt/kvm/kvm_main.c:1367 __fput+0x4ac/0x980 fs/file_table.c:465 ____fput+0x20/0x58 fs/file_table.c:493 task_work_run+0x1bc/0x254 kernel/task_work.c:227 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] do_notify_resume+0x1b4/0x270 arch/arm64/kernel/entry-common.c:151 exit_to_user_mode_prepare arch/arm64/kernel/entry-common.c:169 [inline] exit_to_user_mode arch/arm64/kernel/entry-common.c:178 [inline] el0_svc+0xb4/0x160 arch/arm64/kernel/entry-common.c:768 el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600 This is of course no good, but is at odds with how LPI refcounts are managed. Solve the locking mess by deferring the release of unreferenced LPIs after the ap_list_lock is released. Mark these to-be-released LPIs specially to avoid racing with vgic_put_irq() and causing a double-free. Since references can only be taken on LPIs with a nonzero refcount, extending the lifetime of freed LPIs is still safe. Reviewed-by: Marc Zyngier <[email protected]> Reported-by: [email protected] Closes: https://lore.kernel.org/kvmarm/[email protected]/ Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Oliver Upton <[email protected]>

Ido Schimmel says: ==================== ipv4: icmp: Fix source IP derivation in presence of VRFs Align IPv4 with IPv6 and in the presence of VRFs generate ICMP error messages with a source IP that is derived from the receiving interface and not from its VRF master. This is especially important when the error messages are "Time Exceeded" messages as it means that utilities like traceroute will show an incorrect packet path. Patches #1-#2 are preparations. Patch #3 is the actual change. Patches #4-#7 make small improvements in the existing traceroute test. Patch #8 extends the traceroute test with VRF test cases for both IPv4 and IPv6. Changes since v1 [1]: * Rebase. [1] https://lore.kernel.org/netdev/[email protected]/ ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]>

Petr Machata says: ==================== bridge: Allow keeping local FDB entries only on VLAN 0 The bridge FDB contains one local entry per port per VLAN, for the MAC of the port in question, and likewise for the bridge itself. This allows bridge to locally receive and punt "up" any packets whose destination MAC address matches that of one of the bridge interfaces or of the bridge itself. The number of these local "service" FDB entries grows linearly with number of bridge-global VLAN memberships, but that in turn will tend to grow quadratically with number of ports and per-port VLAN memberships. While that does not cause issues during forwarding lookups, it does make dumps impractically slow. As an example, with 100 interfaces, each on 4K VLANs, a full dump of FDB that just contains these 400K local entries, takes 6.5s. That's _without_ considering iproute2 formatting overhead, this is just how long it takes to walk the FDB (repeatedly), serialize it into netlink messages, and parse the messages back in userspace. This is to illustrate that with growing number of ports and VLANs, the time required to dump this repetitive information blows up. Arguably 4K VLANs per interface is not a very realistic configuration, but then modern switches can instead have several hundred interfaces, and we have fielded requests for >1K VLAN memberships per port among customers. FDB entries are currently all kept on a single linked list, and then dumping uses this linked list to walk all entries and dump them in order. When the message buffer is full, the iteration is cut short, and later restarted. Of course, to restart the iteration, it's first necessary to walk the already-dumped front part of the list before starting dumping again. So one possibility is to organize the FDB entries in different structure more amenable to walk restarts. One option is to walk directly the hash table. The advantage is that no auxiliary structure needs to be introduced. With a rough sketch of this approach, the above scenario gets dumped in not quite 3 s, saving over 50 % of time. However hash table iteration requires maintaining an active cursor that must be collected when the dump is aborted. It looks like that would require changes in the NDO protocol to allow to run this cleanup. Moreover, on hash table resize the iteration is simply restarted. FDB dumps are currently not guaranteed to correspond to any one particular state: entries can be missed, or be duplicated. But with hash table iteration we would get that plus the much less graceful resize behavior, where swaths of FDB are duplicated. Another option is to maintain the FDB entries in a red-black tree. We have a PoC of this approach on hand, and the above scenario is dumped in about 2.5 s. Still not as snappy as we'd like it, but better than the hash table. However the savings come at the expense of a more expensive insertion, and require locking during dumps, which blocks insertion. The upside of these approaches is that they provide benefits whatever the FDB contents. But it does not seem like either of these is workable. However we intend to clean up the RB tree PoC and present it for consideration later on in case the trade-offs are considered acceptable. Yet another option might be to use in-kernel FDB filtering, and to filter the local entries when dumping. Unfortunately, this does not help all that much either, because the linked-list walk still needs to happen. Also, with the obvious filtering interface built around ndm_flags / ndm_state filtering, one can't just exclude pure local entries in one query. One needs to dump all non-local entries first, and then to get permanent entries in another run filter local & added_by_user. I.e. one needs to pay the iteration overhead twice, and then integrate the result in userspace. To get significant savings, one would need a very specific knob like "dump, but skip/only include local entries". But if we are adding a local-specific knobs, maybe let's have an option to just not duplicate them in the first place. All this FDB duplication is there merely to make things snappy during forwarding. But high-radix switches with thousands of VLANs typically do not process much traffic in the SW datapath at all, but rather offload vast majority of it. So we could exchange some of the runtime performance for a neater FDB. To that end, in this patchset, introduce a new bridge option, BR_BOOLOPT_FDB_LOCAL_VLAN_0, which when enabled, has local FDB entries installed only on VLAN 0, instead of duplicating them across all VLANs. Then to maintain the local termination behavior, on FDB miss, the bridge does a second lookup on VLAN 0. Enabling this option changes the bridge behavior in expected ways. Since the entries are only kept on VLAN 0, FDB get, flush and dump will not perceive them on non-0 VLANs. And deleting the VLAN 0 entry affects forwarding on all VLANs. This patchset is loosely based on a privately circulated patch by Nikolay Aleksandrov. The patchset progresses as follows: - Patch #1 introduces a bridge option to enable the above feature. Then patches #2 to #5 gradually patch the bridge to do the right thing when the option is enabled. Finally patch #6 adds the UAPI knob and the code for when the feature is enabled or disabled. - Patches #7, #8 and #9 contain fixes and improvements to selftest libraries - Patch #10 contains a new selftest ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>

…PAIR A NULL pointer dereference can occur in tcp_ao_finish_connect() during a connect() system call on a socket with a TCP-AO key added and TCP_REPAIR enabled. The function is called with skb being NULL and attempts to dereference it on tcp_hdr(skb)->seq without a prior skb validation. Fix this by checking if skb is NULL before dereferencing it. The commentary is taken from bpf_skops_established(), which is also called in the same flow. Unlike the function being patched, bpf_skops_established() validates the skb before dereferencing it. int main(void){ struct sockaddr_in sockaddr; struct tcp_ao_add tcp_ao; int sk; int one = 1; memset(&sockaddr,'\0',sizeof(sockaddr)); memset(&tcp_ao,'\0',sizeof(tcp_ao)); sk = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); sockaddr.sin_family = AF_INET; memcpy(tcp_ao.alg_name,"cmac(aes128)",12); memcpy(tcp_ao.key,"ABCDEFGHABCDEFGH",16); tcp_ao.keylen = 16; memcpy(&tcp_ao.addr,&sockaddr,sizeof(sockaddr)); setsockopt(sk, IPPROTO_TCP, TCP_AO_ADD_KEY, &tcp_ao, sizeof(tcp_ao)); setsockopt(sk, IPPROTO_TCP, TCP_REPAIR, &one, sizeof(one)); sockaddr.sin_family = AF_INET; sockaddr.sin_port = htobe16(123); inet_aton("127.0.0.1", &sockaddr.sin_addr); connect(sk,(struct sockaddr *)&sockaddr,sizeof(sockaddr)); return 0; } $ gcc tcp-ao-nullptr.c -o tcp-ao-nullptr -Wall $ unshare -Urn BUG: kernel NULL pointer dereference, address: 00000000000000b6 PGD 1f648d067 P4D 1f648d067 PUD 1982e8067 PMD 0 Oops: Oops: 0000 [#1] SMP NOPTI Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 RIP: 0010:tcp_ao_finish_connect (net/ipv4/tcp_ao.c:1182) Fixes: 7c2ffaf ("net/tcp: Calculate TCP-AO traffic keys") Signed-off-by: Anderson Nascimento <[email protected]> Reviewed-by: Dmitry Safonov <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>

When igc_led_setup() fails, igc_probe() fails and triggers kernel panic in free_netdev() since unregister_netdev() is not called. [1] This behavior can be tested using fault-injection framework, especially the failslab feature. [2] Since LED support is not mandatory, treat LED setup failures as non-fatal and continue probe with a warning message, consequently avoiding the kernel panic. [1] kernel BUG at net/core/dev.c:12047! Oops: invalid opcode: 0000 [#1] SMP NOPTI CPU: 0 UID: 0 PID: 937 Comm: repro-igc-led-e Not tainted 6.17.0-rc4-enjuk-tnguy-00865-gc4940196ab02 #64 PREEMPT(voluntary) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:free_netdev+0x278/0x2b0 [...] Call Trace: <TASK> igc_probe+0x370/0x910 local_pci_probe+0x3a/0x80 pci_device_probe+0xd1/0x200 [...] [2] #!/bin/bash -ex FAILSLAB_PATH=/sys/kernel/debug/failslab/ DEVICE=0000:00:05.0 START_ADDR=$(grep " igc_led_setup" /proc/kallsyms \ | awk '{printf("0x%s", $1)}') END_ADDR=$(printf "0x%x" $((START_ADDR + 0x100))) echo $START_ADDR > $FAILSLAB_PATH/require-start echo $END_ADDR > $FAILSLAB_PATH/require-end echo 1 > $FAILSLAB_PATH/times echo 100 > $FAILSLAB_PATH/probability echo N > $FAILSLAB_PATH/ignore-gfp-wait echo $DEVICE > /sys/bus/pci/drivers/igc/bind Fixes: ea57870 ("igc: Add support for LEDs on i225/i226") Signed-off-by: Kohei Enju <[email protected]> Reviewed-by: Paul Menzel <[email protected]> Reviewed-by: Aleksandr Loktionov <[email protected]> Reviewed-by: Vitaly Lifshits <[email protected]> Reviewed-by: Kurt Kanzenbach <[email protected]> Tested-by: Mor Bar-Gabay <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>

…rnal() A crash was observed with the following output: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 1 UID: 0 PID: 2899 Comm: syz.2.399 Not tainted 6.17.0-rc5+ #5 PREEMPT(none) RIP: 0010:trace_kprobe_create_internal+0x3fc/0x1440 kernel/trace/trace_kprobe.c:911 Call Trace: <TASK> trace_kprobe_create_cb+0xa2/0xf0 kernel/trace/trace_kprobe.c:1089 trace_probe_create+0xf1/0x110 kernel/trace/trace_probe.c:2246 dyn_event_create+0x45/0x70 kernel/trace/trace_dynevent.c:128 create_or_delete_trace_kprobe+0x5e/0xc0 kernel/trace/trace_kprobe.c:1107 trace_parse_run_command+0x1a5/0x330 kernel/trace/trace.c:10785 vfs_write+0x2b6/0xd00 fs/read_write.c:684 ksys_write+0x129/0x240 fs/read_write.c:738 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x5d/0x2d0 arch/x86/entry/syscall_64.c:94 </TASK> Function kmemdup() may return NULL in trace_kprobe_create_internal(), add check for it's return value. Link: https://lore.kernel.org/all/[email protected]/ Fixes: 33b4e38 ("tracing: kprobe-event: Allocate string buffers from heap") Signed-off-by: Wang Liang <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>

clang-18.1.3 on Ubuntu 24.04.2 reports warning: thread_loop.c:41:23: warning: value size does not match register size specified by the constraint and modifier [-Wasm-operand-widths] 41 | : /* in */ [i] "r" (i), [len] "r" (len) | ^ thread_loop.c:37:8: note: use constraint modifier "w" 37 | "add %[i], %[i], #1\n" | ^~~~ | %w[i] thread_loop.c:41:23: warning: value size does not match register size specified by the constraint and modifier [-Wasm-operand-widths] 41 | : /* in */ [i] "r" (i), [len] "r" (len) | ^ thread_loop.c:37:14: note: use constraint modifier "w" 37 | "add %[i], %[i], #1\n" | ^~~~ | %w[i] thread_loop.c:41:23: warning: value size does not match register size specified by the constraint and modifier [-Wasm-operand-widths] 41 | : /* in */ [i] "r" (i), [len] "r" (len) | ^ thread_loop.c:38:8: note: use constraint modifier "w" 38 | "cmp %[i], %[len]\n" | ^~~~ | %w[i] thread_loop.c:41:38: warning: value size does not match register size specified by the constraint and modifier [-Wasm-operand-widths] 41 | : /* in */ [i] "r" (i), [len] "r" (len) | ^ thread_loop.c:38:14: note: use constraint modifier "w" 38 | "cmp %[i], %[len]\n" | ^~~~~~ | %w[len] Use the modifier "w" for 32-bit register access. Signed-off-by: Leo Yan <[email protected]>

Check in snd_intel_dsp_check_soundwire() that the pointer returned by ACPI_HANDLE() is not NULL, before passing it on to other functions. The original code assumed a non-NULL return, but if it was unexpectedly NULL it would end up passed to acpi_walk_namespace() as the start point, and would result in [ 3.219028] BUG: kernel NULL pointer dereference, address: 0000000000000018 [ 3.219029] #PF: supervisor read access in kernel mode [ 3.219030] #PF: error_code(0x0000) - not-present page [ 3.219031] PGD 0 P4D 0 [ 3.219032] Oops: Oops: 0000 [#1] SMP NOPTI [ 3.219035] CPU: 2 UID: 0 PID: 476 Comm: (udev-worker) Tainted: G S AW E 6.17.0-rc5-test #1 PREEMPT(voluntary) [ 3.219038] Tainted: [S]=CPU_OUT_OF_SPEC, [A]=OVERRIDDEN_ACPI_TABLE, [W]=WARN, [E]=UNSIGNED_MODULE [ 3.219040] RIP: 0010:acpi_ns_walk_namespace+0xb5/0x480 This problem was triggered by a bugged DSDT that the kernel couldn't parse. But it shouldn't be possible to SEGFAULT the kernel just because of some bugs in ACPI. Fixes: 0650857 ("ALSA: hda: add autodetection for SoundWire") Signed-off-by: Richard Fitzgerald <[email protected]> Signed-off-by: Takashi Iwai <[email protected]>

…own_locked() When 9770b42 ("crypto: ccp - Move dev_info/err messages for SEV/SNP init and shutdown") moved the error messages dumping so that they don't need to be issued by the callers, it missed the case where __sev_firmware_shutdown() calls __sev_platform_shutdown_locked() with a NULL argument which leads to a NULL ptr deref on the shutdown path, during suspend to disk: #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 0 UID: 0 PID: 983 Comm: hib.sh Not tainted 6.17.0-rc4+ #1 PREEMPT(voluntary) Hardware name: Supermicro Super Server/H12SSL-i, BIOS 2.5 09/08/2022 RIP: 0010:__sev_platform_shutdown_locked.cold+0x0/0x21 [ccp] That rIP is: 00000000000006fd <__sev_platform_shutdown_locked.cold>: 6fd: 8b 13 mov (%rbx),%edx 6ff: 48 8b 7d 00 mov 0x0(%rbp),%rdi 703: 89 c1 mov %eax,%ecx Code: 74 05 31 ff 41 89 3f 49 8b 3e 89 ea 48 c7 c6 a0 8e 54 a0 41 bf 92 ff ff ff e8 e5 2e 09 e1 c6 05 2a d4 38 00 01 e9 26 af ff ff <8b> 13 48 8b 7d 00 89 c1 48 c7 c6 18 90 54 a0 89 44 24 04 e8 c1 2e RSP: 0018:ffffc90005467d00 EFLAGS: 00010282 RAX: 00000000ffffff92 RBX: 0000000000000000 RCX: 0000000000000000 ^^^^^^^^^^^^^^^^ and %rbx is nice and clean. Call Trace: <TASK> __sev_firmware_shutdown.isra.0 sev_dev_destroy psp_dev_destroy sp_destroy pci_device_shutdown device_shutdown kernel_power_off hibernate.cold state_store kernfs_fop_write_iter vfs_write ksys_write do_syscall_64 entry_SYSCALL_64_after_hwframe Pass in a pointer to the function-local error var in the caller. With that addressed, suspending the ccp shows the error properly at least: ccp 0000:47:00.1: sev command 0x2 timed out, disabling PSP ccp 0000:47:00.1: SEV: failed to SHUTDOWN error 0x0, rc -110 SEV-SNP: Leaking PFN range 0x146800-0x146a00 SEV-SNP: PFN 0x146800 unassigned, dumping non-zero entries in 2M PFN region: [0x146800 - 0x146a00] ... ccp 0000:47:00.1: SEV-SNP firmware shutdown failed, rc -16, error 0x0 ACPI: PM: Preparing to enter system sleep state S5 kvm: exiting hardware virtualization reboot: Power down Btw, this driver is crying to be cleaned up to pass in a proper I/O struct which can be used to store information between the different functions, otherwise stuff like that will happen in the future again. Fixes: 9770b42 ("crypto: ccp - Move dev_info/err messages for SEV/SNP init and shutdown") Signed-off-by: Borislav Petkov (AMD) <[email protected]> Cc: <[email protected]> Reviewed-by: Ashish Kalra <[email protected]> Acked-by: Tom Lendacky <[email protected]> Signed-off-by: Herbert Xu <[email protected]>

clang-18.1.3 on Ubuntu 24.04.2 reports warning: thread_loop.c:41:23: warning: value size does not match register size specified by the constraint and modifier [-Wasm-operand-widths] 41 | : /* in */ [i] "r" (i), [len] "r" (len) | ^ thread_loop.c:37:8: note: use constraint modifier "w" 37 | "add %[i], %[i], #1\n" | ^~~~ | %w[i] thread_loop.c:41:23: warning: value size does not match register size specified by the constraint and modifier [-Wasm-operand-widths] 41 | : /* in */ [i] "r" (i), [len] "r" (len) | ^ thread_loop.c:37:14: note: use constraint modifier "w" 37 | "add %[i], %[i], #1\n" | ^~~~ | %w[i] thread_loop.c:41:23: warning: value size does not match register size specified by the constraint and modifier [-Wasm-operand-widths] 41 | : /* in */ [i] "r" (i), [len] "r" (len) | ^ thread_loop.c:38:8: note: use constraint modifier "w" 38 | "cmp %[i], %[len]\n" | ^~~~ | %w[i] thread_loop.c:41:38: warning: value size does not match register size specified by the constraint and modifier [-Wasm-operand-widths] 41 | : /* in */ [i] "r" (i), [len] "r" (len) | ^ thread_loop.c:38:14: note: use constraint modifier "w" 38 | "cmp %[i], %[len]\n" | ^~~~~~ | %w[len] Use the modifier "w" for 32-bit register access. Signed-off-by: Leo Yan <[email protected]>

Add 0x29 as the accelerometer address for the Dell Latitude E6530 to lis3lv02d_devices[]. The address was verified as below: $ cd /sys/bus/pci/drivers/i801_smbus/0000:00:1f.3 $ ls -d i2c-* i2c-20 $ sudo modprobe i2c-dev $ sudo i2cdetect 20 WARNING! This program can confuse your I2C bus, cause data loss and worse! I will probe file /dev/i2c-20. I will probe address range 0x08-0x77. Continue? [Y/n] Y 0 1 2 3 4 5 6 7 8 9 a b c d e f 00: 08 -- -- -- -- -- -- -- 10: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 20: -- -- -- -- -- -- -- -- -- UU -- 2b -- -- -- -- 30: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 40: -- -- -- -- 44 -- -- -- -- -- -- -- -- -- -- -- 50: UU -- 52 -- -- -- -- -- -- -- -- -- -- -- -- -- 60: -- 61 -- -- -- -- -- -- -- -- -- -- -- -- -- -- 70: -- -- -- -- -- -- -- -- $ cat /proc/cmdline BOOT_IMAGE=/vmlinuz-linux-cachyos-bore root=UUID=<redacted> rw loglevel=3 quiet dell_lis3lv02d.probe_i2c_addr=1 $ sudo dmesg [ 0.000000] Linux version 6.16.6-2-cachyos-bore (linux-cachyos-bore@cachyos) (gcc (GCC) 15.2.1 20250813, GNU ld (GNU Binutils) 2.45.0) #1 SMP PREEMPT_DYNAMIC Thu, 11 Sep 2025 16:01:12 +0000 […] [ 0.000000] DMI: Dell Inc. Latitude E6530/07Y85M, BIOS A22 11/30/2018 […] [ 5.166442] i2c i2c-20: Probing for lis3lv02d on address 0x29 [ 5.167854] i2c i2c-20: Detected lis3lv02d on address 0x29, please report this upstream to [email protected] so that a quirk can be added Signed-off-by: Nickolay Goppen <[email protected]> Reviewed-by: Hans de Goede <[email protected]> Link: https://patch.msgid.link/20250917-dell-lis3lv02d-latitude-e6530-v1-1-8a6dec4e51e9@mainlining.org Reviewed-by: Ilpo Järvinen <[email protected]> Signed-off-by: Ilpo Järvinen <[email protected]>

The kernel forbids the creation of non-FDB nexthop groups with FDB nexthops: # ip nexthop add id 1 via 192.0.2.1 fdb # ip nexthop add id 2 group 1 Error: Non FDB nexthop group cannot have fdb nexthops. And vice versa: # ip nexthop add id 3 via 192.0.2.2 dev dummy1 # ip nexthop add id 4 group 3 fdb Error: FDB nexthop group can only have fdb nexthops. However, as long as no routes are pointing to a non-FDB nexthop group, the kernel allows changing the type of a nexthop from FDB to non-FDB and vice versa: # ip nexthop add id 5 via 192.0.2.2 dev dummy1 # ip nexthop add id 6 group 5 # ip nexthop replace id 5 via 192.0.2.2 fdb # echo $? 0 This configuration is invalid and can result in a NPD [1] since FDB nexthops are not associated with a nexthop device: # ip route add 198.51.100.1/32 nhid 6 # ping 198.51.100.1 Fix by preventing nexthop FDB status change while the nexthop is in a group: # ip nexthop add id 7 via 192.0.2.2 dev dummy1 # ip nexthop add id 8 group 7 # ip nexthop replace id 7 via 192.0.2.2 fdb Error: Cannot change nexthop FDB status while in a group. [1] BUG: kernel NULL pointer dereference, address: 00000000000003c0 [...] Oops: Oops: 0000 [#1] SMP CPU: 6 UID: 0 PID: 367 Comm: ping Not tainted 6.17.0-rc6-virtme-gb65678cacc03 #1 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-4.fc41 04/01/2014 RIP: 0010:fib_lookup_good_nhc+0x1e/0x80 [...] Call Trace: <TASK> fib_table_lookup+0x541/0x650 ip_route_output_key_hash_rcu+0x2ea/0x970 ip_route_output_key_hash+0x55/0x80 __ip4_datagram_connect+0x250/0x330 udp_connect+0x2b/0x60 __sys_connect+0x9c/0xd0 __x64_sys_connect+0x18/0x20 do_syscall_64+0xa4/0x2a0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Fixes: 38428d6 ("nexthop: support for fdb ecmp nexthops") Reported-by: [email protected] Closes: https://lore.kernel.org/netdev/[email protected]/ Tested-by: [email protected] Signed-off-by: Ido Schimmel <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>

Ido Schimmel says: ==================== nexthop: Various fixes Patch #1 fixes a NPD that was recently reported by syzbot. Patch #2 fixes an issue in the existing FIB nexthop selftest. Patch #3 extends the selftest with test cases for the bug that was fixed in the first patch. ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>

Running sha224_kunit on a KMSAN-enabled kernel results in a crash in kmsan_internal_set_shadow_origin(): BUG: unable to handle page fault for address: ffffbc3840291000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 1810067 P4D 1810067 PUD 192d067 PMD 3c17067 PTE 0 Oops: 0000 [#1] SMP NOPTI CPU: 0 UID: 0 PID: 81 Comm: kunit_try_catch Tainted: G N 6.17.0-rc3 #10 PREEMPT(voluntary) Tainted: [N]=TEST Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 RIP: 0010:kmsan_internal_set_shadow_origin+0x91/0x100 [...] Call Trace: <TASK> __msan_memset+0xee/0x1a0 sha224_final+0x9e/0x350 test_hash_buffer_overruns+0x46f/0x5f0 ? kmsan_get_shadow_origin_ptr+0x46/0xa0 ? __pfx_test_hash_buffer_overruns+0x10/0x10 kunit_try_run_case+0x198/0xa00 This occurs when memset() is called on a buffer that is not 4-byte aligned and extends to the end of a guard page, i.e. the next page is unmapped. The bug is that the loop at the end of kmsan_internal_set_shadow_origin() accesses the wrong shadow memory bytes when the address is not 4-byte aligned. Since each 4 bytes are associated with an origin, it rounds the address and size so that it can access all the origins that contain the buffer. However, when it checks the corresponding shadow bytes for a particular origin, it incorrectly uses the original unrounded shadow address. This results in reads from shadow memory beyond the end of the buffer's shadow memory, which crashes when that memory is not mapped. To fix this, correctly align the shadow address before accessing the 4 shadow bytes corresponding to each origin. Link: https://lkml.kernel.org/r/[email protected] Fixes: 2ef3cec ("kmsan: do not wipe out origin when doing partial unpoisoning") Signed-off-by: Eric Biggers <[email protected]> Tested-by: Alexander Potapenko <[email protected]> Reviewed-by: Alexander Potapenko <[email protected]> Cc: Dmitriy Vyukov <[email protected]> Cc: Marco Elver <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

When the PAGEMAP_SCAN ioctl is invoked with vec_len = 0 reaches pagemap_scan_backout_range(), kernel panics with null-ptr-deref: [ 44.936808] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI [ 44.937797] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] [ 44.938391] CPU: 1 UID: 0 PID: 2480 Comm: reproducer Not tainted 6.17.0-rc6 #22 PREEMPT(none) [ 44.939062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 44.939935] RIP: 0010:pagemap_scan_thp_entry.isra.0+0x741/0xa80 <snip registers, unreliable trace> [ 44.946828] Call Trace: [ 44.947030] <TASK> [ 44.949219] pagemap_scan_pmd_entry+0xec/0xfa0 [ 44.952593] walk_pmd_range.isra.0+0x302/0x910 [ 44.954069] walk_pud_range.isra.0+0x419/0x790 [ 44.954427] walk_p4d_range+0x41e/0x620 [ 44.954743] walk_pgd_range+0x31e/0x630 [ 44.955057] __walk_page_range+0x160/0x670 [ 44.956883] walk_page_range_mm+0x408/0x980 [ 44.958677] walk_page_range+0x66/0x90 [ 44.958984] do_pagemap_scan+0x28d/0x9c0 [ 44.961833] do_pagemap_cmd+0x59/0x80 [ 44.962484] __x64_sys_ioctl+0x18d/0x210 [ 44.962804] do_syscall_64+0x5b/0x290 [ 44.963111] entry_SYSCALL_64_after_hwframe+0x76/0x7e vec_len = 0 in pagemap_scan_init_bounce_buffer() means no buffers are allocated and p->vec_buf remains set to NULL. This breaks an assumption made later in pagemap_scan_backout_range(), that page_region is always allocated for p->vec_buf_index. Fix it by explicitly checking p->vec_buf for NULL before dereferencing. Other sites that might run into same deref-issue are already (directly or transitively) protected by checking p->vec_buf. Note: From PAGEMAP_SCAN man page, it seems vec_len = 0 is valid when no output is requested and it's only the side effects caller is interested in, hence it passes check in pagemap_scan_get_args(). This issue was found by syzkaller. Link: https://lkml.kernel.org/r/[email protected] Fixes: 52526ca ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs") Signed-off-by: Jakub Acs <[email protected]> Reviewed-by: Muhammad Usama Anjum <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Jinjiang Tu <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Penglei Jiang <[email protected]> Cc: Mark Brown <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: "Michał Mirosław" <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

Dave Chinner and others added 30 commits February 1, 2022 14:14

Merge tag 'kvm-riscv-fixes-5.17-1' of https://github.com/kvm-riscv/linux

cb4f084

into HEAD KVM/riscv fixes for 5.17, take #1 - Rework guest entry logic - Make CY, TM, and IR counters accessible in VU mode - Fix SBI implementation version

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[test] master_test #1

[test] master_test #1

Uh oh!

kernel-patches-bot commented Feb 15, 2022

Uh oh!

Uh oh!

[test] master_test #1

[test] master_test #1

Uh oh!

Conversation

kernel-patches-bot commented Feb 15, 2022

Uh oh!

Uh oh!