-
Notifications
You must be signed in to change notification settings - Fork 141
Set advertized speed to 10gbps #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please consider revisiting this issue in 5 years to reevaluate the advertized speed. Additionally, please make a pull request to torvalds/linux, as only collaborators are allowed to do that. |
The Arch Linux kernel follows upstream, so if upstream applies that patch it will end up in the Arch Linux kernel. We don't do any kernel development or apply patches which are not critical bugs. Note that one does not make pull requests to |
We shouldn't be calling runtime PM APIs from within the genpd enable/disable path for a couple reasons. First, this causes an AA lockdep splat[1] because genpd can call into genpd code again while holding the genpd lock. WARNING: possible recursive locking detected 5.19.0-rc2-lockdep+ #7 Not tainted -------------------------------------------- kworker/2:1/49 is trying to acquire lock: ffffffeea0370788 (&genpd->mlock){+.+.}-{3:3}, at: genpd_lock_mtx+0x24/0x30 but task is already holding lock: ffffffeea03710a8 (&genpd->mlock){+.+.}-{3:3}, at: genpd_lock_mtx+0x24/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&genpd->mlock); lock(&genpd->mlock); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by kworker/2:1/49: #0: 74ffff80811a5748 ((wq_completion)pm){+.+.}-{0:0}, at: process_one_work+0x320/0x5fc #1: ffffffc008537cf8 ((work_completion)(&genpd->power_off_work)){+.+.}-{0:0}, at: process_one_work+0x354/0x5fc #2: ffffffeea03710a8 (&genpd->mlock){+.+.}-{3:3}, at: genpd_lock_mtx+0x24/0x30 stack backtrace: CPU: 2 PID: 49 Comm: kworker/2:1 Not tainted 5.19.0-rc2-lockdep+ #7 Hardware name: Google Lazor (rev3 - 8) with KB Backlight (DT) Workqueue: pm genpd_power_off_work_fn Call trace: dump_backtrace+0x1a0/0x200 show_stack+0x24/0x30 dump_stack_lvl+0x7c/0xa0 dump_stack+0x18/0x44 __lock_acquire+0xb38/0x3634 lock_acquire+0x180/0x2d4 __mutex_lock_common+0x118/0xe30 mutex_lock_nested+0x70/0x7c genpd_lock_mtx+0x24/0x30 genpd_runtime_suspend+0x2f0/0x414 __rpm_callback+0xdc/0x1b8 rpm_callback+0x4c/0xcc rpm_suspend+0x21c/0x5f0 rpm_idle+0x17c/0x1e0 __pm_runtime_idle+0x78/0xcc gdsc_disable+0x24c/0x26c _genpd_power_off+0xd4/0x1c4 genpd_power_off+0x2d8/0x41c genpd_power_off_work_fn+0x60/0x94 process_one_work+0x398/0x5fc worker_thread+0x42c/0x6c4 kthread+0x194/0x1b4 ret_from_fork+0x10/0x20 Second, this confuses runtime PM on CoachZ for the camera devices by causing the camera clock controller's runtime PM usage_count to go negative after resuming from suspend. This is because runtime PM is being used on the clock controller while runtime PM is disabled for the device. The reason for the negative count is because a GDSC is represented as a genpd and each genpd that is attached to a device is resumed during the noirq phase of system wide suspend/resume (see the noirq suspend ops assignment in pm_genpd_init() for more details). The camera GDSCs are attached to camera devices with the 'power-domains' property in DT. Every device has runtime PM disabled in the late system suspend phase via __device_suspend_late(). Runtime PM is not usable until runtime PM is enabled in device_resume_early(). The noirq phases run after the 'late' and before the 'early' phase of suspend/resume. When the genpds are resumed in genpd_resume_noirq(), we call down into gdsc_enable() that calls pm_runtime_resume_and_get() and that returns -EACCES to indicate failure to resume because runtime PM is disabled for all devices. Upon closer inspection, calling runtime PM APIs like this in the GDSC driver doesn't make sense. It was intended to make sure the GDSC for the clock controller providing other GDSCs was enabled, specifically the MMCX GDSC for the display clk controller on SM8250 (sm8250-dispcc), so that GDSC register accesses succeeded. That will already happen because we make the 'dev->pm_domain' a parent domain of each GDSC we register in gdsc_register() via pm_genpd_add_subdomain(). When any of these GDSCs are accessed, we'll enable the parent domain (in this specific case MMCX). We also remove any getting of runtime PM during registration, because when a genpd is registered it increments the count on the parent if the genpd itself is already enabled. Cc: Dmitry Baryshkov <[email protected]> Cc: Johan Hovold <[email protected]> Cc: Ulf Hansson <[email protected]> Cc: Taniya Das <[email protected]> Cc: Satya Priya <[email protected]> Reviewed-by: Douglas Anderson <[email protected]> Tested-by: Douglas Anderson <[email protected]> Cc: Matthias Kaehlcke <[email protected]> Reported-by: Stephen Boyd <[email protected]> Link: https://lore.kernel.org/r/CAE-0n52xbZeJ66RaKwggeRB57fUAwjvxGxfFMKOKJMKVyFTe+w@mail.gmail.com [1] Fixes: 1b77183 ("clk: qcom: gdsc: enable optional power domain support") Signed-off-by: Stephen Boyd <[email protected]> Link: https://lore.kernel.org/r/[email protected] Tested-by: Johan Hovold <[email protected]> Reviewed-by: Johan Hovold <[email protected]> Signed-off-by: Stephen Boyd <[email protected]>
…ata and not linked with libtraceevent When we have a perf.data file with tracepoints, such as: # perf evlist -f probe_perf:lzma_decompress_to_file # Tip: use 'perf evlist --trace-fields' to show fields for tracepoint events # We end up segfaulting when using perf built with NO_LIBTRACEEVENT=1 by trying to find an evsel with a NULL 'event_name' variable: (gdb) run report --stdio -f Starting program: /root/bin/perf report --stdio -f Program received signal SIGSEGV, Segmentation fault. 0x000000000055219d in find_evsel (evlist=0xfda7b0, event_name=0x0) at util/sort.c:2830 warning: Source file is more recent than executable. 2830 if (event_name[0] == '%') { Missing separate debuginfos, use: dnf debuginfo-install bzip2-libs-1.0.8-11.fc36.x86_64 cyrus-sasl-lib-2.1.27-18.fc36.x86_64 elfutils-debuginfod-client-0.188-3.fc36.x86_64 elfutils-libelf-0.188-3.fc36.x86_64 elfutils-libs-0.188-3.fc36.x86_64 glibc-2.35-20.fc36.x86_64 keyutils-libs-1.6.1-4.fc36.x86_64 krb5-libs-1.19.2-12.fc36.x86_64 libbrotli-1.0.9-7.fc36.x86_64 libcap-2.48-4.fc36.x86_64 libcom_err-1.46.5-2.fc36.x86_64 libcurl-7.82.0-12.fc36.x86_64 libevent-2.1.12-6.fc36.x86_64 libgcc-12.2.1-4.fc36.x86_64 libidn2-2.3.4-1.fc36.x86_64 libnghttp2-1.51.0-1.fc36.x86_64 libpsl-0.21.1-5.fc36.x86_64 libselinux-3.3-4.fc36.x86_64 libssh-0.9.6-4.fc36.x86_64 libstdc++-12.2.1-4.fc36.x86_64 libunistring-1.0-1.fc36.x86_64 libunwind-1.6.2-2.fc36.x86_64 libxcrypt-4.4.33-4.fc36.x86_64 libzstd-1.5.2-2.fc36.x86_64 numactl-libs-2.0.14-5.fc36.x86_64 opencsd-1.2.0-1.fc36.x86_64 openldap-2.6.3-1.fc36.x86_64 openssl-libs-3.0.5-2.fc36.x86_64 slang-2.3.2-11.fc36.x86_64 xz-libs-5.2.5-9.fc36.x86_64 zlib-1.2.11-33.fc36.x86_64 (gdb) bt #0 0x000000000055219d in find_evsel (evlist=0xfda7b0, event_name=0x0) at util/sort.c:2830 #1 0x0000000000552416 in add_dynamic_entry (evlist=0xfda7b0, tok=0xffb6eb "trace", level=2) at util/sort.c:2976 #2 0x0000000000552d26 in sort_dimension__add (list=0xf93e00 <perf_hpp_list>, tok=0xffb6eb "trace", evlist=0xfda7b0, level=2) at util/sort.c:3193 #3 0x0000000000552e1c in setup_sort_list (list=0xf93e00 <perf_hpp_list>, str=0xffb6eb "trace", evlist=0xfda7b0) at util/sort.c:3227 #4 0x00000000005532fa in __setup_sorting (evlist=0xfda7b0) at util/sort.c:3381 #5 0x0000000000553cdc in setup_sorting (evlist=0xfda7b0) at util/sort.c:3608 #6 0x000000000042eb9f in cmd_report (argc=0, argv=0x7fffffffe470) at builtin-report.c:1596 #7 0x00000000004aee7e in run_builtin (p=0xf64ca0 <commands+288>, argc=3, argv=0x7fffffffe470) at perf.c:330 #8 0x00000000004af0f2 in handle_internal_command (argc=3, argv=0x7fffffffe470) at perf.c:384 #9 0x00000000004af241 in run_argv (argcp=0x7fffffffe29c, argv=0x7fffffffe290) at perf.c:428 torvalds#10 0x00000000004af5fc in main (argc=3, argv=0x7fffffffe470) at perf.c:562 (gdb) So check if we have tracepoint events in add_dynamic_entry() and bail out instead: # perf report --stdio -f This perf binary isn't linked with libtraceevent, can't process probe_perf:lzma_decompress_to_file Error: Unknown --sort key: `trace' # Fixes: 378ef0f ("perf build: Use libtraceevent from the system") Acked-by: Ian Rogers <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Namhyung Kim <[email protected]> Link: http://lore.kernel.org/lkml/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Commit 9130b8d ("SUNRPC: allow for upcalls for the same uid but different gss service") introduced `auth` argument to __gss_find_upcall(), but in gss_pipe_downcall() it was left as NULL since it (and auth->service) was not (yet) determined. When multiple upcalls with the same uid and different service are ongoing, it could happen that __gss_find_upcall(), which returns the first match found in the pipe->in_downcall list, could not find the correct gss_msg corresponding to the downcall we are looking for. Moreover, it might return a msg which is not sent to rpc.gssd yet. We could see mount.nfs process hung in D state with multiple mount.nfs are executed in parallel. The call trace below is of CentOS 7.9 kernel-3.10.0-1160.24.1.el7.x86_64 but we observed the same hang w/ elrepo kernel-ml-6.0.7-1.el7. PID: 71258 TASK: ffff91ebd4be0000 CPU: 36 COMMAND: "mount.nfs" #0 [ffff9203ca3234f8] __schedule at ffffffffa3b8899f #1 [ffff9203ca323580] schedule at ffffffffa3b88eb9 #2 [ffff9203ca323590] gss_cred_init at ffffffffc0355818 [auth_rpcgss] #3 [ffff9203ca323658] rpcauth_lookup_credcache at ffffffffc0421ebc [sunrpc] #4 [ffff9203ca3236d8] gss_lookup_cred at ffffffffc0353633 [auth_rpcgss] #5 [ffff9203ca3236e8] rpcauth_lookupcred at ffffffffc0421581 [sunrpc] #6 [ffff9203ca323740] rpcauth_refreshcred at ffffffffc04223d3 [sunrpc] #7 [ffff9203ca3237a0] call_refresh at ffffffffc04103dc [sunrpc] #8 [ffff9203ca3237b8] __rpc_execute at ffffffffc041e1c9 [sunrpc] #9 [ffff9203ca323820] rpc_execute at ffffffffc0420a48 [sunrpc] The scenario is like this. Let's say there are two upcalls for services A and B, A -> B in pipe->in_downcall, B -> A in pipe->pipe. When rpc.gssd reads pipe to get the upcall msg corresponding to service B from pipe->pipe and then writes the response, in gss_pipe_downcall the msg corresponding to service A will be picked because only uid is used to find the msg and it is before the one for B in pipe->in_downcall. And the process waiting for the msg corresponding to service A will be woken up. Actual scheduing of that process might be after rpc.gssd processes the next msg. In rpc_pipe_generic_upcall it clears msg->errno (for A). The process is scheduled to see gss_msg->ctx == NULL and gss_msg->msg.errno == 0, therefore it cannot break the loop in gss_create_upcall and is never woken up after that. This patch adds a simple check to ensure that a msg which is not sent to rpc.gssd yet is not chosen as the matching upcall upon receiving a downcall. Signed-off-by: minoura makoto <[email protected]> Signed-off-by: Hiroshi Shimamoto <[email protected]> Tested-by: Hiroshi Shimamoto <[email protected]> Cc: Trond Myklebust <[email protected]> Fixes: 9130b8d ("SUNRPC: allow for upcalls for same uid but different gss service") Signed-off-by: Trond Myklebust <[email protected]>
Using GFP_KERNEL in preemption disable context, causing below warning when CONFIG_DEBUG_ATOMIC_SLEEP is enabled. [ 32.542271] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274 [ 32.550883] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 [ 32.558707] preempt_count: 1, expected: 0 [ 32.562710] RCU nest depth: 0, expected: 0 [ 32.566800] CPU: 3 PID: 1 Comm: swapper/0 Tainted: G W 6.2.0-rc2-00269-gae9dcb91c606 #7 [ 32.576188] Hardware name: Marvell CN106XX board (DT) [ 32.581232] Call trace: [ 32.583670] dump_backtrace.part.0+0xe0/0xf0 [ 32.587937] show_stack+0x18/0x30 [ 32.591245] dump_stack_lvl+0x68/0x84 [ 32.594900] dump_stack+0x18/0x34 [ 32.598206] __might_resched+0x12c/0x160 [ 32.602122] __might_sleep+0x48/0xa0 [ 32.605689] __kmem_cache_alloc_node+0x2b8/0x2e0 [ 32.610301] __kmalloc+0x58/0x190 [ 32.613610] otx2_sq_aura_pool_init+0x1a8/0x314 [ 32.618134] otx2_open+0x1d4/0x9d0 To avoid use of GFP_ATOMIC for memory allocation, disable preemption after all memory allocation is done. Fixes: 4af1b64 ("octeontx2-pf: Fix lmtst ID used in aura free") Signed-off-by: Geetha sowjanya <[email protected]> Signed-off-by: Sunil Kovvuri Goutham <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]> Signed-off-by: David S. Miller <[email protected]>
During EEH error injection testing, a deadlock was encountered in the tg3 driver when tg3_io_error_detected() was attempting to cancel outstanding reset tasks: crash> foreach UN bt ... PID: 159 TASK: c0000000067c6000 CPU: 8 COMMAND: "eehd" ... #5 [c00000000681f990] __cancel_work_timer at c00000000019fd18 #6 [c00000000681fa30] tg3_io_error_detected at c00800000295f098 [tg3] #7 [c00000000681faf0] eeh_report_error at c00000000004e25c ... PID: 290 TASK: c000000036e5f800 CPU: 6 COMMAND: "kworker/6:1" ... #4 [c00000003721fbc0] rtnl_lock at c000000000c940d8 #5 [c00000003721fbe0] tg3_reset_task at c008000002969358 [tg3] #6 [c00000003721fc60] process_one_work at c00000000019e5c4 ... PID: 296 TASK: c000000037a65800 CPU: 21 COMMAND: "kworker/21:1" ... #4 [c000000037247bc0] rtnl_lock at c000000000c940d8 #5 [c000000037247be0] tg3_reset_task at c008000002969358 [tg3] #6 [c000000037247c60] process_one_work at c00000000019e5c4 ... PID: 655 TASK: c000000036f49000 CPU: 16 COMMAND: "kworker/16:2" ...:1 #4 [c0000000373ebbc0] rtnl_lock at c000000000c940d8 #5 [c0000000373ebbe0] tg3_reset_task at c008000002969358 [tg3] #6 [c0000000373ebc60] process_one_work at c00000000019e5c4 ... Code inspection shows that both tg3_io_error_detected() and tg3_reset_task() attempt to acquire the RTNL lock at the beginning of their code blocks. If tg3_reset_task() should happen to execute between the times when tg3_io_error_deteced() acquires the RTNL lock and tg3_reset_task_cancel() is called, a deadlock will occur. Moving tg3_reset_task_cancel() call earlier within the code block, prior to acquiring RTNL, prevents this from happening, but also exposes another deadlock issue where tg3_reset_task() may execute AFTER tg3_io_error_detected() has executed: crash> foreach UN bt PID: 159 TASK: c0000000067d2000 CPU: 9 COMMAND: "eehd" ... #4 [c000000006867a60] rtnl_lock at c000000000c940d8 #5 [c000000006867a80] tg3_io_slot_reset at c0080000026c2ea8 [tg3] #6 [c000000006867b00] eeh_report_reset at c00000000004de88 ... PID: 363 TASK: c000000037564000 CPU: 6 COMMAND: "kworker/6:1" ... #3 [c000000036c1bb70] msleep at c000000000259e6c #4 [c000000036c1bba0] napi_disable at c000000000c6b848 #5 [c000000036c1bbe0] tg3_reset_task at c0080000026d942c [tg3] #6 [c000000036c1bc60] process_one_work at c00000000019e5c4 ... This issue can be avoided by aborting tg3_reset_task() if EEH error recovery is already in progress. Fixes: db84bf4 ("tg3: tg3_reset_task() needs to use rtnl_lock to synchronize") Signed-off-by: David Christensen <[email protected]> Reviewed-by: Pavan Chebbi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
GCC warns about the pattern sizeof(void*)/sizeof(void), as it looks like the abuse of a pattern to calculate the array size. This pattern appears in the unevaluated part of the ternary operator in _INTC_ARRAY if the parameter is NULL. The replacement uses an alternate approach to return 0 in case of NULL which does not generate the pattern sizeof(void*)/sizeof(void), but still emits the warning if _INTC_ARRAY is called with a nonarray parameter. This patch is required for successful compilation with -Werror enabled. The idea to use _Generic for type distinction is taken from Comment #7 in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108483 by Jakub Jelinek Signed-off-by: Michael Karcher <[email protected]> Acked-by: Randy Dunlap <[email protected]> # build-tested Link: https://lore.kernel.org/r/619fa552-c988-35e5-b1d7-fe256c46a272@mkarcher.dialup.fu-berlin.de Signed-off-by: John Paul Adrian Glaubitz <[email protected]>
cti_enable_hw() and cti_disable_hw() are called from an atomic context so shouldn't use runtime PM because it can result in a sleep when communicating with firmware. Since commit 3c66563 ("Revert "firmware: arm_scmi: Add clock management to the SCMI power domain""), this causes a hang on Juno when running the Perf Coresight tests or running this command: perf record -e cs_etm//u -- ls This was also missed until the revert commit because pm_runtime_put() was called with the wrong device until commit 692c9a4 ("coresight: cti: Correct the parameter for pm_runtime_put") With lock and scheduler debugging enabled the following is output: coresight cti_sys0: cti_enable_hw -- dev:cti_sys0 parent: 20020000.cti BUG: sleeping function called from invalid context at drivers/base/power/runtime.c:1151 in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 330, name: perf-exec preempt_count: 2, expected: 0 RCU nest depth: 0, expected: 0 INFO: lockdep is turned off. irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffff80000822b394>] copy_process+0xa0c/0x1948 softirqs last enabled at (0): [<ffff80000822b394>] copy_process+0xa0c/0x1948 softirqs last disabled at (0): [<0000000000000000>] 0x0 CPU: 3 PID: 330 Comm: perf-exec Not tainted 6.0.0-00053-g042116d99298 archlinux#7 Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Sep 13 2022 Call trace: dump_backtrace+0x134/0x140 show_stack+0x20/0x58 dump_stack_lvl+0x8c/0xb8 dump_stack+0x18/0x34 __might_resched+0x180/0x228 __might_sleep+0x50/0x88 __pm_runtime_resume+0xac/0xb0 cti_enable+0x44/0x120 coresight_control_assoc_ectdev+0xc0/0x150 coresight_enable_path+0xb4/0x288 etm_event_start+0x138/0x170 etm_event_add+0x48/0x70 event_sched_in.isra.122+0xb4/0x280 merge_sched_in+0x1fc/0x3d0 visit_groups_merge.constprop.137+0x16c/0x4b0 ctx_sched_in+0x114/0x1f0 perf_event_sched_in+0x60/0x90 ctx_resched+0x68/0xb0 perf_event_exec+0x138/0x508 begin_new_exec+0x52c/0xd40 load_elf_binary+0x6b8/0x17d0 bprm_execve+0x360/0x7f8 do_execveat_common.isra.47+0x218/0x238 __arm64_sys_execve+0x48/0x60 invoke_syscall+0x4c/0x110 el0_svc_common.constprop.4+0xfc/0x120 do_el0_svc+0x34/0xc0 el0_svc+0x40/0x98 el0t_64_sync_handler+0x98/0xc0 el0t_64_sync+0x170/0x174 Fix the issue by removing the runtime PM calls completely. They are not needed here because it must have already been done when building the path for a trace. Fixes: 835d722 ("coresight: cti: Initial CoreSight CTI Driver") Cc: stable <[email protected]> Reported-by: Aishwarya TCV <[email protected]> Reported-by: Cristian Marussi <[email protected]> Suggested-by: Suzuki K Poulose <[email protected]> Signed-off-by: James Clark <[email protected]> Reviewed-by: Mike Leach <[email protected]> Tested-by: Mike Leach <[email protected]> [ Fix build warnings ] Signed-off-by: Suzuki K Poulose <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
KASAN reported a use-after-free with ftrace ops [1]. It was found from vmcore that perf had registered two ops with the same content successively, both dynamic. After unregistering the second ops, a use-after-free occurred. In ftrace_shutdown(), when the second ops is unregistered, the FTRACE_UPDATE_CALLS command is not set because there is another enabled ops with the same content. Also, both ops are dynamic and the ftrace callback function is ftrace_ops_list_func, so the FTRACE_UPDATE_TRACE_FUNC command will not be set. Eventually the value of 'command' will be 0 and ftrace_shutdown() will skip the rcu synchronization. However, ftrace may be activated. When the ops is released, another CPU may be accessing the ops. Add the missing synchronization to fix this problem. [1] BUG: KASAN: use-after-free in __ftrace_ops_list_func kernel/trace/ftrace.c:7020 [inline] BUG: KASAN: use-after-free in ftrace_ops_list_func+0x2b0/0x31c kernel/trace/ftrace.c:7049 Read of size 8 at addr ffff56551965bbc8 by task syz-executor.2/14468 CPU: 1 PID: 14468 Comm: syz-executor.2 Not tainted 5.10.0 archlinux#7 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x40c arch/arm64/kernel/stacktrace.c:132 show_stack+0x30/0x40 arch/arm64/kernel/stacktrace.c:196 __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1b4/0x248 lib/dump_stack.c:118 print_address_description.constprop.0+0x28/0x48c mm/kasan/report.c:387 __kasan_report mm/kasan/report.c:547 [inline] kasan_report+0x118/0x210 mm/kasan/report.c:564 check_memory_region_inline mm/kasan/generic.c:187 [inline] __asan_load8+0x98/0xc0 mm/kasan/generic.c:253 __ftrace_ops_list_func kernel/trace/ftrace.c:7020 [inline] ftrace_ops_list_func+0x2b0/0x31c kernel/trace/ftrace.c:7049 ftrace_graph_call+0x0/0x4 __might_sleep+0x8/0x100 include/linux/perf_event.h:1170 __might_fault mm/memory.c:5183 [inline] __might_fault+0x58/0x70 mm/memory.c:5171 do_strncpy_from_user lib/strncpy_from_user.c:41 [inline] strncpy_from_user+0x1f4/0x4b0 lib/strncpy_from_user.c:139 getname_flags+0xb0/0x31c fs/namei.c:149 getname+0x2c/0x40 fs/namei.c:209 [...] Allocated by task 14445: kasan_save_stack+0x24/0x50 mm/kasan/common.c:48 kasan_set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc mm/kasan/common.c:479 [inline] __kasan_kmalloc.constprop.0+0x110/0x13c mm/kasan/common.c:449 kasan_kmalloc+0xc/0x14 mm/kasan/common.c:493 kmem_cache_alloc_trace+0x440/0x924 mm/slub.c:2950 kmalloc include/linux/slab.h:563 [inline] kzalloc include/linux/slab.h:675 [inline] perf_event_alloc.part.0+0xb4/0x1350 kernel/events/core.c:11230 perf_event_alloc kernel/events/core.c:11733 [inline] __do_sys_perf_event_open kernel/events/core.c:11831 [inline] __se_sys_perf_event_open+0x550/0x15f4 kernel/events/core.c:11723 __arm64_sys_perf_event_open+0x6c/0x80 kernel/events/core.c:11723 [...] Freed by task 14445: kasan_save_stack+0x24/0x50 mm/kasan/common.c:48 kasan_set_track+0x24/0x34 mm/kasan/common.c:56 kasan_set_free_info+0x20/0x40 mm/kasan/generic.c:358 __kasan_slab_free.part.0+0x11c/0x1b0 mm/kasan/common.c:437 __kasan_slab_free mm/kasan/common.c:445 [inline] kasan_slab_free+0x2c/0x40 mm/kasan/common.c:446 slab_free_hook mm/slub.c:1569 [inline] slab_free_freelist_hook mm/slub.c:1608 [inline] slab_free mm/slub.c:3179 [inline] kfree+0x12c/0xc10 mm/slub.c:4176 perf_event_alloc.part.0+0xa0c/0x1350 kernel/events/core.c:11434 perf_event_alloc kernel/events/core.c:11733 [inline] __do_sys_perf_event_open kernel/events/core.c:11831 [inline] __se_sys_perf_event_open+0x550/0x15f4 kernel/events/core.c:11723 [...] Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Fixes: edb096e ("ftrace: Fix memleak when unregistering dynamic ops when tracing disabled") Cc: [email protected] Suggested-by: Steven Rostedt <[email protected]> Signed-off-by: Li Huafei <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
code path: ocfs2_ioctl_move_extents ocfs2_move_extents ocfs2_defrag_extent __ocfs2_move_extent + ocfs2_journal_access_di + ocfs2_split_extent //sub-paths call jbd2_journal_restart + ocfs2_journal_dirty //crash by jbs2 ASSERT crash stacks: PID: 11297 TASK: ffff974a676dcd00 CPU: 67 COMMAND: "defragfs.ocfs2" #0 [ffffb25d8dad3900] machine_kexec at ffffffff8386fe01 #1 [ffffb25d8dad3958] __crash_kexec at ffffffff8395959d #2 [ffffb25d8dad3a20] crash_kexec at ffffffff8395a45d #3 [ffffb25d8dad3a38] oops_end at ffffffff83836d3f #4 [ffffb25d8dad3a58] do_trap at ffffffff83833205 #5 [ffffb25d8dad3aa0] do_invalid_op at ffffffff83833aa6 #6 [ffffb25d8dad3ac0] invalid_op at ffffffff84200d18 [exception RIP: jbd2_journal_dirty_metadata+0x2ba] RIP: ffffffffc09ca54a RSP: ffffb25d8dad3b70 RFLAGS: 00010207 RAX: 0000000000000000 RBX: ffff9706eedc5248 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffff97337029ea28 RDI: ffff9706eedc5250 RBP: ffff9703c3520200 R8: 000000000f46b0b2 R9: 0000000000000000 R10: 0000000000000001 R11: 00000001000000fe R12: ffff97337029ea28 R13: 0000000000000000 R14: ffff9703de59bf60 R15: ffff9706eedc5250 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffffb25d8dad3ba8] ocfs2_journal_dirty at ffffffffc137fb95 [ocfs2] #8 [ffffb25d8dad3be8] __ocfs2_move_extent at ffffffffc139a950 [ocfs2] #9 [ffffb25d8dad3c80] ocfs2_defrag_extent at ffffffffc139b2d2 [ocfs2] Analysis This bug has the same root cause of 'commit 7f27ec9 ("ocfs2: call ocfs2_journal_access_di() before ocfs2_journal_dirty() in ocfs2_write_end_nolock()")'. For this bug, jbd2_journal_restart() is called by ocfs2_split_extent() during defragmenting. How to fix For ocfs2_split_extent() can handle journal operations totally by itself. Caller doesn't need to call journal access/dirty pair, and caller only needs to call journal start/stop pair. The fix method is to remove journal access/dirty from __ocfs2_move_extent(). The discussion for this patch: https://oss.oracle.com/pipermail/ocfs2-devel/2023-February/000647.html Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Heming Zhao <[email protected]> Reviewed-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
When a system with E810 with existing VFs gets rebooted the following hang may be observed. Pid 1 is hung in iavf_remove(), part of a network driver: PID: 1 TASK: ffff965400e5a340 CPU: 24 COMMAND: "systemd-shutdow" #0 [ffffaad04005fa50] __schedule at ffffffff8b3239cb #1 [ffffaad04005fae8] schedule at ffffffff8b323e2d #2 [ffffaad04005fb00] schedule_hrtimeout_range_clock at ffffffff8b32cebc #3 [ffffaad04005fb80] usleep_range_state at ffffffff8b32c930 #4 [ffffaad04005fbb0] iavf_remove at ffffffffc12b9b4c [iavf] #5 [ffffaad04005fbf0] pci_device_remove at ffffffff8add7513 #6 [ffffaad04005fc10] device_release_driver_internal at ffffffff8af08baa #7 [ffffaad04005fc40] pci_stop_bus_device at ffffffff8adcc5fc #8 [ffffaad04005fc60] pci_stop_and_remove_bus_device at ffffffff8adcc81e #9 [ffffaad04005fc70] pci_iov_remove_virtfn at ffffffff8adf9429 torvalds#10 [ffffaad04005fca8] sriov_disable at ffffffff8adf98e4 torvalds#11 [ffffaad04005fcc8] ice_free_vfs at ffffffffc04bb2c8 [ice] torvalds#12 [ffffaad04005fd10] ice_remove at ffffffffc04778fe [ice] torvalds#13 [ffffaad04005fd38] ice_shutdown at ffffffffc0477946 [ice] torvalds#14 [ffffaad04005fd50] pci_device_shutdown at ffffffff8add58f1 torvalds#15 [ffffaad04005fd70] device_shutdown at ffffffff8af05386 torvalds#16 [ffffaad04005fd98] kernel_restart at ffffffff8a92a870 torvalds#17 [ffffaad04005fda8] __do_sys_reboot at ffffffff8a92abd6 torvalds#18 [ffffaad04005fee0] do_syscall_64 at ffffffff8b317159 torvalds#19 [ffffaad04005ff08] __context_tracking_enter at ffffffff8b31b6fc torvalds#20 [ffffaad04005ff18] syscall_exit_to_user_mode at ffffffff8b31b50d torvalds#21 [ffffaad04005ff28] do_syscall_64 at ffffffff8b317169 torvalds#22 [ffffaad04005ff50] entry_SYSCALL_64_after_hwframe at ffffffff8b40009b RIP: 00007f1baa5c13d7 RSP: 00007fffbcc55a98 RFLAGS: 00000202 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1baa5c13d7 RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead RBP: 00007fffbcc55ca0 R8: 0000000000000000 R9: 00007fffbcc54e90 R10: 00007fffbcc55050 R11: 0000000000000202 R12: 0000000000000005 R13: 0000000000000000 R14: 00007fffbcc55af0 R15: 0000000000000000 ORIG_RAX: 00000000000000a9 CS: 0033 SS: 002b During reboot all drivers PM shutdown callbacks are invoked. In iavf_shutdown() the adapter state is changed to __IAVF_REMOVE. In ice_shutdown() the call chain above is executed, which at some point calls iavf_remove(). However iavf_remove() expects the VF to be in one of the states __IAVF_RUNNING, __IAVF_DOWN or __IAVF_INIT_FAILED. If that's not the case it sleeps forever. So if iavf_shutdown() gets invoked before iavf_remove() the system will hang indefinitely because the adapter is already in state __IAVF_REMOVE. Fix this by returning from iavf_remove() if the state is __IAVF_REMOVE, as we already went through iavf_shutdown(). Fixes: 9745780 ("iavf: Add waiting so the port is initialized in remove") Fixes: a841733 ("iavf: Fix race condition between iavf_shutdown and iavf_remove") Reported-by: Marius Cornea <[email protected]> Signed-off-by: Stefan Assmann <[email protected]> Reviewed-by: Michal Kubiak <[email protected]> Tested-by: Rafal Romanowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
Dae R. Jeong reported a NULL deref in raw_get_next() [0]. It seems that the repro was running these sequences in parallel so that one thread was iterating on a socket that was being freed in another netns. unshare(0x40060200) r0 = syz_open_procfs(0x0, &(0x7f0000002080)='net/raw\x00') socket$inet_icmp_raw(0x2, 0x3, 0x1) pread64(r0, &(0x7f0000000000)=""/10, 0xa, 0x10000000007f) After commit 0daf07e ("raw: convert raw sockets to RCU"), we use RCU and hlist_nulls_for_each_entry() to iterate over SOCK_RAW sockets. However, we should use spinlock for slow paths to avoid the NULL deref. Also, SOCK_RAW does not use SLAB_TYPESAFE_BY_RCU, and the slab object is not reused during iteration in the grace period. In fact, the lockless readers do not check the nulls marker with get_nulls_value(). So, SOCK_RAW should use hlist instead of hlist_nulls. Instead of adding an unnecessary barrier by sk_nulls_for_each_rcu(), let's convert hlist_nulls to hlist and use sk_for_each_rcu() for fast paths and sk_for_each() and spinlock for /proc/net/raw. [0]: general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f] CPU: 2 PID: 20952 Comm: syz-executor.0 Not tainted 6.2.0-g048ec869bafd-dirty #7 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 RIP: 0010:read_pnet include/net/net_namespace.h:383 [inline] RIP: 0010:sock_net include/net/sock.h:649 [inline] RIP: 0010:raw_get_next net/ipv4/raw.c:974 [inline] RIP: 0010:raw_get_idx net/ipv4/raw.c:986 [inline] RIP: 0010:raw_seq_start+0x431/0x800 net/ipv4/raw.c:995 Code: ef e8 33 3d 94 f7 49 8b 6d 00 4c 89 ef e8 b7 65 5f f7 49 89 ed 49 83 c5 98 0f 84 9a 00 00 00 48 83 c5 c8 48 89 e8 48 c1 e8 03 <42> 80 3c 30 00 74 08 48 89 ef e8 00 3d 94 f7 4c 8b 7d 00 48 89 ef RSP: 0018:ffffc9001154f9b0 EFLAGS: 00010206 RAX: 0000000000000005 RBX: 1ffff1100302c8fd RCX: 0000000000000000 RDX: 0000000000000028 RSI: ffffc9001154f988 RDI: ffffc9000f77a338 RBP: 0000000000000029 R08: ffffffff8a50ffb4 R09: fffffbfff24b6bd9 R10: fffffbfff24b6bd9 R11: 0000000000000000 R12: ffff88801db73b78 R13: fffffffffffffff9 R14: dffffc0000000000 R15: 0000000000000030 FS: 00007f843ae8e700(0000) GS:ffff888063700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055bb9614b35f CR3: 000000003c672000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> seq_read_iter+0x4c6/0x10f0 fs/seq_file.c:225 seq_read+0x224/0x320 fs/seq_file.c:162 pde_read fs/proc/inode.c:316 [inline] proc_reg_read+0x23f/0x330 fs/proc/inode.c:328 vfs_read+0x31e/0xd30 fs/read_write.c:468 ksys_pread64 fs/read_write.c:665 [inline] __do_sys_pread64 fs/read_write.c:675 [inline] __se_sys_pread64 fs/read_write.c:672 [inline] __x64_sys_pread64+0x1e9/0x280 fs/read_write.c:672 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x4e/0xa0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x478d29 Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f843ae8dbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011 RAX: ffffffffffffffda RBX: 0000000000791408 RCX: 0000000000478d29 RDX: 000000000000000a RSI: 0000000020000000 RDI: 0000000000000003 RBP: 00000000f477909a R08: 0000000000000000 R09: 0000000000000000 R10: 000010000000007f R11: 0000000000000246 R12: 0000000000791740 R13: 0000000000791414 R14: 0000000000791408 R15: 00007ffc2eb48a50 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:read_pnet include/net/net_namespace.h:383 [inline] RIP: 0010:sock_net include/net/sock.h:649 [inline] RIP: 0010:raw_get_next net/ipv4/raw.c:974 [inline] RIP: 0010:raw_get_idx net/ipv4/raw.c:986 [inline] RIP: 0010:raw_seq_start+0x431/0x800 net/ipv4/raw.c:995 Code: ef e8 33 3d 94 f7 49 8b 6d 00 4c 89 ef e8 b7 65 5f f7 49 89 ed 49 83 c5 98 0f 84 9a 00 00 00 48 83 c5 c8 48 89 e8 48 c1 e8 03 <42> 80 3c 30 00 74 08 48 89 ef e8 00 3d 94 f7 4c 8b 7d 00 48 89 ef RSP: 0018:ffffc9001154f9b0 EFLAGS: 00010206 RAX: 0000000000000005 RBX: 1ffff1100302c8fd RCX: 0000000000000000 RDX: 0000000000000028 RSI: ffffc9001154f988 RDI: ffffc9000f77a338 RBP: 0000000000000029 R08: ffffffff8a50ffb4 R09: fffffbfff24b6bd9 R10: fffffbfff24b6bd9 R11: 0000000000000000 R12: ffff88801db73b78 R13: fffffffffffffff9 R14: dffffc0000000000 R15: 0000000000000030 FS: 00007f843ae8e700(0000) GS:ffff888063700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f92ff166000 CR3: 000000003c672000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fixes: 0daf07e ("raw: convert raw sockets to RCU") Reported-by: syzbot <[email protected]> Reported-by: Dae R. Jeong <[email protected]> Link: https://lore.kernel.org/netdev/ZCA2mGV_cmq7lIfV@dragonet/ Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
[ Upstream commit 2508708 ] GCC warns about the pattern sizeof(void*)/sizeof(void), as it looks like the abuse of a pattern to calculate the array size. This pattern appears in the unevaluated part of the ternary operator in _INTC_ARRAY if the parameter is NULL. The replacement uses an alternate approach to return 0 in case of NULL which does not generate the pattern sizeof(void*)/sizeof(void), but still emits the warning if _INTC_ARRAY is called with a nonarray parameter. This patch is required for successful compilation with -Werror enabled. The idea to use _Generic for type distinction is taken from Comment archlinux#7 in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108483 by Jakub Jelinek Signed-off-by: Michael Karcher <[email protected]> Acked-by: Randy Dunlap <[email protected]> # build-tested Link: https://lore.kernel.org/r/619fa552-c988-35e5-b1d7-fe256c46a272@mkarcher.dialup.fu-berlin.de Signed-off-by: John Paul Adrian Glaubitz <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 4e264be ] When a system with E810 with existing VFs gets rebooted the following hang may be observed. Pid 1 is hung in iavf_remove(), part of a network driver: PID: 1 TASK: ffff965400e5a340 CPU: 24 COMMAND: "systemd-shutdow" #0 [ffffaad04005fa50] __schedule at ffffffff8b3239cb archlinux#1 [ffffaad04005fae8] schedule at ffffffff8b323e2d archlinux#2 [ffffaad04005fb00] schedule_hrtimeout_range_clock at ffffffff8b32cebc archlinux#3 [ffffaad04005fb80] usleep_range_state at ffffffff8b32c930 archlinux#4 [ffffaad04005fbb0] iavf_remove at ffffffffc12b9b4c [iavf] archlinux#5 [ffffaad04005fbf0] pci_device_remove at ffffffff8add7513 archlinux#6 [ffffaad04005fc10] device_release_driver_internal at ffffffff8af08baa archlinux#7 [ffffaad04005fc40] pci_stop_bus_device at ffffffff8adcc5fc archlinux#8 [ffffaad04005fc60] pci_stop_and_remove_bus_device at ffffffff8adcc81e archlinux#9 [ffffaad04005fc70] pci_iov_remove_virtfn at ffffffff8adf9429 torvalds#10 [ffffaad04005fca8] sriov_disable at ffffffff8adf98e4 torvalds#11 [ffffaad04005fcc8] ice_free_vfs at ffffffffc04bb2c8 [ice] torvalds#12 [ffffaad04005fd10] ice_remove at ffffffffc04778fe [ice] torvalds#13 [ffffaad04005fd38] ice_shutdown at ffffffffc0477946 [ice] torvalds#14 [ffffaad04005fd50] pci_device_shutdown at ffffffff8add58f1 torvalds#15 [ffffaad04005fd70] device_shutdown at ffffffff8af05386 torvalds#16 [ffffaad04005fd98] kernel_restart at ffffffff8a92a870 torvalds#17 [ffffaad04005fda8] __do_sys_reboot at ffffffff8a92abd6 torvalds#18 [ffffaad04005fee0] do_syscall_64 at ffffffff8b317159 torvalds#19 [ffffaad04005ff08] __context_tracking_enter at ffffffff8b31b6fc torvalds#20 [ffffaad04005ff18] syscall_exit_to_user_mode at ffffffff8b31b50d torvalds#21 [ffffaad04005ff28] do_syscall_64 at ffffffff8b317169 torvalds#22 [ffffaad04005ff50] entry_SYSCALL_64_after_hwframe at ffffffff8b40009b RIP: 00007f1baa5c13d7 RSP: 00007fffbcc55a98 RFLAGS: 00000202 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1baa5c13d7 RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead RBP: 00007fffbcc55ca0 R8: 0000000000000000 R9: 00007fffbcc54e90 R10: 00007fffbcc55050 R11: 0000000000000202 R12: 0000000000000005 R13: 0000000000000000 R14: 00007fffbcc55af0 R15: 0000000000000000 ORIG_RAX: 00000000000000a9 CS: 0033 SS: 002b During reboot all drivers PM shutdown callbacks are invoked. In iavf_shutdown() the adapter state is changed to __IAVF_REMOVE. In ice_shutdown() the call chain above is executed, which at some point calls iavf_remove(). However iavf_remove() expects the VF to be in one of the states __IAVF_RUNNING, __IAVF_DOWN or __IAVF_INIT_FAILED. If that's not the case it sleeps forever. So if iavf_shutdown() gets invoked before iavf_remove() the system will hang indefinitely because the adapter is already in state __IAVF_REMOVE. Fix this by returning from iavf_remove() if the state is __IAVF_REMOVE, as we already went through iavf_shutdown(). Fixes: 9745780 ("iavf: Add waiting so the port is initialized in remove") Fixes: a841733 ("iavf: Fix race condition between iavf_shutdown and iavf_remove") Reported-by: Marius Cornea <[email protected]> Signed-off-by: Stefan Assmann <[email protected]> Reviewed-by: Michal Kubiak <[email protected]> Tested-by: Rafal Romanowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
With Eric's ref tracker, syzbot finally found a repro for use-after-free in tcp_write_timer_handler() by kernel TCP sockets. [0] If SMC creates a kernel socket in __smc_create(), the kernel socket is supposed to be freed in smc_clcsock_release() by calling sock_release() when we close() the parent SMC socket. However, at the end of smc_clcsock_release(), the kernel socket's sk_state might not be TCP_CLOSE. This means that we have not called inet_csk_destroy_sock() in __tcp_close() and have not stopped the TCP timers. The kernel socket's TCP timers can be fired later, so we need to hold a refcnt for net as we do for MPTCP subflows in mptcp_subflow_create_socket(). [0]: leaked reference. sk_alloc (./include/net/net_namespace.h:335 net/core/sock.c:2108) inet_create (net/ipv4/af_inet.c:319 net/ipv4/af_inet.c:244) __sock_create (net/socket.c:1546) smc_create (net/smc/af_smc.c:3269 net/smc/af_smc.c:3284) __sock_create (net/socket.c:1546) __sys_socket (net/socket.c:1634 net/socket.c:1618 net/socket.c:1661) __x64_sys_socket (net/socket.c:1672) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) ================================================================== BUG: KASAN: slab-use-after-free in tcp_write_timer_handler (net/ipv4/tcp_timer.c:378 net/ipv4/tcp_timer.c:624 net/ipv4/tcp_timer.c:594) Read of size 1 at addr ffff888052b65e0d by task syzrepro/18091 CPU: 0 PID: 18091 Comm: syzrepro Tainted: G W 6.3.0-rc4-01174-gb5d54eb5899a #7 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.amzn2022.0.1 04/01/2014 Call Trace: <IRQ> dump_stack_lvl (lib/dump_stack.c:107) print_report (mm/kasan/report.c:320 mm/kasan/report.c:430) kasan_report (mm/kasan/report.c:538) tcp_write_timer_handler (net/ipv4/tcp_timer.c:378 net/ipv4/tcp_timer.c:624 net/ipv4/tcp_timer.c:594) tcp_write_timer (./include/linux/spinlock.h:390 net/ipv4/tcp_timer.c:643) call_timer_fn (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/timer.h:127 kernel/time/timer.c:1701) __run_timers.part.0 (kernel/time/timer.c:1752 kernel/time/timer.c:2022) run_timer_softirq (kernel/time/timer.c:2037) __do_softirq (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/irq.h:142 kernel/softirq.c:572) __irq_exit_rcu (kernel/softirq.c:445 kernel/softirq.c:650) irq_exit_rcu (kernel/softirq.c:664) sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1107 (discriminator 14)) </IRQ> Fixes: ac71387 ("smc: establish new socket family") Reported-by: [email protected] Link: https://lore.kernel.org/netdev/[email protected]/ Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Tony Lu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Balance as exclusive state is compatible with paused balance and device add, which makes some things more complicated. The assertion of valid states when starting from paused balance needs to take into account two more states, the combinations can be hit when there are several threads racing to start balance and device add. This won't typically happen when the commands are started from command line. Scenario 1: With exclusive_operation state == BTRFS_EXCLOP_NONE. Concurrently adding multiple devices to the same mount point and btrfs_exclop_finish executed finishes before assertion in btrfs_exclop_balance, exclusive_operation will changed to BTRFS_EXCLOP_NONE state which lead to assertion failed: fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE || fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD, in fs/btrfs/ioctl.c:456 Call Trace: <TASK> btrfs_exclop_balance+0x13c/0x310 ? memdup_user+0xab/0xc0 ? PTR_ERR+0x17/0x20 btrfs_ioctl_add_dev+0x2ee/0x320 btrfs_ioctl+0x9d5/0x10d0 ? btrfs_ioctl_encoded_write+0xb80/0xb80 __x64_sys_ioctl+0x197/0x210 do_syscall_64+0x3c/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd Scenario 2: With exclusive_operation state == BTRFS_EXCLOP_BALANCE_PAUSED. Concurrently adding multiple devices to the same mount point and btrfs_exclop_balance executed finish before the latter thread execute assertion in btrfs_exclop_balance, exclusive_operation will changed to BTRFS_EXCLOP_BALANCE_PAUSED state which lead to assertion failed: fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE || fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD || fs_info->exclusive_operation == BTRFS_EXCLOP_NONE, fs/btrfs/ioctl.c:458 Call Trace: <TASK> btrfs_exclop_balance+0x240/0x410 ? memdup_user+0xab/0xc0 ? PTR_ERR+0x17/0x20 btrfs_ioctl_add_dev+0x2ee/0x320 btrfs_ioctl+0x9d5/0x10d0 ? btrfs_ioctl_encoded_write+0xb80/0xb80 __x64_sys_ioctl+0x197/0x210 do_syscall_64+0x3c/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd An example of the failed assertion is below, which shows that the paused balance is also needed to be checked. root@syzkaller:/home/xsk# ./repro Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 [ 416.611428][ T7970] BTRFS info (device loop0): fs_info exclusive_operation: 0 Failed to add device /dev/vda, errno 14 [ 416.613973][ T7971] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.615456][ T7972] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.617528][ T7973] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.618359][ T7974] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.622589][ T7975] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.624034][ T7976] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.626420][ T7977] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.627643][ T7978] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.629006][ T7979] BTRFS info (device loop0): fs_info exclusive_operation: 3 [ 416.630298][ T7980] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 [ 416.632787][ T7981] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.634282][ T7982] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.636202][ T7983] BTRFS info (device loop0): fs_info exclusive_operation: 3 [ 416.637012][ T7984] BTRFS info (device loop0): fs_info exclusive_operation: 1 Failed to add device /dev/vda, errno 14 [ 416.637759][ T7984] assertion failed: fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE || fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD || fs_info->exclusive_operation == BTRFS_EXCLOP_NONE, in fs/btrfs/ioctl.c:458 [ 416.639845][ T7984] invalid opcode: 0000 [#1] PREEMPT SMP KASAN [ 416.640485][ T7984] CPU: 0 PID: 7984 Comm: repro Not tainted 6.2.0 #7 [ 416.641172][ T7984] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 [ 416.642090][ T7984] RIP: 0010:btrfs_assertfail+0x2c/0x2e [ 416.644423][ T7984] RSP: 0018:ffffc90003ea7e28 EFLAGS: 00010282 [ 416.645018][ T7984] RAX: 00000000000000cc RBX: 0000000000000000 RCX: 0000000000000000 [ 416.645763][ T7984] RDX: ffff88801d030000 RSI: ffffffff81637e7c RDI: fffff520007d4fb7 [ 416.646554][ T7984] RBP: ffffffff8a533de0 R08: 00000000000000cc R09: 0000000000000000 [ 416.647299][ T7984] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a533da0 [ 416.648041][ T7984] R13: 00000000000001ca R14: 000000005000940a R15: 0000000000000000 [ 416.648785][ T7984] FS: 00007fa2985d4640(0000) GS:ffff88802cc00000(0000) knlGS:0000000000000000 [ 416.649616][ T7984] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 416.650238][ T7984] CR2: 0000000000000000 CR3: 0000000018e5e000 CR4: 0000000000750ef0 [ 416.650980][ T7984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 416.651725][ T7984] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 416.652502][ T7984] PKRU: 55555554 [ 416.652888][ T7984] Call Trace: [ 416.653241][ T7984] <TASK> [ 416.653527][ T7984] btrfs_exclop_balance+0x240/0x410 [ 416.654036][ T7984] ? memdup_user+0xab/0xc0 [ 416.654465][ T7984] ? PTR_ERR+0x17/0x20 [ 416.654874][ T7984] btrfs_ioctl_add_dev+0x2ee/0x320 [ 416.655380][ T7984] btrfs_ioctl+0x9d5/0x10d0 [ 416.655822][ T7984] ? btrfs_ioctl_encoded_write+0xb80/0xb80 [ 416.656400][ T7984] __x64_sys_ioctl+0x197/0x210 [ 416.656874][ T7984] do_syscall_64+0x3c/0xb0 [ 416.657346][ T7984] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 416.657922][ T7984] RIP: 0033:0x4546af [ 416.660170][ T7984] RSP: 002b:00007fa2985d4150 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 416.660972][ T7984] RAX: ffffffffffffffda RBX: 00007fa2985d4640 RCX: 00000000004546af [ 416.661714][ T7984] RDX: 0000000000000000 RSI: 000000005000940a RDI: 0000000000000003 [ 416.662449][ T7984] RBP: 00007fa2985d41d0 R08: 0000000000000000 R09: 00007ffee37a4c4f [ 416.663195][ T7984] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fa2985d4640 [ 416.663951][ T7984] R13: 0000000000000009 R14: 000000000041b320 R15: 00007fa297dd4000 [ 416.664703][ T7984] </TASK> [ 416.665040][ T7984] Modules linked in: [ 416.665590][ T7984] ---[ end trace 0000000000000000 ]--- [ 416.666176][ T7984] RIP: 0010:btrfs_assertfail+0x2c/0x2e [ 416.668775][ T7984] RSP: 0018:ffffc90003ea7e28 EFLAGS: 00010282 [ 416.669425][ T7984] RAX: 00000000000000cc RBX: 0000000000000000 RCX: 0000000000000000 [ 416.670235][ T7984] RDX: ffff88801d030000 RSI: ffffffff81637e7c RDI: fffff520007d4fb7 [ 416.671050][ T7984] RBP: ffffffff8a533de0 R08: 00000000000000cc R09: 0000000000000000 [ 416.671867][ T7984] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a533da0 [ 416.672685][ T7984] R13: 00000000000001ca R14: 000000005000940a R15: 0000000000000000 [ 416.673501][ T7984] FS: 00007fa2985d4640(0000) GS:ffff88802cc00000(0000) knlGS:0000000000000000 [ 416.674425][ T7984] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 416.675114][ T7984] CR2: 0000000000000000 CR3: 0000000018e5e000 CR4: 0000000000750ef0 [ 416.675933][ T7984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 416.676760][ T7984] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Link: https://lore.kernel.org/linux-btrfs/[email protected]/ CC: [email protected] # 6.1+ Signed-off-by: xiaoshoukui <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
Sai Krishna says: ==================== octeontx2: Miscellaneous fixes This patchset includes following fixes. Patch #1 Fix for the race condition while updating APR table Patch #2 Fix end bit position in NPC scan config Patch #3 Fix depth of CAM, MEM table entries Patch #4 Fix in increase the size of DMAC filter flows Patch #5 Fix driver crash resulting from invalid interface type information retrieved from firmware Patch #6 Fix incorrect mask used while installing filters involving fragmented packets Patch #7 Fixes for NPC field hash extract w.r.t IPV6 hash reduction, IPV6 filed hash configuration. Patch #8 Fix for NPC hardware parser configuration destination address hash, IPV6 endianness issues. Patch #9 Fix for skipping mbox initialization for PFs disabled by firmware. Patch torvalds#10 Fix disabling packet I/O in case of mailbox timeout. Patch torvalds#11 Fix detaching LF resources in case of VF probe fail. ==================== Signed-off-by: David S. Miller <[email protected]>
In the function ieee80211_tx_dequeue() there is a particular locking sequence: begin: spin_lock(&local->queue_stop_reason_lock); q_stopped = local->queue_stop_reasons[q]; spin_unlock(&local->queue_stop_reason_lock); However small the chance (increased by ftracetest), an asynchronous interrupt can occur in between of spin_lock() and spin_unlock(), and the interrupt routine will attempt to lock the same &local->queue_stop_reason_lock again. This will cause a costly reset of the CPU and the wifi device or an altogether hang in the single CPU and single core scenario. The only remaining spin_lock(&local->queue_stop_reason_lock) that did not disable interrupts was patched, which should prevent any deadlocks on the same CPU/core and the same wifi device. This is the probable trace of the deadlock: kernel: ================================ kernel: WARNING: inconsistent lock state kernel: 6.3.0-rc6-mt-20230401-00001-gf86822a1170f #4 Tainted: G W kernel: -------------------------------- kernel: inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. kernel: kworker/5:0/25656 [HC0[0]:SC0[0]:HE1:SE1] takes: kernel: ffff9d6190779478 (&local->queue_stop_reason_lock){+.?.}-{2:2}, at: return_to_handler+0x0/0x40 kernel: {IN-SOFTIRQ-W} state was registered at: kernel: lock_acquire+0xc7/0x2d0 kernel: _raw_spin_lock+0x36/0x50 kernel: ieee80211_tx_dequeue+0xb4/0x1330 [mac80211] kernel: iwl_mvm_mac_itxq_xmit+0xae/0x210 [iwlmvm] kernel: iwl_mvm_mac_wake_tx_queue+0x2d/0xd0 [iwlmvm] kernel: ieee80211_queue_skb+0x450/0x730 [mac80211] kernel: __ieee80211_xmit_fast.constprop.66+0x834/0xa50 [mac80211] kernel: __ieee80211_subif_start_xmit+0x217/0x530 [mac80211] kernel: ieee80211_subif_start_xmit+0x60/0x580 [mac80211] kernel: dev_hard_start_xmit+0xb5/0x260 kernel: __dev_queue_xmit+0xdbe/0x1200 kernel: neigh_resolve_output+0x166/0x260 kernel: ip_finish_output2+0x216/0xb80 kernel: __ip_finish_output+0x2a4/0x4d0 kernel: ip_finish_output+0x2d/0xd0 kernel: ip_output+0x82/0x2b0 kernel: ip_local_out+0xec/0x110 kernel: igmpv3_sendpack+0x5c/0x90 kernel: igmp_ifc_timer_expire+0x26e/0x4e0 kernel: call_timer_fn+0xa5/0x230 kernel: run_timer_softirq+0x27f/0x550 kernel: __do_softirq+0xb4/0x3a4 kernel: irq_exit_rcu+0x9b/0xc0 kernel: sysvec_apic_timer_interrupt+0x80/0xa0 kernel: asm_sysvec_apic_timer_interrupt+0x1f/0x30 kernel: _raw_spin_unlock_irqrestore+0x3f/0x70 kernel: free_to_partial_list+0x3d6/0x590 kernel: __slab_free+0x1b7/0x310 kernel: kmem_cache_free+0x52d/0x550 kernel: putname+0x5d/0x70 kernel: do_sys_openat2+0x1d7/0x310 kernel: do_sys_open+0x51/0x80 kernel: __x64_sys_openat+0x24/0x30 kernel: do_syscall_64+0x5c/0x90 kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc kernel: irq event stamp: 5120729 kernel: hardirqs last enabled at (5120729): [<ffffffff9d149936>] trace_graph_return+0xd6/0x120 kernel: hardirqs last disabled at (5120728): [<ffffffff9d149950>] trace_graph_return+0xf0/0x120 kernel: softirqs last enabled at (5069900): [<ffffffff9cf65b60>] return_to_handler+0x0/0x40 kernel: softirqs last disabled at (5067555): [<ffffffff9cf65b60>] return_to_handler+0x0/0x40 kernel: other info that might help us debug this: kernel: Possible unsafe locking scenario: kernel: CPU0 kernel: ---- kernel: lock(&local->queue_stop_reason_lock); kernel: <Interrupt> kernel: lock(&local->queue_stop_reason_lock); kernel: *** DEADLOCK *** kernel: 8 locks held by kworker/5:0/25656: kernel: #0: ffff9d618009d138 ((wq_completion)events_freezable){+.+.}-{0:0}, at: process_one_work+0x1ca/0x530 kernel: #1: ffffb1ef4637fe68 ((work_completion)(&local->restart_work)){+.+.}-{0:0}, at: process_one_work+0x1ce/0x530 kernel: #2: ffffffff9f166548 (rtnl_mutex){+.+.}-{3:3}, at: return_to_handler+0x0/0x40 kernel: #3: ffff9d6190778728 (&rdev->wiphy.mtx){+.+.}-{3:3}, at: return_to_handler+0x0/0x40 kernel: #4: ffff9d619077b480 (&mvm->mutex){+.+.}-{3:3}, at: return_to_handler+0x0/0x40 kernel: #5: ffff9d61907bacd8 (&trans_pcie->mutex){+.+.}-{3:3}, at: return_to_handler+0x0/0x40 kernel: #6: ffffffff9ef9cda0 (rcu_read_lock){....}-{1:2}, at: iwl_mvm_queue_state_change+0x59/0x3a0 [iwlmvm] kernel: #7: ffffffff9ef9cda0 (rcu_read_lock){....}-{1:2}, at: iwl_mvm_mac_itxq_xmit+0x42/0x210 [iwlmvm] kernel: stack backtrace: kernel: CPU: 5 PID: 25656 Comm: kworker/5:0 Tainted: G W 6.3.0-rc6-mt-20230401-00001-gf86822a1170f #4 kernel: Hardware name: LENOVO 82H8/LNVNB161216, BIOS GGCN51WW 11/16/2022 kernel: Workqueue: events_freezable ieee80211_restart_work [mac80211] kernel: Call Trace: kernel: <TASK> kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: dump_stack_lvl+0x5f/0xa0 kernel: dump_stack+0x14/0x20 kernel: print_usage_bug.part.46+0x208/0x2a0 kernel: mark_lock.part.47+0x605/0x630 kernel: ? sched_clock+0xd/0x20 kernel: ? trace_clock_local+0x14/0x30 kernel: ? __rb_reserve_next+0x5f/0x490 kernel: ? _raw_spin_lock+0x1b/0x50 kernel: __lock_acquire+0x464/0x1990 kernel: ? mark_held_locks+0x4e/0x80 kernel: lock_acquire+0xc7/0x2d0 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: ? ftrace_return_to_handler+0x8b/0x100 kernel: ? preempt_count_add+0x4/0x70 kernel: _raw_spin_lock+0x36/0x50 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: ieee80211_tx_dequeue+0xb4/0x1330 [mac80211] kernel: ? prepare_ftrace_return+0xc5/0x190 kernel: ? ftrace_graph_func+0x16/0x20 kernel: ? 0xffffffffc02ab0b1 kernel: ? lock_acquire+0xc7/0x2d0 kernel: ? iwl_mvm_mac_itxq_xmit+0x42/0x210 [iwlmvm] kernel: ? ieee80211_tx_dequeue+0x9/0x1330 [mac80211] kernel: ? __rcu_read_lock+0x4/0x40 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_mvm_mac_itxq_xmit+0xae/0x210 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_mvm_queue_state_change+0x311/0x3a0 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_mvm_wake_sw_queue+0x17/0x20 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_txq_gen2_unmap+0x1c9/0x1f0 [iwlwifi] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_txq_gen2_free+0x55/0x130 [iwlwifi] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_txq_gen2_tx_free+0x63/0x80 [iwlwifi] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: _iwl_trans_pcie_gen2_stop_device+0x3f3/0x5b0 [iwlwifi] kernel: ? _iwl_trans_pcie_gen2_stop_device+0x9/0x5b0 [iwlwifi] kernel: ? mutex_lock_nested+0x4/0x30 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_trans_pcie_gen2_stop_device+0x5f/0x90 [iwlwifi] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_mvm_stop_device+0x78/0xd0 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: __iwl_mvm_mac_start+0x114/0x210 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_mvm_mac_start+0x76/0x150 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: drv_start+0x79/0x180 [mac80211] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: ieee80211_reconfig+0x1523/0x1ce0 [mac80211] kernel: ? synchronize_net+0x4/0x50 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: ieee80211_restart_work+0x108/0x170 [mac80211] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: process_one_work+0x250/0x530 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: worker_thread+0x48/0x3a0 kernel: ? __pfx_worker_thread+0x10/0x10 kernel: kthread+0x10f/0x140 kernel: ? __pfx_kthread+0x10/0x10 kernel: ret_from_fork+0x29/0x50 kernel: </TASK> Fixes: 4444bc2 ("wifi: mac80211: Proper mark iTXQs for resumption") Link: https://lore.kernel.org/all/[email protected]/ Reported-by: Mirsad Goran Todorovac <[email protected]> Cc: Gregory Greenman <[email protected]> Cc: Johannes Berg <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Cc: David S. Miller <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: Paolo Abeni <[email protected]> Cc: Leon Romanovsky <[email protected]> Cc: Alexander Wetzel <[email protected]> Signed-off-by: Mirsad Goran Todorovac <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]> Reviewed-by: tag, or it goes automatically? Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
The cited commit adds a compeletion to remove dependency on rtnl lock. But it causes a deadlock for multiple encapsulations: crash> bt ffff8aece8a64000 PID: 1514557 TASK: ffff8aece8a64000 CPU: 3 COMMAND: "tc" #0 [ffffa6d14183f368] __schedule at ffffffffb8ba7f45 #1 [ffffa6d14183f3f8] schedule at ffffffffb8ba8418 #2 [ffffa6d14183f418] schedule_preempt_disabled at ffffffffb8ba8898 #3 [ffffa6d14183f428] __mutex_lock at ffffffffb8baa7f8 #4 [ffffa6d14183f4d0] mutex_lock_nested at ffffffffb8baabeb #5 [ffffa6d14183f4e0] mlx5e_attach_encap at ffffffffc0f48c17 [mlx5_core] #6 [ffffa6d14183f628] mlx5e_tc_add_fdb_flow at ffffffffc0f39680 [mlx5_core] #7 [ffffa6d14183f688] __mlx5e_add_fdb_flow at ffffffffc0f3b636 [mlx5_core] #8 [ffffa6d14183f6f0] mlx5e_tc_add_flow at ffffffffc0f3bcdf [mlx5_core] #9 [ffffa6d14183f728] mlx5e_configure_flower at ffffffffc0f3c1d1 [mlx5_core] torvalds#10 [ffffa6d14183f790] mlx5e_rep_setup_tc_cls_flower at ffffffffc0f3d529 [mlx5_core] torvalds#11 [ffffa6d14183f7a0] mlx5e_rep_setup_tc_cb at ffffffffc0f3d714 [mlx5_core] torvalds#12 [ffffa6d14183f7b0] tc_setup_cb_add at ffffffffb8931bb8 torvalds#13 [ffffa6d14183f810] fl_hw_replace_filter at ffffffffc0dae901 [cls_flower] torvalds#14 [ffffa6d14183f8d8] fl_change at ffffffffc0db5c57 [cls_flower] torvalds#15 [ffffa6d14183f970] tc_new_tfilter at ffffffffb8936047 torvalds#16 [ffffa6d14183fac8] rtnetlink_rcv_msg at ffffffffb88c7c31 torvalds#17 [ffffa6d14183fb50] netlink_rcv_skb at ffffffffb8942853 torvalds#18 [ffffa6d14183fbc0] rtnetlink_rcv at ffffffffb88c1835 torvalds#19 [ffffa6d14183fbd0] netlink_unicast at ffffffffb8941f27 torvalds#20 [ffffa6d14183fc18] netlink_sendmsg at ffffffffb8942245 torvalds#21 [ffffa6d14183fc98] sock_sendmsg at ffffffffb887d482 torvalds#22 [ffffa6d14183fcb8] ____sys_sendmsg at ffffffffb887d81a torvalds#23 [ffffa6d14183fd38] ___sys_sendmsg at ffffffffb88806e2 torvalds#24 [ffffa6d14183fe90] __sys_sendmsg at ffffffffb88807a2 torvalds#25 [ffffa6d14183ff28] __x64_sys_sendmsg at ffffffffb888080f torvalds#26 [ffffa6d14183ff38] do_syscall_64 at ffffffffb8b9b6a8 torvalds#27 [ffffa6d14183ff50] entry_SYSCALL_64_after_hwframe at ffffffffb8c0007c crash> bt 0xffff8aeb07544000 PID: 1110766 TASK: ffff8aeb07544000 CPU: 0 COMMAND: "kworker/u20:9" #0 [ffffa6d14e6b7bd8] __schedule at ffffffffb8ba7f45 #1 [ffffa6d14e6b7c68] schedule at ffffffffb8ba8418 #2 [ffffa6d14e6b7c88] schedule_timeout at ffffffffb8baef88 #3 [ffffa6d14e6b7d10] wait_for_completion at ffffffffb8ba968b #4 [ffffa6d14e6b7d60] mlx5e_take_all_encap_flows at ffffffffc0f47ec4 [mlx5_core] #5 [ffffa6d14e6b7da0] mlx5e_rep_update_flows at ffffffffc0f3e734 [mlx5_core] #6 [ffffa6d14e6b7df8] mlx5e_rep_neigh_update at ffffffffc0f400bb [mlx5_core] #7 [ffffa6d14e6b7e50] process_one_work at ffffffffb80acc9c #8 [ffffa6d14e6b7ed0] worker_thread at ffffffffb80ad012 #9 [ffffa6d14e6b7f10] kthread at ffffffffb80b615d torvalds#10 [ffffa6d14e6b7f50] ret_from_fork at ffffffffb8001b2f After the first encap is attached, flow will be added to encap entry's flows list. If neigh update is running at this time, the following encaps of the flow can't hold the encap_tbl_lock and sleep. If neigh update thread is waiting for that flow's init_done, deadlock happens. Fix it by holding lock outside of the for loop. If neigh update is running, prevent encap flows from offloading. Since the lock is held outside of the for loop, concurrent creation of encap entries is not allowed. So remove unnecessary wait_for_completion call for res_ready. Fixes: 95435ad ("net/mlx5e: Only access fully initialized flows in neigh update") Signed-off-by: Chris Mi <[email protected]> Reviewed-by: Roi Dayan <[email protected]> Reviewed-by: Vlad Buslov <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
A remote DoS vulnerability of RPL Source Routing is assigned CVE-2023-2156. The Source Routing Header (SRH) has the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Next Header | Hdr Ext Len | Routing Type | Segments Left | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CmprI | CmprE | Pad | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | . . . Addresses[1..n] . . . | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The originator of an SRH places the first hop's IPv6 address in the IPv6 header's IPv6 Destination Address and the second hop's IPv6 address as the first address in Addresses[1..n]. The CmprI and CmprE fields indicate the number of prefix octets that are shared with the IPv6 Destination Address. When CmprI or CmprE is not 0, Addresses[1..n] are compressed as follows: 1..n-1 : (16 - CmprI) bytes n : (16 - CmprE) bytes Segments Left indicates the number of route segments remaining. When the value is not zero, the SRH is forwarded to the next hop. Its address is extracted from Addresses[n - Segment Left + 1] and swapped with IPv6 Destination Address. When Segment Left is greater than or equal to 2, the size of SRH is not changed because Addresses[1..n-1] are decompressed and recompressed with CmprI. OTOH, when Segment Left changes from 1 to 0, the new SRH could have a different size because Addresses[1..n-1] are decompressed with CmprI and recompressed with CmprE. Let's say CmprI is 15 and CmprE is 0. When we receive SRH with Segment Left >= 2, Addresses[1..n-1] have 1 byte for each, and Addresses[n] has 16 bytes. When Segment Left is 1, Addresses[1..n-1] is decompressed to 16 bytes and not recompressed. Finally, the new SRH will need more room in the header, and the size is (16 - 1) * (n - 1) bytes. Here the max value of n is 255 as Segment Left is u8, so in the worst case, we have to allocate 3825 bytes in the skb headroom. However, now we only allocate a small fixed buffer that is IPV6_RPL_SRH_WORST_SWAP_SIZE (16 + 7 bytes). If the decompressed size overflows the room, skb_push() hits BUG() below [0]. Instead of allocating the fixed buffer for every packet, let's allocate enough headroom only when we receive SRH with Segment Left 1. [0]: skbuff: skb_under_panic: text:ffffffff81c9f6e2 len:576 put:576 head:ffff8880070b5180 data:ffff8880070b4fb0 tail:0x70 end:0x140 dev:lo kernel BUG at net/core/skbuff.c:200! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 154 Comm: python3 Not tainted 6.4.0-rc4-00190-gc308e9ec0047 #7 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:skb_panic (net/core/skbuff.c:200) Code: 4f 70 50 8b 87 bc 00 00 00 50 8b 87 b8 00 00 00 50 ff b7 c8 00 00 00 4c 8b 8f c0 00 00 00 48 c7 c7 80 6e 77 82 e8 ad 8b 60 ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 RSP: 0018:ffffc90000003da0 EFLAGS: 00000246 RAX: 0000000000000085 RBX: ffff8880058a6600 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff88807dc1c540 RDI: ffff88807dc1c540 RBP: ffffc90000003e48 R08: ffffffff82b392c8 R09: 00000000ffffdfff R10: ffffffff82a592e0 R11: ffffffff82b092e0 R12: ffff888005b1c800 R13: ffff8880070b51b8 R14: ffff888005b1ca18 R15: ffff8880070b5190 FS: 00007f4539f0b740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055670baf3000 CR3: 0000000005b0e000 CR4: 00000000007506f0 PKRU: 55555554 Call Trace: <IRQ> skb_push (net/core/skbuff.c:210) ipv6_rthdr_rcv (./include/linux/skbuff.h:2880 net/ipv6/exthdrs.c:634 net/ipv6/exthdrs.c:718) ip6_protocol_deliver_rcu (net/ipv6/ip6_input.c:437 (discriminator 5)) ip6_input_finish (./include/linux/rcupdate.h:805 net/ipv6/ip6_input.c:483) __netif_receive_skb_one_core (net/core/dev.c:5494) process_backlog (./include/linux/rcupdate.h:805 net/core/dev.c:5934) __napi_poll (net/core/dev.c:6496) net_rx_action (net/core/dev.c:6565 net/core/dev.c:6696) __do_softirq (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/irq.h:142 kernel/softirq.c:572) do_softirq (kernel/softirq.c:472 kernel/softirq.c:459) </IRQ> <TASK> __local_bh_enable_ip (kernel/softirq.c:396) __dev_queue_xmit (net/core/dev.c:4272) ip6_finish_output2 (./include/net/neighbour.h:544 net/ipv6/ip6_output.c:134) rawv6_sendmsg (./include/net/dst.h:458 ./include/linux/netfilter.h:303 net/ipv6/raw.c:656 net/ipv6/raw.c:914) sock_sendmsg (net/socket.c:724 net/socket.c:747) __sys_sendto (net/socket.c:2144) __x64_sys_sendto (net/socket.c:2156 net/socket.c:2152 net/socket.c:2152) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) RIP: 0033:0x7f453a138aea Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89 RSP: 002b:00007ffcc212a1c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00007ffcc212a288 RCX: 00007f453a138aea RDX: 0000000000000060 RSI: 00007f4539084c20 RDI: 0000000000000003 RBP: 00007f4538308e80 R08: 00007ffcc212a300 R09: 000000000000001c R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007f4539712d1b </TASK> Modules linked in: Fixes: 8610c7c ("net: ipv6: add support for rpl sr exthdr") Reported-by: Max VA Closes: https://www.interruptlabs.co.uk/articles/linux-ipv6-route-of-death Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
Currently, the per cpu upcall counters are allocated after the vport is created and inserted into the system. This could lead to the datapath accessing the counters before they are allocated resulting in a kernel Oops. Here is an example: PID: 59693 TASK: ffff0005f4f51500 CPU: 0 COMMAND: "ovs-vswitchd" #0 [ffff80000a39b5b0] __switch_to at ffffb70f0629f2f4 #1 [ffff80000a39b5d0] __schedule at ffffb70f0629f5cc #2 [ffff80000a39b650] preempt_schedule_common at ffffb70f0629fa60 #3 [ffff80000a39b670] dynamic_might_resched at ffffb70f0629fb58 #4 [ffff80000a39b680] mutex_lock_killable at ffffb70f062a1388 #5 [ffff80000a39b6a0] pcpu_alloc at ffffb70f0594460c #6 [ffff80000a39b750] __alloc_percpu_gfp at ffffb70f05944e68 #7 [ffff80000a39b760] ovs_vport_cmd_new at ffffb70ee6961b90 [openvswitch] ... PID: 58682 TASK: ffff0005b2f0bf00 CPU: 0 COMMAND: "kworker/0:3" #0 [ffff80000a5d2f40] machine_kexec at ffffb70f056a0758 #1 [ffff80000a5d2f70] __crash_kexec at ffffb70f057e2994 #2 [ffff80000a5d3100] crash_kexec at ffffb70f057e2ad8 #3 [ffff80000a5d3120] die at ffffb70f0628234c #4 [ffff80000a5d31e0] die_kernel_fault at ffffb70f062828a8 #5 [ffff80000a5d3210] __do_kernel_fault at ffffb70f056a31f4 #6 [ffff80000a5d3240] do_bad_area at ffffb70f056a32a4 #7 [ffff80000a5d3260] do_translation_fault at ffffb70f062a9710 #8 [ffff80000a5d3270] do_mem_abort at ffffb70f056a2f74 #9 [ffff80000a5d32a0] el1_abort at ffffb70f06297dac torvalds#10 [ffff80000a5d32d0] el1h_64_sync_handler at ffffb70f06299b24 torvalds#11 [ffff80000a5d3410] el1h_64_sync at ffffb70f056812dc torvalds#12 [ffff80000a5d3430] ovs_dp_upcall at ffffb70ee6963c84 [openvswitch] torvalds#13 [ffff80000a5d3470] ovs_dp_process_packet at ffffb70ee6963fdc [openvswitch] torvalds#14 [ffff80000a5d34f0] ovs_vport_receive at ffffb70ee6972c78 [openvswitch] torvalds#15 [ffff80000a5d36f0] netdev_port_receive at ffffb70ee6973948 [openvswitch] torvalds#16 [ffff80000a5d3720] netdev_frame_hook at ffffb70ee6973a28 [openvswitch] torvalds#17 [ffff80000a5d3730] __netif_receive_skb_core.constprop.0 at ffffb70f06079f90 We moved the per cpu upcall counter allocation to the existing vport alloc and free functions to solve this. Fixes: 95637d9 ("net: openvswitch: release vport resources on failure") Fixes: 1933ea3 ("net: openvswitch: Add support to count upcall packets") Signed-off-by: Eelco Chaudron <[email protected]> Reviewed-by: Simon Horman <[email protected]> Acked-by: Aaron Conole <[email protected]> Signed-off-by: David S. Miller <[email protected]>
commit 675751b upstream. If something was written to the buffer just before destruction, it may be possible (maybe not in a real system, but it did happen in ARCH=um with time-travel) to destroy the ringbuffer before the IRQ work ran, leading this KASAN report (or a crash without KASAN): BUG: KASAN: slab-use-after-free in irq_work_run_list+0x11a/0x13a Read of size 8 at addr 000000006d640a48 by task swapper/0 CPU: 0 PID: 0 Comm: swapper Tainted: G W O 6.3.0-rc1 archlinux#7 Stack: 60c4f20 0c203d48 41b58ab3 60f224fc 600477fa 60f35687 60c4f20 601273dd 00000008 6101eb00 6101eab0 615be548 Call Trace: [<60047a58>] show_stack+0x25e/0x282 [<60c609e0>] dump_stack_lvl+0x96/0xfd [<60c50d4c>] print_report+0x1a7/0x5a8 [<603078d3>] kasan_report+0xc1/0xe9 [<60308950>] __asan_report_load8_noabort+0x1b/0x1d [<60232844>] irq_work_run_list+0x11a/0x13a [<602328b4>] irq_work_tick+0x24/0x34 [<6017f9dc>] update_process_times+0x162/0x196 [<6019f335>] tick_sched_handle+0x1a4/0x1c3 [<6019fd9e>] tick_sched_timer+0x79/0x10c [<601812b9>] __hrtimer_run_queues.constprop.0+0x425/0x695 [<60182913>] hrtimer_interrupt+0x16c/0x2c4 [<600486a3>] um_timer+0x164/0x183 [...] Allocated by task 411: save_stack_trace+0x99/0xb5 stack_trace_save+0x81/0x9b kasan_save_stack+0x2d/0x54 kasan_set_track+0x34/0x3e kasan_save_alloc_info+0x25/0x28 ____kasan_kmalloc+0x8b/0x97 __kasan_kmalloc+0x10/0x12 __kmalloc+0xb2/0xe8 load_elf_phdrs+0xee/0x182 [...] The buggy address belongs to the object at 000000006d640800 which belongs to the cache kmalloc-1k of size 1024 The buggy address is located 584 bytes inside of freed 1024-byte region [000000006d640800, 000000006d640c00) Add the appropriate irq_work_sync() so the work finishes before the buffers are destroyed. Prior to the commit in the Fixes tag below, there was only a single global IRQ work, so this issue didn't exist. Link: https://lore.kernel.org/linux-trace-kernel/20230427175920.a76159263122.I8295e405c44362a86c995e9c2c37e3e03810aa56@changeid Cc: [email protected] Cc: Masami Hiramatsu <[email protected]> Fixes: 1569345 ("tracing/ring-buffer: Move poll wake ups into ring buffer code") Signed-off-by: Johannes Berg <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit ac868bc upstream. Balance as exclusive state is compatible with paused balance and device add, which makes some things more complicated. The assertion of valid states when starting from paused balance needs to take into account two more states, the combinations can be hit when there are several threads racing to start balance and device add. This won't typically happen when the commands are started from command line. Scenario 1: With exclusive_operation state == BTRFS_EXCLOP_NONE. Concurrently adding multiple devices to the same mount point and btrfs_exclop_finish executed finishes before assertion in btrfs_exclop_balance, exclusive_operation will changed to BTRFS_EXCLOP_NONE state which lead to assertion failed: fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE || fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD, in fs/btrfs/ioctl.c:456 Call Trace: <TASK> btrfs_exclop_balance+0x13c/0x310 ? memdup_user+0xab/0xc0 ? PTR_ERR+0x17/0x20 btrfs_ioctl_add_dev+0x2ee/0x320 btrfs_ioctl+0x9d5/0x10d0 ? btrfs_ioctl_encoded_write+0xb80/0xb80 __x64_sys_ioctl+0x197/0x210 do_syscall_64+0x3c/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd Scenario 2: With exclusive_operation state == BTRFS_EXCLOP_BALANCE_PAUSED. Concurrently adding multiple devices to the same mount point and btrfs_exclop_balance executed finish before the latter thread execute assertion in btrfs_exclop_balance, exclusive_operation will changed to BTRFS_EXCLOP_BALANCE_PAUSED state which lead to assertion failed: fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE || fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD || fs_info->exclusive_operation == BTRFS_EXCLOP_NONE, fs/btrfs/ioctl.c:458 Call Trace: <TASK> btrfs_exclop_balance+0x240/0x410 ? memdup_user+0xab/0xc0 ? PTR_ERR+0x17/0x20 btrfs_ioctl_add_dev+0x2ee/0x320 btrfs_ioctl+0x9d5/0x10d0 ? btrfs_ioctl_encoded_write+0xb80/0xb80 __x64_sys_ioctl+0x197/0x210 do_syscall_64+0x3c/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd An example of the failed assertion is below, which shows that the paused balance is also needed to be checked. root@syzkaller:/home/xsk# ./repro Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 [ 416.611428][ T7970] BTRFS info (device loop0): fs_info exclusive_operation: 0 Failed to add device /dev/vda, errno 14 [ 416.613973][ T7971] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.615456][ T7972] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.617528][ T7973] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.618359][ T7974] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.622589][ T7975] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.624034][ T7976] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.626420][ T7977] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.627643][ T7978] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.629006][ T7979] BTRFS info (device loop0): fs_info exclusive_operation: 3 [ 416.630298][ T7980] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 [ 416.632787][ T7981] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.634282][ T7982] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.636202][ T7983] BTRFS info (device loop0): fs_info exclusive_operation: 3 [ 416.637012][ T7984] BTRFS info (device loop0): fs_info exclusive_operation: 1 Failed to add device /dev/vda, errno 14 [ 416.637759][ T7984] assertion failed: fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE || fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD || fs_info->exclusive_operation == BTRFS_EXCLOP_NONE, in fs/btrfs/ioctl.c:458 [ 416.639845][ T7984] invalid opcode: 0000 [archlinux#1] PREEMPT SMP KASAN [ 416.640485][ T7984] CPU: 0 PID: 7984 Comm: repro Not tainted 6.2.0 archlinux#7 [ 416.641172][ T7984] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 [ 416.642090][ T7984] RIP: 0010:btrfs_assertfail+0x2c/0x2e [ 416.644423][ T7984] RSP: 0018:ffffc90003ea7e28 EFLAGS: 00010282 [ 416.645018][ T7984] RAX: 00000000000000cc RBX: 0000000000000000 RCX: 0000000000000000 [ 416.645763][ T7984] RDX: ffff88801d030000 RSI: ffffffff81637e7c RDI: fffff520007d4fb7 [ 416.646554][ T7984] RBP: ffffffff8a533de0 R08: 00000000000000cc R09: 0000000000000000 [ 416.647299][ T7984] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a533da0 [ 416.648041][ T7984] R13: 00000000000001ca R14: 000000005000940a R15: 0000000000000000 [ 416.648785][ T7984] FS: 00007fa2985d4640(0000) GS:ffff88802cc00000(0000) knlGS:0000000000000000 [ 416.649616][ T7984] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 416.650238][ T7984] CR2: 0000000000000000 CR3: 0000000018e5e000 CR4: 0000000000750ef0 [ 416.650980][ T7984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 416.651725][ T7984] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 416.652502][ T7984] PKRU: 55555554 [ 416.652888][ T7984] Call Trace: [ 416.653241][ T7984] <TASK> [ 416.653527][ T7984] btrfs_exclop_balance+0x240/0x410 [ 416.654036][ T7984] ? memdup_user+0xab/0xc0 [ 416.654465][ T7984] ? PTR_ERR+0x17/0x20 [ 416.654874][ T7984] btrfs_ioctl_add_dev+0x2ee/0x320 [ 416.655380][ T7984] btrfs_ioctl+0x9d5/0x10d0 [ 416.655822][ T7984] ? btrfs_ioctl_encoded_write+0xb80/0xb80 [ 416.656400][ T7984] __x64_sys_ioctl+0x197/0x210 [ 416.656874][ T7984] do_syscall_64+0x3c/0xb0 [ 416.657346][ T7984] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 416.657922][ T7984] RIP: 0033:0x4546af [ 416.660170][ T7984] RSP: 002b:00007fa2985d4150 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 416.660972][ T7984] RAX: ffffffffffffffda RBX: 00007fa2985d4640 RCX: 00000000004546af [ 416.661714][ T7984] RDX: 0000000000000000 RSI: 000000005000940a RDI: 0000000000000003 [ 416.662449][ T7984] RBP: 00007fa2985d41d0 R08: 0000000000000000 R09: 00007ffee37a4c4f [ 416.663195][ T7984] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fa2985d4640 [ 416.663951][ T7984] R13: 0000000000000009 R14: 000000000041b320 R15: 00007fa297dd4000 [ 416.664703][ T7984] </TASK> [ 416.665040][ T7984] Modules linked in: [ 416.665590][ T7984] ---[ end trace 0000000000000000 ]--- [ 416.666176][ T7984] RIP: 0010:btrfs_assertfail+0x2c/0x2e [ 416.668775][ T7984] RSP: 0018:ffffc90003ea7e28 EFLAGS: 00010282 [ 416.669425][ T7984] RAX: 00000000000000cc RBX: 0000000000000000 RCX: 0000000000000000 [ 416.670235][ T7984] RDX: ffff88801d030000 RSI: ffffffff81637e7c RDI: fffff520007d4fb7 [ 416.671050][ T7984] RBP: ffffffff8a533de0 R08: 00000000000000cc R09: 0000000000000000 [ 416.671867][ T7984] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a533da0 [ 416.672685][ T7984] R13: 00000000000001ca R14: 000000005000940a R15: 0000000000000000 [ 416.673501][ T7984] FS: 00007fa2985d4640(0000) GS:ffff88802cc00000(0000) knlGS:0000000000000000 [ 416.674425][ T7984] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 416.675114][ T7984] CR2: 0000000000000000 CR3: 0000000018e5e000 CR4: 0000000000750ef0 [ 416.675933][ T7984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 416.676760][ T7984] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Link: https://lore.kernel.org/linux-btrfs/[email protected]/ CC: [email protected] # 6.1+ Signed-off-by: xiaoshoukui <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
[ Upstream commit 05bb016 ] ACPICA commit 770653e3ba67c30a629ca7d12e352d83c2541b1e Before this change we see the following UBSAN stack trace in Fuchsia: #0 0x000021e4213b3302 in acpi_ds_init_aml_walk(struct acpi_walk_state*, union acpi_parse_object*, struct acpi_namespace_node*, u8*, u32, struct acpi_evaluate_info*, u8) ../../third_party/acpica/source/components/dispatcher/dswstate.c:682 <platform-bus-x86.so>+0x233302 archlinux#1.2 0x000020d0f660777f in ubsan_get_stack_trace() compiler-rt/lib/ubsan/ubsan_diag.cpp:41 <libclang_rt.asan.so>+0x3d77f archlinux#1.1 0x000020d0f660777f in maybe_print_stack_trace() compiler-rt/lib/ubsan/ubsan_diag.cpp:51 <libclang_rt.asan.so>+0x3d77f archlinux#1 0x000020d0f660777f in ~scoped_report() compiler-rt/lib/ubsan/ubsan_diag.cpp:387 <libclang_rt.asan.so>+0x3d77f archlinux#2 0x000020d0f660b96d in handlepointer_overflow_impl() compiler-rt/lib/ubsan/ubsan_handlers.cpp:809 <libclang_rt.asan.so>+0x4196d archlinux#3 0x000020d0f660b50d in compiler-rt/lib/ubsan/ubsan_handlers.cpp:815 <libclang_rt.asan.so>+0x4150d archlinux#4 0x000021e4213b3302 in acpi_ds_init_aml_walk(struct acpi_walk_state*, union acpi_parse_object*, struct acpi_namespace_node*, u8*, u32, struct acpi_evaluate_info*, u8) ../../third_party/acpica/source/components/dispatcher/dswstate.c:682 <platform-bus-x86.so>+0x233302 archlinux#5 0x000021e4213e2369 in acpi_ds_call_control_method(struct acpi_thread_state*, struct acpi_walk_state*, union acpi_parse_object*) ../../third_party/acpica/source/components/dispatcher/dsmethod.c:605 <platform-bus-x86.so>+0x262369 archlinux#6 0x000021e421437fac in acpi_ps_parse_aml(struct acpi_walk_state*) ../../third_party/acpica/source/components/parser/psparse.c:550 <platform-bus-x86.so>+0x2b7fac archlinux#7 0x000021e4214464d2 in acpi_ps_execute_method(struct acpi_evaluate_info*) ../../third_party/acpica/source/components/parser/psxface.c:244 <platform-bus-x86.so>+0x2c64d2 archlinux#8 0x000021e4213aa052 in acpi_ns_evaluate(struct acpi_evaluate_info*) ../../third_party/acpica/source/components/namespace/nseval.c:250 <platform-bus-x86.so>+0x22a052 archlinux#9 0x000021e421413dd8 in acpi_ns_init_one_device(acpi_handle, u32, void*, void**) ../../third_party/acpica/source/components/namespace/nsinit.c:735 <platform-bus-x86.so>+0x293dd8 torvalds#10 0x000021e421429e98 in acpi_ns_walk_namespace(acpi_object_type, acpi_handle, u32, u32, acpi_walk_callback, acpi_walk_callback, void*, void**) ../../third_party/acpica/source/components/namespace/nswalk.c:298 <platform-bus-x86.so>+0x2a9e98 torvalds#11 0x000021e4214131ac in acpi_ns_initialize_devices(u32) ../../third_party/acpica/source/components/namespace/nsinit.c:268 <platform-bus-x86.so>+0x2931ac torvalds#12 0x000021e42147c40d in acpi_initialize_objects(u32) ../../third_party/acpica/source/components/utilities/utxfinit.c:304 <platform-bus-x86.so>+0x2fc40d torvalds#13 0x000021e42126d603 in acpi::acpi_impl::initialize_acpi(acpi::acpi_impl*) ../../src/devices/board/lib/acpi/acpi-impl.cc:224 <platform-bus-x86.so>+0xed603 Add a simple check that avoids incrementing a pointer by zero, but otherwise behaves as before. Note that our findings are against ACPICA 20221020, but the same code exists on master. Link: acpica/acpica@770653e3 Signed-off-by: Bob Moore <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit ef6e199 ] In the function ieee80211_tx_dequeue() there is a particular locking sequence: begin: spin_lock(&local->queue_stop_reason_lock); q_stopped = local->queue_stop_reasons[q]; spin_unlock(&local->queue_stop_reason_lock); However small the chance (increased by ftracetest), an asynchronous interrupt can occur in between of spin_lock() and spin_unlock(), and the interrupt routine will attempt to lock the same &local->queue_stop_reason_lock again. This will cause a costly reset of the CPU and the wifi device or an altogether hang in the single CPU and single core scenario. The only remaining spin_lock(&local->queue_stop_reason_lock) that did not disable interrupts was patched, which should prevent any deadlocks on the same CPU/core and the same wifi device. This is the probable trace of the deadlock: kernel: ================================ kernel: WARNING: inconsistent lock state kernel: 6.3.0-rc6-mt-20230401-00001-gf86822a1170f archlinux#4 Tainted: G W kernel: -------------------------------- kernel: inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. kernel: kworker/5:0/25656 [HC0[0]:SC0[0]:HE1:SE1] takes: kernel: ffff9d6190779478 (&local->queue_stop_reason_lock){+.?.}-{2:2}, at: return_to_handler+0x0/0x40 kernel: {IN-SOFTIRQ-W} state was registered at: kernel: lock_acquire+0xc7/0x2d0 kernel: _raw_spin_lock+0x36/0x50 kernel: ieee80211_tx_dequeue+0xb4/0x1330 [mac80211] kernel: iwl_mvm_mac_itxq_xmit+0xae/0x210 [iwlmvm] kernel: iwl_mvm_mac_wake_tx_queue+0x2d/0xd0 [iwlmvm] kernel: ieee80211_queue_skb+0x450/0x730 [mac80211] kernel: __ieee80211_xmit_fast.constprop.66+0x834/0xa50 [mac80211] kernel: __ieee80211_subif_start_xmit+0x217/0x530 [mac80211] kernel: ieee80211_subif_start_xmit+0x60/0x580 [mac80211] kernel: dev_hard_start_xmit+0xb5/0x260 kernel: __dev_queue_xmit+0xdbe/0x1200 kernel: neigh_resolve_output+0x166/0x260 kernel: ip_finish_output2+0x216/0xb80 kernel: __ip_finish_output+0x2a4/0x4d0 kernel: ip_finish_output+0x2d/0xd0 kernel: ip_output+0x82/0x2b0 kernel: ip_local_out+0xec/0x110 kernel: igmpv3_sendpack+0x5c/0x90 kernel: igmp_ifc_timer_expire+0x26e/0x4e0 kernel: call_timer_fn+0xa5/0x230 kernel: run_timer_softirq+0x27f/0x550 kernel: __do_softirq+0xb4/0x3a4 kernel: irq_exit_rcu+0x9b/0xc0 kernel: sysvec_apic_timer_interrupt+0x80/0xa0 kernel: asm_sysvec_apic_timer_interrupt+0x1f/0x30 kernel: _raw_spin_unlock_irqrestore+0x3f/0x70 kernel: free_to_partial_list+0x3d6/0x590 kernel: __slab_free+0x1b7/0x310 kernel: kmem_cache_free+0x52d/0x550 kernel: putname+0x5d/0x70 kernel: do_sys_openat2+0x1d7/0x310 kernel: do_sys_open+0x51/0x80 kernel: __x64_sys_openat+0x24/0x30 kernel: do_syscall_64+0x5c/0x90 kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc kernel: irq event stamp: 5120729 kernel: hardirqs last enabled at (5120729): [<ffffffff9d149936>] trace_graph_return+0xd6/0x120 kernel: hardirqs last disabled at (5120728): [<ffffffff9d149950>] trace_graph_return+0xf0/0x120 kernel: softirqs last enabled at (5069900): [<ffffffff9cf65b60>] return_to_handler+0x0/0x40 kernel: softirqs last disabled at (5067555): [<ffffffff9cf65b60>] return_to_handler+0x0/0x40 kernel: other info that might help us debug this: kernel: Possible unsafe locking scenario: kernel: CPU0 kernel: ---- kernel: lock(&local->queue_stop_reason_lock); kernel: <Interrupt> kernel: lock(&local->queue_stop_reason_lock); kernel: *** DEADLOCK *** kernel: 8 locks held by kworker/5:0/25656: kernel: #0: ffff9d618009d138 ((wq_completion)events_freezable){+.+.}-{0:0}, at: process_one_work+0x1ca/0x530 kernel: archlinux#1: ffffb1ef4637fe68 ((work_completion)(&local->restart_work)){+.+.}-{0:0}, at: process_one_work+0x1ce/0x530 kernel: archlinux#2: ffffffff9f166548 (rtnl_mutex){+.+.}-{3:3}, at: return_to_handler+0x0/0x40 kernel: archlinux#3: ffff9d6190778728 (&rdev->wiphy.mtx){+.+.}-{3:3}, at: return_to_handler+0x0/0x40 kernel: archlinux#4: ffff9d619077b480 (&mvm->mutex){+.+.}-{3:3}, at: return_to_handler+0x0/0x40 kernel: archlinux#5: ffff9d61907bacd8 (&trans_pcie->mutex){+.+.}-{3:3}, at: return_to_handler+0x0/0x40 kernel: archlinux#6: ffffffff9ef9cda0 (rcu_read_lock){....}-{1:2}, at: iwl_mvm_queue_state_change+0x59/0x3a0 [iwlmvm] kernel: archlinux#7: ffffffff9ef9cda0 (rcu_read_lock){....}-{1:2}, at: iwl_mvm_mac_itxq_xmit+0x42/0x210 [iwlmvm] kernel: stack backtrace: kernel: CPU: 5 PID: 25656 Comm: kworker/5:0 Tainted: G W 6.3.0-rc6-mt-20230401-00001-gf86822a1170f archlinux#4 kernel: Hardware name: LENOVO 82H8/LNVNB161216, BIOS GGCN51WW 11/16/2022 kernel: Workqueue: events_freezable ieee80211_restart_work [mac80211] kernel: Call Trace: kernel: <TASK> kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: dump_stack_lvl+0x5f/0xa0 kernel: dump_stack+0x14/0x20 kernel: print_usage_bug.part.46+0x208/0x2a0 kernel: mark_lock.part.47+0x605/0x630 kernel: ? sched_clock+0xd/0x20 kernel: ? trace_clock_local+0x14/0x30 kernel: ? __rb_reserve_next+0x5f/0x490 kernel: ? _raw_spin_lock+0x1b/0x50 kernel: __lock_acquire+0x464/0x1990 kernel: ? mark_held_locks+0x4e/0x80 kernel: lock_acquire+0xc7/0x2d0 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: ? ftrace_return_to_handler+0x8b/0x100 kernel: ? preempt_count_add+0x4/0x70 kernel: _raw_spin_lock+0x36/0x50 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: ieee80211_tx_dequeue+0xb4/0x1330 [mac80211] kernel: ? prepare_ftrace_return+0xc5/0x190 kernel: ? ftrace_graph_func+0x16/0x20 kernel: ? 0xffffffffc02ab0b1 kernel: ? lock_acquire+0xc7/0x2d0 kernel: ? iwl_mvm_mac_itxq_xmit+0x42/0x210 [iwlmvm] kernel: ? ieee80211_tx_dequeue+0x9/0x1330 [mac80211] kernel: ? __rcu_read_lock+0x4/0x40 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_mvm_mac_itxq_xmit+0xae/0x210 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_mvm_queue_state_change+0x311/0x3a0 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_mvm_wake_sw_queue+0x17/0x20 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_txq_gen2_unmap+0x1c9/0x1f0 [iwlwifi] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_txq_gen2_free+0x55/0x130 [iwlwifi] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_txq_gen2_tx_free+0x63/0x80 [iwlwifi] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: _iwl_trans_pcie_gen2_stop_device+0x3f3/0x5b0 [iwlwifi] kernel: ? _iwl_trans_pcie_gen2_stop_device+0x9/0x5b0 [iwlwifi] kernel: ? mutex_lock_nested+0x4/0x30 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_trans_pcie_gen2_stop_device+0x5f/0x90 [iwlwifi] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_mvm_stop_device+0x78/0xd0 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: __iwl_mvm_mac_start+0x114/0x210 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: iwl_mvm_mac_start+0x76/0x150 [iwlmvm] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: drv_start+0x79/0x180 [mac80211] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: ieee80211_reconfig+0x1523/0x1ce0 [mac80211] kernel: ? synchronize_net+0x4/0x50 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: ieee80211_restart_work+0x108/0x170 [mac80211] kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: process_one_work+0x250/0x530 kernel: ? ftrace_regs_caller_end+0x66/0x66 kernel: worker_thread+0x48/0x3a0 kernel: ? __pfx_worker_thread+0x10/0x10 kernel: kthread+0x10f/0x140 kernel: ? __pfx_kthread+0x10/0x10 kernel: ret_from_fork+0x29/0x50 kernel: </TASK> Fixes: 4444bc2 ("wifi: mac80211: Proper mark iTXQs for resumption") Link: https://lore.kernel.org/all/[email protected]/ Reported-by: Mirsad Goran Todorovac <[email protected]> Cc: Gregory Greenman <[email protected]> Cc: Johannes Berg <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Cc: David S. Miller <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: Paolo Abeni <[email protected]> Cc: Leon Romanovsky <[email protected]> Cc: Alexander Wetzel <[email protected]> Signed-off-by: Mirsad Goran Todorovac <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]> Reviewed-by: tag, or it goes automatically? Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 37c3b9f ] The cited commit adds a compeletion to remove dependency on rtnl lock. But it causes a deadlock for multiple encapsulations: crash> bt ffff8aece8a64000 PID: 1514557 TASK: ffff8aece8a64000 CPU: 3 COMMAND: "tc" #0 [ffffa6d14183f368] __schedule at ffffffffb8ba7f45 archlinux#1 [ffffa6d14183f3f8] schedule at ffffffffb8ba8418 archlinux#2 [ffffa6d14183f418] schedule_preempt_disabled at ffffffffb8ba8898 archlinux#3 [ffffa6d14183f428] __mutex_lock at ffffffffb8baa7f8 archlinux#4 [ffffa6d14183f4d0] mutex_lock_nested at ffffffffb8baabeb archlinux#5 [ffffa6d14183f4e0] mlx5e_attach_encap at ffffffffc0f48c17 [mlx5_core] archlinux#6 [ffffa6d14183f628] mlx5e_tc_add_fdb_flow at ffffffffc0f39680 [mlx5_core] archlinux#7 [ffffa6d14183f688] __mlx5e_add_fdb_flow at ffffffffc0f3b636 [mlx5_core] archlinux#8 [ffffa6d14183f6f0] mlx5e_tc_add_flow at ffffffffc0f3bcdf [mlx5_core] archlinux#9 [ffffa6d14183f728] mlx5e_configure_flower at ffffffffc0f3c1d1 [mlx5_core] torvalds#10 [ffffa6d14183f790] mlx5e_rep_setup_tc_cls_flower at ffffffffc0f3d529 [mlx5_core] torvalds#11 [ffffa6d14183f7a0] mlx5e_rep_setup_tc_cb at ffffffffc0f3d714 [mlx5_core] torvalds#12 [ffffa6d14183f7b0] tc_setup_cb_add at ffffffffb8931bb8 torvalds#13 [ffffa6d14183f810] fl_hw_replace_filter at ffffffffc0dae901 [cls_flower] torvalds#14 [ffffa6d14183f8d8] fl_change at ffffffffc0db5c57 [cls_flower] torvalds#15 [ffffa6d14183f970] tc_new_tfilter at ffffffffb8936047 torvalds#16 [ffffa6d14183fac8] rtnetlink_rcv_msg at ffffffffb88c7c31 torvalds#17 [ffffa6d14183fb50] netlink_rcv_skb at ffffffffb8942853 torvalds#18 [ffffa6d14183fbc0] rtnetlink_rcv at ffffffffb88c1835 torvalds#19 [ffffa6d14183fbd0] netlink_unicast at ffffffffb8941f27 torvalds#20 [ffffa6d14183fc18] netlink_sendmsg at ffffffffb8942245 torvalds#21 [ffffa6d14183fc98] sock_sendmsg at ffffffffb887d482 torvalds#22 [ffffa6d14183fcb8] ____sys_sendmsg at ffffffffb887d81a torvalds#23 [ffffa6d14183fd38] ___sys_sendmsg at ffffffffb88806e2 torvalds#24 [ffffa6d14183fe90] __sys_sendmsg at ffffffffb88807a2 torvalds#25 [ffffa6d14183ff28] __x64_sys_sendmsg at ffffffffb888080f torvalds#26 [ffffa6d14183ff38] do_syscall_64 at ffffffffb8b9b6a8 torvalds#27 [ffffa6d14183ff50] entry_SYSCALL_64_after_hwframe at ffffffffb8c0007c crash> bt 0xffff8aeb07544000 PID: 1110766 TASK: ffff8aeb07544000 CPU: 0 COMMAND: "kworker/u20:9" #0 [ffffa6d14e6b7bd8] __schedule at ffffffffb8ba7f45 archlinux#1 [ffffa6d14e6b7c68] schedule at ffffffffb8ba8418 archlinux#2 [ffffa6d14e6b7c88] schedule_timeout at ffffffffb8baef88 archlinux#3 [ffffa6d14e6b7d10] wait_for_completion at ffffffffb8ba968b archlinux#4 [ffffa6d14e6b7d60] mlx5e_take_all_encap_flows at ffffffffc0f47ec4 [mlx5_core] archlinux#5 [ffffa6d14e6b7da0] mlx5e_rep_update_flows at ffffffffc0f3e734 [mlx5_core] archlinux#6 [ffffa6d14e6b7df8] mlx5e_rep_neigh_update at ffffffffc0f400bb [mlx5_core] archlinux#7 [ffffa6d14e6b7e50] process_one_work at ffffffffb80acc9c archlinux#8 [ffffa6d14e6b7ed0] worker_thread at ffffffffb80ad012 archlinux#9 [ffffa6d14e6b7f10] kthread at ffffffffb80b615d torvalds#10 [ffffa6d14e6b7f50] ret_from_fork at ffffffffb8001b2f After the first encap is attached, flow will be added to encap entry's flows list. If neigh update is running at this time, the following encaps of the flow can't hold the encap_tbl_lock and sleep. If neigh update thread is waiting for that flow's init_done, deadlock happens. Fix it by holding lock outside of the for loop. If neigh update is running, prevent encap flows from offloading. Since the lock is held outside of the for loop, concurrent creation of encap entries is not allowed. So remove unnecessary wait_for_completion call for res_ready. Fixes: 95435ad ("net/mlx5e: Only access fully initialized flows in neigh update") Signed-off-by: Chris Mi <[email protected]> Reviewed-by: Roi Dayan <[email protected]> Reviewed-by: Vlad Buslov <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit a2f4c14 ] A remote DoS vulnerability of RPL Source Routing is assigned CVE-2023-2156. The Source Routing Header (SRH) has the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Next Header | Hdr Ext Len | Routing Type | Segments Left | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CmprI | CmprE | Pad | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | . . . Addresses[1..n] . . . | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The originator of an SRH places the first hop's IPv6 address in the IPv6 header's IPv6 Destination Address and the second hop's IPv6 address as the first address in Addresses[1..n]. The CmprI and CmprE fields indicate the number of prefix octets that are shared with the IPv6 Destination Address. When CmprI or CmprE is not 0, Addresses[1..n] are compressed as follows: 1..n-1 : (16 - CmprI) bytes n : (16 - CmprE) bytes Segments Left indicates the number of route segments remaining. When the value is not zero, the SRH is forwarded to the next hop. Its address is extracted from Addresses[n - Segment Left + 1] and swapped with IPv6 Destination Address. When Segment Left is greater than or equal to 2, the size of SRH is not changed because Addresses[1..n-1] are decompressed and recompressed with CmprI. OTOH, when Segment Left changes from 1 to 0, the new SRH could have a different size because Addresses[1..n-1] are decompressed with CmprI and recompressed with CmprE. Let's say CmprI is 15 and CmprE is 0. When we receive SRH with Segment Left >= 2, Addresses[1..n-1] have 1 byte for each, and Addresses[n] has 16 bytes. When Segment Left is 1, Addresses[1..n-1] is decompressed to 16 bytes and not recompressed. Finally, the new SRH will need more room in the header, and the size is (16 - 1) * (n - 1) bytes. Here the max value of n is 255 as Segment Left is u8, so in the worst case, we have to allocate 3825 bytes in the skb headroom. However, now we only allocate a small fixed buffer that is IPV6_RPL_SRH_WORST_SWAP_SIZE (16 + 7 bytes). If the decompressed size overflows the room, skb_push() hits BUG() below [0]. Instead of allocating the fixed buffer for every packet, let's allocate enough headroom only when we receive SRH with Segment Left 1. [0]: skbuff: skb_under_panic: text:ffffffff81c9f6e2 len:576 put:576 head:ffff8880070b5180 data:ffff8880070b4fb0 tail:0x70 end:0x140 dev:lo kernel BUG at net/core/skbuff.c:200! invalid opcode: 0000 [archlinux#1] PREEMPT SMP PTI CPU: 0 PID: 154 Comm: python3 Not tainted 6.4.0-rc4-00190-gc308e9ec0047 archlinux#7 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:skb_panic (net/core/skbuff.c:200) Code: 4f 70 50 8b 87 bc 00 00 00 50 8b 87 b8 00 00 00 50 ff b7 c8 00 00 00 4c 8b 8f c0 00 00 00 48 c7 c7 80 6e 77 82 e8 ad 8b 60 ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 RSP: 0018:ffffc90000003da0 EFLAGS: 00000246 RAX: 0000000000000085 RBX: ffff8880058a6600 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff88807dc1c540 RDI: ffff88807dc1c540 RBP: ffffc90000003e48 R08: ffffffff82b392c8 R09: 00000000ffffdfff R10: ffffffff82a592e0 R11: ffffffff82b092e0 R12: ffff888005b1c800 R13: ffff8880070b51b8 R14: ffff888005b1ca18 R15: ffff8880070b5190 FS: 00007f4539f0b740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055670baf3000 CR3: 0000000005b0e000 CR4: 00000000007506f0 PKRU: 55555554 Call Trace: <IRQ> skb_push (net/core/skbuff.c:210) ipv6_rthdr_rcv (./include/linux/skbuff.h:2880 net/ipv6/exthdrs.c:634 net/ipv6/exthdrs.c:718) ip6_protocol_deliver_rcu (net/ipv6/ip6_input.c:437 (discriminator 5)) ip6_input_finish (./include/linux/rcupdate.h:805 net/ipv6/ip6_input.c:483) __netif_receive_skb_one_core (net/core/dev.c:5494) process_backlog (./include/linux/rcupdate.h:805 net/core/dev.c:5934) __napi_poll (net/core/dev.c:6496) net_rx_action (net/core/dev.c:6565 net/core/dev.c:6696) __do_softirq (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/irq.h:142 kernel/softirq.c:572) do_softirq (kernel/softirq.c:472 kernel/softirq.c:459) </IRQ> <TASK> __local_bh_enable_ip (kernel/softirq.c:396) __dev_queue_xmit (net/core/dev.c:4272) ip6_finish_output2 (./include/net/neighbour.h:544 net/ipv6/ip6_output.c:134) rawv6_sendmsg (./include/net/dst.h:458 ./include/linux/netfilter.h:303 net/ipv6/raw.c:656 net/ipv6/raw.c:914) sock_sendmsg (net/socket.c:724 net/socket.c:747) __sys_sendto (net/socket.c:2144) __x64_sys_sendto (net/socket.c:2156 net/socket.c:2152 net/socket.c:2152) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) RIP: 0033:0x7f453a138aea Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89 RSP: 002b:00007ffcc212a1c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00007ffcc212a288 RCX: 00007f453a138aea RDX: 0000000000000060 RSI: 00007f4539084c20 RDI: 0000000000000003 RBP: 00007f4538308e80 R08: 00007ffcc212a300 R09: 000000000000001c R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007f4539712d1b </TASK> Modules linked in: Fixes: 8610c7c ("net: ipv6: add support for rpl sr exthdr") Reported-by: Max VA Closes: https://www.interruptlabs.co.uk/articles/linux-ipv6-route-of-death Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 6a2fa13 ] syzkaller reported a use-after-free of UDP kernel socket in cleanup_bearer() without repro. [0][1] When bearer_disable() calls tipc_udp_disable(), cleanup of the UDP kernel socket is deferred by work calling cleanup_bearer(). tipc_net_stop() waits for such works to finish by checking tipc_net(net)->wq_count. However, the work decrements the count too early before releasing the kernel socket, unblocking cleanup_net() and resulting in use-after-free. Let's move the decrement after releasing the socket in cleanup_bearer(). [0]: ref_tracker: net notrefcnt@000000009b3d1faf has 1/1 users at sk_alloc+0x438/0x608 inet_create+0x4c8/0xcb0 __sock_create+0x350/0x6b8 sock_create_kern+0x58/0x78 udp_sock_create4+0x68/0x398 udp_sock_create+0x88/0xc8 tipc_udp_enable+0x5e8/0x848 __tipc_nl_bearer_enable+0x84c/0xed8 tipc_nl_bearer_enable+0x38/0x60 genl_family_rcv_msg_doit+0x170/0x248 genl_rcv_msg+0x400/0x5b0 netlink_rcv_skb+0x1dc/0x398 genl_rcv+0x44/0x68 netlink_unicast+0x678/0x8b0 netlink_sendmsg+0x5e4/0x898 ____sys_sendmsg+0x500/0x830 [1]: BUG: KMSAN: use-after-free in udp_hashslot include/net/udp.h:85 [inline] BUG: KMSAN: use-after-free in udp_lib_unhash+0x3b8/0x930 net/ipv4/udp.c:1979 udp_hashslot include/net/udp.h:85 [inline] udp_lib_unhash+0x3b8/0x930 net/ipv4/udp.c:1979 sk_common_release+0xaf/0x3f0 net/core/sock.c:3820 inet_release+0x1e0/0x260 net/ipv4/af_inet.c:437 inet6_release+0x6f/0xd0 net/ipv6/af_inet6.c:489 __sock_release net/socket.c:658 [inline] sock_release+0xa0/0x210 net/socket.c:686 cleanup_bearer+0x42d/0x4c0 net/tipc/udp_media.c:819 process_one_work kernel/workqueue.c:3229 [inline] process_scheduled_works+0xcaf/0x1c90 kernel/workqueue.c:3310 worker_thread+0xf6c/0x1510 kernel/workqueue.c:3391 kthread+0x531/0x6b0 kernel/kthread.c:389 ret_from_fork+0x60/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:244 Uninit was created at: slab_free_hook mm/slub.c:2269 [inline] slab_free mm/slub.c:4580 [inline] kmem_cache_free+0x207/0xc40 mm/slub.c:4682 net_free net/core/net_namespace.c:454 [inline] cleanup_net+0x16f2/0x19d0 net/core/net_namespace.c:647 process_one_work kernel/workqueue.c:3229 [inline] process_scheduled_works+0xcaf/0x1c90 kernel/workqueue.c:3310 worker_thread+0xf6c/0x1510 kernel/workqueue.c:3391 kthread+0x531/0x6b0 kernel/kthread.c:389 ret_from_fork+0x60/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:244 CPU: 0 UID: 0 PID: 54 Comm: kworker/0:2 Not tainted 6.12.0-rc1-00131-gf66ebf37d69c #7 91723d6f74857f70725e1583cba3cf4adc716cfa Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 Workqueue: events cleanup_bearer Fixes: 26abe14 ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") Reported-by: syzkaller <[email protected]> Signed-off-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 5858b68 ] Kernel will hang on destroy admin_q while we create ctrl failed, such as following calltrace: PID: 23644 TASK: ff2d52b40f439fc0 CPU: 2 COMMAND: "nvme" #0 [ff61d23de260fb78] __schedule at ffffffff8323bc15 #1 [ff61d23de260fc08] schedule at ffffffff8323c014 #2 [ff61d23de260fc28] blk_mq_freeze_queue_wait at ffffffff82a3dba1 #3 [ff61d23de260fc78] blk_freeze_queue at ffffffff82a4113a #4 [ff61d23de260fc90] blk_cleanup_queue at ffffffff82a33006 #5 [ff61d23de260fcb0] nvme_rdma_destroy_admin_queue at ffffffffc12686ce #6 [ff61d23de260fcc8] nvme_rdma_setup_ctrl at ffffffffc1268ced #7 [ff61d23de260fd28] nvme_rdma_create_ctrl at ffffffffc126919b #8 [ff61d23de260fd68] nvmf_dev_write at ffffffffc024f362 #9 [ff61d23de260fe38] vfs_write at ffffffff827d5f25 RIP: 00007fda7891d574 RSP: 00007ffe2ef06958 RFLAGS: 00000202 RAX: ffffffffffffffda RBX: 000055e8122a4d90 RCX: 00007fda7891d574 RDX: 000000000000012b RSI: 000055e8122a4d90 RDI: 0000000000000004 RBP: 00007ffe2ef079c0 R8: 000000000000012b R9: 000055e8122a4d90 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000004 R13: 000055e8122923c0 R14: 000000000000012b R15: 00007fda78a54500 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b This due to we have quiesced admi_q before cancel requests, but forgot to unquiesce before destroy it, as a result we fail to drain the pending requests, and hang on blk_mq_freeze_queue_wait() forever. Here try to reuse nvme_rdma_teardown_admin_queue() to fix this issue and simplify the code. Fixes: 958dc1d ("nvme-rdma: add clean action for failed reconnection") Reported-by: Yingfu.zhou <[email protected]> Signed-off-by: Chunguang.xu <[email protected]> Signed-off-by: Yue.zhao <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Keith Busch <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Its used from trace__run(), for the 'perf trace' live mode, i.e. its strace-like, non-perf.data file processing mode, the most common one. The trace__run() function will set trace->host using machine__new_host() that is supposed to give a machine instance representing the running machine, and since we'll use perf_env__arch_strerrno() to get the right errno -> string table, we need to use machine->env, so initialize it in machine__new_host(). Before the patch: (gdb) run trace --errno-summary -a sleep 1 <SNIP> Summary of events: gvfs-afc-volume (3187), 2 events, 0.0% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ pselect6 1 0 0.000 0.000 0.000 0.000 0.00% GUsbEventThread (3519), 2 events, 0.0% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ poll 1 0 0.000 0.000 0.000 0.000 0.00% <SNIP> Program received signal SIGSEGV, Segmentation fault. 0x00000000005caba0 in perf_env__arch_strerrno (env=0x0, err=110) at util/env.c:478 478 if (env->arch_strerrno == NULL) (gdb) bt #0 0x00000000005caba0 in perf_env__arch_strerrno (env=0x0, err=110) at util/env.c:478 #1 0x00000000004b75d2 in thread__dump_stats (ttrace=0x14f58f0, trace=0x7fffffffa5b0, fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>) at builtin-trace.c:4673 #2 0x00000000004b78bf in trace__fprintf_thread (fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>, thread=0x10fa0b0, trace=0x7fffffffa5b0) at builtin-trace.c:4708 #3 0x00000000004b7ad9 in trace__fprintf_thread_summary (trace=0x7fffffffa5b0, fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>) at builtin-trace.c:4747 #4 0x00000000004b656e in trace__run (trace=0x7fffffffa5b0, argc=2, argv=0x7fffffffde60) at builtin-trace.c:4456 #5 0x00000000004ba43e in cmd_trace (argc=2, argv=0x7fffffffde60) at builtin-trace.c:5487 #6 0x00000000004c0414 in run_builtin (p=0xec3068 <commands+648>, argc=5, argv=0x7fffffffde60) at perf.c:351 #7 0x00000000004c06bb in handle_internal_command (argc=5, argv=0x7fffffffde60) at perf.c:404 #8 0x00000000004c0814 in run_argv (argcp=0x7fffffffdc4c, argv=0x7fffffffdc40) at perf.c:448 #9 0x00000000004c0b5d in main (argc=5, argv=0x7fffffffde60) at perf.c:560 (gdb) After: root@number:~# perf trace -a --errno-summary sleep 1 <SNIP> pw-data-loop (2685), 1410 events, 16.0% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ epoll_wait 188 0 983.428 0.000 5.231 15.595 8.68% ioctl 94 0 0.811 0.004 0.009 0.016 2.82% read 188 0 0.322 0.001 0.002 0.006 5.15% write 141 0 0.280 0.001 0.002 0.018 8.39% timerfd_settime 94 0 0.138 0.001 0.001 0.007 6.47% gnome-control-c (179406), 1848 events, 20.9% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ poll 222 0 959.577 0.000 4.322 21.414 11.40% recvmsg 150 0 0.539 0.001 0.004 0.013 5.12% write 300 0 0.442 0.001 0.001 0.007 3.29% read 150 0 0.183 0.001 0.001 0.009 5.53% getpid 102 0 0.101 0.000 0.001 0.008 7.82% root@number:~# Fixes: 54373b5 ("perf env: Introduce perf_env__arch_strerrno()") Reported-by: Veronika Molnarova <[email protected]> Signed-off-by: Arnaldo Carvalho de Melo <[email protected]> Acked-by: Veronika Molnarova <[email protected]> Acked-by: Michael Petlan <[email protected]> Tested-by: Michael Petlan <[email protected]> Link: https://lore.kernel.org/r/Z0XffUgNSv_9OjOi@x1 Signed-off-by: Namhyung Kim <[email protected]>
The 5760X (P7) chip's HW GRO/LRO interface is very similar to that of the previous generation (5750X or P5). However, the aggregation ID fields in the completion structures on P7 have been redefined from 16 bits to 12 bits. The freed up 4 bits are redefined for part of the metadata such as the VLAN ID. The aggregation ID mask was not modified when adding support for P7 chips. Including the extra 4 bits for the aggregation ID can potentially cause the driver to store or fetch the packet header of GRO/LRO packets in the wrong TPA buffer. It may hit the BUG() condition in __skb_pull() because the SKB contains no valid packet header: kernel BUG at include/linux/skbuff.h:2766! Oops: invalid opcode: 0000 1 PREEMPT SMP NOPTI CPU: 4 UID: 0 PID: 0 Comm: swapper/4 Kdump: loaded Tainted: G OE 6.12.0-rc2+ #7 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: Dell Inc. PowerEdge R760/0VRV9X, BIOS 1.0.1 12/27/2022 RIP: 0010:eth_type_trans+0xda/0x140 Code: 80 00 00 00 eb c1 8b 47 70 2b 47 74 48 8b 97 d0 00 00 00 83 f8 01 7e 1b 48 85 d2 74 06 66 83 3a ff 74 09 b8 00 04 00 00 eb a5 <0f> 0b b8 00 01 00 00 eb 9c 48 85 ff 74 eb 31 f6 b9 02 00 00 00 48 RSP: 0018:ff615003803fcc28 EFLAGS: 00010283 RAX: 00000000000022d2 RBX: 0000000000000003 RCX: ff2e8c25da334040 RDX: 0000000000000040 RSI: ff2e8c25c1ce8000 RDI: ff2e8c25869f9000 RBP: ff2e8c258c31c000 R08: ff2e8c25da334000 R09: 0000000000000001 R10: ff2e8c25da3342c0 R11: ff2e8c25c1ce89c0 R12: ff2e8c258e0990b0 R13: ff2e8c25bb120000 R14: ff2e8c25c1ce89c0 R15: ff2e8c25869f9000 FS: 0000000000000000(0000) GS:ff2e8c34be300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055f05317e4c8 CR3: 000000108bac6006 CR4: 0000000000773ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> ? die+0x33/0x90 ? do_trap+0xd9/0x100 ? eth_type_trans+0xda/0x140 ? do_error_trap+0x65/0x80 ? eth_type_trans+0xda/0x140 ? exc_invalid_op+0x4e/0x70 ? eth_type_trans+0xda/0x140 ? asm_exc_invalid_op+0x16/0x20 ? eth_type_trans+0xda/0x140 bnxt_tpa_end+0x10b/0x6b0 [bnxt_en] ? bnxt_tpa_start+0x195/0x320 [bnxt_en] bnxt_rx_pkt+0x902/0xd90 [bnxt_en] ? __bnxt_tx_int.constprop.0+0x89/0x300 [bnxt_en] ? kmem_cache_free+0x343/0x440 ? __bnxt_tx_int.constprop.0+0x24f/0x300 [bnxt_en] __bnxt_poll_work+0x193/0x370 [bnxt_en] bnxt_poll_p5+0x9a/0x300 [bnxt_en] ? try_to_wake_up+0x209/0x670 __napi_poll+0x29/0x1b0 Fix it by redefining the aggregation ID mask for P5_PLUS chips to be 12 bits. This will work because the maximum aggregation ID is less than 4096 on all P5_PLUS chips. Fixes: 13d2d3d ("bnxt_en: Add new P7 hardware interface definitions") Reviewed-by: Damodharam Ammepalli <[email protected]> Reviewed-by: Kalesh AP <[email protected]> Reviewed-by: Andy Gospodarek <[email protected]> Signed-off-by: Michael Chan <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
[ Upstream commit 88a6e2f ] Its used from trace__run(), for the 'perf trace' live mode, i.e. its strace-like, non-perf.data file processing mode, the most common one. The trace__run() function will set trace->host using machine__new_host() that is supposed to give a machine instance representing the running machine, and since we'll use perf_env__arch_strerrno() to get the right errno -> string table, we need to use machine->env, so initialize it in machine__new_host(). Before the patch: (gdb) run trace --errno-summary -a sleep 1 <SNIP> Summary of events: gvfs-afc-volume (3187), 2 events, 0.0% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ pselect6 1 0 0.000 0.000 0.000 0.000 0.00% GUsbEventThread (3519), 2 events, 0.0% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ poll 1 0 0.000 0.000 0.000 0.000 0.00% <SNIP> Program received signal SIGSEGV, Segmentation fault. 0x00000000005caba0 in perf_env__arch_strerrno (env=0x0, err=110) at util/env.c:478 478 if (env->arch_strerrno == NULL) (gdb) bt #0 0x00000000005caba0 in perf_env__arch_strerrno (env=0x0, err=110) at util/env.c:478 #1 0x00000000004b75d2 in thread__dump_stats (ttrace=0x14f58f0, trace=0x7fffffffa5b0, fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>) at builtin-trace.c:4673 #2 0x00000000004b78bf in trace__fprintf_thread (fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>, thread=0x10fa0b0, trace=0x7fffffffa5b0) at builtin-trace.c:4708 #3 0x00000000004b7ad9 in trace__fprintf_thread_summary (trace=0x7fffffffa5b0, fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>) at builtin-trace.c:4747 #4 0x00000000004b656e in trace__run (trace=0x7fffffffa5b0, argc=2, argv=0x7fffffffde60) at builtin-trace.c:4456 #5 0x00000000004ba43e in cmd_trace (argc=2, argv=0x7fffffffde60) at builtin-trace.c:5487 #6 0x00000000004c0414 in run_builtin (p=0xec3068 <commands+648>, argc=5, argv=0x7fffffffde60) at perf.c:351 #7 0x00000000004c06bb in handle_internal_command (argc=5, argv=0x7fffffffde60) at perf.c:404 #8 0x00000000004c0814 in run_argv (argcp=0x7fffffffdc4c, argv=0x7fffffffdc40) at perf.c:448 #9 0x00000000004c0b5d in main (argc=5, argv=0x7fffffffde60) at perf.c:560 (gdb) After: root@number:~# perf trace -a --errno-summary sleep 1 <SNIP> pw-data-loop (2685), 1410 events, 16.0% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ epoll_wait 188 0 983.428 0.000 5.231 15.595 8.68% ioctl 94 0 0.811 0.004 0.009 0.016 2.82% read 188 0 0.322 0.001 0.002 0.006 5.15% write 141 0 0.280 0.001 0.002 0.018 8.39% timerfd_settime 94 0 0.138 0.001 0.001 0.007 6.47% gnome-control-c (179406), 1848 events, 20.9% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ poll 222 0 959.577 0.000 4.322 21.414 11.40% recvmsg 150 0 0.539 0.001 0.004 0.013 5.12% write 300 0 0.442 0.001 0.001 0.007 3.29% read 150 0 0.183 0.001 0.001 0.009 5.53% getpid 102 0 0.101 0.000 0.001 0.008 7.82% root@number:~# Fixes: 54373b5 ("perf env: Introduce perf_env__arch_strerrno()") Reported-by: Veronika Molnarova <[email protected]> Signed-off-by: Arnaldo Carvalho de Melo <[email protected]> Acked-by: Veronika Molnarova <[email protected]> Acked-by: Michael Petlan <[email protected]> Tested-by: Michael Petlan <[email protected]> Link: https://lore.kernel.org/r/Z0XffUgNSv_9OjOi@x1 Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 24c6843 ] The 5760X (P7) chip's HW GRO/LRO interface is very similar to that of the previous generation (5750X or P5). However, the aggregation ID fields in the completion structures on P7 have been redefined from 16 bits to 12 bits. The freed up 4 bits are redefined for part of the metadata such as the VLAN ID. The aggregation ID mask was not modified when adding support for P7 chips. Including the extra 4 bits for the aggregation ID can potentially cause the driver to store or fetch the packet header of GRO/LRO packets in the wrong TPA buffer. It may hit the BUG() condition in __skb_pull() because the SKB contains no valid packet header: kernel BUG at include/linux/skbuff.h:2766! Oops: invalid opcode: 0000 1 PREEMPT SMP NOPTI CPU: 4 UID: 0 PID: 0 Comm: swapper/4 Kdump: loaded Tainted: G OE 6.12.0-rc2+ #7 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: Dell Inc. PowerEdge R760/0VRV9X, BIOS 1.0.1 12/27/2022 RIP: 0010:eth_type_trans+0xda/0x140 Code: 80 00 00 00 eb c1 8b 47 70 2b 47 74 48 8b 97 d0 00 00 00 83 f8 01 7e 1b 48 85 d2 74 06 66 83 3a ff 74 09 b8 00 04 00 00 eb a5 <0f> 0b b8 00 01 00 00 eb 9c 48 85 ff 74 eb 31 f6 b9 02 00 00 00 48 RSP: 0018:ff615003803fcc28 EFLAGS: 00010283 RAX: 00000000000022d2 RBX: 0000000000000003 RCX: ff2e8c25da334040 RDX: 0000000000000040 RSI: ff2e8c25c1ce8000 RDI: ff2e8c25869f9000 RBP: ff2e8c258c31c000 R08: ff2e8c25da334000 R09: 0000000000000001 R10: ff2e8c25da3342c0 R11: ff2e8c25c1ce89c0 R12: ff2e8c258e0990b0 R13: ff2e8c25bb120000 R14: ff2e8c25c1ce89c0 R15: ff2e8c25869f9000 FS: 0000000000000000(0000) GS:ff2e8c34be300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055f05317e4c8 CR3: 000000108bac6006 CR4: 0000000000773ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> ? die+0x33/0x90 ? do_trap+0xd9/0x100 ? eth_type_trans+0xda/0x140 ? do_error_trap+0x65/0x80 ? eth_type_trans+0xda/0x140 ? exc_invalid_op+0x4e/0x70 ? eth_type_trans+0xda/0x140 ? asm_exc_invalid_op+0x16/0x20 ? eth_type_trans+0xda/0x140 bnxt_tpa_end+0x10b/0x6b0 [bnxt_en] ? bnxt_tpa_start+0x195/0x320 [bnxt_en] bnxt_rx_pkt+0x902/0xd90 [bnxt_en] ? __bnxt_tx_int.constprop.0+0x89/0x300 [bnxt_en] ? kmem_cache_free+0x343/0x440 ? __bnxt_tx_int.constprop.0+0x24f/0x300 [bnxt_en] __bnxt_poll_work+0x193/0x370 [bnxt_en] bnxt_poll_p5+0x9a/0x300 [bnxt_en] ? try_to_wake_up+0x209/0x670 __napi_poll+0x29/0x1b0 Fix it by redefining the aggregation ID mask for P5_PLUS chips to be 12 bits. This will work because the maximum aggregation ID is less than 4096 on all P5_PLUS chips. Fixes: 13d2d3d ("bnxt_en: Add new P7 hardware interface definitions") Reviewed-by: Damodharam Ammepalli <[email protected]> Reviewed-by: Kalesh AP <[email protected]> Reviewed-by: Andy Gospodarek <[email protected]> Signed-off-by: Michael Chan <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
…le_direct_reclaim() The task sometimes continues looping in throttle_direct_reclaim() because allow_direct_reclaim(pgdat) keeps returning false. #0 [ffff80002cb6f8d0] __switch_to at ffff8000080095ac #1 [ffff80002cb6f900] __schedule at ffff800008abbd1c #2 [ffff80002cb6f990] schedule at ffff800008abc50c #3 [ffff80002cb6f9b0] throttle_direct_reclaim at ffff800008273550 #4 [ffff80002cb6fa20] try_to_free_pages at ffff800008277b68 #5 [ffff80002cb6fae0] __alloc_pages_nodemask at ffff8000082c4660 #6 [ffff80002cb6fc50] alloc_pages_vma at ffff8000082e4a98 #7 [ffff80002cb6fca0] do_anonymous_page at ffff80000829f5a8 #8 [ffff80002cb6fce0] __handle_mm_fault at ffff8000082a5974 #9 [ffff80002cb6fd90] handle_mm_fault at ffff8000082a5bd4 At this point, the pgdat contains the following two zones: NODE: 4 ZONE: 0 ADDR: ffff00817fffe540 NAME: "DMA32" SIZE: 20480 MIN/LOW/HIGH: 11/28/45 VM_STAT: NR_FREE_PAGES: 359 NR_ZONE_INACTIVE_ANON: 18813 NR_ZONE_ACTIVE_ANON: 0 NR_ZONE_INACTIVE_FILE: 50 NR_ZONE_ACTIVE_FILE: 0 NR_ZONE_UNEVICTABLE: 0 NR_ZONE_WRITE_PENDING: 0 NR_MLOCK: 0 NR_BOUNCE: 0 NR_ZSPAGES: 0 NR_FREE_CMA_PAGES: 0 NODE: 4 ZONE: 1 ADDR: ffff00817fffec00 NAME: "Normal" SIZE: 8454144 PRESENT: 98304 MIN/LOW/HIGH: 68/166/264 VM_STAT: NR_FREE_PAGES: 146 NR_ZONE_INACTIVE_ANON: 94668 NR_ZONE_ACTIVE_ANON: 3 NR_ZONE_INACTIVE_FILE: 735 NR_ZONE_ACTIVE_FILE: 78 NR_ZONE_UNEVICTABLE: 0 NR_ZONE_WRITE_PENDING: 0 NR_MLOCK: 0 NR_BOUNCE: 0 NR_ZSPAGES: 0 NR_FREE_CMA_PAGES: 0 In allow_direct_reclaim(), while processing ZONE_DMA32, the sum of inactive/active file-backed pages calculated in zone_reclaimable_pages() based on the result of zone_page_state_snapshot() is zero. Additionally, since this system lacks swap, the calculation of inactive/ active anonymous pages is skipped. crash> p nr_swap_pages nr_swap_pages = $1937 = { counter = 0 } As a result, ZONE_DMA32 is deemed unreclaimable and skipped, moving on to the processing of the next zone, ZONE_NORMAL, despite ZONE_DMA32 having free pages significantly exceeding the high watermark. The problem is that the pgdat->kswapd_failures hasn't been incremented. crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_failures $1935 = 0x0 This is because the node deemed balanced. The node balancing logic in balance_pgdat() evaluates all zones collectively. If one or more zones (e.g., ZONE_DMA32) have enough free pages to meet their watermarks, the entire node is deemed balanced. This causes balance_pgdat() to exit early before incrementing the kswapd_failures, as it considers the overall memory state acceptable, even though some zones (like ZONE_NORMAL) remain under significant pressure. The patch ensures that zone_reclaimable_pages() includes free pages (NR_FREE_PAGES) in its calculation when no other reclaimable pages are available (e.g., file-backed or anonymous pages). This change prevents zones like ZONE_DMA32, which have sufficient free pages, from being mistakenly deemed unreclaimable. By doing so, the patch ensures proper node balancing, avoids masking pressure on other zones like ZONE_NORMAL, and prevents infinite loops in throttle_direct_reclaim() caused by allow_direct_reclaim(pgdat) repeatedly returning false. The kernel hangs due to a task stuck in throttle_direct_reclaim(), caused by a node being incorrectly deemed balanced despite pressure in certain zones, such as ZONE_NORMAL. This issue arises from zone_reclaimable_pages() returning 0 for zones without reclaimable file- backed or anonymous pages, causing zones like ZONE_DMA32 with sufficient free pages to be skipped. The lack of swap or reclaimable pages results in ZONE_DMA32 being ignored during reclaim, masking pressure in other zones. Consequently, pgdat->kswapd_failures remains 0 in balance_pgdat(), preventing fallback mechanisms in allow_direct_reclaim() from being triggered, leading to an infinite loop in throttle_direct_reclaim(). This patch modifies zone_reclaimable_pages() to account for free pages (NR_FREE_PAGES) when no other reclaimable pages exist. This ensures zones with sufficient free pages are not skipped, enabling proper balancing and reclaim behavior. [[email protected]: coding-style cleanups] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 5a1c84b ("mm: remove reclaim and compaction retry approximations") Signed-off-by: Seiji Nishikawa <[email protected]> Cc: Mel Gorman <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
The folio refcount may be increased unexpectly through try_get_folio() by caller such as split_huge_pages. In huge_pmd_unshare(), we use refcount to check whether a pmd page table is shared. The check is incorrect if the refcount is increased by the above caller, and this can cause the page table leaked: BUG: Bad page state in process sh pfn:109324 page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x66 pfn:0x109324 flags: 0x17ffff800000000(node=0|zone=2|lastcpupid=0xfffff) page_type: f2(table) raw: 017ffff800000000 0000000000000000 0000000000000000 0000000000000000 raw: 0000000000000066 0000000000000000 00000000f2000000 0000000000000000 page dumped because: nonzero mapcount ... CPU: 31 UID: 0 PID: 7515 Comm: sh Kdump: loaded Tainted: G B 6.13.0-rc2master+ #7 Tainted: [B]=BAD_PAGE Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 Call trace: show_stack+0x20/0x38 (C) dump_stack_lvl+0x80/0xf8 dump_stack+0x18/0x28 bad_page+0x8c/0x130 free_page_is_bad_report+0xa4/0xb0 free_unref_page+0x3cc/0x620 __folio_put+0xf4/0x158 split_huge_pages_all+0x1e0/0x3e8 split_huge_pages_write+0x25c/0x2d8 full_proxy_write+0x64/0xd8 vfs_write+0xcc/0x280 ksys_write+0x70/0x110 __arm64_sys_write+0x24/0x38 invoke_syscall+0x50/0x120 el0_svc_common.constprop.0+0xc8/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x34/0x128 el0t_64_sync_handler+0xc8/0xd0 el0t_64_sync+0x190/0x198 The issue may be triggered by damon, offline_page, page_idle, etc, which will increase the refcount of page table. 1. The page table itself will be discarded after reporting the "nonzero mapcount". 2. The HugeTLB page mapped by the page table miss freeing since we treat the page table as shared and a shared page table will not be unmapped. Fix it by introducing independent PMD page table shared count. As described by comment, pt_index/pt_mm/pt_frag_refcount are used for s390 gmap, x86 pgds and powerpc, pt_share_count is used for x86/arm64/riscv pmds, so we can reuse the field as pt_share_count. Link: https://lkml.kernel.org/r/[email protected] Fixes: 39dde65 ("[PATCH] shared page table for hugetlb page") Signed-off-by: Liu Shixin <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Ken Chen <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nanyong Sun <[email protected]> Cc: Jane Chu <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
…le_direct_reclaim() commit 6aaced5 upstream. The task sometimes continues looping in throttle_direct_reclaim() because allow_direct_reclaim(pgdat) keeps returning false. #0 [ffff80002cb6f8d0] __switch_to at ffff8000080095ac #1 [ffff80002cb6f900] __schedule at ffff800008abbd1c #2 [ffff80002cb6f990] schedule at ffff800008abc50c #3 [ffff80002cb6f9b0] throttle_direct_reclaim at ffff800008273550 #4 [ffff80002cb6fa20] try_to_free_pages at ffff800008277b68 #5 [ffff80002cb6fae0] __alloc_pages_nodemask at ffff8000082c4660 #6 [ffff80002cb6fc50] alloc_pages_vma at ffff8000082e4a98 #7 [ffff80002cb6fca0] do_anonymous_page at ffff80000829f5a8 #8 [ffff80002cb6fce0] __handle_mm_fault at ffff8000082a5974 #9 [ffff80002cb6fd90] handle_mm_fault at ffff8000082a5bd4 At this point, the pgdat contains the following two zones: NODE: 4 ZONE: 0 ADDR: ffff00817fffe540 NAME: "DMA32" SIZE: 20480 MIN/LOW/HIGH: 11/28/45 VM_STAT: NR_FREE_PAGES: 359 NR_ZONE_INACTIVE_ANON: 18813 NR_ZONE_ACTIVE_ANON: 0 NR_ZONE_INACTIVE_FILE: 50 NR_ZONE_ACTIVE_FILE: 0 NR_ZONE_UNEVICTABLE: 0 NR_ZONE_WRITE_PENDING: 0 NR_MLOCK: 0 NR_BOUNCE: 0 NR_ZSPAGES: 0 NR_FREE_CMA_PAGES: 0 NODE: 4 ZONE: 1 ADDR: ffff00817fffec00 NAME: "Normal" SIZE: 8454144 PRESENT: 98304 MIN/LOW/HIGH: 68/166/264 VM_STAT: NR_FREE_PAGES: 146 NR_ZONE_INACTIVE_ANON: 94668 NR_ZONE_ACTIVE_ANON: 3 NR_ZONE_INACTIVE_FILE: 735 NR_ZONE_ACTIVE_FILE: 78 NR_ZONE_UNEVICTABLE: 0 NR_ZONE_WRITE_PENDING: 0 NR_MLOCK: 0 NR_BOUNCE: 0 NR_ZSPAGES: 0 NR_FREE_CMA_PAGES: 0 In allow_direct_reclaim(), while processing ZONE_DMA32, the sum of inactive/active file-backed pages calculated in zone_reclaimable_pages() based on the result of zone_page_state_snapshot() is zero. Additionally, since this system lacks swap, the calculation of inactive/ active anonymous pages is skipped. crash> p nr_swap_pages nr_swap_pages = $1937 = { counter = 0 } As a result, ZONE_DMA32 is deemed unreclaimable and skipped, moving on to the processing of the next zone, ZONE_NORMAL, despite ZONE_DMA32 having free pages significantly exceeding the high watermark. The problem is that the pgdat->kswapd_failures hasn't been incremented. crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_failures $1935 = 0x0 This is because the node deemed balanced. The node balancing logic in balance_pgdat() evaluates all zones collectively. If one or more zones (e.g., ZONE_DMA32) have enough free pages to meet their watermarks, the entire node is deemed balanced. This causes balance_pgdat() to exit early before incrementing the kswapd_failures, as it considers the overall memory state acceptable, even though some zones (like ZONE_NORMAL) remain under significant pressure. The patch ensures that zone_reclaimable_pages() includes free pages (NR_FREE_PAGES) in its calculation when no other reclaimable pages are available (e.g., file-backed or anonymous pages). This change prevents zones like ZONE_DMA32, which have sufficient free pages, from being mistakenly deemed unreclaimable. By doing so, the patch ensures proper node balancing, avoids masking pressure on other zones like ZONE_NORMAL, and prevents infinite loops in throttle_direct_reclaim() caused by allow_direct_reclaim(pgdat) repeatedly returning false. The kernel hangs due to a task stuck in throttle_direct_reclaim(), caused by a node being incorrectly deemed balanced despite pressure in certain zones, such as ZONE_NORMAL. This issue arises from zone_reclaimable_pages() returning 0 for zones without reclaimable file- backed or anonymous pages, causing zones like ZONE_DMA32 with sufficient free pages to be skipped. The lack of swap or reclaimable pages results in ZONE_DMA32 being ignored during reclaim, masking pressure in other zones. Consequently, pgdat->kswapd_failures remains 0 in balance_pgdat(), preventing fallback mechanisms in allow_direct_reclaim() from being triggered, leading to an infinite loop in throttle_direct_reclaim(). This patch modifies zone_reclaimable_pages() to account for free pages (NR_FREE_PAGES) when no other reclaimable pages exist. This ensures zones with sufficient free pages are not skipped, enabling proper balancing and reclaim behavior. [[email protected]: coding-style cleanups] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 5a1c84b ("mm: remove reclaim and compaction retry approximations") Signed-off-by: Seiji Nishikawa <[email protected]> Cc: Mel Gorman <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 59d9094 upstream. The folio refcount may be increased unexpectly through try_get_folio() by caller such as split_huge_pages. In huge_pmd_unshare(), we use refcount to check whether a pmd page table is shared. The check is incorrect if the refcount is increased by the above caller, and this can cause the page table leaked: BUG: Bad page state in process sh pfn:109324 page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x66 pfn:0x109324 flags: 0x17ffff800000000(node=0|zone=2|lastcpupid=0xfffff) page_type: f2(table) raw: 017ffff800000000 0000000000000000 0000000000000000 0000000000000000 raw: 0000000000000066 0000000000000000 00000000f2000000 0000000000000000 page dumped because: nonzero mapcount ... CPU: 31 UID: 0 PID: 7515 Comm: sh Kdump: loaded Tainted: G B 6.13.0-rc2master+ #7 Tainted: [B]=BAD_PAGE Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 Call trace: show_stack+0x20/0x38 (C) dump_stack_lvl+0x80/0xf8 dump_stack+0x18/0x28 bad_page+0x8c/0x130 free_page_is_bad_report+0xa4/0xb0 free_unref_page+0x3cc/0x620 __folio_put+0xf4/0x158 split_huge_pages_all+0x1e0/0x3e8 split_huge_pages_write+0x25c/0x2d8 full_proxy_write+0x64/0xd8 vfs_write+0xcc/0x280 ksys_write+0x70/0x110 __arm64_sys_write+0x24/0x38 invoke_syscall+0x50/0x120 el0_svc_common.constprop.0+0xc8/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x34/0x128 el0t_64_sync_handler+0xc8/0xd0 el0t_64_sync+0x190/0x198 The issue may be triggered by damon, offline_page, page_idle, etc, which will increase the refcount of page table. 1. The page table itself will be discarded after reporting the "nonzero mapcount". 2. The HugeTLB page mapped by the page table miss freeing since we treat the page table as shared and a shared page table will not be unmapped. Fix it by introducing independent PMD page table shared count. As described by comment, pt_index/pt_mm/pt_frag_refcount are used for s390 gmap, x86 pgds and powerpc, pt_share_count is used for x86/arm64/riscv pmds, so we can reuse the field as pt_share_count. Link: https://lkml.kernel.org/r/[email protected] Fixes: 39dde65 ("[PATCH] shared page table for hugetlb page") Signed-off-by: Liu Shixin <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Ken Chen <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nanyong Sun <[email protected]> Cc: Jane Chu <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
When COWing a relocation tree path, at relocation.c:replace_path(), we can trigger a lockdep splat while we are in the btrfs_search_slot() call against the relocation root. This happens in that callchain at ctree.c:read_block_for_search() when we happen to find a child extent buffer already loaded through the fs tree with a lockdep class set to the fs tree. So when we attempt to lock that extent buffer through a relocation tree we have to reset the lockdep class to the class for a relocation tree, since a relocation tree has extent buffers that used to belong to a fs tree and may currently be already loaded (we swap extent buffers between the two trees at the end of replace_path()). However we are missing calls to btrfs_maybe_reset_lockdep_class() to reset the lockdep class at ctree.c:read_block_for_search() before we read lock an extent buffer, just like we did for btrfs_search_slot() in commit b40130b ("btrfs: fix lockdep splat with reloc root extent buffers"). So add the missing btrfs_maybe_reset_lockdep_class() calls before the attempts to read lock an extent buffer at ctree.c:read_block_for_search(). The lockdep splat was reported by syzbot and it looks like this: ====================================================== WARNING: possible circular locking dependency detected 6.13.0-rc5-syzkaller-00163-gab75170520d4 #0 Not tainted ------------------------------------------------------ syz.0.0/5335 is trying to acquire lock: ffff8880545dbc38 (btrfs-tree-01){++++}-{4:4}, at: btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146 but task is already holding lock: ffff8880545dba58 (btrfs-treloc-02/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-treloc-02/1){+.+.}-{4:4}: reacquire_held_locks+0x3eb/0x690 kernel/locking/lockdep.c:5374 __lock_release kernel/locking/lockdep.c:5563 [inline] lock_release+0x396/0xa30 kernel/locking/lockdep.c:5870 up_write+0x79/0x590 kernel/locking/rwsem.c:1629 btrfs_force_cow_block+0x14b3/0x1fd0 fs/btrfs/ctree.c:660 btrfs_cow_block+0x371/0x830 fs/btrfs/ctree.c:755 btrfs_search_slot+0xc01/0x3180 fs/btrfs/ctree.c:2153 replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224 merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692 merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942 relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754 btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087 btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494 __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278 btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655 btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #1 (btrfs-tree-01/1){+.+.}-{4:4}: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 down_write_nested+0xa2/0x220 kernel/locking/rwsem.c:1693 btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189 btrfs_init_new_buffer fs/btrfs/extent-tree.c:5052 [inline] btrfs_alloc_tree_block+0x41c/0x1440 fs/btrfs/extent-tree.c:5132 btrfs_force_cow_block+0x526/0x1fd0 fs/btrfs/ctree.c:573 btrfs_cow_block+0x371/0x830 fs/btrfs/ctree.c:755 btrfs_search_slot+0xc01/0x3180 fs/btrfs/ctree.c:2153 btrfs_insert_empty_items+0x9c/0x1a0 fs/btrfs/ctree.c:4351 btrfs_insert_empty_item fs/btrfs/ctree.h:688 [inline] btrfs_insert_inode_ref+0x2bb/0xf80 fs/btrfs/inode-item.c:330 btrfs_rename_exchange fs/btrfs/inode.c:7990 [inline] btrfs_rename2+0xcb7/0x2b90 fs/btrfs/inode.c:8374 vfs_rename+0xbdb/0xf00 fs/namei.c:5067 do_renameat2+0xd94/0x13f0 fs/namei.c:5224 __do_sys_renameat2 fs/namei.c:5258 [inline] __se_sys_renameat2 fs/namei.c:5255 [inline] __x64_sys_renameat2+0xce/0xe0 fs/namei.c:5255 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #0 (btrfs-tree-01){++++}-{4:4}: check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 down_read_nested+0xb5/0xa50 kernel/locking/rwsem.c:1649 btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146 btrfs_tree_read_lock fs/btrfs/locking.h:188 [inline] read_block_for_search+0x718/0xbb0 fs/btrfs/ctree.c:1610 btrfs_search_slot+0x1274/0x3180 fs/btrfs/ctree.c:2237 replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224 merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692 merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942 relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754 btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087 btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494 __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278 btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655 btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f other info that might help us debug this: Chain exists of: btrfs-tree-01 --> btrfs-tree-01/1 --> btrfs-treloc-02/1 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-treloc-02/1); lock(btrfs-tree-01/1); lock(btrfs-treloc-02/1); rlock(btrfs-tree-01); *** DEADLOCK *** 8 locks held by syz.0.0/5335: #0: ffff88801e3ae420 (sb_writers#13){.+.+}-{0:0}, at: mnt_want_write_file+0x5e/0x200 fs/namespace.c:559 #1: ffff888052c760d0 (&fs_info->reclaim_bgs_lock){+.+.}-{4:4}, at: __btrfs_balance+0x4c2/0x26b0 fs/btrfs/volumes.c:4183 #2: ffff888052c74850 (&fs_info->cleaner_mutex){+.+.}-{4:4}, at: btrfs_relocate_block_group+0x775/0xd90 fs/btrfs/relocation.c:4086 #3: ffff88801e3ae610 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xf11/0x1ad0 fs/btrfs/relocation.c:1659 #4: ffff888052c76470 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x405/0xda0 fs/btrfs/transaction.c:288 #5: ffff888052c76498 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x405/0xda0 fs/btrfs/transaction.c:288 #6: ffff8880545db878 (btrfs-tree-01/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189 #7: ffff8880545dba58 (btrfs-treloc-02/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189 stack backtrace: CPU: 0 UID: 0 PID: 5335 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller-00163-gab75170520d4 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_circular_bug+0x13a/0x1b0 kernel/locking/lockdep.c:2074 check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2206 check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 down_read_nested+0xb5/0xa50 kernel/locking/rwsem.c:1649 btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146 btrfs_tree_read_lock fs/btrfs/locking.h:188 [inline] read_block_for_search+0x718/0xbb0 fs/btrfs/ctree.c:1610 btrfs_search_slot+0x1274/0x3180 fs/btrfs/ctree.c:2237 replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224 merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692 merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942 relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754 btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087 btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494 __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278 btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655 btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f1ac6985d29 Code: ff ff c3 (...) RSP: 002b:00007f1ac63fe038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f1ac6b76160 RCX: 00007f1ac6985d29 RDX: 0000000020000180 RSI: 00000000c4009420 RDI: 0000000000000007 RBP: 00007f1ac6a01b08 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000001 R14: 00007f1ac6b76160 R15: 00007fffda145a88 </TASK> Reported-by: [email protected] Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Fixes: 9978599 ("btrfs: reduce lock contention when eb cache miss for btree search") Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
devm_platform_profile_register() expects a pointer to the private driver data but instead an address of the pointer variable is passed due to a typo. This leads to the crashes later: BUG: unable to handle page fault for address: 00000000fe0d0044 PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 6 UID: 0 PID: 1284 Comm: tuned Tainted: G W 6.13.0+ #7 Tainted: [W]=WARN Hardware name: LENOVO 21D0/LNVNB161216, BIOS J6CN45WW 03/17/2023 RIP: 0010:__mutex_lock.constprop.0+0x6bf/0x7f0 Call Trace: <TASK> dytc_profile_set+0x4a/0x140 [ideapad_laptop] _store_and_notify+0x13/0x40 [platform_profile] class_for_each_device+0x145/0x180 platform_profile_store+0xc0/0x130 [platform_profile] kernfs_fop_write_iter+0x13e/0x1f0 vfs_write+0x290/0x450 ksys_write+0x6c/0xe0 do_syscall_64+0x82/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e Found by Linux Verification Center (linuxtesting.org). Fixes: 249c576 ("ACPI: platform_profile: Let drivers set drvdata to the class device") Signed-off-by: Fedor Pchelkin <[email protected]> Reviewed-by: Kurt Borja <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Ilpo Järvinen <[email protected]> Signed-off-by: Ilpo Järvinen <[email protected]>
[ Upstream commit c7b87ce ] libtraceevent parses and returns an array of argument fields, sometimes larger than RAW_SYSCALL_ARGS_NUM (6) because it includes "__syscall_nr", idx will traverse to index 6 (7th element) whereas sc->fmt->arg holds 6 elements max, creating an out-of-bounds access. This runtime error is found by UBsan. The error message: $ sudo UBSAN_OPTIONS=print_stacktrace=1 ./perf trace -a --max-events=1 builtin-trace.c:1966:35: runtime error: index 6 out of bounds for type 'syscall_arg_fmt [6]' #0 0x5c04956be5fe in syscall__alloc_arg_fmts /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:1966 #1 0x5c04956c0510 in trace__read_syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2110 #2 0x5c04956c372b in trace__syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2436 #3 0x5c04956d2f39 in trace__init_syscalls_bpf_prog_array_maps /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:3897 #4 0x5c04956d6d25 in trace__run /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:4335 #5 0x5c04956e112e in cmd_trace /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:5502 #6 0x5c04956eda7d in run_builtin /home/howard/hw/linux-perf/tools/perf/perf.c:351 #7 0x5c04956ee0a8 in handle_internal_command /home/howard/hw/linux-perf/tools/perf/perf.c:404 #8 0x5c04956ee37f in run_argv /home/howard/hw/linux-perf/tools/perf/perf.c:448 #9 0x5c04956ee8e9 in main /home/howard/hw/linux-perf/tools/perf/perf.c:556 torvalds#10 0x79eb3622a3b7 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 torvalds#11 0x79eb3622a47a in __libc_start_main_impl ../csu/libc-start.c:360 torvalds#12 0x5c04955422d4 in _start (/home/howard/hw/linux-perf/tools/perf/perf+0x4e02d4) (BuildId: 5b6cab2d59e96a4341741765ad6914a4d784dbc6) 0.000 ( 0.014 ms): Chrome_ChildIO/117244 write(fd: 238, buf: !, count: 1) = 1 Fixes: 5e58fcf ("perf trace: Allow allocating sc->arg_fmt even without the syscall tracepoint") Signed-off-by: Howard Chu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 2c2ebb2 ] Fix the suspend/resume path by ensuring the rtnl lock is held where required. Calls to ravb_open, ravb_close and wol operations must be performed under the rtnl lock to prevent conflicts with ongoing ndo operations. Without this fix, the following warning is triggered: [ 39.032969] ============================= [ 39.032983] WARNING: suspicious RCU usage [ 39.033019] ----------------------------- [ 39.033033] drivers/net/phy/phy_device.c:2004 suspicious rcu_dereference_protected() usage! ... [ 39.033597] stack backtrace: [ 39.033613] CPU: 0 UID: 0 PID: 174 Comm: python3 Not tainted 6.13.0-rc7-next-20250116-arm64-renesas-00002-g35245dfdc62c #7 [ 39.033623] Hardware name: Renesas SMARC EVK version 2 based on r9a08g045s33 (DT) [ 39.033628] Call trace: [ 39.033633] show_stack+0x14/0x1c (C) [ 39.033652] dump_stack_lvl+0xb4/0xc4 [ 39.033664] dump_stack+0x14/0x1c [ 39.033671] lockdep_rcu_suspicious+0x16c/0x22c [ 39.033682] phy_detach+0x160/0x190 [ 39.033694] phy_disconnect+0x40/0x54 [ 39.033703] ravb_close+0x6c/0x1cc [ 39.033714] ravb_suspend+0x48/0x120 [ 39.033721] dpm_run_callback+0x4c/0x14c [ 39.033731] device_suspend+0x11c/0x4dc [ 39.033740] dpm_suspend+0xdc/0x214 [ 39.033748] dpm_suspend_start+0x48/0x60 [ 39.033758] suspend_devices_and_enter+0x124/0x574 [ 39.033769] pm_suspend+0x1ac/0x274 [ 39.033778] state_store+0x88/0x124 [ 39.033788] kobj_attr_store+0x14/0x24 [ 39.033798] sysfs_kf_write+0x48/0x6c [ 39.033808] kernfs_fop_write_iter+0x118/0x1a8 [ 39.033817] vfs_write+0x27c/0x378 [ 39.033825] ksys_write+0x64/0xf4 [ 39.033833] __arm64_sys_write+0x18/0x20 [ 39.033841] invoke_syscall+0x44/0x104 [ 39.033852] el0_svc_common.constprop.0+0xb4/0xd4 [ 39.033862] do_el0_svc+0x18/0x20 [ 39.033870] el0_svc+0x3c/0xf0 [ 39.033880] el0t_64_sync_handler+0xc0/0xc4 [ 39.033888] el0t_64_sync+0x154/0x158 [ 39.041274] ravb 11c30000.ethernet eth0: Link is Down Reported-by: Claudiu Beznea <[email protected]> Closes: https://lore.kernel.org/netdev/[email protected]/ Fixes: 0184165 ("ravb: add sleep PM suspend/resume support") Signed-off-by: Kory Maincent <[email protected]> Tested-by: Niklas Söderlund <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit a216542 ] When COWing a relocation tree path, at relocation.c:replace_path(), we can trigger a lockdep splat while we are in the btrfs_search_slot() call against the relocation root. This happens in that callchain at ctree.c:read_block_for_search() when we happen to find a child extent buffer already loaded through the fs tree with a lockdep class set to the fs tree. So when we attempt to lock that extent buffer through a relocation tree we have to reset the lockdep class to the class for a relocation tree, since a relocation tree has extent buffers that used to belong to a fs tree and may currently be already loaded (we swap extent buffers between the two trees at the end of replace_path()). However we are missing calls to btrfs_maybe_reset_lockdep_class() to reset the lockdep class at ctree.c:read_block_for_search() before we read lock an extent buffer, just like we did for btrfs_search_slot() in commit b40130b ("btrfs: fix lockdep splat with reloc root extent buffers"). So add the missing btrfs_maybe_reset_lockdep_class() calls before the attempts to read lock an extent buffer at ctree.c:read_block_for_search(). The lockdep splat was reported by syzbot and it looks like this: ====================================================== WARNING: possible circular locking dependency detected 6.13.0-rc5-syzkaller-00163-gab75170520d4 #0 Not tainted ------------------------------------------------------ syz.0.0/5335 is trying to acquire lock: ffff8880545dbc38 (btrfs-tree-01){++++}-{4:4}, at: btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146 but task is already holding lock: ffff8880545dba58 (btrfs-treloc-02/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-treloc-02/1){+.+.}-{4:4}: reacquire_held_locks+0x3eb/0x690 kernel/locking/lockdep.c:5374 __lock_release kernel/locking/lockdep.c:5563 [inline] lock_release+0x396/0xa30 kernel/locking/lockdep.c:5870 up_write+0x79/0x590 kernel/locking/rwsem.c:1629 btrfs_force_cow_block+0x14b3/0x1fd0 fs/btrfs/ctree.c:660 btrfs_cow_block+0x371/0x830 fs/btrfs/ctree.c:755 btrfs_search_slot+0xc01/0x3180 fs/btrfs/ctree.c:2153 replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224 merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692 merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942 relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754 btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087 btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494 __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278 btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655 btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #1 (btrfs-tree-01/1){+.+.}-{4:4}: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 down_write_nested+0xa2/0x220 kernel/locking/rwsem.c:1693 btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189 btrfs_init_new_buffer fs/btrfs/extent-tree.c:5052 [inline] btrfs_alloc_tree_block+0x41c/0x1440 fs/btrfs/extent-tree.c:5132 btrfs_force_cow_block+0x526/0x1fd0 fs/btrfs/ctree.c:573 btrfs_cow_block+0x371/0x830 fs/btrfs/ctree.c:755 btrfs_search_slot+0xc01/0x3180 fs/btrfs/ctree.c:2153 btrfs_insert_empty_items+0x9c/0x1a0 fs/btrfs/ctree.c:4351 btrfs_insert_empty_item fs/btrfs/ctree.h:688 [inline] btrfs_insert_inode_ref+0x2bb/0xf80 fs/btrfs/inode-item.c:330 btrfs_rename_exchange fs/btrfs/inode.c:7990 [inline] btrfs_rename2+0xcb7/0x2b90 fs/btrfs/inode.c:8374 vfs_rename+0xbdb/0xf00 fs/namei.c:5067 do_renameat2+0xd94/0x13f0 fs/namei.c:5224 __do_sys_renameat2 fs/namei.c:5258 [inline] __se_sys_renameat2 fs/namei.c:5255 [inline] __x64_sys_renameat2+0xce/0xe0 fs/namei.c:5255 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #0 (btrfs-tree-01){++++}-{4:4}: check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 down_read_nested+0xb5/0xa50 kernel/locking/rwsem.c:1649 btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146 btrfs_tree_read_lock fs/btrfs/locking.h:188 [inline] read_block_for_search+0x718/0xbb0 fs/btrfs/ctree.c:1610 btrfs_search_slot+0x1274/0x3180 fs/btrfs/ctree.c:2237 replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224 merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692 merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942 relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754 btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087 btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494 __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278 btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655 btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f other info that might help us debug this: Chain exists of: btrfs-tree-01 --> btrfs-tree-01/1 --> btrfs-treloc-02/1 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-treloc-02/1); lock(btrfs-tree-01/1); lock(btrfs-treloc-02/1); rlock(btrfs-tree-01); *** DEADLOCK *** 8 locks held by syz.0.0/5335: #0: ffff88801e3ae420 (sb_writers#13){.+.+}-{0:0}, at: mnt_want_write_file+0x5e/0x200 fs/namespace.c:559 #1: ffff888052c760d0 (&fs_info->reclaim_bgs_lock){+.+.}-{4:4}, at: __btrfs_balance+0x4c2/0x26b0 fs/btrfs/volumes.c:4183 #2: ffff888052c74850 (&fs_info->cleaner_mutex){+.+.}-{4:4}, at: btrfs_relocate_block_group+0x775/0xd90 fs/btrfs/relocation.c:4086 #3: ffff88801e3ae610 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xf11/0x1ad0 fs/btrfs/relocation.c:1659 #4: ffff888052c76470 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x405/0xda0 fs/btrfs/transaction.c:288 #5: ffff888052c76498 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x405/0xda0 fs/btrfs/transaction.c:288 #6: ffff8880545db878 (btrfs-tree-01/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189 #7: ffff8880545dba58 (btrfs-treloc-02/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189 stack backtrace: CPU: 0 UID: 0 PID: 5335 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller-00163-gab75170520d4 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_circular_bug+0x13a/0x1b0 kernel/locking/lockdep.c:2074 check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2206 check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 down_read_nested+0xb5/0xa50 kernel/locking/rwsem.c:1649 btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146 btrfs_tree_read_lock fs/btrfs/locking.h:188 [inline] read_block_for_search+0x718/0xbb0 fs/btrfs/ctree.c:1610 btrfs_search_slot+0x1274/0x3180 fs/btrfs/ctree.c:2237 replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224 merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692 merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942 relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754 btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087 btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494 __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278 btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655 btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f1ac6985d29 Code: ff ff c3 (...) RSP: 002b:00007f1ac63fe038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f1ac6b76160 RCX: 00007f1ac6985d29 RDX: 0000000020000180 RSI: 00000000c4009420 RDI: 0000000000000007 RBP: 00007f1ac6a01b08 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000001 R14: 00007f1ac6b76160 R15: 00007fffda145a88 </TASK> Reported-by: [email protected] Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Fixes: 9978599 ("btrfs: reduce lock contention when eb cache miss for btree search") Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
We have several places across the kernel where we want to access another task's syscall arguments, such as ptrace(2), seccomp(2), etc., by making a call to syscall_get_arguments(). This works for register arguments right away by accessing the task's `regs' member of `struct pt_regs', however for stack arguments seen with 32-bit/o32 kernels things are more complicated. Technically they ought to be obtained from the user stack with calls to an access_remote_vm(), but we have an easier way available already. So as to be able to access syscall stack arguments as regular function arguments following the MIPS calling convention we copy them over from the user stack to the kernel stack in arch/mips/kernel/scall32-o32.S, in handle_sys(), to the current stack frame's outgoing argument space at the top of the stack, which is where the handler called expects to see its incoming arguments. This area is also pointed at by the `pt_regs' pointer obtained by task_pt_regs(). Make the o32 stack argument space a proper member of `struct pt_regs' then, by renaming the existing member from `pad0' to `args' and using generated offsets to access the space. No functional change though. With the change in place the o32 kernel stack frame layout at the entry to a syscall handler invoked by handle_sys() is therefore as follows: $sp + 68 -> | ... | <- pt_regs.regs[9] +---------------------+ $sp + 64 -> | $t0 | <- pt_regs.regs[8] +---------------------+ $sp + 60 -> | $a3/argument #4 | <- pt_regs.regs[7] +---------------------+ $sp + 56 -> | $a2/argument #3 | <- pt_regs.regs[6] +---------------------+ $sp + 52 -> | $a1/argument #2 | <- pt_regs.regs[5] +---------------------+ $sp + 48 -> | $a0/argument #1 | <- pt_regs.regs[4] +---------------------+ $sp + 44 -> | $v1 | <- pt_regs.regs[3] +---------------------+ $sp + 40 -> | $v0 | <- pt_regs.regs[2] +---------------------+ $sp + 36 -> | $at | <- pt_regs.regs[1] +---------------------+ $sp + 32 -> | $zero | <- pt_regs.regs[0] +---------------------+ $sp + 28 -> | stack argument #8 | <- pt_regs.args[7] +---------------------+ $sp + 24 -> | stack argument #7 | <- pt_regs.args[6] +---------------------+ $sp + 20 -> | stack argument #6 | <- pt_regs.args[5] +---------------------+ $sp + 16 -> | stack argument #5 | <- pt_regs.args[4] +---------------------+ $sp + 12 -> | psABI space for $a3 | <- pt_regs.args[3] +---------------------+ $sp + 8 -> | psABI space for $a2 | <- pt_regs.args[2] +---------------------+ $sp + 4 -> | psABI space for $a1 | <- pt_regs.args[1] +---------------------+ $sp + 0 -> | psABI space for $a0 | <- pt_regs.args[0] +---------------------+ holding user data received and with the first 4 frame slots reserved by the psABI for the compiler to spill the incoming arguments from $a0-$a3 registers (which it sometimes does according to its needs) and the next 4 frame slots designated by the psABI for any stack function arguments that follow. This data is also available for other tasks to peek/poke at as reqired and where permitted. Signed-off-by: Maciej W. Rozycki <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>
This makes ptrace/get_syscall_info selftest pass on mips o32 and mips64 o32 by fixing the following two test assertions: 1. get_syscall_info test assertion on mips o32: # get_syscall_info.c:218:get_syscall_info:Expected exp_args[5] (3134521044) == info.entry.args[4] (4911432) # get_syscall_info.c:219:get_syscall_info:wait #1: entry stop mismatch 2. get_syscall_info test assertion on mips64 o32: # get_syscall_info.c:209:get_syscall_info:Expected exp_args[2] (3134324433) == info.entry.args[1] (18446744072548908753) # get_syscall_info.c:210:get_syscall_info:wait #1: entry stop mismatch The first assertion happens due to mips_get_syscall_arg() trying to access another task's context but failing to do it properly because get_user() it calls just peeks at the current task's context. It usually does not crash because the default user stack always gets assigned the same VMA, but it is pure luck which mips_get_syscall_arg() wouldn't have if e.g. the stack was switched (via setcontext(3) or however) or a non-default process's thread peeked at, and in any case irrelevant data is obtained just as observed with the test case. mips_get_syscall_arg() ought to be using access_remote_vm() instead to retrieve the other task's stack contents, but given that the data has been already obtained and saved in `struct pt_regs' it would be an overkill. The first assertion is fixed for mips o32 by using struct pt_regs.args instead of get_user() to obtain syscall arguments. This approach works due to this piece in arch/mips/kernel/scall32-o32.S: /* * Ok, copy the args from the luser stack to the kernel stack. */ .set push .set noreorder .set nomacro load_a4: user_lw(t5, 16(t0)) # argument #5 from usp load_a5: user_lw(t6, 20(t0)) # argument #6 from usp load_a6: user_lw(t7, 24(t0)) # argument #7 from usp load_a7: user_lw(t8, 28(t0)) # argument #8 from usp loads_done: sw t5, PT_ARG4(sp) # argument #5 to ksp sw t6, PT_ARG5(sp) # argument #6 to ksp sw t7, PT_ARG6(sp) # argument #7 to ksp sw t8, PT_ARG7(sp) # argument #8 to ksp .set pop .section __ex_table,"a" PTR_WD load_a4, bad_stack_a4 PTR_WD load_a5, bad_stack_a5 PTR_WD load_a6, bad_stack_a6 PTR_WD load_a7, bad_stack_a7 .previous arch/mips/kernel/scall64-o32.S has analogous code for mips64 o32 that allows fixing the issue by obtaining syscall arguments from struct pt_regs.regs[4..11] instead of the erroneous use of get_user(). The second assertion is fixed by truncating 64-bit values to 32-bit syscall arguments. Fixes: c0ff3c5 ("MIPS: Enable HAVE_ARCH_TRACEHOOK.") Signed-off-by: Dmitry V. Levin <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>
[ Upstream commit 888751e ] perf test 11 hwmon fails on s390 with this error # ./perf test -Fv 11 --- start --- ---- end ---- 11.1: Basic parsing test : Ok --- start --- Testing 'temp_test_hwmon_event1' Using CPUID IBM,3931,704,A01,3.7,002f temp_test_hwmon_event1 -> hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/ FAILED tests/hwmon_pmu.c:189 Unexpected config for 'temp_test_hwmon_event1', 292470092988416 != 655361 ---- end ---- 11.2: Parsing without PMU name : FAILED! --- start --- Testing 'hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/' FAILED tests/hwmon_pmu.c:189 Unexpected config for 'hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/', 292470092988416 != 655361 ---- end ---- 11.3: Parsing with PMU name : FAILED! # The root cause is in member test_event::config which is initialized to 0xA0001 or 655361. During event parsing a long list event parsing functions are called and end up with this gdb call stack: #0 hwmon_pmu__config_term (hwm=0x168dfd0, attr=0x3ffffff5ee8, term=0x168db60, err=0x3ffffff81c8) at util/hwmon_pmu.c:623 #1 hwmon_pmu__config_terms (pmu=0x168dfd0, attr=0x3ffffff5ee8, terms=0x3ffffff5ea8, err=0x3ffffff81c8) at util/hwmon_pmu.c:662 #2 0x00000000012f870c in perf_pmu__config_terms (pmu=0x168dfd0, attr=0x3ffffff5ee8, terms=0x3ffffff5ea8, zero=false, apply_hardcoded=false, err=0x3ffffff81c8) at util/pmu.c:1519 #3 0x00000000012f88a4 in perf_pmu__config (pmu=0x168dfd0, attr=0x3ffffff5ee8, head_terms=0x3ffffff5ea8, apply_hardcoded=false, err=0x3ffffff81c8) at util/pmu.c:1545 #4 0x00000000012680c4 in parse_events_add_pmu (parse_state=0x3ffffff7fb8, list=0x168dc00, pmu=0x168dfd0, const_parsed_terms=0x3ffffff6090, auto_merge_stats=true, alternate_hw_config=10) at util/parse-events.c:1508 #5 0x00000000012684c6 in parse_events_multi_pmu_add (parse_state=0x3ffffff7fb8, event_name=0x168ec10 "temp_test_hwmon_event1", hw_config=10, const_parsed_terms=0x0, listp=0x3ffffff6230, loc_=0x3ffffff70e0) at util/parse-events.c:1592 #6 0x00000000012f0e4e in parse_events_parse (_parse_state=0x3ffffff7fb8, scanner=0x16878c0) at util/parse-events.y:293 #7 0x00000000012695a0 in parse_events__scanner (str=0x3ffffff81d8 "temp_test_hwmon_event1", input=0x0, parse_state=0x3ffffff7fb8) at util/parse-events.c:1867 #8 0x000000000126a1e8 in __parse_events (evlist=0x168b580, str=0x3ffffff81d8 "temp_test_hwmon_event1", pmu_filter=0x0, err=0x3ffffff81c8, fake_pmu=false, warn_if_reordered=true, fake_tp=false) at util/parse-events.c:2136 #9 0x00000000011e36aa in parse_events (evlist=0x168b580, str=0x3ffffff81d8 "temp_test_hwmon_event1", err=0x3ffffff81c8) at /root/linux/tools/perf/util/parse-events.h:41 torvalds#10 0x00000000011e3e64 in do_test (i=0, with_pmu=false, with_alias=false) at tests/hwmon_pmu.c:164 torvalds#11 0x00000000011e422c in test__hwmon_pmu (with_pmu=false) at tests/hwmon_pmu.c:219 torvalds#12 0x00000000011e431c in test__hwmon_pmu_without_pmu (test=0x1610368 <suite.hwmon_pmu>, subtest=1) at tests/hwmon_pmu.c:23 where the attr::config is set to value 292470092988416 or 0x10a0000000000 in line 625 of file ./util/hwmon_pmu.c: attr->config = key.type_and_num; However member key::type_and_num is defined as union and bit field: union hwmon_pmu_event_key { long type_and_num; struct { int num :16; enum hwmon_type type :8; }; }; s390 is big endian and Intel is little endian architecture. The events for the hwmon dummy pmu have num = 1 or num = 2 and type is set to HWMON_TYPE_TEMP (which is 10). On s390 this assignes member key::type_and_num the value of 0x10a0000000000 (which is 292470092988416) as shown in above trace output. Fix this and export the structure/union hwmon_pmu_event_key so the test shares the same implementation as the event parsing functions for union and bit fields. This should avoid endianess issues on all platforms. Output after: # ./perf test -F 11 11.1: Basic parsing test : Ok 11.2: Parsing without PMU name : Ok 11.3: Parsing with PMU name : Ok # Fixes: 531ee0f ("perf test: Add hwmon "PMU" test") Signed-off-by: Thomas Richter <[email protected]> Reviewed-by: Ian Rogers <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
commit d093c90 upstream. Currently, when no active threads are running, a root user using nfsdctl command can try to remove a particular listener from the list of previously added ones, then start the server by increasing the number of threads, it leads to the following problem: [ 158.835354] refcount_t: addition on 0; use-after-free. [ 158.835603] WARNING: CPU: 2 PID: 9145 at lib/refcount.c:25 refcount_warn_saturate+0x160/0x1a0 [ 158.836017] Modules linked in: rpcrdma rdma_cm iw_cm ib_cm ib_core nfsd auth_rpcgss nfs_acl lockd grace overlay isofs uinput snd_seq_dummy snd_hrtimer nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables qrtr sunrpc vfat fat uvcvideo videobuf2_vmalloc videobuf2_memops uvc videobuf2_v4l2 videodev videobuf2_common snd_hda_codec_generic mc e1000e snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore sg loop dm_multipath dm_mod nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vmw_vmci vsock xfs libcrc32c crct10dif_ce ghash_ce vmwgfx sha2_ce sha256_arm64 sr_mod sha1_ce cdrom nvme drm_client_lib drm_ttm_helper ttm nvme_core drm_kms_helper nvme_auth drm fuse [ 158.840093] CPU: 2 UID: 0 PID: 9145 Comm: nfsd Kdump: loaded Tainted: G B W 6.13.0-rc6+ #7 [ 158.840624] Tainted: [B]=BAD_PAGE, [W]=WARN [ 158.840802] Hardware name: VMware, Inc. VMware20,1/VBSA, BIOS VMW201.00V.24006586.BA64.2406042154 06/04/2024 [ 158.841220] pstate: 6140000 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 158.841563] pc : refcount_warn_saturate+0x160/0x1a0 [ 158.841780] lr : refcount_warn_saturate+0x160/0x1a0 [ 158.842000] sp : ffff800089be7d80 [ 158.842147] x29: ffff800089be7d80 x28: ffff00008e68c148 x27: ffff00008e68c148 [ 158.842492] x26: ffff0002e3b5c000 x25: ffff600011cd1829 x24: ffff00008653c010 [ 158.842832] x23: ffff00008653c000 x22: 1fffe00011cd1829 x21: ffff00008653c028 [ 158.843175] x20: 0000000000000002 x19: ffff00008653c010 x18: 0000000000000000 [ 158.843505] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 158.843836] x14: 0000000000000000 x13: 0000000000000001 x12: ffff600050a26493 [ 158.844143] x11: 1fffe00050a26492 x10: ffff600050a26492 x9 : dfff800000000000 [ 158.844475] x8 : 00009fffaf5d9b6e x7 : ffff000285132493 x6 : 0000000000000001 [ 158.844823] x5 : ffff000285132490 x4 : ffff600050a26493 x3 : ffff8000805e72bc [ 158.845174] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000098588000 [ 158.845528] Call trace: [ 158.845658] refcount_warn_saturate+0x160/0x1a0 (P) [ 158.845894] svc_recv+0x58c/0x680 [sunrpc] [ 158.846183] nfsd+0x1fc/0x348 [nfsd] [ 158.846390] kthread+0x274/0x2f8 [ 158.846546] ret_from_fork+0x10/0x20 [ 158.846714] ---[ end trace 0000000000000000 ]--- nfsd_nl_listener_set_doit() would manipulate the list of transports of server's sv_permsocks and close the specified listener but the other list of transports (server's sp_xprts list) would not be changed leading to the problem above. Instead, determined if the nfsdctl is trying to remove a listener, in which case, delete all the existing listener transports and re-create all-but-the-removed ones. Fixes: 16a4711 ("NFSD: add listener-{set,get} netlink command") Signed-off-by: Olga Kornievskaia <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Cc: [email protected] Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Without CONFIG_DRM_XE_GPUSVM set, GPU SVM is not initialized thus below warning pops. Refine the flush work code to be controlled by the config to avoid below warning: " [ 453.132028] ------------[ cut here ]------------ [ 453.132527] WARNING: CPU: 9 PID: 4491 at kernel/workqueue.c:4205 __flush_work+0x379/0x3a0 [ 453.133355] Modules linked in: xe drm_ttm_helper ttm gpu_sched drm_buddy drm_suballoc_helper drm_gpuvm drm_exec [ 453.134352] CPU: 9 UID: 0 PID: 4491 Comm: xe_exec_mix_mod Tainted: G U W 6.15.0-rc3+ #7 PREEMPT(full) [ 453.135405] Tainted: [U]=USER, [W]=WARN ... [ 453.136921] RIP: 0010:__flush_work+0x379/0x3a0 [ 453.137417] Code: 8b 45 00 48 8b 55 08 89 c7 48 c1 e8 04 83 e7 08 83 e0 0f 83 cf 02 89 c6 48 0f ba 6d 00 03 e9 d5 fe ff ff 0f 0b e9 db fd ff ff <0f> 0b 45 31 e4 e9 d1 fd ff ff 0f 0b e9 03 ff ff ff 0f 0b e9 d6 fe [ 453.139250] RSP: 0018:ffffc90000c67b18 EFLAGS: 00010246 [ 453.139782] RAX: 0000000000000000 RBX: ffff888108a24000 RCX: 0000000000002000 [ 453.140521] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8881016d61c8 [ 453.141253] RBP: ffff8881016d61c8 R08: 0000000000000000 R09: 0000000000000000 [ 453.141985] R10: 0000000000000000 R11: 0000000008a24000 R12: 0000000000000001 [ 453.142709] R13: 0000000000000002 R14: 0000000000000000 R15: ffff888107db8c00 [ 453.143450] FS: 00007f44853d4c80(0000) GS:ffff8882f469b000(0000) knlGS:0000000000000000 [ 453.144276] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 453.144853] CR2: 00007f4487629228 CR3: 00000001016aa000 CR4: 00000000000406f0 [ 453.145594] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 453.146320] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 453.147061] Call Trace: [ 453.147336] <TASK> [ 453.147579] ? tick_nohz_tick_stopped+0xd/0x30 [ 453.148067] ? xas_load+0x9/0xb0 [ 453.148435] ? xa_load+0x6f/0xb0 [ 453.148781] __xe_vm_bind_ioctl+0xbd5/0x1500 [xe] [ 453.149338] ? dev_printk_emit+0x48/0x70 [ 453.149762] ? _dev_printk+0x57/0x80 [ 453.150148] ? drm_ioctl+0x17c/0x440 [ 453.150544] ? __drm_dev_vprintk+0x36/0x90 [ 453.150983] ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe] [ 453.151575] ? drm_ioctl_kernel+0x9f/0xf0 [ 453.151998] ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe] [ 453.152560] drm_ioctl_kernel+0x9f/0xf0 [ 453.152968] drm_ioctl+0x20f/0x440 [ 453.153332] ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe] [ 453.153893] ? ioctl_has_perm.constprop.0.isra.0+0xae/0x100 [ 453.154489] ? memory_bm_test_bit+0x5/0x60 [ 453.154935] xe_drm_ioctl+0x47/0x70 [xe] [ 453.155419] __x64_sys_ioctl+0x8d/0xc0 [ 453.155824] do_syscall_64+0x47/0x110 [ 453.156228] entry_SYSCALL_64_after_hwframe+0x76/0x7e " v2 (Matt): refine commit message to have more details add Fixes tag move the code to xe_svm.h which already have the config remove a blank line per codestyle suggestion Fixes: 63f6e48 ("drm/xe: Add SVM garbage collector") Cc: Matthew Brost <[email protected]> Signed-off-by: Shuicheng Lin <[email protected]> Reviewed-by: Matthew Brost <[email protected]> Signed-off-by: Matthew Brost <[email protected]> Link: https://lore.kernel.org/r/[email protected] (cherry picked from commit 9d80698) Signed-off-by: Lucas De Marchi <[email protected]>
[ Upstream commit ee684de ] As shown in [1], it is possible to corrupt a BPF ELF file such that arbitrary BPF instructions are loaded by libbpf. This can be done by setting a symbol (BPF program) section offset to a large (unsigned) number such that <section start + symbol offset> overflows and points before the section data in the memory. Consider the situation below where: - prog_start = sec_start + symbol_offset <-- size_t overflow here - prog_end = prog_start + prog_size prog_start sec_start prog_end sec_end | | | | v v v v .....................|################################|............ The report in [1] also provides a corrupted BPF ELF which can be used as a reproducer: $ readelf -S crash Section Headers: [Nr] Name Type Address Offset Size EntSize Flags Link Info Align ... [ 2] uretprobe.mu[...] PROGBITS 0000000000000000 00000040 0000000000000068 0000000000000000 AX 0 0 8 $ readelf -s crash Symbol table '.symtab' contains 8 entries: Num: Value Size Type Bind Vis Ndx Name ... 6: ffffffffffffffb8 104 FUNC GLOBAL DEFAULT 2 handle_tp Here, the handle_tp prog has section offset ffffffffffffffb8, i.e. will point before the actual memory where section 2 is allocated. This is also reported by AddressSanitizer: ================================================================= ==1232==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7c7302fe0000 at pc 0x7fc3046e4b77 bp 0x7ffe64677cd0 sp 0x7ffe64677490 READ of size 104 at 0x7c7302fe0000 thread T0 #0 0x7fc3046e4b76 in memcpy (/lib64/libasan.so.8+0xe4b76) #1 0x00000040df3e in bpf_object__init_prog /src/libbpf/src/libbpf.c:856 #2 0x00000040df3e in bpf_object__add_programs /src/libbpf/src/libbpf.c:928 #3 0x00000040df3e in bpf_object__elf_collect /src/libbpf/src/libbpf.c:3930 #4 0x00000040df3e in bpf_object_open /src/libbpf/src/libbpf.c:8067 #5 0x00000040f176 in bpf_object__open_file /src/libbpf/src/libbpf.c:8090 #6 0x000000400c16 in main /poc/poc.c:8 #7 0x7fc3043d25b4 in __libc_start_call_main (/lib64/libc.so.6+0x35b4) #8 0x7fc3043d2667 in __libc_start_main@@GLIBC_2.34 (/lib64/libc.so.6+0x3667) #9 0x000000400b34 in _start (/poc/poc+0x400b34) 0x7c7302fe0000 is located 64 bytes before 104-byte region [0x7c7302fe0040,0x7c7302fe00a8) allocated by thread T0 here: #0 0x7fc3046e716b in malloc (/lib64/libasan.so.8+0xe716b) #1 0x7fc3045ee600 in __libelf_set_rawdata_wrlock (/lib64/libelf.so.1+0xb600) #2 0x7fc3045ef018 in __elf_getdata_rdlock (/lib64/libelf.so.1+0xc018) #3 0x00000040642f in elf_sec_data /src/libbpf/src/libbpf.c:3740 The problem here is that currently, libbpf only checks that the program end is within the section bounds. There used to be a check `while (sec_off < sec_sz)` in bpf_object__add_programs, however, it was removed by commit 6245947 ("libbpf: Allow gaps in BPF program sections to support overriden weak functions"). Add a check for detecting the overflow of `sec_off + prog_sz` to bpf_object__init_prog to fix this issue. [1] https://github.com/lmarch2/poc/blob/main/libbpf/libbpf.md Fixes: 6245947 ("libbpf: Allow gaps in BPF program sections to support overriden weak functions") Reported-by: lmarch2 <[email protected]> Signed-off-by: Viktor Malik <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Reviewed-by: Shung-Hsi Yu <[email protected]> Link: https://github.com/lmarch2/poc/blob/main/libbpf/libbpf.md Link: https://lore.kernel.org/bpf/[email protected] Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit eedf3e3 ] ACPICA commit 1c28da2242783579d59767617121035dafba18c3 This was originally done in NetBSD: NetBSD/src@b69d1ac and is the correct alternative to the smattering of `memcpy`s I previously contributed to this repository. This also sidesteps the newly strict checks added in UBSAN: llvm/llvm-project@7926744 Before this change we see the following UBSAN stack trace in Fuchsia: #0 0x000021afcfdeca5e in acpi_rs_get_address_common(struct acpi_resource*, union aml_resource*) ../../third_party/acpica/source/components/resources/rsaddr.c:329 <platform-bus-x86.so>+0x6aca5e #1.2 0x000021982bc4af3c in ubsan_get_stack_trace() compiler-rt/lib/ubsan/ubsan_diag.cpp:41 <libclang_rt.asan.so>+0x41f3c #1.1 0x000021982bc4af3c in maybe_print_stack_trace() compiler-rt/lib/ubsan/ubsan_diag.cpp:51 <libclang_rt.asan.so>+0x41f3c #1 0x000021982bc4af3c in ~scoped_report() compiler-rt/lib/ubsan/ubsan_diag.cpp:395 <libclang_rt.asan.so>+0x41f3c #2 0x000021982bc4bb6f in handletype_mismatch_impl() compiler-rt/lib/ubsan/ubsan_handlers.cpp:137 <libclang_rt.asan.so>+0x42b6f #3 0x000021982bc4b723 in __ubsan_handle_type_mismatch_v1 compiler-rt/lib/ubsan/ubsan_handlers.cpp:142 <libclang_rt.asan.so>+0x42723 #4 0x000021afcfdeca5e in acpi_rs_get_address_common(struct acpi_resource*, union aml_resource*) ../../third_party/acpica/source/components/resources/rsaddr.c:329 <platform-bus-x86.so>+0x6aca5e #5 0x000021afcfdf2089 in acpi_rs_convert_aml_to_resource(struct acpi_resource*, union aml_resource*, struct acpi_rsconvert_info*) ../../third_party/acpica/source/components/resources/rsmisc.c:355 <platform-bus-x86.so>+0x6b2089 #6 0x000021afcfded169 in acpi_rs_convert_aml_to_resources(u8*, u32, u32, u8, void**) ../../third_party/acpica/source/components/resources/rslist.c:137 <platform-bus-x86.so>+0x6ad169 #7 0x000021afcfe2d24a in acpi_ut_walk_aml_resources(struct acpi_walk_state*, u8*, acpi_size, acpi_walk_aml_callback, void**) ../../third_party/acpica/source/components/utilities/utresrc.c:237 <platform-bus-x86.so>+0x6ed24a #8 0x000021afcfde66b7 in acpi_rs_create_resource_list(union acpi_operand_object*, struct acpi_buffer*) ../../third_party/acpica/source/components/resources/rscreate.c:199 <platform-bus-x86.so>+0x6a66b7 #9 0x000021afcfdf6979 in acpi_rs_get_method_data(acpi_handle, const char*, struct acpi_buffer*) ../../third_party/acpica/source/components/resources/rsutils.c:770 <platform-bus-x86.so>+0x6b6979 torvalds#10 0x000021afcfdf708f in acpi_walk_resources(acpi_handle, char*, acpi_walk_resource_callback, void*) ../../third_party/acpica/source/components/resources/rsxface.c:731 <platform-bus-x86.so>+0x6b708f torvalds#11 0x000021afcfa95dcf in acpi::acpi_impl::walk_resources(acpi::acpi_impl*, acpi_handle, const char*, acpi::Acpi::resources_callable) ../../src/devices/board/lib/acpi/acpi-impl.cc:41 <platform-bus-x86.so>+0x355dcf torvalds#12 0x000021afcfaa8278 in acpi::device_builder::gather_resources(acpi::device_builder*, acpi::Acpi*, fidl::any_arena&, acpi::Manager*, acpi::device_builder::gather_resources_callback) ../../src/devices/board/lib/acpi/device-builder.cc:84 <platform-bus-x86.so>+0x368278 torvalds#13 0x000021afcfbddb87 in acpi::Manager::configure_discovered_devices(acpi::Manager*) ../../src/devices/board/lib/acpi/manager.cc:75 <platform-bus-x86.so>+0x49db87 torvalds#14 0x000021afcf99091d in publish_acpi_devices(acpi::Manager*, zx_device_t*, zx_device_t*) ../../src/devices/board/drivers/x86/acpi-nswalk.cc:95 <platform-bus-x86.so>+0x25091d torvalds#15 0x000021afcf9c1d4e in x86::X86::do_init(x86::X86*) ../../src/devices/board/drivers/x86/x86.cc:60 <platform-bus-x86.so>+0x281d4e torvalds#16 0x000021afcf9e33ad in λ(x86::X86::ddk_init::(anon class)*) ../../src/devices/board/drivers/x86/x86.cc:77 <platform-bus-x86.so>+0x2a33ad torvalds#17 0x000021afcf9e313e in fit::internal::target<(lambda at../../src/devices/board/drivers/x86/x86.cc:76:19), false, false, std::__2::allocator<std::byte>, void>::invoke(void*) ../../sdk/lib/fit/include/lib/fit/internal/function.h:183 <platform-bus-x86.so>+0x2a313e torvalds#18 0x000021afcfbab4c7 in fit::internal::function_base<16UL, false, void(), std::__2::allocator<std::byte>>::invoke(const fit::internal::function_base<16UL, false, void (), std::__2::allocator<std::byte> >*) ../../sdk/lib/fit/include/lib/fit/internal/function.h:522 <platform-bus-x86.so>+0x46b4c7 torvalds#19 0x000021afcfbab342 in fit::function_impl<16UL, false, void(), std::__2::allocator<std::byte>>::operator()(const fit::function_impl<16UL, false, void (), std::__2::allocator<std::byte> >*) ../../sdk/lib/fit/include/lib/fit/function.h:315 <platform-bus-x86.so>+0x46b342 torvalds#20 0x000021afcfcd98c3 in async::internal::retained_task::Handler(async_dispatcher_t*, async_task_t*, zx_status_t) ../../sdk/lib/async/task.cc:24 <platform-bus-x86.so>+0x5998c3 torvalds#21 0x00002290f9924616 in λ(const driver_runtime::Dispatcher::post_task::(anon class)*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, zx_status_t) ../../src/devices/bin/driver_runtime/dispatcher.cc:789 <libdriver_runtime.so>+0x10a616 torvalds#22 0x00002290f9924323 in fit::internal::target<(lambda at../../src/devices/bin/driver_runtime/dispatcher.cc:788:7), true, false, std::__2::allocator<std::byte>, void, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request>>, int>::invoke(void*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, int) ../../sdk/lib/fit/include/lib/fit/internal/function.h:128 <libdriver_runtime.so>+0x10a323 torvalds#23 0x00002290f9904b76 in fit::internal::function_base<24UL, true, void(std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request>>, int), std::__2::allocator<std::byte>>::invoke(const fit::internal::function_base<24UL, true, void (std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, int), std::__2::allocator<std::byte> >*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, int) ../../sdk/lib/fit/include/lib/fit/internal/function.h:522 <libdriver_runtime.so>+0xeab76 torvalds#24 0x00002290f9904831 in fit::callback_impl<24UL, true, void(std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request>>, int), std::__2::allocator<std::byte>>::operator()(fit::callback_impl<24UL, true, void (std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, int), std::__2::allocator<std::byte> >*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, int) ../../sdk/lib/fit/include/lib/fit/function.h:471 <libdriver_runtime.so>+0xea831 torvalds#25 0x00002290f98d5adc in driver_runtime::callback_request::Call(driver_runtime::callback_request*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >, zx_status_t) ../../src/devices/bin/driver_runtime/callback_request.h:74 <libdriver_runtime.so>+0xbbadc torvalds#26 0x00002290f98e1e58 in driver_runtime::Dispatcher::dispatch_callback(driver_runtime::Dispatcher*, std::__2::unique_ptr<driver_runtime::callback_request, std::__2::default_delete<driver_runtime::callback_request> >) ../../src/devices/bin/driver_runtime/dispatcher.cc:1248 <libdriver_runtime.so>+0xc7e58 torvalds#27 0x00002290f98e4159 in driver_runtime::Dispatcher::dispatch_callbacks(driver_runtime::Dispatcher*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../src/devices/bin/driver_runtime/dispatcher.cc:1308 <libdriver_runtime.so>+0xca159 torvalds#28 0x00002290f9918414 in λ(const driver_runtime::Dispatcher::create_with_adder::(anon class)*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../src/devices/bin/driver_runtime/dispatcher.cc:353 <libdriver_runtime.so>+0xfe414 torvalds#29 0x00002290f991812d in fit::internal::target<(lambda at../../src/devices/bin/driver_runtime/dispatcher.cc:351:7), true, false, std::__2::allocator<std::byte>, void, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter>>, fbl::ref_ptr<driver_runtime::Dispatcher>>::invoke(void*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../sdk/lib/fit/include/lib/fit/internal/function.h:128 <libdriver_runtime.so>+0xfe12d torvalds#30 0x00002290f9906fc7 in fit::internal::function_base<8UL, true, void(std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter>>, fbl::ref_ptr<driver_runtime::Dispatcher>), std::__2::allocator<std::byte>>::invoke(const fit::internal::function_base<8UL, true, void (std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>), std::__2::allocator<std::byte> >*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../sdk/lib/fit/include/lib/fit/internal/function.h:522 <libdriver_runtime.so>+0xecfc7 torvalds#31 0x00002290f9906c66 in fit::function_impl<8UL, true, void(std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter>>, fbl::ref_ptr<driver_runtime::Dispatcher>), std::__2::allocator<std::byte>>::operator()(const fit::function_impl<8UL, true, void (std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>), std::__2::allocator<std::byte> >*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../sdk/lib/fit/include/lib/fit/function.h:315 <libdriver_runtime.so>+0xecc66 torvalds#32 0x00002290f98e73d9 in driver_runtime::Dispatcher::event_waiter::invoke_callback(driver_runtime::Dispatcher::event_waiter*, std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, fbl::ref_ptr<driver_runtime::Dispatcher>) ../../src/devices/bin/driver_runtime/dispatcher.h:543 <libdriver_runtime.so>+0xcd3d9 torvalds#33 0x00002290f98e700d in driver_runtime::Dispatcher::event_waiter::handle_event(std::__2::unique_ptr<driver_runtime::Dispatcher::event_waiter, std::__2::default_delete<driver_runtime::Dispatcher::event_waiter> >, async_dispatcher_t*, async::wait_base*, zx_status_t, zx_packet_signal_t const*) ../../src/devices/bin/driver_runtime/dispatcher.cc:1442 <libdriver_runtime.so>+0xcd00d torvalds#34 0x00002290f9918983 in async_loop_owned_event_handler<driver_runtime::Dispatcher::event_waiter>::handle_event(async_loop_owned_event_handler<driver_runtime::Dispatcher::event_waiter>*, async_dispatcher_t*, async::wait_base*, zx_status_t, zx_packet_signal_t const*) ../../src/devices/bin/driver_runtime/async_loop_owned_event_handler.h:59 <libdriver_runtime.so>+0xfe983 torvalds#35 0x00002290f9918b9e in async::wait_method<async_loop_owned_event_handler<driver_runtime::Dispatcher::event_waiter>, &async_loop_owned_event_handler<driver_runtime::Dispatcher::event_waiter>::handle_event>::call_handler(async_dispatcher_t*, async_wait_t*, zx_status_t, zx_packet_signal_t const*) ../../sdk/lib/async/include/lib/async/cpp/wait.h:201 <libdriver_runtime.so>+0xfeb9e torvalds#36 0x00002290f99bf509 in async_loop_dispatch_wait(async_loop_t*, async_wait_t*, zx_status_t, zx_packet_signal_t const*) ../../sdk/lib/async-loop/loop.c:394 <libdriver_runtime.so>+0x1a5509 torvalds#37 0x00002290f99b9958 in async_loop_run_once(async_loop_t*, zx_time_t) ../../sdk/lib/async-loop/loop.c:343 <libdriver_runtime.so>+0x19f958 torvalds#38 0x00002290f99b9247 in async_loop_run(async_loop_t*, zx_time_t, _Bool) ../../sdk/lib/async-loop/loop.c:301 <libdriver_runtime.so>+0x19f247 torvalds#39 0x00002290f99ba962 in async_loop_run_thread(void*) ../../sdk/lib/async-loop/loop.c:860 <libdriver_runtime.so>+0x1a0962 torvalds#40 0x000041afd176ef30 in start_c11(void*) ../../zircon/third_party/ulib/musl/pthread/pthread_create.c:63 <libc.so>+0x84f30 torvalds#41 0x000041afd18a448d in thread_trampoline(uintptr_t, uintptr_t) ../../zircon/system/ulib/runtime/thread.cc:100 <libc.so>+0x1ba48d Link: acpica/acpica@1c28da22 Signed-off-by: Rafael J. Wysocki <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Tamir Duberstein <[email protected]> [ rjw: Pick up the tag from Tamir ] Signed-off-by: Rafael J. Wysocki <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
A crash in conntrack was reported while trying to unlink the conntrack entry from the hash bucket list: [exception RIP: __nf_ct_delete_from_lists+172] [..] #7 [ff539b5a2b043aa0] nf_ct_delete at ffffffffc124d421 [nf_conntrack] #8 [ff539b5a2b043ad0] nf_ct_gc_expired at ffffffffc124d999 [nf_conntrack] #9 [ff539b5a2b043ae0] __nf_conntrack_find_get at ffffffffc124efbc [nf_conntrack] [..] The nf_conn struct is marked as allocated from slab but appears to be in a partially initialised state: ct hlist pointer is garbage; looks like the ct hash value (hence crash). ct->status is equal to IPS_CONFIRMED|IPS_DYING, which is expected ct->timeout is 30000 (=30s), which is unexpected. Everything else looks like normal udp conntrack entry. If we ignore ct->status and pretend its 0, the entry matches those that are newly allocated but not yet inserted into the hash: - ct hlist pointers are overloaded and store/cache the raw tuple hash - ct->timeout matches the relative time expected for a new udp flow rather than the absolute 'jiffies' value. If it were not for the presence of IPS_CONFIRMED, __nf_conntrack_find_get() would have skipped the entry. Theory is that we did hit following race: cpu x cpu y cpu z found entry E found entry E E is expired <preemption> nf_ct_delete() return E to rcu slab init_conntrack E is re-inited, ct->status set to 0 reply tuplehash hnnode.pprev stores hash value. cpu y found E right before it was deleted on cpu x. E is now re-inited on cpu z. cpu y was preempted before checking for expiry and/or confirm bit. ->refcnt set to 1 E now owned by skb ->timeout set to 30000 If cpu y were to resume now, it would observe E as expired but would skip E due to missing CONFIRMED bit. nf_conntrack_confirm gets called sets: ct->status |= CONFIRMED This is wrong: E is not yet added to hashtable. cpu y resumes, it observes E as expired but CONFIRMED: <resumes> nf_ct_expired() -> yes (ct->timeout is 30s) confirmed bit set. cpu y will try to delete E from the hashtable: nf_ct_delete() -> set DYING bit __nf_ct_delete_from_lists Even this scenario doesn't guarantee a crash: cpu z still holds the table bucket lock(s) so y blocks: wait for spinlock held by z CONFIRMED is set but there is no guarantee ct will be added to hash: "chaintoolong" or "clash resolution" logic both skip the insert step. reply hnnode.pprev still stores the hash value. unlocks spinlock return NF_DROP <unblocks, then crashes on hlist_nulls_del_rcu pprev> In case CPU z does insert the entry into the hashtable, cpu y will unlink E again right away but no crash occurs. Without 'cpu y' race, 'garbage' hlist is of no consequence: ct refcnt remains at 1, eventually skb will be free'd and E gets destroyed via: nf_conntrack_put -> nf_conntrack_destroy -> nf_ct_destroy. To resolve this, move the IPS_CONFIRMED assignment after the table insertion but before the unlock. Pablo points out that the confirm-bit-store could be reordered to happen before hlist add resp. the timeout fixup, so switch to set_bit and before_atomic memory barrier to prevent this. It doesn't matter if other CPUs can observe a newly inserted entry right before the CONFIRMED bit was set: Such event cannot be distinguished from above "E is the old incarnation" case: the entry will be skipped. Also change nf_ct_should_gc() to first check the confirmed bit. The gc sequence is: 1. Check if entry has expired, if not skip to next entry 2. Obtain a reference to the expired entry. 3. Call nf_ct_should_gc() to double-check step 1. nf_ct_should_gc() is thus called only for entries that already failed an expiry check. After this patch, once the confirmed bit check passes ct->timeout has been altered to reflect the absolute 'best before' date instead of a relative time. Step 3 will therefore not remove the entry. Without this change to nf_ct_should_gc() we could still get this sequence: 1. Check if entry has expired. 2. Obtain a reference. 3. Call nf_ct_should_gc() to double-check step 1: 4 - entry is still observed as expired 5 - meanwhile, ct->timeout is corrected to absolute value on other CPU and confirm bit gets set 6 - confirm bit is seen 7 - valid entry is removed again First do check 6), then 4) so the gc expiry check always picks up either confirmed bit unset (entry gets skipped) or expiry re-check failure for re-inited conntrack objects. This change cannot be backported to releases before 5.19. Without commit 8a75a2c ("netfilter: conntrack: remove unconfirmed list") |= IPS_CONFIRMED line cannot be moved without further changes. Cc: Razvan Cojocaru <[email protected]> Link: https://lore.kernel.org/netfilter-devel/[email protected]/ Link: https://lore.kernel.org/netfilter-devel/[email protected]/ Fixes: 1397af5 ("netfilter: conntrack: remove the percpu dying list") Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
pert script tests fails with segmentation fault as below: 92: perf script tests: --- start --- test child forked, pid 103769 DB test [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.012 MB /tmp/perf-test-script.7rbftEpOzX/perf.data (9 samples) ] /usr/libexec/perf-core/tests/shell/script.sh: line 35: 103780 Segmentation fault (core dumped) perf script -i "${perfdatafile}" -s "${db_test}" --- Cleaning up --- ---- end(-1) ---- 92: perf script tests : FAILED! Backtrace pointed to : #0 0x0000000010247dd0 in maps.machine () #1 0x00000000101d178c in db_export.sample () #2 0x00000000103412c8 in python_process_event () #3 0x000000001004eb28 in process_sample_event () #4 0x000000001024fcd0 in machines.deliver_event () #5 0x000000001025005c in perf_session.deliver_event () #6 0x00000000102568b0 in __ordered_events__flush.part.0 () #7 0x0000000010251618 in perf_session.process_events () #8 0x0000000010053620 in cmd_script () #9 0x00000000100b5a28 in run_builtin () torvalds#10 0x00000000100b5f94 in handle_internal_command () torvalds#11 0x0000000010011114 in main () Further investigation reveals that this occurs in the `perf script tests`, because it uses `db_test.py` script. This script sets `perf_db_export_mode = True`. With `perf_db_export_mode` enabled, if a sample originates from a hypervisor, perf doesn't set maps for "[H]" sample in the code. Consequently, `al->maps` remains NULL when `maps__machine(al->maps)` is called from `db_export__sample`. As al->maps can be NULL in case of Hypervisor samples , use thread->maps because even for Hypervisor sample, machine should exist. If we don't have machine for some reason, return -1 to avoid segmentation fault. Reported-by: Disha Goel <[email protected]> Signed-off-by: Aditya Bodkhe <[email protected]> Reviewed-by: Adrian Hunter <[email protected]> Tested-by: Disha Goel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Suggested-by: Adrian Hunter <[email protected]> Signed-off-by: Namhyung Kim <[email protected]>
Without the change `perf `hangs up on charaster devices. On my system it's enough to run system-wide sampler for a few seconds to get the hangup: $ perf record -a -g --call-graph=dwarf $ perf report # hung `strace` shows that hangup happens on reading on a character device `/dev/dri/renderD128` $ strace -y -f -p 2780484 strace: Process 2780484 attached pread64(101</dev/dri/renderD128>, strace: Process 2780484 detached It's call trace descends into `elfutils`: $ gdb -p 2780484 (gdb) bt #0 0x00007f5e508f04b7 in __libc_pread64 (fd=101, buf=0x7fff9df7edb0, count=0, offset=0) at ../sysdeps/unix/sysv/linux/pread64.c:25 #1 0x00007f5e52b79515 in read_file () from /<<NIX>>/elfutils-0.192/lib/libelf.so.1 #2 0x00007f5e52b25666 in libdw_open_elf () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1 #3 0x00007f5e52b25907 in __libdw_open_file () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1 #4 0x00007f5e52b120a9 in dwfl_report_elf@@ELFUTILS_0.156 () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1 #5 0x000000000068bf20 in __report_module (al=al@entry=0x7fff9df80010, ip=ip@entry=139803237033216, ui=ui@entry=0x5369b5e0) at util/dso.h:537 #6 0x000000000068c3d1 in report_module (ip=139803237033216, ui=0x5369b5e0) at util/unwind-libdw.c:114 #7 frame_callback (state=0x535aef10, arg=0x5369b5e0) at util/unwind-libdw.c:242 #8 0x00007f5e52b261d3 in dwfl_thread_getframes () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1 #9 0x00007f5e52b25bdb in get_one_thread_cb () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1 torvalds#10 0x00007f5e52b25faa in dwfl_getthreads () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1 torvalds#11 0x00007f5e52b26514 in dwfl_getthread_frames () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1 torvalds#12 0x000000000068c6ce in unwind__get_entries (cb=cb@entry=0x5d4620 <unwind_entry>, arg=arg@entry=0x10cd5fa0, thread=thread@entry=0x1076a290, data=data@entry=0x7fff9df80540, max_stack=max_stack@entry=127, best_effort=best_effort@entry=false) at util/thread.h:152 torvalds#13 0x00000000005dae95 in thread__resolve_callchain_unwind (evsel=0x106006d0, thread=0x1076a290, cursor=0x10cd5fa0, sample=0x7fff9df80540, max_stack=127, symbols=true) at util/machine.c:2939 torvalds#14 thread__resolve_callchain_unwind (thread=0x1076a290, cursor=0x10cd5fa0, evsel=0x106006d0, sample=0x7fff9df80540, max_stack=127, symbols=true) at util/machine.c:2920 torvalds#15 __thread__resolve_callchain (thread=0x1076a290, cursor=0x10cd5fa0, evsel=0x106006d0, evsel@entry=0x7fff9df80440, sample=0x7fff9df80540, parent=parent@entry=0x7fff9df804a0, root_al=root_al@entry=0x7fff9df80440, max_stack=127, symbols=true) at util/machine.c:2970 torvalds#16 0x00000000005d0cb2 in thread__resolve_callchain (thread=<optimized out>, cursor=<optimized out>, evsel=0x7fff9df80440, sample=<optimized out>, parent=0x7fff9df804a0, root_al=0x7fff9df80440, max_stack=127) at util/machine.h:198 torvalds#17 sample__resolve_callchain (sample=<optimized out>, cursor=<optimized out>, parent=parent@entry=0x7fff9df804a0, evsel=evsel@entry=0x106006d0, al=al@entry=0x7fff9df80440, max_stack=max_stack@entry=127) at util/callchain.c:1127 torvalds#18 0x0000000000617e08 in hist_entry_iter__add (iter=iter@entry=0x7fff9df80480, al=al@entry=0x7fff9df80440, max_stack_depth=127, arg=arg@entry=0x7fff9df81ae0) at util/hist.c:1255 torvalds#19 0x000000000045d2d0 in process_sample_event (tool=0x7fff9df81ae0, event=<optimized out>, sample=0x7fff9df80540, evsel=0x106006d0, machine=<optimized out>) at builtin-report.c:334 torvalds#20 0x00000000005e3bb1 in perf_session__deliver_event (session=0x105ff2c0, event=0x7f5c7d735ca0, tool=0x7fff9df81ae0, file_offset=2914716832, file_path=0x105ffbf0 "perf.data") at util/session.c:1367 torvalds#21 0x00000000005e8d93 in do_flush (oe=0x105ffa50, show_progress=false) at util/ordered-events.c:245 torvalds#22 __ordered_events__flush (oe=0x105ffa50, how=OE_FLUSH__ROUND, timestamp=<optimized out>) at util/ordered-events.c:324 torvalds#23 0x00000000005e1f64 in perf_session__process_user_event (session=0x105ff2c0, event=0x7f5c7d752b18, file_offset=2914835224, file_path=0x105ffbf0 "perf.data") at util/session.c:1419 torvalds#24 0x00000000005e47c7 in reader__read_event (rd=rd@entry=0x7fff9df81260, session=session@entry=0x105ff2c0, --Type <RET> for more, q to quit, c to continue without paging-- quit prog=prog@entry=0x7fff9df81220) at util/session.c:2132 torvalds#25 0x00000000005e4b37 in reader__process_events (rd=0x7fff9df81260, session=0x105ff2c0, prog=0x7fff9df81220) at util/session.c:2181 torvalds#26 __perf_session__process_events (session=0x105ff2c0) at util/session.c:2226 torvalds#27 perf_session__process_events (session=session@entry=0x105ff2c0) at util/session.c:2390 torvalds#28 0x0000000000460add in __cmd_report (rep=0x7fff9df81ae0) at builtin-report.c:1076 torvalds#29 cmd_report (argc=<optimized out>, argv=<optimized out>) at builtin-report.c:1827 torvalds#30 0x00000000004c5a40 in run_builtin (p=p@entry=0xd8f7f8 <commands+312>, argc=argc@entry=1, argv=argv@entry=0x7fff9df844b0) at perf.c:351 torvalds#31 0x00000000004c5d63 in handle_internal_command (argc=argc@entry=1, argv=argv@entry=0x7fff9df844b0) at perf.c:404 torvalds#32 0x0000000000442de3 in run_argv (argcp=<synthetic pointer>, argv=<synthetic pointer>) at perf.c:448 torvalds#33 main (argc=<optimized out>, argv=0x7fff9df844b0) at perf.c:556 The hangup happens because nothing in` perf` or `elfutils` checks if a mapped file is easily readable. The change conservatively skips all non-regular files. Signed-off-by: Sergei Trofimovich <[email protected]> Acked-by: Namhyung Kim <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Namhyung Kim <[email protected]>
Symbolize stack traces by creating a live machine. Add this functionality to dump_stack and switch dump_stack users to use it. Switch TUI to use it. Add stack traces to the child test function which can be useful to diagnose blocked code. Example output: ``` $ perf test -vv PERF_RECORD_ ... 7: PERF_RECORD_* events & perf_sample fields: 7: PERF_RECORD_* events & perf_sample fields : Running (1 active) ^C Signal (2) while running tests. Terminating tests with the same signal Internal test harness failure. Completing any started tests: : 7: PERF_RECORD_* events & perf_sample fields: ---- unexpected signal (2) ---- #0 0x55788c6210a3 in child_test_sig_handler builtin-test.c:0 #1 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0 #2 0x7fc12fe99687 in __internal_syscall_cancel cancellation.c:64 #3 0x7fc12fee5f7a in clock_nanosleep@GLIBC_2.2.5 clock_nanosleep.c:72 #4 0x7fc12fef1393 in __nanosleep nanosleep.c:26 #5 0x7fc12ff02d68 in __sleep sleep.c:55 #6 0x55788c63196b in test__PERF_RECORD perf-record.c:0 #7 0x55788c620fb0 in run_test_child builtin-test.c:0 #8 0x55788c5bd18d in start_command run-command.c:127 #9 0x55788c621ef3 in __cmd_test builtin-test.c:0 torvalds#10 0x55788c6225bf in cmd_test ??:0 torvalds#11 0x55788c5afbd0 in run_builtin perf.c:0 torvalds#12 0x55788c5afeeb in handle_internal_command perf.c:0 torvalds#13 0x55788c52b383 in main ??:0 torvalds#14 0x7fc12fe33ca8 in __libc_start_call_main libc_start_call_main.h:74 torvalds#15 0x7fc12fe33d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128 torvalds#16 0x55788c52b9d1 in _start ??:0 ---- unexpected signal (2) ---- #0 0x55788c6210a3 in child_test_sig_handler builtin-test.c:0 #1 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0 #2 0x7fc12fea3a14 in pthread_sigmask@GLIBC_2.2.5 pthread_sigmask.c:45 #3 0x7fc12fe49fd9 in __GI___sigprocmask sigprocmask.c:26 #4 0x7fc12ff2601b in __longjmp_chk longjmp.c:36 #5 0x55788c6210c0 in print_test_result.isra.0 builtin-test.c:0 #6 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0 #7 0x7fc12fe99687 in __internal_syscall_cancel cancellation.c:64 #8 0x7fc12fee5f7a in clock_nanosleep@GLIBC_2.2.5 clock_nanosleep.c:72 #9 0x7fc12fef1393 in __nanosleep nanosleep.c:26 torvalds#10 0x7fc12ff02d68 in __sleep sleep.c:55 torvalds#11 0x55788c63196b in test__PERF_RECORD perf-record.c:0 torvalds#12 0x55788c620fb0 in run_test_child builtin-test.c:0 torvalds#13 0x55788c5bd18d in start_command run-command.c:127 torvalds#14 0x55788c621ef3 in __cmd_test builtin-test.c:0 torvalds#15 0x55788c6225bf in cmd_test ??:0 torvalds#16 0x55788c5afbd0 in run_builtin perf.c:0 torvalds#17 0x55788c5afeeb in handle_internal_command perf.c:0 torvalds#18 0x55788c52b383 in main ??:0 torvalds#19 0x7fc12fe33ca8 in __libc_start_call_main libc_start_call_main.h:74 torvalds#20 0x7fc12fe33d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128 torvalds#21 0x55788c52b9d1 in _start ??:0 7: PERF_RECORD_* events & perf_sample fields : Skip (permissions) ``` Signed-off-by: Ian Rogers <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Namhyung Kim <[email protected]>
Calling perf top with branch filters enabled on Intel CPU's with branch counters logging (A.K.A LBR event logging [1]) support results in a segfault. $ perf top -e '{cpu_core/cpu-cycles/,cpu_core/event=0xc6,umask=0x3,frontend=0x11,name=frontend_retired_dsb_miss/}' -j any,counter ... Thread 27 "perf" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffafff76c0 (LWP 949003)] perf_env__find_br_cntr_info (env=0xf66dc0 <perf_env>, nr=0x0, width=0x7fffafff62c0) at util/env.c:653 653 *width = env->cpu_pmu_caps ? env->br_cntr_width : (gdb) bt #0 perf_env__find_br_cntr_info (env=0xf66dc0 <perf_env>, nr=0x0, width=0x7fffafff62c0) at util/env.c:653 #1 0x00000000005b1599 in symbol__account_br_cntr (branch=0x7fffcc3db580, evsel=0xfea2d0, offset=12, br_cntr=8) at util/annotate.c:345 #2 0x00000000005b17fb in symbol__account_cycles (addr=5658172, start=5658160, sym=0x7fffcc0ee420, cycles=539, evsel=0xfea2d0, br_cntr=8) at util/annotate.c:389 #3 0x00000000005b1976 in addr_map_symbol__account_cycles (ams=0x7fffcd7b01d0, start=0x7fffcd7b02b0, cycles=539, evsel=0xfea2d0, br_cntr=8) at util/annotate.c:422 #4 0x000000000068d57f in hist__account_cycles (bs=0x110d288, al=0x7fffafff6540, sample=0x7fffafff6760, nonany_branch_mode=false, total_cycles=0x0, evsel=0xfea2d0) at util/hist.c:2850 #5 0x0000000000446216 in hist_iter__top_callback (iter=0x7fffafff6590, al=0x7fffafff6540, single=true, arg=0x7fffffff9e00) at builtin-top.c:737 #6 0x0000000000689787 in hist_entry_iter__add (iter=0x7fffafff6590, al=0x7fffafff6540, max_stack_depth=127, arg=0x7fffffff9e00) at util/hist.c:1359 #7 0x0000000000446710 in perf_event__process_sample (tool=0x7fffffff9e00, event=0x110d250, evsel=0xfea2d0, sample=0x7fffafff6760, machine=0x108c968) at builtin-top.c:845 #8 0x0000000000447735 in deliver_event (qe=0x7fffffffa120, qevent=0x10fc200) at builtin-top.c:1211 #9 0x000000000064ccae in do_flush (oe=0x7fffffffa120, show_progress=false) at util/ordered-events.c:245 torvalds#10 0x000000000064d005 in __ordered_events__flush (oe=0x7fffffffa120, how=OE_FLUSH__TOP, timestamp=0) at util/ordered-events.c:324 torvalds#11 0x000000000064d0ef in ordered_events__flush (oe=0x7fffffffa120, how=OE_FLUSH__TOP) at util/ordered-events.c:342 torvalds#12 0x00000000004472a9 in process_thread (arg=0x7fffffff9e00) at builtin-top.c:1120 torvalds#13 0x00007ffff6e7dba8 in start_thread (arg=<optimized out>) at pthread_create.c:448 torvalds#14 0x00007ffff6f01b8c in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78 The cause is that perf_env__find_br_cntr_info tries to access a null pointer pmu_caps in the perf_env struct. A similar issue exists for homogeneous core systems which use the cpu_pmu_caps structure. Fix this by populating cpu_pmu_caps and pmu_caps structures with values from sysfs when calling perf top with branch stack sampling enabled. [1], LBR event logging introduced here: https://lore.kernel.org/all/[email protected]/ Reviewed-by: Ian Rogers <[email protected]> Signed-off-by: Thomas Falcon <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Namhyung Kim <[email protected]>
As syzbot [1] reported as below: R10: 0000000000000100 R11: 0000000000000206 R12: 00007ffe17473450 R13: 00007f28b1c10854 R14: 000000000000dae5 R15: 00007ffe17474520 </TASK> ---[ end trace 0000000000000000 ]--- ================================================================== BUG: KASAN: use-after-free in __list_del_entry_valid+0xa6/0x130 lib/list_debug.c:62 Read of size 8 at addr ffff88812d962278 by task syz-executor/564 CPU: 1 PID: 564 Comm: syz-executor Tainted: G W 6.1.129-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025 Call Trace: <TASK> __dump_stack+0x21/0x24 lib/dump_stack.c:88 dump_stack_lvl+0xee/0x158 lib/dump_stack.c:106 print_address_description+0x71/0x210 mm/kasan/report.c:316 print_report+0x4a/0x60 mm/kasan/report.c:427 kasan_report+0x122/0x150 mm/kasan/report.c:531 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report_generic.c:351 __list_del_entry_valid+0xa6/0x130 lib/list_debug.c:62 __list_del_entry include/linux/list.h:134 [inline] list_del_init include/linux/list.h:206 [inline] f2fs_inode_synced+0xf7/0x2e0 fs/f2fs/super.c:1531 f2fs_update_inode+0x74/0x1c40 fs/f2fs/inode.c:585 f2fs_update_inode_page+0x137/0x170 fs/f2fs/inode.c:703 f2fs_write_inode+0x4ec/0x770 fs/f2fs/inode.c:731 write_inode fs/fs-writeback.c:1460 [inline] __writeback_single_inode+0x4a0/0xab0 fs/fs-writeback.c:1677 writeback_single_inode+0x221/0x8b0 fs/fs-writeback.c:1733 sync_inode_metadata+0xb6/0x110 fs/fs-writeback.c:2789 f2fs_sync_inode_meta+0x16d/0x2a0 fs/f2fs/checkpoint.c:1159 block_operations fs/f2fs/checkpoint.c:1269 [inline] f2fs_write_checkpoint+0xca3/0x2100 fs/f2fs/checkpoint.c:1658 kill_f2fs_super+0x231/0x390 fs/f2fs/super.c:4668 deactivate_locked_super+0x98/0x100 fs/super.c:332 deactivate_super+0xaf/0xe0 fs/super.c:363 cleanup_mnt+0x45f/0x4e0 fs/namespace.c:1186 __cleanup_mnt+0x19/0x20 fs/namespace.c:1193 task_work_run+0x1c6/0x230 kernel/task_work.c:203 exit_task_work include/linux/task_work.h:39 [inline] do_exit+0x9fb/0x2410 kernel/exit.c:871 do_group_exit+0x210/0x2d0 kernel/exit.c:1021 __do_sys_exit_group kernel/exit.c:1032 [inline] __se_sys_exit_group kernel/exit.c:1030 [inline] __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1030 x64_sys_call+0x7b4/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:232 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x4c/0xa0 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x68/0xd2 RIP: 0033:0x7f28b1b8e169 Code: Unable to access opcode bytes at 0x7f28b1b8e13f. RSP: 002b:00007ffe174710a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 00007f28b1c10879 RCX: 00007f28b1b8e169 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001 RBP: 0000000000000002 R08: 00007ffe1746ee47 R09: 00007ffe17472360 R10: 0000000000000009 R11: 0000000000000246 R12: 00007ffe17472360 R13: 00007f28b1c10854 R14: 000000000000dae5 R15: 00007ffe17474520 </TASK> Allocated by task 569: kasan_save_stack mm/kasan/common.c:45 [inline] kasan_set_track+0x4b/0x70 mm/kasan/common.c:52 kasan_save_alloc_info+0x25/0x30 mm/kasan/generic.c:505 __kasan_slab_alloc+0x72/0x80 mm/kasan/common.c:328 kasan_slab_alloc include/linux/kasan.h:201 [inline] slab_post_alloc_hook+0x4f/0x2c0 mm/slab.h:737 slab_alloc_node mm/slub.c:3398 [inline] slab_alloc mm/slub.c:3406 [inline] __kmem_cache_alloc_lru mm/slub.c:3413 [inline] kmem_cache_alloc_lru+0x104/0x220 mm/slub.c:3429 alloc_inode_sb include/linux/fs.h:3245 [inline] f2fs_alloc_inode+0x2d/0x340 fs/f2fs/super.c:1419 alloc_inode fs/inode.c:261 [inline] iget_locked+0x186/0x880 fs/inode.c:1373 f2fs_iget+0x55/0x4c60 fs/f2fs/inode.c:483 f2fs_lookup+0x366/0xab0 fs/f2fs/namei.c:487 __lookup_slow+0x2a3/0x3d0 fs/namei.c:1690 lookup_slow+0x57/0x70 fs/namei.c:1707 walk_component+0x2e6/0x410 fs/namei.c:1998 lookup_last fs/namei.c:2455 [inline] path_lookupat+0x180/0x490 fs/namei.c:2479 filename_lookup+0x1f0/0x500 fs/namei.c:2508 vfs_statx+0x10b/0x660 fs/stat.c:229 vfs_fstatat fs/stat.c:267 [inline] vfs_lstat include/linux/fs.h:3424 [inline] __do_sys_newlstat fs/stat.c:423 [inline] __se_sys_newlstat+0xd5/0x350 fs/stat.c:417 __x64_sys_newlstat+0x5b/0x70 fs/stat.c:417 x64_sys_call+0x393/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:7 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x4c/0xa0 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x68/0xd2 Freed by task 13: kasan_save_stack mm/kasan/common.c:45 [inline] kasan_set_track+0x4b/0x70 mm/kasan/common.c:52 kasan_save_free_info+0x31/0x50 mm/kasan/generic.c:516 ____kasan_slab_free+0x132/0x180 mm/kasan/common.c:236 __kasan_slab_free+0x11/0x20 mm/kasan/common.c:244 kasan_slab_free include/linux/kasan.h:177 [inline] slab_free_hook mm/slub.c:1724 [inline] slab_free_freelist_hook+0xc2/0x190 mm/slub.c:1750 slab_free mm/slub.c:3661 [inline] kmem_cache_free+0x12d/0x2a0 mm/slub.c:3683 f2fs_free_inode+0x24/0x30 fs/f2fs/super.c:1562 i_callback+0x4c/0x70 fs/inode.c:250 rcu_do_batch+0x503/0xb80 kernel/rcu/tree.c:2297 rcu_core+0x5a2/0xe70 kernel/rcu/tree.c:2557 rcu_core_si+0x9/0x10 kernel/rcu/tree.c:2574 handle_softirqs+0x178/0x500 kernel/softirq.c:578 run_ksoftirqd+0x28/0x30 kernel/softirq.c:945 smpboot_thread_fn+0x45a/0x8c0 kernel/smpboot.c:164 kthread+0x270/0x310 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 Last potentially related work creation: kasan_save_stack+0x3a/0x60 mm/kasan/common.c:45 __kasan_record_aux_stack+0xb6/0xc0 mm/kasan/generic.c:486 kasan_record_aux_stack_noalloc+0xb/0x10 mm/kasan/generic.c:496 call_rcu+0xd4/0xf70 kernel/rcu/tree.c:2845 destroy_inode fs/inode.c:316 [inline] evict+0x7da/0x870 fs/inode.c:720 iput_final fs/inode.c:1834 [inline] iput+0x62b/0x830 fs/inode.c:1860 do_unlinkat+0x356/0x540 fs/namei.c:4397 __do_sys_unlink fs/namei.c:4438 [inline] __se_sys_unlink fs/namei.c:4436 [inline] __x64_sys_unlink+0x49/0x50 fs/namei.c:4436 x64_sys_call+0x958/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:88 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x4c/0xa0 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x68/0xd2 The buggy address belongs to the object at ffff88812d961f20 which belongs to the cache f2fs_inode_cache of size 1200 The buggy address is located 856 bytes inside of 1200-byte region [ffff88812d961f20, ffff88812d9623d0) The buggy address belongs to the physical page: page:ffffea0004b65800 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12d960 head:ffffea0004b65800 order:2 compound_mapcount:0 compound_pincount:0 flags: 0x4000000000010200(slab|head|zone=1) raw: 4000000000010200 0000000000000000 dead000000000122 ffff88810a94c500 raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 2, migratetype Reclaimable, gfp_mask 0x1d2050(__GFP_IO|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_RECLAIMABLE), pid 569, tgid 568 (syz.2.16), ts 55943246141, free_ts 0 set_page_owner include/linux/page_owner.h:31 [inline] post_alloc_hook+0x1d0/0x1f0 mm/page_alloc.c:2532 prep_new_page mm/page_alloc.c:2539 [inline] get_page_from_freelist+0x2e63/0x2ef0 mm/page_alloc.c:4328 __alloc_pages+0x235/0x4b0 mm/page_alloc.c:5605 alloc_slab_page include/linux/gfp.h:-1 [inline] allocate_slab mm/slub.c:1939 [inline] new_slab+0xec/0x4b0 mm/slub.c:1992 ___slab_alloc+0x6f6/0xb50 mm/slub.c:3180 __slab_alloc+0x5e/0xa0 mm/slub.c:3279 slab_alloc_node mm/slub.c:3364 [inline] slab_alloc mm/slub.c:3406 [inline] __kmem_cache_alloc_lru mm/slub.c:3413 [inline] kmem_cache_alloc_lru+0x13f/0x220 mm/slub.c:3429 alloc_inode_sb include/linux/fs.h:3245 [inline] f2fs_alloc_inode+0x2d/0x340 fs/f2fs/super.c:1419 alloc_inode fs/inode.c:261 [inline] iget_locked+0x186/0x880 fs/inode.c:1373 f2fs_iget+0x55/0x4c60 fs/f2fs/inode.c:483 f2fs_fill_super+0x3ad7/0x6bb0 fs/f2fs/super.c:4293 mount_bdev+0x2ae/0x3e0 fs/super.c:1443 f2fs_mount+0x34/0x40 fs/f2fs/super.c:4642 legacy_get_tree+0xea/0x190 fs/fs_context.c:632 vfs_get_tree+0x89/0x260 fs/super.c:1573 do_new_mount+0x25a/0xa20 fs/namespace.c:3056 page_owner free stack trace missing Memory state around the buggy address: ffff88812d962100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88812d962180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff88812d962200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88812d962280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88812d962300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== [1] https://syzkaller.appspot.com/x/report.txt?x=13448368580000 This bug can be reproduced w/ the reproducer [2], once we enable CONFIG_F2FS_CHECK_FS config, the reproducer will trigger panic as below, so the direct reason of this bug is the same as the one below patch [3] fixed. kernel BUG at fs/f2fs/inode.c:857! RIP: 0010:f2fs_evict_inode+0x1204/0x1a20 Call Trace: <TASK> evict+0x32a/0x7a0 do_unlinkat+0x37b/0x5b0 __x64_sys_unlink+0xad/0x100 do_syscall_64+0x5a/0xb0 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0010:f2fs_evict_inode+0x1204/0x1a20 [2] https://syzkaller.appspot.com/x/repro.c?x=17495ccc580000 [3] https://lore.kernel.org/linux-f2fs-devel/[email protected] Tracepoints before panic: f2fs_unlink_enter: dev = (7,0), dir ino = 3, i_size = 4096, i_blocks = 8, name = file1 f2fs_unlink_exit: dev = (7,0), ino = 7, ret = 0 f2fs_evict_inode: dev = (7,0), ino = 7, pino = 3, i_mode = 0x81ed, i_size = 10, i_nlink = 0, i_blocks = 0, i_advise = 0x0 f2fs_truncate_node: dev = (7,0), ino = 7, nid = 8, block_address = 0x3c05 f2fs_unlink_enter: dev = (7,0), dir ino = 3, i_size = 4096, i_blocks = 8, name = file3 f2fs_unlink_exit: dev = (7,0), ino = 8, ret = 0 f2fs_evict_inode: dev = (7,0), ino = 8, pino = 3, i_mode = 0x81ed, i_size = 9000, i_nlink = 0, i_blocks = 24, i_advise = 0x4 f2fs_truncate: dev = (7,0), ino = 8, pino = 3, i_mode = 0x81ed, i_size = 0, i_nlink = 0, i_blocks = 24, i_advise = 0x4 f2fs_truncate_blocks_enter: dev = (7,0), ino = 8, i_size = 0, i_blocks = 24, start file offset = 0 f2fs_truncate_blocks_exit: dev = (7,0), ino = 8, ret = -2 The root cause is: in the fuzzed image, dnode #8 belongs to inode #7, after inode #7 eviction, dnode #8 was dropped. However there is dirent that has ino #8, so, once we unlink file3, in f2fs_evict_inode(), both f2fs_truncate() and f2fs_update_inode_page() will fail due to we can not load node #8, result in we missed to call f2fs_inode_synced() to clear inode dirty status. Let's fix this by calling f2fs_inode_synced() in error path of f2fs_evict_inode(). PS: As I verified, the reproducer [2] can trigger this bug in v6.1.129, but it failed in v6.16-rc4, this is because the testcase will stop due to other corruption has been detected by f2fs: F2FS-fs (loop0): inconsistent node block, node_type:2, nid:8, node_footer[nid:8,ino:8,ofs:0,cpver:5013063228981249506,blkaddr:15366] F2FS-fs (loop0): f2fs_lookup: inode (ino=9) has zero i_nlink Fixes: 0f18b46 ("f2fs: flush inode metadata when checkpoint is doing") Closes: https://syzkaller.appspot.com/x/report.txt?x=13448368580000 Signed-off-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
[ Upstream commit a509a55 ] As syzbot [1] reported as below: R10: 0000000000000100 R11: 0000000000000206 R12: 00007ffe17473450 R13: 00007f28b1c10854 R14: 000000000000dae5 R15: 00007ffe17474520 </TASK> ---[ end trace 0000000000000000 ]--- ================================================================== BUG: KASAN: use-after-free in __list_del_entry_valid+0xa6/0x130 lib/list_debug.c:62 Read of size 8 at addr ffff88812d962278 by task syz-executor/564 CPU: 1 PID: 564 Comm: syz-executor Tainted: G W 6.1.129-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025 Call Trace: <TASK> __dump_stack+0x21/0x24 lib/dump_stack.c:88 dump_stack_lvl+0xee/0x158 lib/dump_stack.c:106 print_address_description+0x71/0x210 mm/kasan/report.c:316 print_report+0x4a/0x60 mm/kasan/report.c:427 kasan_report+0x122/0x150 mm/kasan/report.c:531 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report_generic.c:351 __list_del_entry_valid+0xa6/0x130 lib/list_debug.c:62 __list_del_entry include/linux/list.h:134 [inline] list_del_init include/linux/list.h:206 [inline] f2fs_inode_synced+0xf7/0x2e0 fs/f2fs/super.c:1531 f2fs_update_inode+0x74/0x1c40 fs/f2fs/inode.c:585 f2fs_update_inode_page+0x137/0x170 fs/f2fs/inode.c:703 f2fs_write_inode+0x4ec/0x770 fs/f2fs/inode.c:731 write_inode fs/fs-writeback.c:1460 [inline] __writeback_single_inode+0x4a0/0xab0 fs/fs-writeback.c:1677 writeback_single_inode+0x221/0x8b0 fs/fs-writeback.c:1733 sync_inode_metadata+0xb6/0x110 fs/fs-writeback.c:2789 f2fs_sync_inode_meta+0x16d/0x2a0 fs/f2fs/checkpoint.c:1159 block_operations fs/f2fs/checkpoint.c:1269 [inline] f2fs_write_checkpoint+0xca3/0x2100 fs/f2fs/checkpoint.c:1658 kill_f2fs_super+0x231/0x390 fs/f2fs/super.c:4668 deactivate_locked_super+0x98/0x100 fs/super.c:332 deactivate_super+0xaf/0xe0 fs/super.c:363 cleanup_mnt+0x45f/0x4e0 fs/namespace.c:1186 __cleanup_mnt+0x19/0x20 fs/namespace.c:1193 task_work_run+0x1c6/0x230 kernel/task_work.c:203 exit_task_work include/linux/task_work.h:39 [inline] do_exit+0x9fb/0x2410 kernel/exit.c:871 do_group_exit+0x210/0x2d0 kernel/exit.c:1021 __do_sys_exit_group kernel/exit.c:1032 [inline] __se_sys_exit_group kernel/exit.c:1030 [inline] __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1030 x64_sys_call+0x7b4/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:232 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x4c/0xa0 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x68/0xd2 RIP: 0033:0x7f28b1b8e169 Code: Unable to access opcode bytes at 0x7f28b1b8e13f. RSP: 002b:00007ffe174710a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 00007f28b1c10879 RCX: 00007f28b1b8e169 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001 RBP: 0000000000000002 R08: 00007ffe1746ee47 R09: 00007ffe17472360 R10: 0000000000000009 R11: 0000000000000246 R12: 00007ffe17472360 R13: 00007f28b1c10854 R14: 000000000000dae5 R15: 00007ffe17474520 </TASK> Allocated by task 569: kasan_save_stack mm/kasan/common.c:45 [inline] kasan_set_track+0x4b/0x70 mm/kasan/common.c:52 kasan_save_alloc_info+0x25/0x30 mm/kasan/generic.c:505 __kasan_slab_alloc+0x72/0x80 mm/kasan/common.c:328 kasan_slab_alloc include/linux/kasan.h:201 [inline] slab_post_alloc_hook+0x4f/0x2c0 mm/slab.h:737 slab_alloc_node mm/slub.c:3398 [inline] slab_alloc mm/slub.c:3406 [inline] __kmem_cache_alloc_lru mm/slub.c:3413 [inline] kmem_cache_alloc_lru+0x104/0x220 mm/slub.c:3429 alloc_inode_sb include/linux/fs.h:3245 [inline] f2fs_alloc_inode+0x2d/0x340 fs/f2fs/super.c:1419 alloc_inode fs/inode.c:261 [inline] iget_locked+0x186/0x880 fs/inode.c:1373 f2fs_iget+0x55/0x4c60 fs/f2fs/inode.c:483 f2fs_lookup+0x366/0xab0 fs/f2fs/namei.c:487 __lookup_slow+0x2a3/0x3d0 fs/namei.c:1690 lookup_slow+0x57/0x70 fs/namei.c:1707 walk_component+0x2e6/0x410 fs/namei.c:1998 lookup_last fs/namei.c:2455 [inline] path_lookupat+0x180/0x490 fs/namei.c:2479 filename_lookup+0x1f0/0x500 fs/namei.c:2508 vfs_statx+0x10b/0x660 fs/stat.c:229 vfs_fstatat fs/stat.c:267 [inline] vfs_lstat include/linux/fs.h:3424 [inline] __do_sys_newlstat fs/stat.c:423 [inline] __se_sys_newlstat+0xd5/0x350 fs/stat.c:417 __x64_sys_newlstat+0x5b/0x70 fs/stat.c:417 x64_sys_call+0x393/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:7 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x4c/0xa0 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x68/0xd2 Freed by task 13: kasan_save_stack mm/kasan/common.c:45 [inline] kasan_set_track+0x4b/0x70 mm/kasan/common.c:52 kasan_save_free_info+0x31/0x50 mm/kasan/generic.c:516 ____kasan_slab_free+0x132/0x180 mm/kasan/common.c:236 __kasan_slab_free+0x11/0x20 mm/kasan/common.c:244 kasan_slab_free include/linux/kasan.h:177 [inline] slab_free_hook mm/slub.c:1724 [inline] slab_free_freelist_hook+0xc2/0x190 mm/slub.c:1750 slab_free mm/slub.c:3661 [inline] kmem_cache_free+0x12d/0x2a0 mm/slub.c:3683 f2fs_free_inode+0x24/0x30 fs/f2fs/super.c:1562 i_callback+0x4c/0x70 fs/inode.c:250 rcu_do_batch+0x503/0xb80 kernel/rcu/tree.c:2297 rcu_core+0x5a2/0xe70 kernel/rcu/tree.c:2557 rcu_core_si+0x9/0x10 kernel/rcu/tree.c:2574 handle_softirqs+0x178/0x500 kernel/softirq.c:578 run_ksoftirqd+0x28/0x30 kernel/softirq.c:945 smpboot_thread_fn+0x45a/0x8c0 kernel/smpboot.c:164 kthread+0x270/0x310 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 Last potentially related work creation: kasan_save_stack+0x3a/0x60 mm/kasan/common.c:45 __kasan_record_aux_stack+0xb6/0xc0 mm/kasan/generic.c:486 kasan_record_aux_stack_noalloc+0xb/0x10 mm/kasan/generic.c:496 call_rcu+0xd4/0xf70 kernel/rcu/tree.c:2845 destroy_inode fs/inode.c:316 [inline] evict+0x7da/0x870 fs/inode.c:720 iput_final fs/inode.c:1834 [inline] iput+0x62b/0x830 fs/inode.c:1860 do_unlinkat+0x356/0x540 fs/namei.c:4397 __do_sys_unlink fs/namei.c:4438 [inline] __se_sys_unlink fs/namei.c:4436 [inline] __x64_sys_unlink+0x49/0x50 fs/namei.c:4436 x64_sys_call+0x958/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:88 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x4c/0xa0 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x68/0xd2 The buggy address belongs to the object at ffff88812d961f20 which belongs to the cache f2fs_inode_cache of size 1200 The buggy address is located 856 bytes inside of 1200-byte region [ffff88812d961f20, ffff88812d9623d0) The buggy address belongs to the physical page: page:ffffea0004b65800 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12d960 head:ffffea0004b65800 order:2 compound_mapcount:0 compound_pincount:0 flags: 0x4000000000010200(slab|head|zone=1) raw: 4000000000010200 0000000000000000 dead000000000122 ffff88810a94c500 raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 2, migratetype Reclaimable, gfp_mask 0x1d2050(__GFP_IO|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_RECLAIMABLE), pid 569, tgid 568 (syz.2.16), ts 55943246141, free_ts 0 set_page_owner include/linux/page_owner.h:31 [inline] post_alloc_hook+0x1d0/0x1f0 mm/page_alloc.c:2532 prep_new_page mm/page_alloc.c:2539 [inline] get_page_from_freelist+0x2e63/0x2ef0 mm/page_alloc.c:4328 __alloc_pages+0x235/0x4b0 mm/page_alloc.c:5605 alloc_slab_page include/linux/gfp.h:-1 [inline] allocate_slab mm/slub.c:1939 [inline] new_slab+0xec/0x4b0 mm/slub.c:1992 ___slab_alloc+0x6f6/0xb50 mm/slub.c:3180 __slab_alloc+0x5e/0xa0 mm/slub.c:3279 slab_alloc_node mm/slub.c:3364 [inline] slab_alloc mm/slub.c:3406 [inline] __kmem_cache_alloc_lru mm/slub.c:3413 [inline] kmem_cache_alloc_lru+0x13f/0x220 mm/slub.c:3429 alloc_inode_sb include/linux/fs.h:3245 [inline] f2fs_alloc_inode+0x2d/0x340 fs/f2fs/super.c:1419 alloc_inode fs/inode.c:261 [inline] iget_locked+0x186/0x880 fs/inode.c:1373 f2fs_iget+0x55/0x4c60 fs/f2fs/inode.c:483 f2fs_fill_super+0x3ad7/0x6bb0 fs/f2fs/super.c:4293 mount_bdev+0x2ae/0x3e0 fs/super.c:1443 f2fs_mount+0x34/0x40 fs/f2fs/super.c:4642 legacy_get_tree+0xea/0x190 fs/fs_context.c:632 vfs_get_tree+0x89/0x260 fs/super.c:1573 do_new_mount+0x25a/0xa20 fs/namespace.c:3056 page_owner free stack trace missing Memory state around the buggy address: ffff88812d962100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88812d962180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff88812d962200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88812d962280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88812d962300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== [1] https://syzkaller.appspot.com/x/report.txt?x=13448368580000 This bug can be reproduced w/ the reproducer [2], once we enable CONFIG_F2FS_CHECK_FS config, the reproducer will trigger panic as below, so the direct reason of this bug is the same as the one below patch [3] fixed. kernel BUG at fs/f2fs/inode.c:857! RIP: 0010:f2fs_evict_inode+0x1204/0x1a20 Call Trace: <TASK> evict+0x32a/0x7a0 do_unlinkat+0x37b/0x5b0 __x64_sys_unlink+0xad/0x100 do_syscall_64+0x5a/0xb0 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0010:f2fs_evict_inode+0x1204/0x1a20 [2] https://syzkaller.appspot.com/x/repro.c?x=17495ccc580000 [3] https://lore.kernel.org/linux-f2fs-devel/[email protected] Tracepoints before panic: f2fs_unlink_enter: dev = (7,0), dir ino = 3, i_size = 4096, i_blocks = 8, name = file1 f2fs_unlink_exit: dev = (7,0), ino = 7, ret = 0 f2fs_evict_inode: dev = (7,0), ino = 7, pino = 3, i_mode = 0x81ed, i_size = 10, i_nlink = 0, i_blocks = 0, i_advise = 0x0 f2fs_truncate_node: dev = (7,0), ino = 7, nid = 8, block_address = 0x3c05 f2fs_unlink_enter: dev = (7,0), dir ino = 3, i_size = 4096, i_blocks = 8, name = file3 f2fs_unlink_exit: dev = (7,0), ino = 8, ret = 0 f2fs_evict_inode: dev = (7,0), ino = 8, pino = 3, i_mode = 0x81ed, i_size = 9000, i_nlink = 0, i_blocks = 24, i_advise = 0x4 f2fs_truncate: dev = (7,0), ino = 8, pino = 3, i_mode = 0x81ed, i_size = 0, i_nlink = 0, i_blocks = 24, i_advise = 0x4 f2fs_truncate_blocks_enter: dev = (7,0), ino = 8, i_size = 0, i_blocks = 24, start file offset = 0 f2fs_truncate_blocks_exit: dev = (7,0), ino = 8, ret = -2 The root cause is: in the fuzzed image, dnode #8 belongs to inode #7, after inode #7 eviction, dnode #8 was dropped. However there is dirent that has ino #8, so, once we unlink file3, in f2fs_evict_inode(), both f2fs_truncate() and f2fs_update_inode_page() will fail due to we can not load node #8, result in we missed to call f2fs_inode_synced() to clear inode dirty status. Let's fix this by calling f2fs_inode_synced() in error path of f2fs_evict_inode(). PS: As I verified, the reproducer [2] can trigger this bug in v6.1.129, but it failed in v6.16-rc4, this is because the testcase will stop due to other corruption has been detected by f2fs: F2FS-fs (loop0): inconsistent node block, node_type:2, nid:8, node_footer[nid:8,ino:8,ofs:0,cpver:5013063228981249506,blkaddr:15366] F2FS-fs (loop0): f2fs_lookup: inode (ino=9) has zero i_nlink Fixes: 0f18b46 ("f2fs: flush inode metadata when checkpoint is doing") Closes: https://syzkaller.appspot.com/x/report.txt?x=13448368580000 Signed-off-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit c80aa2a ] The hfsplus_bnode_read() method can trigger the issue: [ 174.852007][ T9784] ================================================================== [ 174.852709][ T9784] BUG: KASAN: slab-out-of-bounds in hfsplus_bnode_read+0x2f4/0x360 [ 174.853412][ T9784] Read of size 8 at addr ffff88810b5fc6c0 by task repro/9784 [ 174.854059][ T9784] [ 174.854272][ T9784] CPU: 1 UID: 0 PID: 9784 Comm: repro Not tainted 6.16.0-rc3 #7 PREEMPT(full) [ 174.854281][ T9784] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 174.854286][ T9784] Call Trace: [ 174.854289][ T9784] <TASK> [ 174.854292][ T9784] dump_stack_lvl+0x10e/0x1f0 [ 174.854305][ T9784] print_report+0xd0/0x660 [ 174.854315][ T9784] ? __virt_addr_valid+0x81/0x610 [ 174.854323][ T9784] ? __phys_addr+0xe8/0x180 [ 174.854330][ T9784] ? hfsplus_bnode_read+0x2f4/0x360 [ 174.854337][ T9784] kasan_report+0xc6/0x100 [ 174.854346][ T9784] ? hfsplus_bnode_read+0x2f4/0x360 [ 174.854354][ T9784] hfsplus_bnode_read+0x2f4/0x360 [ 174.854362][ T9784] hfsplus_bnode_dump+0x2ec/0x380 [ 174.854370][ T9784] ? __pfx_hfsplus_bnode_dump+0x10/0x10 [ 174.854377][ T9784] ? hfsplus_bnode_write_u16+0x83/0xb0 [ 174.854385][ T9784] ? srcu_gp_start+0xd0/0x310 [ 174.854393][ T9784] ? __mark_inode_dirty+0x29e/0xe40 [ 174.854402][ T9784] hfsplus_brec_remove+0x3d2/0x4e0 [ 174.854411][ T9784] __hfsplus_delete_attr+0x290/0x3a0 [ 174.854419][ T9784] ? __pfx_hfs_find_1st_rec_by_cnid+0x10/0x10 [ 174.854427][ T9784] ? __pfx___hfsplus_delete_attr+0x10/0x10 [ 174.854436][ T9784] ? __asan_memset+0x23/0x50 [ 174.854450][ T9784] hfsplus_delete_all_attrs+0x262/0x320 [ 174.854459][ T9784] ? __pfx_hfsplus_delete_all_attrs+0x10/0x10 [ 174.854469][ T9784] ? rcu_is_watching+0x12/0xc0 [ 174.854476][ T9784] ? __mark_inode_dirty+0x29e/0xe40 [ 174.854483][ T9784] hfsplus_delete_cat+0x845/0xde0 [ 174.854493][ T9784] ? __pfx_hfsplus_delete_cat+0x10/0x10 [ 174.854507][ T9784] hfsplus_unlink+0x1ca/0x7c0 [ 174.854516][ T9784] ? __pfx_hfsplus_unlink+0x10/0x10 [ 174.854525][ T9784] ? down_write+0x148/0x200 [ 174.854532][ T9784] ? __pfx_down_write+0x10/0x10 [ 174.854540][ T9784] vfs_unlink+0x2fe/0x9b0 [ 174.854549][ T9784] do_unlinkat+0x490/0x670 [ 174.854557][ T9784] ? __pfx_do_unlinkat+0x10/0x10 [ 174.854565][ T9784] ? __might_fault+0xbc/0x130 [ 174.854576][ T9784] ? getname_flags.part.0+0x1c5/0x550 [ 174.854584][ T9784] __x64_sys_unlink+0xc5/0x110 [ 174.854592][ T9784] do_syscall_64+0xc9/0x480 [ 174.854600][ T9784] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 174.854608][ T9784] RIP: 0033:0x7f6fdf4c3167 [ 174.854614][ T9784] Code: f0 ff ff 73 01 c3 48 8b 0d 26 0d 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 08 [ 174.854622][ T9784] RSP: 002b:00007ffcb948bca8 EFLAGS: 00000206 ORIG_RAX: 0000000000000057 [ 174.854630][ T9784] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f6fdf4c3167 [ 174.854636][ T9784] RDX: 00007ffcb948bcc0 RSI: 00007ffcb948bcc0 RDI: 00007ffcb948bd50 [ 174.854641][ T9784] RBP: 00007ffcb948cd90 R08: 0000000000000001 R09: 00007ffcb948bb40 [ 174.854645][ T9784] R10: 00007f6fdf564fc0 R11: 0000000000000206 R12: 0000561e1bc9c2d0 [ 174.854650][ T9784] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 174.854658][ T9784] </TASK> [ 174.854661][ T9784] [ 174.879281][ T9784] Allocated by task 9784: [ 174.879664][ T9784] kasan_save_stack+0x20/0x40 [ 174.880082][ T9784] kasan_save_track+0x14/0x30 [ 174.880500][ T9784] __kasan_kmalloc+0xaa/0xb0 [ 174.880908][ T9784] __kmalloc_noprof+0x205/0x550 [ 174.881337][ T9784] __hfs_bnode_create+0x107/0x890 [ 174.881779][ T9784] hfsplus_bnode_find+0x2d0/0xd10 [ 174.882222][ T9784] hfsplus_brec_find+0x2b0/0x520 [ 174.882659][ T9784] hfsplus_delete_all_attrs+0x23b/0x320 [ 174.883144][ T9784] hfsplus_delete_cat+0x845/0xde0 [ 174.883595][ T9784] hfsplus_rmdir+0x106/0x1b0 [ 174.884004][ T9784] vfs_rmdir+0x206/0x690 [ 174.884379][ T9784] do_rmdir+0x2b7/0x390 [ 174.884751][ T9784] __x64_sys_rmdir+0xc5/0x110 [ 174.885167][ T9784] do_syscall_64+0xc9/0x480 [ 174.885568][ T9784] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 174.886083][ T9784] [ 174.886293][ T9784] The buggy address belongs to the object at ffff88810b5fc600 [ 174.886293][ T9784] which belongs to the cache kmalloc-192 of size 192 [ 174.887507][ T9784] The buggy address is located 40 bytes to the right of [ 174.887507][ T9784] allocated 152-byte region [ffff88810b5fc600, ffff88810b5fc698) [ 174.888766][ T9784] [ 174.888976][ T9784] The buggy address belongs to the physical page: [ 174.889533][ T9784] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10b5fc [ 174.890295][ T9784] flags: 0x57ff00000000000(node=1|zone=2|lastcpupid=0x7ff) [ 174.890927][ T9784] page_type: f5(slab) [ 174.891284][ T9784] raw: 057ff00000000000 ffff88801b4423c0 ffffea000426dc80 dead000000000002 [ 174.892032][ T9784] raw: 0000000000000000 0000000080100010 00000000f5000000 0000000000000000 [ 174.892774][ T9784] page dumped because: kasan: bad access detected [ 174.893327][ T9784] page_owner tracks the page as allocated [ 174.893825][ T9784] page last allocated via order 0, migratetype Unmovable, gfp_mask 0x52c00(GFP_NOIO|__GFP_NOWARN|__GFP_NO1 [ 174.895373][ T9784] post_alloc_hook+0x1c0/0x230 [ 174.895801][ T9784] get_page_from_freelist+0xdeb/0x3b30 [ 174.896284][ T9784] __alloc_frozen_pages_noprof+0x25c/0x2460 [ 174.896810][ T9784] alloc_pages_mpol+0x1fb/0x550 [ 174.897242][ T9784] new_slab+0x23b/0x340 [ 174.897614][ T9784] ___slab_alloc+0xd81/0x1960 [ 174.898028][ T9784] __slab_alloc.isra.0+0x56/0xb0 [ 174.898468][ T9784] __kmalloc_noprof+0x2b0/0x550 [ 174.898896][ T9784] usb_alloc_urb+0x73/0xa0 [ 174.899289][ T9784] usb_control_msg+0x1cb/0x4a0 [ 174.899718][ T9784] usb_get_string+0xab/0x1a0 [ 174.900133][ T9784] usb_string_sub+0x107/0x3c0 [ 174.900549][ T9784] usb_string+0x307/0x670 [ 174.900933][ T9784] usb_cache_string+0x80/0x150 [ 174.901355][ T9784] usb_new_device+0x1d0/0x19d0 [ 174.901786][ T9784] register_root_hub+0x299/0x730 [ 174.902231][ T9784] page last free pid 10 tgid 10 stack trace: [ 174.902757][ T9784] __free_frozen_pages+0x80c/0x1250 [ 174.903217][ T9784] vfree.part.0+0x12b/0xab0 [ 174.903645][ T9784] delayed_vfree_work+0x93/0xd0 [ 174.904073][ T9784] process_one_work+0x9b5/0x1b80 [ 174.904519][ T9784] worker_thread+0x630/0xe60 [ 174.904927][ T9784] kthread+0x3a8/0x770 [ 174.905291][ T9784] ret_from_fork+0x517/0x6e0 [ 174.905709][ T9784] ret_from_fork_asm+0x1a/0x30 [ 174.906128][ T9784] [ 174.906338][ T9784] Memory state around the buggy address: [ 174.906828][ T9784] ffff88810b5fc580: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc [ 174.907528][ T9784] ffff88810b5fc600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 174.908222][ T9784] >ffff88810b5fc680: 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc [ 174.908917][ T9784] ^ [ 174.909481][ T9784] ffff88810b5fc700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 174.910432][ T9784] ffff88810b5fc780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc [ 174.911401][ T9784] ================================================================== The reason of the issue that code doesn't check the correctness of the requested offset and length. As a result, incorrect value of offset or/and length could result in access out of allocated memory. This patch introduces is_bnode_offset_valid() method that checks the requested offset value. Also, it introduces check_and_correct_requested_length() method that checks and correct the requested length (if it is necessary). These methods are used in hfsplus_bnode_read(), hfsplus_bnode_write(), hfsplus_bnode_clear(), hfsplus_bnode_copy(), and hfsplus_bnode_move() with the goal to prevent the access out of allocated memory and triggering the crash. Reported-by: Kun Hu <[email protected]> Reported-by: Jiaji Qin <[email protected]> Reported-by: Shuoran Bai <[email protected]> Signed-off-by: Viacheslav Dubeyko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Viacheslav Dubeyko <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
https://lore.kernel.org/lkml/[email protected]/
Why hasn't this been considered yet?
This should fix a few fringe issues, and reduce bug reports.