-
Notifications
You must be signed in to change notification settings - Fork 151
Usdt nop ci #10299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
olsajiri
wants to merge
134
commits into
kernel-patches:bpf_base
Choose a base branch
from
olsajiri:usdt_nop_ci
base: bpf_base
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Usdt nop ci #10299
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in bpf_sk_storage.c to obtain better performance when PREEMPT_RCU is not enabled. Signed-off-by: Fushuai Wang <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://patch.msgid.link/[email protected]
cleanup unused function args in check_deadlock* functions. Fixes: 31158ad ("rqspinlock: Add deadlock detection and recovery") Signed-off-by: Siddharth Chintamaneni <[email protected]> Reviewed-by: Eduard Zingerman <[email protected]> Acked-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
Fix the BPF verifier to correctly determine the sleepable context of async callbacks based on the async primitive type rather than the arming program's context. The bug is in in_sleepable() which uses OR logic to check if the current execution context is sleepable. When a sleepable program arms a timer callback, the callback's state correctly has in_sleepable=false, but in_sleepable() would still return true due to env->prog->sleepable being true. This incorrectly allows sleepable helpers like bpf_copy_from_user() inside timer callbacks when armed from sleepable programs, even though timer callbacks always execute in non-sleepable context. Fix in_sleepable() to rely solely on env->cur_state->in_sleepable, and initialize state->in_sleepable to env->prog->sleepable in do_check_common() for the main program entry. This ensures the sleepable context is properly tracked per verification state rather than being overridden by the program's sleepability. The env->cur_state NULL check in in_sleepable() was only needed for do_misc_fixups() which runs after verification when env->cur_state is set to NULL. Update do_misc_fixups() to use env->prog->sleepable directly for the storage_get_function check, and remove the redundant NULL check from in_sleepable(). Introduce is_async_cb_sleepable() helper to explicitly determine async callback sleepability based on the primitive type: - bpf_timer callbacks are never sleepable - bpf_wq and bpf_task_work callbacks are always sleepable Add verifier_bug() check to catch unhandled async callback types, ensuring future additions cannot be silently mishandled. Move the is_task_work_add_kfunc() forward declaration to the top alongside other callback-related helpers. We update push_async_cb() to adjust to the new changes. At the same time, while simplifying in_sleepable(), we notice a problem in do_misc_fixups. Fix storage_get helpers to use GFP_ATOMIC when called from non-sleepable contexts within sleepable programs, such as bpf_timer callbacks. Currently, the check in do_misc_fixups assumes that env->prog->sleepable, previously in_sleepable(env) which only resolved to this check before last commit, holds across the program's execution, but that is not true. Instead, the func_atomic bit must be set whenever we see the function being called in an atomic context. Previously, this is being done when the helper is invoked in atomic contexts in sleepable programs, we can simply just set the value to true without doing an in_sleepable() check. We must also do a standalone in_sleepable() check to handle cases where the async callback itself is armed from a sleepable program, but is itself non-sleepable (e.g., timer callback) and invokes such a helper, thus needing the func_atomic bit to be true for the said call. Adjust do_misc_fixups() to drop any checks regarding sleepable nature of the program, and just depend on the func_atomic bit to decide which GFP flag to pass. Fixes: 81f1d7a ("bpf: wq: add bpf_wq_set_callback_impl") Fixes: b00fa38 ("bpf: Enable non-atomic allocations in local storage") Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
Rename the storage_get_func_atomic flag to a more generic non_sleepable flag that tracks whether a helper or kfunc may be called from a non-sleepable context. This makes the flag more broadly applicable beyond just storage_get helpers. See [0] for more context. The flag is now set unconditionally for all helpers and kfuncs when: - RCU critical section is active. - Preemption is disabled. - IRQs are disabled. - In a non-sleepable context within a sleepable program (e.g., timer callbacks), which is indicated by !in_sleepable(). Previously, the flag was only set for storage_get helpers in these contexts. With this change, it can be used by any code that needs to differentiate between sleepable and non-sleepable contexts at the per-instruction level. The existing usage in do_misc_fixups() for storage_get helpers is preserved by checking is_storage_get_function() before using the flag. [0]: https://lore.kernel.org/bpf/CAP01T76cbaNi4p-y8E0sjE2NXSra2S=Uja8G4hSQDu_SbXxREQ@mail.gmail.com Cc: Mykyta Yatsenko <[email protected]> Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Acked-by: Mykyta Yatsenko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
Add tests to verify that async callback's sleepable attribute is correctly determined by the callback type, not the arming program's context, reflecting its true execution context. Introduce verifier_async_cb_context.c with tests for all three async callback primitives: bpf_timer, bpf_wq, and bpf_task_work. Each primitive is tested when armed from both sleepable (lsm.s/file_open) and non-sleepable (fentry) programs. Test coverage: - bpf_timer callbacks: Verify they are never sleepable, even when armed from sleepable programs. Both tests should fail when attempting to use sleepable helper bpf_copy_from_user() in the callback. - bpf_wq callbacks: Verify they are always sleepable, even when armed from non-sleepable programs. Both tests should succeed when using sleepable helpers in the callback. - bpf_task_work callbacks: Verify they are always sleepable, even when armed from non-sleepable programs. Both tests should succeed when using sleepable helpers in the callback. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
Kumar Kartikeya Dwivedi says: ==================== Fix sleepable context tracking for async callbacks Currently, asynchronous execution primitives set up their callback execution simulation using push_async_cb, which will end up inheriting the sleepable or non-sleepable bit from the program triggering the simulation of the callback. This is incorrect, as the actual execution context of the asynchronous callback has nothing to do with the program arming its execution. This set fixes this oversight, and supplies a few test cases ensuring the correct behavior is tested across different types of primitives (i.e. timer, wq, task_work). While looking at this bug, it was noticed that the GFP flag setting logic for storage_get helpers is also broken, hence fix it while we are at it. PSA: These fixes and unit tests were primarily produced by prompting an AI assistant (Claude), and then modified in minor ways, in an exercise to understand how useful it can be at general kernel development tasks. Changelog: ---------- v1 -> v2 v1: https://lore.kernel.org/bpf/[email protected] * Squash fix for GFP flags into 1st commit. (Eduard) * Add a commit refactoring func_atomic to non_sleepable, make it generic, also set for kfuncs in addition to helpers. (Eduard) * Leave selftest as-is, coverage for global subprogs calling sleepable kfuncs or helpers is provided in rcu_read_lock.c. ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
bpf_strcasestr() and bpf_strncasestr() functions perform same like bpf_strstr() and bpf_strnstr() except ignoring the case of the characters. Signed-off-by: Rong Tao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
Add tests for new kfuncs bpf_strcasestr() and bpf_strncasestr(). Signed-off-by: Rong Tao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
Rong Tao says: ==================== Add kfuncs bpf_strcasestr and bpf_strncasestr From: Rong Tao <[email protected]> Add kfuncs bpf_strcasestr and bpf_strncasestr, which are extensions of bpf_strstr and bpf_strnstr, suitable for more scenarios. v4: Fix wrong comment. v3: keep __bpf_strnstr() static and compress some tests. https://lore.kernel.org/lkml/[email protected]/ v2: remove extra __bpf_kfunc and fix comment of bpf_strncasestr(). https://lore.kernel.org/all/[email protected]/ v1: https://lore.kernel.org/all/[email protected]/ ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
This bit of refactoring aims to simplify how we free memory in bpf_prog_test_run_skb to avoid code duplication. Signed-off-by: Paul Chaignon <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Tested-by: [email protected] Link: https://patch.msgid.link/8971e01ae87b84f5af6b8b40defd3c310faf1c0f.1760037899.git.paul.chaignon@gmail.com
This patch reorders the initialization of bpf_prog_test_run_skb to simplify the subsequent patch. Program types are checked first, followed by the ctx init, and finally the data init. With the subsequent patch, program types and the ctx init provide information that is used in the data init. Thus, we need the data init to happen last. Signed-off-by: Paul Chaignon <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://patch.msgid.link/063475176f15828a882c07846017394baf72f682.1760037899.git.paul.chaignon@gmail.com
This patch adds support for crafting non-linear skbs in BPF test runs for tc programs. The size of the linear area is given by ctx->data_end, with a minimum of ETH_HLEN always pulled in the linear area. If ctx or ctx->data_end are null, a linear skb is used. This is particularly useful to test support for non-linear skbs in large codebases such as Cilium. We've had multiple bugs in the past few years where we were missing calls to bpf_skb_pull_data(). This support in BPF_PROG_TEST_RUN would allow us to automatically cover this case in our BPF tests. LWT program types are currently excluded in this patch. Allowing non-linear skbs for these programs would require a bit more care because they are able to call helpers (ex., bpf_clone_redirect, bpf_redirect) that themselves call eth_type_trans(). eth_type_trans() assumes there are at least ETH_HLEN bytes in the linear area. That may not be true for LWT programs as we already pulled the L2 header via the eth_type_trans() call in bpf_prog_test_run_skb(). In addition to the selftests introduced later in the series, this patch was tested by enabling non-linear skbs for all tc selftests programs and checking test failures were expected. Suggested-by: Daniel Borkmann <[email protected]> Signed-off-by: Paul Chaignon <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Tested-by: [email protected] Link: https://patch.msgid.link/5694d4d1af31bddf974afcb1bbb1e28b8998dcd0.1760037899.git.paul.chaignon@gmail.com
This patch adds support for a new tag __linear_size in the test loader, to specify the size of the linear area in case of non-linear skbs. If the tag is absent or null, a linear skb is crafted. Signed-off-by: Paul Chaignon <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://patch.msgid.link/7ad928ec7591daef4f1b84032aeb86c918b3e5a7.1760037899.git.paul.chaignon@gmail.com
This patch adds new selftests in the direct packet access suite, to cover the non-linear case. The first six tests cover the behavior of the bounds check with a non-linear skb. The last test adds a call to bpf_skb_pull_data() to be able to access the packet. Note that the size of the linear area includes the L2 header, but for some program types like cgroup_skb, ctx->data points to the L3 header. Therefore, a linear area of 22 bytes will have only 8 bytes accessible to the BPF program (22 - ETH_HLEN). For that reason, the cgroup_skb test cases access the packet at an offset of 8 bytes. Signed-off-by: Paul Chaignon <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://patch.msgid.link/ceedbfd719e58f0d49dcceb8592f5e6bd38ce5fe.1760037899.git.paul.chaignon@gmail.com
Paul Chaignon says:
====================
Support non-linear skbs for BPF_PROG_TEST_RUN
This patchset adds support for non-linear skbs when running tc programs
with BPF_PROG_TEST_RUN.
We've had multiple bugs in the past few years in Cilium caused by
missing calls to bpf_skb_pull_data(). Daniel suggested to support
non linear skb in BPF_PROG_TEST_RUN to uncover these bugs in
our BPF tests.
Changes in v8:
- Fix uninitialized data pointer spotted by Martin.
- Error out in test_loader if __linear_size tag is used on unsupported
program types.
Changes in v7:
- Refactor use of 'size' variable as suggested by Martin.
- Support copying back the non-linear area to data_out.
- Minor code changes for readability, suggested by Martin.
Changes in v6:
- Disallow non-linear skb in prog_run_skb only for LWT programs
instead of all non-L2 program types, on suggestion from Martin.
- Reject non-null ctx->data and ctx->data_meta, as suggested by Amery.
- Bound linear_size to 'PAGE_SIZE - headroom - tailroom' to be
consistent with prog_run_xdp, as suggested by Martin.
- Allocate exactly linear_size bytes in bpf_test_init, spotted by
Martin.
- Fix wrong conflict resolution on double-free fix, spotted by Amery.
- Rebased.
Changes in v5:
- Fix double free on data in first patch.
Changes in v4:
- Per Martin's suggestion, follow the XDP code pattern and use
bpf_test_init only to initialize the linear area. That way data is
directly copied to the right areas and we avoid the call to
__pskb_pull_tail.
- Fixed outdated commit descriptions.
- Rebased.
Changes in v3:
- Dropped BPF_F_TEST_SKB_NON_LINEAR and used the ctx->data_end to
determine if the user wants non-linear skb, as suggested by Amery.
- Introduced a second commit with a bit of refactoring to allow for
the above requested change.
- Fix bug found by syzkaller on third commit.
- Rebased.
Changes in v2:
- Made the linear size configurable via ctx->data_end, as suggested
by Amery.
- Reworked the selftests to allow testing the configurable linear
size.
- Fix warnings reported by kernel test robot on first commit.
- Rebased.
====================
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
…pf_wq Fix handling maps with no BTF and non-constant offsets for the bpf_wq. This de-duplicates logic with other internal structs (task_work, timer), keeps error reporting consistent, and makes future changes to the layout handling centralized. Fixes: d940c9b ("bpf: add support for KF_ARG_PTR_TO_WORKQUEUE") Signed-off-by: Mykyta Yatsenko <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
Add bpf_wq selftests to verify: * BPF program using non-constant offset of struct bpf_wq is rejected * BPF program using map with no BTF for storing struct bpf_wq is rejected Signed-off-by: Mykyta Yatsenko <[email protected]> Tested-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
The arraymap and hashtab duplicate the logic that checks for and frees
internal structs (timer, workqueue, task_work) based on
BTF record flags. Centralize this by introducing two helpers:
* bpf_map_has_internal_structs(map)
Returns true if the map value contains any of internal structs:
BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK.
* bpf_map_free_internal_structs(map, obj)
Frees the internal structs for a single value object.
Convert arraymap and both the prealloc/malloc hashtab paths to use the
new generic functions. This keeps the functionality for when/how to free
these special fields in one place and makes it easier to add support for
new internal structs in the future without touching every map
implementation.
Signed-off-by: Mykyta Yatsenko <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
…18-rc1 Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <[email protected]>
We have many places which open-code what's now is bpf_rcu_lock_held() macro, so replace all those places with a clean and short macro invocation. For that, move bpf_rcu_lock_held() macro into include/linux/bpf.h. Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <[email protected]>
There are some set but not used build errors when compiling bpf selftests with the latest upstream mainline GCC, at the beginning add the attribute __maybe_unused for the variables, but it is better to just add the option -Wno-unused-but-set-variable to CFLAGS in Makefile to disable the errors instead of hacking the tests. tools/testing/selftests/bpf/map_tests/lpm_trie_map_basic_ops.c:229:36: error: variable ‘n_matches_after_delete’ set but not used [-Werror=unused-but-set-variable=] tools/testing/selftests/bpf/map_tests/lpm_trie_map_basic_ops.c:229:25: error: variable ‘n_matches’ set but not used [-Werror=unused-but-set-variable=] tools/testing/selftests/bpf/prog_tests/bpf_cookie.c:426:22: error: variable ‘j’ set but not used [-Werror=unused-but-set-variable=] tools/testing/selftests/bpf/prog_tests/find_vma.c:52:22: error: variable ‘j’ set but not used [-Werror=unused-but-set-variable=] tools/testing/selftests/bpf/prog_tests/perf_branches.c:67:22: error: variable ‘j’ set but not used [-Werror=unused-but-set-variable=] tools/testing/selftests/bpf/prog_tests/perf_link.c:15:22: error: variable ‘j’ set but not used [-Werror=unused-but-set-variable=] Signed-off-by: Tiezhu Yang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
When CONFIG_MEMCG is enabled, we can access mm->owner under RCU. The owner can be NULL. With this change, BPF helpers can safely access mm->owner to retrieve the associated task from the mm. We can then make policy decision based on the task attribute. The typical use case is as follows, bpf_rcu_read_lock(); // rcu lock must be held for rcu trusted field @owner = @mm->owner; // mm_struct::owner is rcu trusted or null if (!@owner) goto out; /* Do something based on the task attribute */ out: bpf_rcu_read_unlock(); Suggested-by: Andrii Nakryiko <[email protected]> Signed-off-by: Yafang Shao <[email protected]> Acked-by: Lorenzo Stoakes <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
The vma->vm_mm might be NULL and it can be accessed outside of RCU. Thus, we can mark it as trusted_or_null. With this change, BPF helpers can safely access vma->vm_mm to retrieve the associated mm_struct from the VMA. Then we can make policy decision from the VMA. The "trusted" annotation enables direct access to vma->vm_mm within kfuncs marked with KF_TRUSTED_ARGS or KF_RCU, such as bpf_task_get_cgroup1() and bpf_task_under_cgroup(). Conversely, "null" enforcement requires all callsites using vma->vm_mm to perform NULL checks. The lsm selftest must be modified because it directly accesses vma->vm_mm without a NULL pointer check; otherwise it will break due to this change. For the VMA based THP policy, the use case is as follows, @mm = @vma->vm_mm; // vm_area_struct::vm_mm is trusted or null if (!@mm) return; bpf_rcu_read_lock(); // rcu lock must be held to dereference the owner @owner = @mm->owner; // mm_struct::owner is rcu trusted or null if (!@owner) goto out; @Cgroup1 = bpf_task_get_cgroup1(@owner, MEMCG_HIERARCHY_ID); /* make the decision based on the @Cgroup1 attribute */ bpf_cgroup_release(@Cgroup1); // release the associated cgroup out: bpf_rcu_read_unlock(); PSI memory information can be obtained from the associated cgroup to inform policy decisions. Since upstream PSI support is currently limited to cgroup v2, the following example demonstrates cgroup v2 implementation: @owner = @mm->owner; if (@owner) { // @ancestor_cgid is user-configured @ancestor = bpf_cgroup_from_id(@ancestor_cgid); if (bpf_task_under_cgroup(@owner, @ancestor)) { @psi_group = @ancestor->psi; /* Extract PSI metrics from @psi_group and * implement policy logic based on the values */ } } The vma::vm_file can also be marked with __safe_trusted_or_null. No additional selftests are required since vma->vm_file and vma->vm_mm are already validated in the existing selftest suite. Signed-off-by: Yafang Shao <[email protected]> Acked-by: Lorenzo Stoakes <[email protected]> Cc: "Liam R. Howlett" <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
Yafang Shao says: ==================== These two minor patches were developed during the implementation of BPF-THP: https://lwn.net/Articles/1042138/ As suggested by Andrii, they are being submitted separately. ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
With latest llvm22, I hit the verif_scale_strobemeta selftest failure below: $ ./test_progs -n 618 libbpf: prog 'on_event': BPF program load failed: -E2BIG libbpf: prog 'on_event': -- BEGIN PROG LOAD LOG -- BPF program is too large. Processed 1000001 insn verification time 7019091 usec stack depth 488 processed 1000001 insns (limit 1000000) max_states_per_insn 28 total_states 33927 peak_states 12813 mark_read 0 -- END PROG LOAD LOG -- libbpf: prog 'on_event': failed to load: -E2BIG libbpf: failed to load object 'strobemeta.bpf.o' scale_test:FAIL:expect_success unexpected error: -7 (errno 7) kernel-patches#618 verif_scale_strobemeta:FAIL But if I increase the verificaiton insn limit from 1M to 10M, the above test_progs run actually will succeed. The below is the result from veristat: $ ./veristat strobemeta.bpf.o Processing 'strobemeta.bpf.o'... File Program Verdict Duration (us) Insns States Program size Jited size ---------------- -------- ------- ------------- ------- ------ ------------ ---------- strobemeta.bpf.o on_event success 90250893 9777685 358230 15954 80794 ---------------- -------- ------- ------------- ------- ------ ------------ ---------- Done. Processed 1 files, 0 programs. Skipped 1 files, 0 programs. Further debugging shows the llvm commit [1] is responsible for the verificaiton failure as it tries to convert certain switch statement to if-condition. Such change may cause different transformation compared to original switch statement. In bpf program strobemeta.c case, the initial llvm ir for read_int_var() function is define internal void @read_int_var(ptr noundef %0, i64 noundef %1, ptr noundef %2, ptr noundef %3, ptr noundef %4) kernel-patches#2 !dbg !535 { %6 = alloca ptr, align 8 %7 = alloca i64, align 8 %8 = alloca ptr, align 8 %9 = alloca ptr, align 8 %10 = alloca ptr, align 8 %11 = alloca ptr, align 8 %12 = alloca i32, align 4 ... %20 = icmp ne ptr %19, null, !dbg !561 br i1 %20, label %22, label %21, !dbg !562 21: ; preds = %5 store i32 1, ptr %12, align 4 br label %48, !dbg !563 22: %23 = load ptr, ptr %9, align 8, !dbg !564 ... 47: ; preds = %38, %22 store i32 0, ptr %12, align 4, !dbg !588 br label %48, !dbg !588 48: ; preds = %47, %21 call void @llvm.lifetime.end.p0(ptr %11) kernel-patches#4, !dbg !588 %49 = load i32, ptr %12, align 4 switch i32 %49, label %51 [ i32 0, label %50 i32 1, label %50 ] 50: ; preds = %48, %48 ret void, !dbg !589 51: ; preds = %48 unreachable } Note that the above 'switch' statement is added by clang frontend. Without [1], the switch statement will survive until SelectionDag, so the switch statement acts like a 'barrier' and prevents some transformation involved with both 'before' and 'after' the switch statement. But with [1], the switch statement will be removed during middle end optimization and later middle end passes (esp. after inlining) have more freedom to reorder the code. The following is the related source code: static void *calc_location(struct strobe_value_loc *loc, void *tls_base): bpf_probe_read_user(&tls_ptr, sizeof(void *), dtv); /* if pointer has (void *)-1 value, then TLS wasn't initialized yet */ return tls_ptr && tls_ptr != (void *)-1 ? tls_ptr + tls_index.offset : NULL; In read_int_var() func, we have: void *location = calc_location(&cfg->int_locs[idx], tls_base); if (!location) return; bpf_probe_read_user(value, sizeof(struct strobe_value_generic), location); ... The static func calc_location() is called inside read_int_var(). The asm code without [1]: 77: .123....89 (85) call bpf_probe_read_user#112 78: ........89 (79) r1 = *(u64 *)(r10 -368) 79: .1......89 (79) r2 = *(u64 *)(r10 -8) 80: .12.....89 (bf) r3 = r2 81: .123....89 (0f) r3 += r1 82: ..23....89 (07) r2 += 1 83: ..23....89 (79) r4 = *(u64 *)(r10 -464) 84: ..234...89 (a5) if r2 < 0x2 goto pc+13 85: ...34...89 (15) if r3 == 0x0 goto pc+12 86: ...3....89 (bf) r1 = r10 87: .1.3....89 (07) r1 += -400 88: .1.3....89 (b4) w2 = 16 In this case, 'r2 < 0x2' and 'r3 == 0x0' go to null 'locaiton' place, so the verifier actually prefers to do verification first at 'r1 = r10' etc. The asm code with [1]: 119: .123....89 (85) call bpf_probe_read_user#112 120: ........89 (79) r1 = *(u64 *)(r10 -368) 121: .1......89 (79) r2 = *(u64 *)(r10 -8) 122: .12.....89 (bf) r3 = r2 123: .123....89 (0f) r3 += r1 124: ..23....89 (07) r2 += -1 125: ..23....89 (a5) if r2 < 0xfffffffe goto pc+6 126: ........89 (05) goto pc+17 ... 144: ........89 (b4) w1 = 0 145: .1......89 (6b) *(u16 *)(r8 +80) = r1 In this case, if 'r2 < 0xfffffffe' is true, the control will go to non-null 'location' branch, so 'goto pc+17' will actually go to null 'location' branch. This seems causing tremendous amount of verificaiton state. To fix the issue, rewrite the following code return tls_ptr && tls_ptr != (void *)-1 ? tls_ptr + tls_index.offset : NULL; to if/then statement and hopefully these explicit if/then statements are sticky during middle-end optimizations. Test with llvm20 and llvm21 as well and all strobemeta related selftests are passed. [1] llvm/llvm-project#161000 Signed-off-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
do_hbm_test.sh: The comment incorrectly used "upcomming" instead of "upcoming". hbm.c The comment incorrectly used "Managment" instead of "Management". The comment incorrectly used "Currrently" instead of "Currently". tcp_cong_kern.c The comment incorrectly used "deteremined" instead of "determined". tracex1.bpf.c The comment incorrectly used "loobpack" instead of "loopback". Signed-off-by: Chu Guangqing <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
The __list_del fuction doesn't set the previous node's next pointer to the next node of the node to be deleted. It just updates the local variable and not the actual pointer in the previous node. The test was passing up till now because the bpf code is doing bpf_free() after list_del and therfore reading head->first from the userspace will read all zeroes. But after arena_list_del() is finished, head->first should point to NULL; Signed-off-by: Puranjay Mohan <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
Yinhao et al. reported that their fuzzer tool was able to trigger a skb_warn_bad_offload() from netif_skb_features() -> gso_features_check(). When a BPF program - triggered via BPF test infra - pushes the packet to the loopback device via bpf_clone_redirect() then mentioned offload warning can be seen. GSO-related features are then rightfully disabled. We get into this situation due to convert___skb_to_skb() setting gso_segs and gso_size but not gso_type. Technically, it makes sense that this warning triggers since the GSO properties are malformed due to the gso_type. Potentially, the gso_type could be marked non-trustworthy through setting it at least to SKB_GSO_DODGY without any other specific assumptions, but that also feels wrong given we should not go further into the GSO engine in the first place. The checks were added in 121d57a ("gso: validate gso_type in GSO handlers") because there were malicious (syzbot) senders that combine a protocol with a non-matching gso_type. If we would want to drop such packets, gso_features_check() currently only returns feature flags via netif_skb_features(), so one location for potentially dropping such skbs could be validate_xmit_unreadable_skb(), but then otoh it would be an additional check in the fast-path for a very corner case. Given bpf_clone_redirect() is the only place where BPF test infra could emit such packets, lets reject them right there. Fixes: 850a88c ("bpf: Expose __sk_buff wire_len/gso_segs to BPF_PROG_TEST_RUN") Fixes: cf62089 ("bpf: Add gso_size to __sk_buff") Reported-by: Yinhao Hu <[email protected]> Reported-by: Kaiyan Mei <[email protected]> Reported-by: Dongliang Mu <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://patch.msgid.link/[email protected]
…ance() propagate_to_outer_instance() calls get_outer_instance() and uses the returned pointer to reset and commit stack write marks. Under normal conditions, update_instance() guarantees that an outer instance exists, so get_outer_instance() cannot return an ERR_PTR. However, explicitly checking for IS_ERR(outer_instance) makes this code more robust and self-documenting. It reduces cognitive load when reading the control flow and silences potential false-positive reports from static analysis or automated tooling. No functional change intended. Signed-off-by: Shardul Bankar <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
Some tests have to stop/start a server multiple time with the same listening address. Doing so without SO_REUSADDR leads to failures due to the socket still being in TIME_WAIT right after the first instance stop/before the second instance start. Instead of letting each test manually set SO_REUSEADDR on their servers, it can be done automatically by start_server_addr for all tests (and without any major downside). Enforce SO_REUSEADDR in start_server_addr for all tests. Signed-off-by: Alexis Lothoré (eBPF Foundation) <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://patch.msgid.link/[email protected]
…r in tc_tunnel Now that start_server_str enforces SO_REUSEADDR, there's no need to keep using start_reusport_server in tc_tunnel, especially since it only uses one server at a time. Replace start_reuseport_server with start_server_str in tc_tunnel test. Signed-off-by: Alexis Lothoré (eBPF Foundation) <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://patch.msgid.link/[email protected]
Alexis Lothoré says: ==================== This small series is another follow-up to [1], in which I misunderstood Martin's initial feedback (see [2]). I proposed to make tc-tunnel apply SO_REUSEPORT once server is brought up. This series updates start_server_addr to really apply Martin's proposal after his clarification [3] [1] https://lore.kernel.org/bpf/[email protected]/ [2] https://lore.kernel.org/bpf/[email protected]/ [3] https://lore.kernel.org/bpf/[email protected]/ ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
The range tree uses bpf_mem_alloc() that is safe to be called from all contexts and uses a pre-allocated pool of memory to serve these allocations. Replace bpf_mem_alloc() with kmalloc_nolock() as it can be called safely from all contexts and is more scalable than bpf_mem_alloc(). Remove the migrate_disable/enable pairs as they were only needed for bpf_mem_alloc() as it does per-cpu operations, kmalloc_nolock() doesn't need this. Signed-off-by: Puranjay Mohan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
Return -ETIMEDOUT whenever non-head waiters are signalled by head, and fix oversight in commit 7bd6e5c ("rqspinlock: Disable queue destruction for deadlocks"). We no longer signal on deadlocks. Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Reviewed-by: Amery Hung <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
As [lru_,]percpu_hash maps support BPF_KPTR_{REF,PERCPU}, missing
calls to 'bpf_obj_free_fields()' in 'pcpu_copy_value()' could cause the
memory referenced by BPF_KPTR_{REF,PERCPU} fields to be held until the
map gets freed.
Fix this by calling 'bpf_obj_free_fields()' after
'copy_map_value[,_long]()' in 'pcpu_copy_value()'.
Fixes: 65334e6 ("bpf: Support kptrs in percpu hashmap and percpu LRU hashmap")
Signed-off-by: Leon Hwang <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
…maps Add test to verify that updating [lru_,]percpu_hash maps decrements refcount when BPF_KPTR_REF objects are involved. The tests perform the following steps: . Call update_elem() to insert an initial value. . Use bpf_refcount_acquire() to increment the refcount. . Store the node pointer in the map value. . Add the node to a linked list. . Probe-read the refcount and verify it is *2*. . Call update_elem() again to trigger refcount decrement. . Probe-read the refcount and verify it is *1*. Signed-off-by: Leon Hwang <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
Leon Hwang says: ==================== In the discussion thread "[PATCH bpf-next v9 0/7] bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags for percpu maps"[1], it was pointed out that missing calls to bpf_obj_free_fields() could lead to memory leaks. A selftest was added to confirm that this is indeed a real issue - the refcount of BPF_KPTR_REF field is not decremented when bpf_obj_free_fields() is missing after copy_map_value[,_long](). Further inspection of copy_map_value[,_long]() call sites revealed two locations affected by this issue: 1. pcpu_copy_value() 2. htab_map_update_elem() when used with BPF_F_LOCK Similar case happens when update local storage maps with BPF_F_LOCK. This series fixes the cases where BPF_F_LOCK is not involved by properly calling bpf_obj_free_fields() after copy_map_value[,_long](), and adds a selftest to verify the fix. The remaining cases involving BPF_F_LOCK will be addressed in a separate patch set after the series "bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags for percpu maps" is applied. Changes: v5 -> v6: * Update the test name to include "refcounted_kptr". * Update some local variables' name in the test (per Alexei). * v5: https://lore.kernel.org/bpf/[email protected]/ v4 -> v5: * Use a local variable to store the this_cpu_ptr()/per_cpu_ptr() result, and reuse it between copy_map_value[,_long]() and bpf_obj_free_fields() in patch kernel-patches#1 (per Andrii). * Drop patch kernel-patches#2 and kernel-patches#3, because the combination of BPF_F_LOCK with other special fields (except for BPF_SPIN_LOCK) will be disallowed on the UAPI side in the future (per Alexei). * v4: https://lore.kernel.org/bpf/[email protected]/ v3 -> v4: * Target bpf-next tree. * Address comments from Amery: * Drop 'bpf_obj_free_fields()' in the path of updating local storage maps without BPF_F_LOCK. * Drop the corresponding self test. * Respin the other test of local storage maps using syscall BPF programs. * v3: https://lore.kernel.org/bpf/[email protected]/ v2 -> v3: * Free special fields when update local storage maps without BPF_F_LOCK. * Add test to verify decrementing refcount when update cgroup local storage maps without BPF_F_LOCK. * Address review from AI bot: * Slow path with BPF_F_LOCK (around line 642-646) in 'bpf_local_storage.c'. * v2: https://lore.kernel.org/bpf/[email protected]/ v1 -> v2: * Add test to verify decrementing refcount when update cgroup local storage maps with BPF_F_LOCK. * Address review from AI bot: * Fast path without bucket lock (around line 610) in 'bpf_local_storage.c'. * v1: https://lore.kernel.org/bpf/[email protected]/ ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
Executing the test_maps binary on platforms with extremely high core counts may cause intermittent assertion failures in test_update_delete() (called via test_map_parallel()). This can occur because bpf_map_update_elem() under some circumstances (specifically in this case while performing bpf_map_update_elem() with BPF_NOEXIST on a BPF_MAP_TYPE_HASH with its map_flags set to BPF_F_NO_PREALLOC) can return an E2BIG error code i.e. error -7 7 tools/testing/selftests/bpf/test_maps.c:#: void test_update_delete(unsigned int, void *): Assertion `err == 0' failed. tools/testing/selftests/bpf/test_maps.c:#: void __run_parallel(unsigned int, void (*)(unsigned int, void *), void *): Assertion `status == 0' failed. As it turns out, is_map_full() which is called from alloc_htab_elem() can take on a conservative approach when htab->use_percpu_counter is true (which is the case here because the percpu_counter is used when a BPF_MAP_TYPE_HASH is created with its map_flags set to BPF_F_NO_PREALLOC). This conservative approach prioritizes preventing over-allocation and potential issues that could arise from possibly exceeding htab->map.max_entries in highly concurrent environments, even if it means slightly under-utilizing the htab map's capacity. Given that bpf_map_update_elem() from test_update_delete() can return E2BIG, update can_retry() such that it also accounts for the E2BIG error code (specifically only when running with map_flags being set to BPF_F_NO_PREALLOC). The retry loop will allow the global count belonging to the percpu_counter to become synchronized and better reflect the current htab map's capacity. Signed-off-by: Matt Bobrowski <[email protected]> Acked-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
bpf_try_get_buffers() returns one of multiple per-CPU buffers based on a per-CPU nesting counter. This mechanism expects that buffers are not endlessly acquired before being returned. migrate_disable() ensures that a task remains on the same CPU, but it does not prevent the task from being preempted by another task on that CPU. Without disabled preemption, a task may be preempted while holding a buffer, allowing another task to run on same CPU and acquire an additional buffer. Several such preemptions can cause the per-CPU nest counter to exceed MAX_BPRINTF_NEST_LEVEL and trigger the warning in bpf_try_get_buffers(). Adding preempt_disable()/preempt_enable() around buffer acquisition and release prevents this task preemption and preserves the intended bounded nesting behavior. Reported-by: [email protected] Closes: https://lore.kernel.org/all/[email protected]/ Fixes: 4223bf8 ("bpf: Remove preempt_disable in bpf_try_get_buffers") Suggested-by: Yonghong Song <[email protected]> Reviewed-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Sahil Chandna <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
There are a few places where log level is not checked before calling "verbose()". This forces programs working only at BPF_LOG_LEVEL_STATS (e.g. veristat) to allocate unnecessarily large log buffers. Add missing checks. Reported-by: Emil Tsalapatis <[email protected]> Signed-off-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
The error that returned by ftrace_set_filter_ip() in register_fentry() is not handled properly. Just fix it. Fixes: 00963a2 ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)") Signed-off-by: Menglong Dong <[email protected]> Acked-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
Increase arena test coverage. Convert glob_match() to bpf arena in two steps: 1. Copy paste lib/glob.c into bpf_arena_strsearch.h Copy paste lib/globtests.c into progs/arena_strsearch.c 2. Add __arena to pointers Add __arg_arena to global functions that accept arena pointers Add cond_break to loops The test also serves as a good example of what's possible with bpf arena and how existing algorithms can be converted. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
When test_send_signal_kern__open_and_load() fails parent closes the pipe which cases ASSERT_EQ(read(pipe_p2c...)) to fail, but child continues and enters infinite loop, while parent is stuck in wait(NULL). Other error paths have similar issue, so kill the child before waiting on it. The bug was discovered while compiling all of selftests with -O1 instead of -O2 which caused progs/test_send_signal_kern.c to fail to load. Fixes: ab8b7f0 ("tools/bpf: Add self tests for bpf_send_signal_thread()") Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
Handle recursive typedefs in BTF deduplication Pahole fails to encode BTF for some Go projects (e.g. Kubernetes and Podman) due to recursive type definitions that create reference loops not representable in C. These recursive typedefs trigger a failure in the BTF deduplication algorithm. This patch extends btf_dedup_ref_type() to properly handle potential recursion for BTF_KIND_TYPEDEF, similar to how recursion is already handled for BTF_KIND_STRUCT. This allows pahole to successfully generate BTF for Go binaries using recursive types without impacting existing C-based workflows. Suggested-by: Tristan d'Audibert <[email protected]> Co-developed-by: Martin Horth <[email protected]> Co-developed-by: Ouail Derghal <[email protected]> Co-developed-by: Guilhem Jazeron <[email protected]> Co-developed-by: Ludovic Paillat <[email protected]> Co-developed-by: Robin Theveniaut <[email protected]> Signed-off-by: Martin Horth <[email protected]> Signed-off-by: Ouail Derghal <[email protected]> Signed-off-by: Guilhem Jazeron <[email protected]> Signed-off-by: Ludovic Paillat <[email protected]> Signed-off-by: Robin Theveniaut <[email protected]> Signed-off-by: Paul Houssel <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/bpf/bf00857b1e06f282aac12f6834de7396a7547ba6.1763037045.git.paul.houssel@orange.com
Add several ./test_progs tests:
1. btf/dedup:recursive typedef ensures that deduplication no
longer fails on recursive typedefs.
2. btf/dedup:typedef ensures that typedefs are deduplicated correctly
just as they were before this patch.
Signed-off-by: Paul Houssel <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/bpf/9fac2f744089f6090257d4c881914b79f6cd6c6a.1763037045.git.paul.houssel@orange.com
Paul Houssel says: ==================== libbpf: fix BTF dedup to support recursive typedef Pahole fails to encode BTF for some Go projects (e.g. Kubernetes and Podman) due to recursive type definitions that create reference loops not representable in C. These recursive typedefs trigger a failure in the BTF deduplication algorithm. This patch extends btf_dedup_struct_types() to properly handle potential recursion for BTF_KIND_TYPEDEF, similar to how recursion is already handled for BTF_KIND_STRUCT. This allows pahole to successfully generate BTF for Go binaries using recursive types without impacting existing C-based workflows. Changes in v4: fix typo found by Claude-based CI Changes in v3: 1. Patch 1: Adjusted the comment of btf_dedup_ref_type() to refer to typedef as well. 2. Patch 2: Update of the "dedup: recursive typedef" test to include a duplicated version of the types to make sure deduplication still happens in this case. Changes in v2: 1. Patch 1: Refactored code to prevent copying existing logic. Instead of adding a new function we modify the existing btf_dedup_struct_type() function to handle the BTF_KIND_TYPEDEF case. Calls to btf_hash_struct() and btf_shallow_equal_struct() are replaced with calls to functions that select btf_hash_struct() / btf_hash_typedef() based on the type. 2. Patch 2: Added tests v3: https://lore.kernel.org/lkml/[email protected]/ v2: https://lore.kernel.org/lkml/[email protected]/ v1: https://lore.kernel.org/lkml/[email protected]/ ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Andrii Nakryiko <[email protected]>
…8-rc5+ Cross-merge BPF and other fixes after downstream PR. Minor conflict in kernel/bpf/helpers.c Signed-off-by: Alexei Starovoitov <[email protected]>
bpf_task_work_schedule_resume() and bpf_task_work_schedule_signal() have been renamed in bpf tree to bpf_task_work_schedule_resume_impl() and bpf_task_work_schedule_signal_impl() accordingly. There are few uses of these kfuncs in selftests that are not in bpf tree, so that when we port [1] into bpf-next, those BPF programs will not compile. This patch aligns those remaining callsites with the kfunc renaming. It should go on top of [1] when applying on bpf-next. 1: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Mykyta Yatsenko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
Metadata about a kfunc call is added to the kfunc_tab in add_kfunc_call() but the call instruction itself could get removed by opt_remove_dead_code() later if it is not reachable. If the call instruction is removed, specialize_kfunc() is never called for it and the desc->imm in the kfunc_tab is never initialized for this kfunc call. In this case, sort_kfunc_descs_by_imm_off(env->prog); in do_misc_fixups() doesn't sort the table correctly. This is a problem for s390 as its JIT uses this table to find the addresses for kfuncs, and if this table is not sorted properly, JIT may fail to find addresses for valid kfunc calls. This was exposed by: commit d869d56 ("bpf: verifier: refactor kfunc specialization") as before this commit, desc->imm was initialised in add_kfunc_call() which happens before dead code elimination. Move desc->imm setup down to sort_kfunc_descs_by_imm_off(), this fixes the problem and also saves us from having the same logic in add_kfunc_call() and specialize_kfunc(). Suggested-by: Eduard Zingerman <[email protected]> Signed-off-by: Puranjay Mohan <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
The bpf_skb_check_mtu helper needs to use skb->transport_header when the BPF_MTU_CHK_SEGS flag is used: bpf_skb_check_mtu(skb, ifindex, &mtu_len, 0, BPF_MTU_CHK_SEGS) The transport_header is not always set. There is a WARN_ON_ONCE report when CONFIG_DEBUG_NET is enabled + skb->gso_size is set + bpf_prog_test_run is used: WARNING: CPU: 1 PID: 2216 at ./include/linux/skbuff.h:3071 skb_gso_validate_network_len bpf_skb_check_mtu bpf_prog_3920e25740a41171_tc_chk_segs_flag # A test in the next patch bpf_test_run bpf_prog_test_run_skb For a normal ingress skb (not test_run), skb_reset_transport_header is performed but there is plan to avoid setting it as described in commit 2170a1f ("net: no longer reset transport_header in __netif_receive_skb_core()"). This patch fixes the bpf helper by checking skb_transport_header_was_set(). The check is done just before skb->transport_header is used, to avoid breaking the existing bpf prog. The WARN_ON_ONCE is limited to bpf_prog_test_run, so targeting bpf-next. Fixes: 34b2021 ("bpf: Add BPF-helper for MTU checking") Cc: Jesper Dangaard Brouer <[email protected]> Reported-by: Kaiyan Mei <[email protected]> Reported-by: Yinhao Hu <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
…t_header is not set Add a test to check that bpf_skb_check_mtu(BPF_MTU_CHK_SEGS) is rejected (-EINVAL) if skb->transport_header is not set. The test needs to lower the MTU of the loopback device. Thus, take this opportunity to run the test in a netns by adding "ns_" to the test name. The "serial_" prefix can then be removed. Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
If xlated_prog_insns should not be exposed, other information (such as func_info) still can and should be filled in. Therefore, instead of directly terminating in this case, continue with the normal flow. Signed-off-by: Max Altgelt <[email protected]> Link: https://lore.kernel.org/r/efd00fcec5e3e247af551632726e2a90c105fbd8.camel@nextron-systems.com Signed-off-by: Alexei Starovoitov <[email protected]>
Syzkaller triggers an invalid memory access issue following fault
injection in update_effective_progs. The issue can be described as
follows:
__cgroup_bpf_detach
update_effective_progs
compute_effective_progs
bpf_prog_array_alloc <-- fault inject
purge_effective_progs
/* change to dummy_bpf_prog */
array->items[index] = &dummy_bpf_prog.prog
---softirq start---
__do_softirq
...
__cgroup_bpf_run_filter_skb
__bpf_prog_run_save_cb
bpf_prog_run
stats = this_cpu_ptr(prog->stats)
/* invalid memory access */
flags = u64_stats_update_begin_irqsave(&stats->syncp)
---softirq end---
static_branch_dec(&cgroup_bpf_enabled_key[atype])
The reason is that fault injection caused update_effective_progs to fail
and then changed the original prog into dummy_bpf_prog.prog in
purge_effective_progs. Then a softirq came, and accessing the members of
dummy_bpf_prog.prog in the softirq triggers invalid mem access.
To fix it, skip updating stats when stats is NULL.
Fixes: 492ecee ("bpf: enable program stats")
Signed-off-by: Pu Lehui <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
We can currently optimize uprobes on top of nop5 instructions, so application can define USDT_NOP to nop5 and use USDT macro to define optimized usdt probes. This works fine on new kernels, but could have performance penalty on older kernels, that do not have the support to optimize and to emulate nop5 instruction. execution of the usdt probe on top of nop: - nop -> trigger usdt -> emulate nop -> continue execution of the usdt probe on top of nop5: - nop5 -> trigger usdt -> single step nop5 -> continue Note the 'single step nop5' as the source of performance regression. To workaround that we change the USDT macro to emit nop,nop5 for the probe (instead of default nop) and make record of that in USDT record (more on that below). This can be detected by application (libbpf) and it can place the uprobe either on nop or nop5 based on the optimization support in the kernel. We make record of using the nop,nop5 instructions in the USDT ELF note data. Current elf note format is as follows: namesz (4B) | descsz (4B) | type (4B) | name | desc And current usdt record (with "stapsdt" name) placed in the note's desc data look like: loc_addr | 8 bytes base_addr | 8 bytes sema_addr | 8 bytes provider | zero terminated string name | zero terminated string args | zero terminated string None of the tested parsers (bpftrace-bcc, libbpf) checked that the args zero terminated byte is the actual end of the 'desc' data. As Andrii suggested we could use this and place extra zero byte right there as an indication for the parser we use the nop,nop5 instructions. It's bit tricky, but the other way would be to introduce new elf note type or note name and change all existing parsers to recognize it. With the change above the existing parsers would still recognize such usdt probes. Note we do not emit this extra byte if app defined its own nop through USDT_NOP macro. Suggested-by: Andrii Nakryiko <[email protected]> Signed-off-by: Jiri Olsa <[email protected]>
Adding uprobe syscall feature detection that will be used in following changes. Signed-off-by: Jiri Olsa <[email protected]>
Adding support to parse extra info in usdt note record that indicates there's nop,nop5 emitted for probe. We detect this by checking extra zero byte placed in between args zero termination byte and desc data end. Please see [1] for more details. Together with uprobe syscall feature detection we can decide if we want to place the probe on top of nop or nop5. [1] https://github.com/libbpf/usdt Signed-off-by: Jiri Olsa <[email protected]>
Adding test that attaches bpf program on usdt probe in 2 scenarios; - attach program on top of usdt_1 which is standard nop probe incidentally followed by nop5. The usdt probe does not have extra data in elf note record, so we expect the probe to land on the first nop without being optimized. - attach program on top of usdt_2 which is probe defined on top of nop,nop5 combo. The extra data in the elf note record and presence of upeobe syscall ensures that the probe is placed on top of nop5 and optimized. Signed-off-by: Jiri Olsa <[email protected]>
910003b to
e220adc
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.