Skip to content

Conversation

@olsajiri
Copy link
Contributor

No description provided.

Fushuai Wang and others added 30 commits October 7, 2025 11:53
Use rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate() in
bpf_sk_storage.c to obtain better performance when PREEMPT_RCU is
not enabled.

Signed-off-by: Fushuai Wang <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Link: https://patch.msgid.link/[email protected]
cleanup unused function args in check_deadlock* functions.

Fixes: 31158ad ("rqspinlock: Add deadlock detection and recovery")
Signed-off-by: Siddharth Chintamaneni <[email protected]>
Reviewed-by: Eduard Zingerman <[email protected]>
Acked-by: Kumar Kartikeya Dwivedi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Fix the BPF verifier to correctly determine the sleepable context of
async callbacks based on the async primitive type rather than the arming
program's context.

The bug is in in_sleepable() which uses OR logic to check if the current
execution context is sleepable. When a sleepable program arms a timer
callback, the callback's state correctly has in_sleepable=false, but
in_sleepable() would still return true due to env->prog->sleepable being
true. This incorrectly allows sleepable helpers like
bpf_copy_from_user() inside timer callbacks when armed from sleepable
programs, even though timer callbacks always execute in non-sleepable
context.

Fix in_sleepable() to rely solely on env->cur_state->in_sleepable, and
initialize state->in_sleepable to env->prog->sleepable in
do_check_common() for the main program entry. This ensures the sleepable
context is properly tracked per verification state rather than being
overridden by the program's sleepability.

The env->cur_state NULL check in in_sleepable() was only needed for
do_misc_fixups() which runs after verification when env->cur_state is
set to NULL. Update do_misc_fixups() to use env->prog->sleepable
directly for the storage_get_function check, and remove the redundant
NULL check from in_sleepable().

Introduce is_async_cb_sleepable() helper to explicitly determine async
callback sleepability based on the primitive type:
  - bpf_timer callbacks are never sleepable
  - bpf_wq and bpf_task_work callbacks are always sleepable

Add verifier_bug() check to catch unhandled async callback types,
ensuring future additions cannot be silently mishandled. Move the
is_task_work_add_kfunc() forward declaration to the top alongside other
callback-related helpers. We update push_async_cb() to adjust to the new
changes.

At the same time, while simplifying in_sleepable(), we notice a problem
in do_misc_fixups. Fix storage_get helpers to use GFP_ATOMIC when called
from non-sleepable contexts within sleepable programs, such as bpf_timer
callbacks.

Currently, the check in do_misc_fixups assumes that env->prog->sleepable,
previously in_sleepable(env) which only resolved to this check before
last commit, holds across the program's execution, but that is not true.
Instead, the func_atomic bit must be set whenever we see the function
being called in an atomic context. Previously, this is being done when
the helper is invoked in atomic contexts in sleepable programs, we can
simply just set the value to true without doing an in_sleepable() check.

We must also do a standalone in_sleepable() check to handle cases where
the async callback itself is armed from a sleepable program, but is
itself non-sleepable (e.g., timer callback) and invokes such a helper,
thus needing the func_atomic bit to be true for the said call.

Adjust do_misc_fixups() to drop any checks regarding sleepable nature of
the program, and just depend on the func_atomic bit to decide which GFP
flag to pass.

Fixes: 81f1d7a ("bpf: wq: add bpf_wq_set_callback_impl")
Fixes: b00fa38 ("bpf: Enable non-atomic allocations in local storage")
Acked-by: Eduard Zingerman <[email protected]>
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Rename the storage_get_func_atomic flag to a more generic non_sleepable
flag that tracks whether a helper or kfunc may be called from a
non-sleepable context. This makes the flag more broadly applicable
beyond just storage_get helpers. See [0] for more context.

The flag is now set unconditionally for all helpers and kfuncs when:
- RCU critical section is active.
- Preemption is disabled.
- IRQs are disabled.
- In a non-sleepable context within a sleepable program (e.g., timer
  callbacks), which is indicated by !in_sleepable().

Previously, the flag was only set for storage_get helpers in these
contexts. With this change, it can be used by any code that needs to
differentiate between sleepable and non-sleepable contexts at the
per-instruction level.

The existing usage in do_misc_fixups() for storage_get helpers is
preserved by checking is_storage_get_function() before using the flag.

  [0]: https://lore.kernel.org/bpf/CAP01T76cbaNi4p-y8E0sjE2NXSra2S=Uja8G4hSQDu_SbXxREQ@mail.gmail.com

Cc: Mykyta Yatsenko <[email protected]>
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Acked-by: Mykyta Yatsenko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Add tests to verify that async callback's sleepable attribute is
correctly determined by the callback type, not the arming program's
context, reflecting its true execution context.

Introduce verifier_async_cb_context.c with tests for all three async
callback primitives: bpf_timer, bpf_wq, and bpf_task_work. Each
primitive is tested when armed from both sleepable (lsm.s/file_open) and
non-sleepable (fentry) programs.

Test coverage:
- bpf_timer callbacks: Verify they are never sleepable, even when armed
  from sleepable programs. Both tests should fail when attempting to use
  sleepable helper bpf_copy_from_user() in the callback.

- bpf_wq callbacks: Verify they are always sleepable, even when armed
  from non-sleepable programs. Both tests should succeed when using
  sleepable helpers in the callback.

- bpf_task_work callbacks: Verify they are always sleepable, even when
  armed from non-sleepable programs. Both tests should succeed when
  using sleepable helpers in the callback.

Acked-by: Eduard Zingerman <[email protected]>
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Kumar Kartikeya Dwivedi says:

====================
Fix sleepable context tracking for async callbacks

Currently, asynchronous execution primitives set up their callback
execution simulation using push_async_cb, which will end up inheriting
the sleepable or non-sleepable bit from the program triggering the
simulation of the callback. This is incorrect, as the actual execution
context of the asynchronous callback has nothing to do with the program
arming its execution.

This set fixes this oversight, and supplies a few test cases ensuring
the correct behavior is tested across different types of primitives
(i.e. timer, wq, task_work).

While looking at this bug, it was noticed that the GFP flag setting
logic for storage_get helpers is also broken, hence fix it while we
are at it.

PSA: These fixes and unit tests were primarily produced by prompting an
AI assistant (Claude), and then modified in minor ways, in an exercise
to understand how useful it can be at general kernel development tasks.

Changelog:
----------
v1 -> v2
v1: https://lore.kernel.org/bpf/[email protected]

 * Squash fix for GFP flags into 1st commit. (Eduard)
 * Add a commit refactoring func_atomic to non_sleepable, make it
   generic, also set for kfuncs in addition to helpers. (Eduard)
 * Leave selftest as-is, coverage for global subprogs calling sleepable
   kfuncs or helpers is provided in rcu_read_lock.c.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
bpf_strcasestr() and bpf_strncasestr() functions perform same like
bpf_strstr() and bpf_strnstr() except ignoring the case of the
characters.

Signed-off-by: Rong Tao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Add tests for new kfuncs bpf_strcasestr() and bpf_strncasestr().

Signed-off-by: Rong Tao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Rong Tao says:

====================
Add kfuncs bpf_strcasestr and bpf_strncasestr

From: Rong Tao <[email protected]>

Add kfuncs bpf_strcasestr and bpf_strncasestr, which are extensions of
bpf_strstr and bpf_strnstr, suitable for more scenarios.

v4: Fix wrong comment.
v3: keep __bpf_strnstr() static and compress some tests.
    https://lore.kernel.org/lkml/[email protected]/
v2: remove extra __bpf_kfunc and fix comment of bpf_strncasestr().
    https://lore.kernel.org/all/[email protected]/
v1: https://lore.kernel.org/all/[email protected]/
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
This bit of refactoring aims to simplify how we free memory in
bpf_prog_test_run_skb to avoid code duplication.

Signed-off-by: Paul Chaignon <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Tested-by: [email protected]
Link: https://patch.msgid.link/8971e01ae87b84f5af6b8b40defd3c310faf1c0f.1760037899.git.paul.chaignon@gmail.com
This patch reorders the initialization of bpf_prog_test_run_skb to
simplify the subsequent patch. Program types are checked first, followed
by the ctx init, and finally the data init. With the subsequent patch,
program types and the ctx init provide information that is used in the
data init. Thus, we need the data init to happen last.

Signed-off-by: Paul Chaignon <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Link: https://patch.msgid.link/063475176f15828a882c07846017394baf72f682.1760037899.git.paul.chaignon@gmail.com
This patch adds support for crafting non-linear skbs in BPF test runs
for tc programs. The size of the linear area is given by ctx->data_end,
with a minimum of ETH_HLEN always pulled in the linear area. If ctx or
ctx->data_end are null, a linear skb is used.

This is particularly useful to test support for non-linear skbs in large
codebases such as Cilium. We've had multiple bugs in the past few years
where we were missing calls to bpf_skb_pull_data(). This support in
BPF_PROG_TEST_RUN would allow us to automatically cover this case in our
BPF tests.

LWT program types are currently excluded in this patch. Allowing
non-linear skbs for these programs would require a bit more care because
they are able to call helpers (ex., bpf_clone_redirect, bpf_redirect)
that themselves call eth_type_trans(). eth_type_trans() assumes there
are at least ETH_HLEN bytes in the linear area. That may not be true
for LWT programs as we already pulled the L2 header via the
eth_type_trans() call in bpf_prog_test_run_skb().

In addition to the selftests introduced later in the series, this patch
was tested by enabling non-linear skbs for all tc selftests programs
and checking test failures were expected.

Suggested-by: Daniel Borkmann <[email protected]>
Signed-off-by: Paul Chaignon <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Tested-by: [email protected]
Link: https://patch.msgid.link/5694d4d1af31bddf974afcb1bbb1e28b8998dcd0.1760037899.git.paul.chaignon@gmail.com
This patch adds support for a new tag __linear_size in the test loader,
to specify the size of the linear area in case of non-linear skbs. If
the tag is absent or null, a linear skb is crafted.

Signed-off-by: Paul Chaignon <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Link: https://patch.msgid.link/7ad928ec7591daef4f1b84032aeb86c918b3e5a7.1760037899.git.paul.chaignon@gmail.com
This patch adds new selftests in the direct packet access suite, to
cover the non-linear case. The first six tests cover the behavior of
the bounds check with a non-linear skb. The last test adds a call to
bpf_skb_pull_data() to be able to access the packet.

Note that the size of the linear area includes the L2 header, but for
some program types like cgroup_skb, ctx->data points to the L3 header.
Therefore, a linear area of 22 bytes will have only 8 bytes accessible
to the BPF program (22 - ETH_HLEN). For that reason, the cgroup_skb test
cases access the packet at an offset of 8 bytes.

Signed-off-by: Paul Chaignon <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Link: https://patch.msgid.link/ceedbfd719e58f0d49dcceb8592f5e6bd38ce5fe.1760037899.git.paul.chaignon@gmail.com
Paul Chaignon says:

====================
Support non-linear skbs for BPF_PROG_TEST_RUN

This patchset adds support for non-linear skbs when running tc programs
with BPF_PROG_TEST_RUN.

We've had multiple bugs in the past few years in Cilium caused by
missing calls to bpf_skb_pull_data(). Daniel suggested to support
non linear skb in BPF_PROG_TEST_RUN to uncover these bugs in
our BPF tests.

Changes in v8:
  - Fix uninitialized data pointer spotted by Martin.
  - Error out in test_loader if __linear_size tag is used on unsupported
    program types.
Changes in v7:
  - Refactor use of 'size' variable as suggested by Martin.
  - Support copying back the non-linear area to data_out.
  - Minor code changes for readability, suggested by Martin.
Changes in v6:
  - Disallow non-linear skb in prog_run_skb only for LWT programs
    instead of all non-L2 program types, on suggestion from Martin.
  - Reject non-null ctx->data and ctx->data_meta, as suggested by Amery.
  - Bound linear_size to 'PAGE_SIZE - headroom - tailroom' to be
    consistent with prog_run_xdp, as suggested by Martin.
  - Allocate exactly linear_size bytes in bpf_test_init, spotted by
    Martin.
  - Fix wrong conflict resolution on double-free fix, spotted by Amery.
  - Rebased.
Changes in v5:
  - Fix double free on data in first patch.
Changes in v4:
  - Per Martin's suggestion, follow the XDP code pattern and use
    bpf_test_init only to initialize the linear area. That way data is
    directly copied to the right areas and we avoid the call to
    __pskb_pull_tail.
  - Fixed outdated commit descriptions.
  - Rebased.
Changes in v3:
  - Dropped BPF_F_TEST_SKB_NON_LINEAR and used the ctx->data_end to
    determine if the user wants non-linear skb, as suggested by Amery.
  - Introduced a second commit with a bit of refactoring to allow for
    the above requested change.
  - Fix bug found by syzkaller on third commit.
  - Rebased.
Changes in v2:
  - Made the linear size configurable via ctx->data_end, as suggested
    by Amery.
  - Reworked the selftests to allow testing the configurable linear
    size.
  - Fix warnings reported by kernel test robot on first commit.
  - Rebased.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
…pf_wq

Fix handling maps with no BTF and non-constant offsets for the bpf_wq.

This de-duplicates logic with other internal structs (task_work, timer),
keeps error reporting consistent, and makes future changes to the layout
handling centralized.

Fixes: d940c9b ("bpf: add support for KF_ARG_PTR_TO_WORKQUEUE")
Signed-off-by: Mykyta Yatsenko <[email protected]>
Acked-by: Andrii Nakryiko <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Add bpf_wq selftests to verify:
 * BPF program using non-constant offset of struct bpf_wq is rejected
 * BPF program using map with no BTF for storing struct bpf_wq is rejected

Signed-off-by: Mykyta Yatsenko <[email protected]>
Tested-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
The arraymap and hashtab duplicate the logic that checks for and frees
internal structs (timer, workqueue, task_work) based on
BTF record flags. Centralize this by introducing two helpers:

  * bpf_map_has_internal_structs(map)
    Returns true if the map value contains any of internal structs:
    BPF_TIMER | BPF_WORKQUEUE | BPF_TASK_WORK.

  * bpf_map_free_internal_structs(map, obj)
    Frees the internal structs for a single value object.

Convert arraymap and both the prealloc/malloc hashtab paths to use the
new generic functions. This keeps the functionality for when/how to free
these special fields in one place and makes it easier to add support for
new internal structs in the future without touching every map
implementation.

Signed-off-by: Mykyta Yatsenko <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
…18-rc1

Cross-merge BPF and other fixes after downstream PR.

No conflicts.

Signed-off-by: Alexei Starovoitov <[email protected]>
We have many places which open-code what's now is bpf_rcu_lock_held()
macro, so replace all those places with a clean and short macro invocation.
For that, move bpf_rcu_lock_held() macro into include/linux/bpf.h.

Signed-off-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Cross-merge BPF and other fixes after downstream PR.

No conflicts.

Signed-off-by: Alexei Starovoitov <[email protected]>
There are some set but not used build errors when compiling bpf selftests
with the latest upstream mainline GCC, at the beginning add the attribute
__maybe_unused for the variables, but it is better to just add the option
-Wno-unused-but-set-variable to CFLAGS in Makefile to disable the errors
instead of hacking the tests.

  tools/testing/selftests/bpf/map_tests/lpm_trie_map_basic_ops.c:229:36:
  error: variable ‘n_matches_after_delete’ set but not used [-Werror=unused-but-set-variable=]

  tools/testing/selftests/bpf/map_tests/lpm_trie_map_basic_ops.c:229:25:
  error: variable ‘n_matches’ set but not used [-Werror=unused-but-set-variable=]

  tools/testing/selftests/bpf/prog_tests/bpf_cookie.c:426:22:
  error: variable ‘j’ set but not used [-Werror=unused-but-set-variable=]

  tools/testing/selftests/bpf/prog_tests/find_vma.c:52:22:
  error: variable ‘j’ set but not used [-Werror=unused-but-set-variable=]

  tools/testing/selftests/bpf/prog_tests/perf_branches.c:67:22:
  error: variable ‘j’ set but not used [-Werror=unused-but-set-variable=]

  tools/testing/selftests/bpf/prog_tests/perf_link.c:15:22:
  error: variable ‘j’ set but not used [-Werror=unused-but-set-variable=]

Signed-off-by: Tiezhu Yang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
When CONFIG_MEMCG is enabled, we can access mm->owner under RCU. The
owner can be NULL. With this change, BPF helpers can safely access
mm->owner to retrieve the associated task from the mm. We can then make
policy decision based on the task attribute.

The typical use case is as follows,

  bpf_rcu_read_lock(); // rcu lock must be held for rcu trusted field
  @owner = @mm->owner; // mm_struct::owner is rcu trusted or null
  if (!@owner)
      goto out;

  /* Do something based on the task attribute */

out:
  bpf_rcu_read_unlock();

Suggested-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Yafang Shao <[email protected]>
Acked-by: Lorenzo Stoakes <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
The vma->vm_mm might be NULL and it can be accessed outside of RCU. Thus,
we can mark it as trusted_or_null. With this change, BPF helpers can safely
access vma->vm_mm to retrieve the associated mm_struct from the VMA.
Then we can make policy decision from the VMA.

The "trusted" annotation enables direct access to vma->vm_mm within kfuncs
marked with KF_TRUSTED_ARGS or KF_RCU, such as bpf_task_get_cgroup1() and
bpf_task_under_cgroup(). Conversely, "null" enforcement requires all
callsites using vma->vm_mm to perform NULL checks.

The lsm selftest must be modified because it directly accesses vma->vm_mm
without a NULL pointer check; otherwise it will break due to this
change.

For the VMA based THP policy, the use case is as follows,

  @mm = @vma->vm_mm; // vm_area_struct::vm_mm is trusted or null
  if (!@mm)
      return;
  bpf_rcu_read_lock(); // rcu lock must be held to dereference the owner
  @owner = @mm->owner; // mm_struct::owner is rcu trusted or null
  if (!@owner)
    goto out;
  @Cgroup1 = bpf_task_get_cgroup1(@owner, MEMCG_HIERARCHY_ID);

  /* make the decision based on the @Cgroup1 attribute */

  bpf_cgroup_release(@Cgroup1); // release the associated cgroup
out:
  bpf_rcu_read_unlock();

PSI memory information can be obtained from the associated cgroup to inform
policy decisions. Since upstream PSI support is currently limited to cgroup
v2, the following example demonstrates cgroup v2 implementation:

  @owner = @mm->owner;
  if (@owner) {
      // @ancestor_cgid is user-configured
      @ancestor = bpf_cgroup_from_id(@ancestor_cgid);
      if (bpf_task_under_cgroup(@owner, @ancestor)) {
          @psi_group = @ancestor->psi;

          /* Extract PSI metrics from @psi_group and
           * implement policy logic based on the values
           */

      }
  }

The vma::vm_file can also be marked with __safe_trusted_or_null.

No additional selftests are required since vma->vm_file and vma->vm_mm are
already validated in the existing selftest suite.

Signed-off-by: Yafang Shao <[email protected]>
Acked-by: Lorenzo Stoakes <[email protected]>
Cc: "Liam R. Howlett" <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Yafang Shao  says:
====================
These two minor patches were developed during the implementation of
BPF-THP:
  https://lwn.net/Articles/1042138/

As suggested by Andrii, they are being submitted separately.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
With latest llvm22, I hit the verif_scale_strobemeta selftest failure
below:
  $ ./test_progs -n 618
  libbpf: prog 'on_event': BPF program load failed: -E2BIG
  libbpf: prog 'on_event': -- BEGIN PROG LOAD LOG --
  BPF program is too large. Processed 1000001 insn
  verification time 7019091 usec
  stack depth 488
  processed 1000001 insns (limit 1000000) max_states_per_insn 28 total_states 33927 peak_states 12813 mark_read 0
  -- END PROG LOAD LOG --
  libbpf: prog 'on_event': failed to load: -E2BIG
  libbpf: failed to load object 'strobemeta.bpf.o'
  scale_test:FAIL:expect_success unexpected error: -7 (errno 7)
  kernel-patches#618     verif_scale_strobemeta:FAIL

But if I increase the verificaiton insn limit from 1M to 10M, the above
test_progs run actually will succeed. The below is the result from veristat:
  $ ./veristat strobemeta.bpf.o
  Processing 'strobemeta.bpf.o'...
  File              Program   Verdict  Duration (us)    Insns  States  Program size  Jited size
  ----------------  --------  -------  -------------  -------  ------  ------------  ----------
  strobemeta.bpf.o  on_event  success       90250893  9777685  358230         15954       80794
  ----------------  --------  -------  -------------  -------  ------  ------------  ----------
  Done. Processed 1 files, 0 programs. Skipped 1 files, 0 programs.

Further debugging shows the llvm commit [1] is responsible for the verificaiton
failure as it tries to convert certain switch statement to if-condition. Such
change may cause different transformation compared to original switch statement.

In bpf program strobemeta.c case, the initial llvm ir for read_int_var() function is
  define internal void @read_int_var(ptr noundef %0, i64 noundef %1, ptr noundef %2,
      ptr noundef %3, ptr noundef %4) kernel-patches#2 !dbg !535 {
    %6 = alloca ptr, align 8
    %7 = alloca i64, align 8
    %8 = alloca ptr, align 8
    %9 = alloca ptr, align 8
    %10 = alloca ptr, align 8
    %11 = alloca ptr, align 8
    %12 = alloca i32, align 4
    ...
    %20 = icmp ne ptr %19, null, !dbg !561
    br i1 %20, label %22, label %21, !dbg !562

  21:                                               ; preds = %5
    store i32 1, ptr %12, align 4
    br label %48, !dbg !563

  22:
    %23 = load ptr, ptr %9, align 8, !dbg !564
    ...

  47:                                               ; preds = %38, %22
    store i32 0, ptr %12, align 4, !dbg !588
    br label %48, !dbg !588

  48:                                               ; preds = %47, %21
    call void @llvm.lifetime.end.p0(ptr %11) kernel-patches#4, !dbg !588
    %49 = load i32, ptr %12, align 4
    switch i32 %49, label %51 [
      i32 0, label %50
      i32 1, label %50
    ]

  50:                                               ; preds = %48, %48
    ret void, !dbg !589

  51:                                               ; preds = %48
    unreachable
  }

Note that the above 'switch' statement is added by clang frontend.
Without [1], the switch statement will survive until SelectionDag,
so the switch statement acts like a 'barrier' and prevents some
transformation involved with both 'before' and 'after' the switch statement.

But with [1], the switch statement will be removed during middle end
optimization and later middle end passes (esp. after inlining) have more
freedom to reorder the code.

The following is the related source code:

  static void *calc_location(struct strobe_value_loc *loc, void *tls_base):
        bpf_probe_read_user(&tls_ptr, sizeof(void *), dtv);
        /* if pointer has (void *)-1 value, then TLS wasn't initialized yet */
        return tls_ptr && tls_ptr != (void *)-1
                ? tls_ptr + tls_index.offset
                : NULL;

  In read_int_var() func, we have:
        void *location = calc_location(&cfg->int_locs[idx], tls_base);
        if (!location)
                return;

        bpf_probe_read_user(value, sizeof(struct strobe_value_generic), location);
        ...

The static func calc_location() is called inside read_int_var(). The asm code
without [1]:
     77: .123....89 (85) call bpf_probe_read_user#112
     78: ........89 (79) r1 = *(u64 *)(r10 -368)
     79: .1......89 (79) r2 = *(u64 *)(r10 -8)
     80: .12.....89 (bf) r3 = r2
     81: .123....89 (0f) r3 += r1
     82: ..23....89 (07) r2 += 1
     83: ..23....89 (79) r4 = *(u64 *)(r10 -464)
     84: ..234...89 (a5) if r2 < 0x2 goto pc+13
     85: ...34...89 (15) if r3 == 0x0 goto pc+12
     86: ...3....89 (bf) r1 = r10
     87: .1.3....89 (07) r1 += -400
     88: .1.3....89 (b4) w2 = 16
In this case, 'r2 < 0x2' and 'r3 == 0x0' go to null 'locaiton' place,
so the verifier actually prefers to do verification first at 'r1 = r10' etc.

The asm code with [1]:
    119: .123....89 (85) call bpf_probe_read_user#112
    120: ........89 (79) r1 = *(u64 *)(r10 -368)
    121: .1......89 (79) r2 = *(u64 *)(r10 -8)
    122: .12.....89 (bf) r3 = r2
    123: .123....89 (0f) r3 += r1
    124: ..23....89 (07) r2 += -1
    125: ..23....89 (a5) if r2 < 0xfffffffe goto pc+6
    126: ........89 (05) goto pc+17
    ...
    144: ........89 (b4) w1 = 0
    145: .1......89 (6b) *(u16 *)(r8 +80) = r1
In this case, if 'r2 < 0xfffffffe' is true, the control will go to
non-null 'location' branch, so 'goto pc+17' will actually go to
null 'location' branch. This seems causing tremendous amount of
verificaiton state.

To fix the issue, rewrite the following code
  return tls_ptr && tls_ptr != (void *)-1
                ? tls_ptr + tls_index.offset
                : NULL;
to if/then statement and hopefully these explicit if/then statements
are sticky during middle-end optimizations.

Test with llvm20 and llvm21 as well and all strobemeta related selftests
are passed.

  [1] llvm/llvm-project#161000

Signed-off-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
do_hbm_test.sh:
The comment incorrectly used "upcomming" instead of "upcoming".

hbm.c
The comment incorrectly used "Managment" instead of "Management".
The comment incorrectly used "Currrently" instead of "Currently".

tcp_cong_kern.c
The comment incorrectly used "deteremined" instead of "determined".

tracex1.bpf.c
The comment incorrectly used "loobpack" instead of "loopback".

Signed-off-by: Chu Guangqing <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
The __list_del fuction doesn't set the previous node's next pointer to
the next node of the node to be deleted. It just updates the local variable
and not the actual pointer in the previous node.

The test was passing up till now because the bpf code is doing bpf_free()
after list_del and therfore reading head->first from the userspace will
read all zeroes. But after arena_list_del() is finished, head->first should
point to NULL;

Signed-off-by: Puranjay Mohan <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Yinhao et al. reported that their fuzzer tool was able to trigger a
skb_warn_bad_offload() from netif_skb_features() -> gso_features_check().
When a BPF program - triggered via BPF test infra - pushes the packet
to the loopback device via bpf_clone_redirect() then mentioned offload
warning can be seen. GSO-related features are then rightfully disabled.

We get into this situation due to convert___skb_to_skb() setting
gso_segs and gso_size but not gso_type. Technically, it makes sense
that this warning triggers since the GSO properties are malformed due
to the gso_type. Potentially, the gso_type could be marked non-trustworthy
through setting it at least to SKB_GSO_DODGY without any other specific
assumptions, but that also feels wrong given we should not go further
into the GSO engine in the first place.

The checks were added in 121d57a ("gso: validate gso_type in GSO
handlers") because there were malicious (syzbot) senders that combine
a protocol with a non-matching gso_type. If we would want to drop such
packets, gso_features_check() currently only returns feature flags via
netif_skb_features(), so one location for potentially dropping such skbs
could be validate_xmit_unreadable_skb(), but then otoh it would be
an additional check in the fast-path for a very corner case. Given
bpf_clone_redirect() is the only place where BPF test infra could emit
such packets, lets reject them right there.

Fixes: 850a88c ("bpf: Expose __sk_buff wire_len/gso_segs to BPF_PROG_TEST_RUN")
Fixes: cf62089 ("bpf: Add gso_size to __sk_buff")
Reported-by: Yinhao Hu <[email protected]>
Reported-by: Kaiyan Mei <[email protected]>
Reported-by: Dongliang Mu <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Link: https://patch.msgid.link/[email protected]
…ance()

propagate_to_outer_instance() calls get_outer_instance() and uses the
returned pointer to reset and commit stack write marks. Under normal
conditions, update_instance() guarantees that an outer instance exists,
so get_outer_instance() cannot return an ERR_PTR.

However, explicitly checking for IS_ERR(outer_instance) makes this code
more robust and self-documenting. It reduces cognitive load when reading
the control flow and silences potential false-positive reports from
static analysis or automated tooling.

No functional change intended.

Signed-off-by: Shardul Bankar <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Tropicao and others added 29 commits November 6, 2025 15:23
Some tests have to stop/start a server multiple time with the same
listening address. Doing so without SO_REUSADDR leads to failures due to
the socket still being in TIME_WAIT right after the first instance
stop/before the second instance start. Instead of letting each test
manually set SO_REUSEADDR on their servers, it can be done automatically
by start_server_addr for all tests (and without any major downside).

Enforce SO_REUSEADDR in start_server_addr for all tests.

Signed-off-by: Alexis Lothoré (eBPF Foundation) <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Link: https://patch.msgid.link/[email protected]
…r in tc_tunnel

Now that start_server_str enforces SO_REUSEADDR, there's no need to keep
using start_reusport_server in tc_tunnel, especially since it only uses
one server at a time.

Replace start_reuseport_server with start_server_str in tc_tunnel test.

Signed-off-by: Alexis Lothoré (eBPF Foundation) <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Link: https://patch.msgid.link/[email protected]
Alexis Lothoré says:

====================
This small series is another follow-up to [1], in which I misunderstood
Martin's initial feedback (see [2]). I proposed to make tc-tunnel apply
SO_REUSEPORT once server is brought up. This series updates
start_server_addr to really apply Martin's proposal after his
clarification [3]

[1] https://lore.kernel.org/bpf/[email protected]/
[2] https://lore.kernel.org/bpf/[email protected]/
[3] https://lore.kernel.org/bpf/[email protected]/
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
The range tree uses bpf_mem_alloc() that is safe to be called from all
contexts and uses a pre-allocated pool of memory to serve these
allocations.

Replace bpf_mem_alloc() with kmalloc_nolock() as it can be called safely
from all contexts and is more scalable than bpf_mem_alloc().

Remove the migrate_disable/enable pairs as they were only needed for
bpf_mem_alloc() as it does per-cpu operations, kmalloc_nolock() doesn't
need this.

Signed-off-by: Puranjay Mohan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Return -ETIMEDOUT whenever non-head waiters are signalled by head, and fix
oversight in commit 7bd6e5c ("rqspinlock: Disable queue destruction for
deadlocks"). We no longer signal on deadlocks.

Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Reviewed-by: Amery Hung <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
As [lru_,]percpu_hash maps support BPF_KPTR_{REF,PERCPU}, missing
calls to 'bpf_obj_free_fields()' in 'pcpu_copy_value()' could cause the
memory referenced by BPF_KPTR_{REF,PERCPU} fields to be held until the
map gets freed.

Fix this by calling 'bpf_obj_free_fields()' after
'copy_map_value[,_long]()' in 'pcpu_copy_value()'.

Fixes: 65334e6 ("bpf: Support kptrs in percpu hashmap and percpu LRU hashmap")
Signed-off-by: Leon Hwang <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
…maps

Add test to verify that updating [lru_,]percpu_hash maps decrements
refcount when BPF_KPTR_REF objects are involved.

The tests perform the following steps:
. Call update_elem() to insert an initial value.
. Use bpf_refcount_acquire() to increment the refcount.
. Store the node pointer in the map value.
. Add the node to a linked list.
. Probe-read the refcount and verify it is *2*.
. Call update_elem() again to trigger refcount decrement.
. Probe-read the refcount and verify it is *1*.

Signed-off-by: Leon Hwang <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Leon Hwang says:

====================
In the discussion thread
"[PATCH bpf-next v9 0/7] bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags for percpu maps"[1],
it was pointed out that missing calls to bpf_obj_free_fields() could
lead to memory leaks.

A selftest was added to confirm that this is indeed a real issue - the
refcount of BPF_KPTR_REF field is not decremented when
bpf_obj_free_fields() is missing after copy_map_value[,_long]().

Further inspection of copy_map_value[,_long]() call sites revealed two
locations affected by this issue:

1. pcpu_copy_value()
2. htab_map_update_elem() when used with BPF_F_LOCK

Similar case happens when update local storage maps with BPF_F_LOCK.

This series fixes the cases where BPF_F_LOCK is not involved by
properly calling bpf_obj_free_fields() after copy_map_value[,_long](),
and adds a selftest to verify the fix.

The remaining cases involving BPF_F_LOCK will be addressed in a
separate patch set after the series
"bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags for percpu maps"
is applied.

Changes:
v5 -> v6:
* Update the test name to include "refcounted_kptr".
* Update some local variables' name in the test (per Alexei).
* v5: https://lore.kernel.org/bpf/[email protected]/

v4 -> v5:
* Use a local variable to store the this_cpu_ptr()/per_cpu_ptr() result,
  and reuse it between copy_map_value[,_long]() and
  bpf_obj_free_fields() in patch kernel-patches#1 (per Andrii).
* Drop patch kernel-patches#2 and kernel-patches#3, because the combination of BPF_F_LOCK with other
  special fields (except for BPF_SPIN_LOCK) will be disallowed on the
  UAPI side in the future (per Alexei).
* v4: https://lore.kernel.org/bpf/[email protected]/

v3 -> v4:
* Target bpf-next tree.
* Address comments from Amery:
  * Drop 'bpf_obj_free_fields()' in the path of updating local storage
    maps without BPF_F_LOCK.
  * Drop the corresponding self test.
  * Respin the other test of local storage maps using syscall BPF
    programs.
* v3: https://lore.kernel.org/bpf/[email protected]/

v2 -> v3:
* Free special fields when update local storage maps without BPF_F_LOCK.
* Add test to verify decrementing refcount when update cgroup local
  storage maps without BPF_F_LOCK.
* Address review from AI bot:
  * Slow path with BPF_F_LOCK (around line 642-646) in
    'bpf_local_storage.c'.
* v2: https://lore.kernel.org/bpf/[email protected]/

v1 -> v2:
* Add test to verify decrementing refcount when update cgroup local
  storage maps with BPF_F_LOCK.
* Address review from AI bot:
  * Fast path without bucket lock (around line 610) in
    'bpf_local_storage.c'.
* v1: https://lore.kernel.org/bpf/[email protected]/
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Executing the test_maps binary on platforms with extremely high core
counts may cause intermittent assertion failures in
test_update_delete() (called via test_map_parallel()). This can occur
because bpf_map_update_elem() under some circumstances (specifically
in this case while performing bpf_map_update_elem() with BPF_NOEXIST
on a BPF_MAP_TYPE_HASH with its map_flags set to BPF_F_NO_PREALLOC)
can return an E2BIG error code i.e.

error -7 7 tools/testing/selftests/bpf/test_maps.c:#: void
test_update_delete(unsigned int, void *): Assertion `err == 0' failed.
tools/testing/selftests/bpf/test_maps.c:#: void
__run_parallel(unsigned int, void (*)(unsigned int, void *), void *):
Assertion `status == 0' failed.

As it turns out, is_map_full() which is called from alloc_htab_elem()
can take on a conservative approach when htab->use_percpu_counter is
true (which is the case here because the percpu_counter is used when a
BPF_MAP_TYPE_HASH is created with its map_flags set to
BPF_F_NO_PREALLOC). This conservative approach prioritizes preventing
over-allocation and potential issues that could arise from possibly
exceeding htab->map.max_entries in highly concurrent environments,
even if it means slightly under-utilizing the htab map's capacity.

Given that bpf_map_update_elem() from test_update_delete() can return
E2BIG, update can_retry() such that it also accounts for the E2BIG
error code (specifically only when running with map_flags being set to
BPF_F_NO_PREALLOC). The retry loop will allow the global count
belonging to the percpu_counter to become synchronized and better
reflect the current htab map's capacity.

Signed-off-by: Matt Bobrowski <[email protected]>
Acked-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
bpf_try_get_buffers() returns one of multiple per-CPU buffers based on a
per-CPU nesting counter. This mechanism expects that buffers are not
endlessly acquired before being returned. migrate_disable() ensures that a
task remains on the same CPU, but it does not prevent the task from being
preempted by another task on that CPU.

Without disabled preemption, a task may be preempted while holding a
buffer, allowing another task to run on same CPU and acquire an
additional buffer. Several such preemptions can cause the per-CPU
nest counter to exceed MAX_BPRINTF_NEST_LEVEL and trigger the warning in
bpf_try_get_buffers(). Adding preempt_disable()/preempt_enable() around
buffer acquisition and release prevents this task preemption and
preserves the intended bounded nesting behavior.

Reported-by: [email protected]
Closes: https://lore.kernel.org/all/[email protected]/
Fixes: 4223bf8 ("bpf: Remove preempt_disable in bpf_try_get_buffers")
Suggested-by: Yonghong Song <[email protected]>
Reviewed-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Sahil Chandna <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
There are a few places where log level is not checked before calling
"verbose()". This forces programs working only at
BPF_LOG_LEVEL_STATS (e.g. veristat) to allocate unnecessarily large
log buffers. Add missing checks.

Reported-by: Emil Tsalapatis <[email protected]>
Signed-off-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
The error that returned by ftrace_set_filter_ip() in register_fentry() is
not handled properly. Just fix it.

Fixes: 00963a2 ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)")
Signed-off-by: Menglong Dong <[email protected]>
Acked-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Increase arena test coverage.
Convert glob_match() to bpf arena in two steps:
1.
Copy paste lib/glob.c into bpf_arena_strsearch.h
Copy paste lib/globtests.c into progs/arena_strsearch.c

2.
Add __arena to pointers
Add __arg_arena to global functions that accept arena pointers
Add cond_break to loops

The test also serves as a good example of what's possible
with bpf arena and how existing algorithms can be converted.

Signed-off-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
When test_send_signal_kern__open_and_load() fails parent closes the
pipe which cases ASSERT_EQ(read(pipe_p2c...)) to fail, but child
continues and enters infinite loop, while parent is stuck in wait(NULL).
Other error paths have similar issue, so kill the child before waiting on it.

The bug was discovered while compiling all of selftests with -O1 instead of -O2
which caused progs/test_send_signal_kern.c to fail to load.

Fixes: ab8b7f0 ("tools/bpf: Add self tests for bpf_send_signal_thread()")
Signed-off-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Handle recursive typedefs in BTF deduplication

Pahole fails to encode BTF for some Go projects (e.g. Kubernetes and
Podman) due to recursive type definitions that create reference loops
not representable in C. These recursive typedefs trigger a failure in
the BTF deduplication algorithm.

This patch extends btf_dedup_ref_type() to properly handle potential
recursion for BTF_KIND_TYPEDEF, similar to how recursion is already
handled for BTF_KIND_STRUCT. This allows pahole to successfully
generate BTF for Go binaries using recursive types without impacting
existing C-based workflows.

Suggested-by: Tristan d'Audibert <[email protected]>
Co-developed-by: Martin Horth <[email protected]>
Co-developed-by: Ouail Derghal <[email protected]>
Co-developed-by: Guilhem Jazeron <[email protected]>
Co-developed-by: Ludovic Paillat <[email protected]>
Co-developed-by: Robin Theveniaut <[email protected]>
Signed-off-by: Martin Horth <[email protected]>
Signed-off-by: Ouail Derghal <[email protected]>
Signed-off-by: Guilhem Jazeron <[email protected]>
Signed-off-by: Ludovic Paillat <[email protected]>
Signed-off-by: Robin Theveniaut <[email protected]>
Signed-off-by: Paul Houssel <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/bpf/bf00857b1e06f282aac12f6834de7396a7547ba6.1763037045.git.paul.houssel@orange.com
Add several ./test_progs tests:
    1.  btf/dedup:recursive typedef ensures that deduplication no
	longer fails on recursive typedefs.
    2.  btf/dedup:typedef ensures that typedefs are deduplicated correctly
	just as they were before this patch.

Signed-off-by: Paul Houssel <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/bpf/9fac2f744089f6090257d4c881914b79f6cd6c6a.1763037045.git.paul.houssel@orange.com
Paul Houssel says:

====================
libbpf: fix BTF dedup to support recursive typedef

Pahole fails to encode BTF for some Go projects (e.g. Kubernetes and
Podman) due to recursive type definitions that create reference loops
not representable in C. These recursive typedefs trigger a failure in
the BTF deduplication algorithm.

This patch extends btf_dedup_struct_types() to properly handle potential
recursion for BTF_KIND_TYPEDEF, similar to how recursion is already
handled for BTF_KIND_STRUCT. This allows pahole to successfully
generate BTF for Go binaries using recursive types without impacting
existing C-based workflows.

Changes in v4: fix typo found by Claude-based CI

Changes in v3:
  1. Patch 1: Adjusted the comment of btf_dedup_ref_type() to refer to
  typedef as well.
  2. Patch 2: Update of the "dedup: recursive typedef" test to include a
  duplicated version of the types to make sure deduplication still happens
  in this case.

Changes in v2:
  1. Patch 1: Refactored code to prevent copying existing logic. Instead of
  adding a new function we modify the existing btf_dedup_struct_type()
  function to handle the BTF_KIND_TYPEDEF case. Calls to btf_hash_struct()
  and btf_shallow_equal_struct() are replaced with calls to functions that
  select btf_hash_struct() / btf_hash_typedef() based on the type.
  2. Patch 2: Added tests

v3: https://lore.kernel.org/lkml/[email protected]/

v2: https://lore.kernel.org/lkml/[email protected]/

v1: https://lore.kernel.org/lkml/[email protected]/
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Andrii Nakryiko <[email protected]>
…8-rc5+

Cross-merge BPF and other fixes after downstream PR.

Minor conflict in kernel/bpf/helpers.c

Signed-off-by: Alexei Starovoitov <[email protected]>
bpf_task_work_schedule_resume() and bpf_task_work_schedule_signal() have
been renamed in bpf tree to bpf_task_work_schedule_resume_impl() and
bpf_task_work_schedule_signal_impl() accordingly.
There are few uses of these kfuncs in selftests that are not in bpf
tree, so that when we port [1] into bpf-next, those BPF programs will
not compile.
This patch aligns those remaining callsites with the kfunc renaming.
It should go on top of [1] when applying on bpf-next.

1: https://lore.kernel.org/all/[email protected]/

Signed-off-by: Mykyta Yatsenko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Metadata about a kfunc call is added to the kfunc_tab in
add_kfunc_call() but the call instruction itself could get removed by
opt_remove_dead_code() later if it is not reachable.

If the call instruction is removed, specialize_kfunc() is never called
for it and the desc->imm in the kfunc_tab is never initialized for this
kfunc call. In this case, sort_kfunc_descs_by_imm_off(env->prog); in
do_misc_fixups() doesn't sort the table correctly.
This is a problem for s390 as its JIT uses this table to find the
addresses for kfuncs, and if this table is not sorted properly, JIT may
fail to find addresses for valid kfunc calls.

This was exposed by:

commit d869d56 ("bpf: verifier: refactor kfunc specialization")

as before this commit, desc->imm was initialised in add_kfunc_call()
which happens before dead code elimination.

Move desc->imm setup down to sort_kfunc_descs_by_imm_off(), this fixes
the problem and also saves us from having the same logic in
add_kfunc_call() and specialize_kfunc().

Suggested-by: Eduard Zingerman <[email protected]>
Signed-off-by: Puranjay Mohan <[email protected]>
Acked-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
The bpf_skb_check_mtu helper needs to use skb->transport_header when
the BPF_MTU_CHK_SEGS flag is used:

	bpf_skb_check_mtu(skb, ifindex, &mtu_len, 0, BPF_MTU_CHK_SEGS)

The transport_header is not always set. There is a WARN_ON_ONCE
report when CONFIG_DEBUG_NET is enabled + skb->gso_size is set +
bpf_prog_test_run is used:

WARNING: CPU: 1 PID: 2216 at ./include/linux/skbuff.h:3071
 skb_gso_validate_network_len
 bpf_skb_check_mtu
 bpf_prog_3920e25740a41171_tc_chk_segs_flag # A test in the next patch
 bpf_test_run
 bpf_prog_test_run_skb

For a normal ingress skb (not test_run), skb_reset_transport_header
is performed but there is plan to avoid setting it as described in
commit 2170a1f ("net: no longer reset transport_header in __netif_receive_skb_core()").

This patch fixes the bpf helper by checking
skb_transport_header_was_set(). The check is done just before
skb->transport_header is used, to avoid breaking the existing bpf prog.
The WARN_ON_ONCE is limited to bpf_prog_test_run, so targeting bpf-next.

Fixes: 34b2021 ("bpf: Add BPF-helper for MTU checking")
Cc: Jesper Dangaard Brouer <[email protected]>
Reported-by: Kaiyan Mei <[email protected]>
Reported-by: Yinhao Hu <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
…t_header is not set

Add a test to check that bpf_skb_check_mtu(BPF_MTU_CHK_SEGS) is
rejected (-EINVAL) if skb->transport_header is not set. The test
needs to lower the MTU of the loopback device. Thus, take this
opportunity to run the test in a netns by adding "ns_" to the test
name. The "serial_" prefix can then be removed.

Signed-off-by: Martin KaFai Lau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
If xlated_prog_insns should not be exposed, other information
(such as func_info) still can and should be filled in.
Therefore, instead of directly terminating in this case,
continue with the normal flow.

Signed-off-by: Max Altgelt <[email protected]>
Link: https://lore.kernel.org/r/efd00fcec5e3e247af551632726e2a90c105fbd8.camel@nextron-systems.com
Signed-off-by: Alexei Starovoitov <[email protected]>
Syzkaller triggers an invalid memory access issue following fault
injection in update_effective_progs. The issue can be described as
follows:

__cgroup_bpf_detach
  update_effective_progs
    compute_effective_progs
      bpf_prog_array_alloc <-- fault inject
  purge_effective_progs
    /* change to dummy_bpf_prog */
    array->items[index] = &dummy_bpf_prog.prog

---softirq start---
__do_softirq
  ...
    __cgroup_bpf_run_filter_skb
      __bpf_prog_run_save_cb
        bpf_prog_run
          stats = this_cpu_ptr(prog->stats)
          /* invalid memory access */
          flags = u64_stats_update_begin_irqsave(&stats->syncp)
---softirq end---

  static_branch_dec(&cgroup_bpf_enabled_key[atype])

The reason is that fault injection caused update_effective_progs to fail
and then changed the original prog into dummy_bpf_prog.prog in
purge_effective_progs. Then a softirq came, and accessing the members of
dummy_bpf_prog.prog in the softirq triggers invalid mem access.

To fix it, skip updating stats when stats is NULL.

Fixes: 492ecee ("bpf: enable program stats")
Signed-off-by: Pu Lehui <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
We can currently optimize uprobes on top of nop5 instructions,
so application can define USDT_NOP to nop5 and use USDT macro
to define optimized usdt probes.

This works fine on new kernels, but could have performance penalty
on older kernels, that do not have the support to optimize and to
emulate nop5 instruction.

  execution of the usdt probe on top of nop:
  - nop -> trigger usdt -> emulate nop -> continue

  execution of the usdt probe on top of nop5:
  - nop5 -> trigger usdt -> single step nop5 -> continue

Note the 'single step nop5' as the source of performance regression.

To workaround that we change the USDT macro to emit nop,nop5 for
the probe (instead of default nop) and make record of that in
USDT record (more on that below).

This can be detected by application (libbpf) and it can place the
uprobe either on nop or nop5 based on the optimization support in
the kernel.

We make record of using the nop,nop5 instructions in the USDT ELF
note data.

Current elf note format is as follows:

  namesz (4B) | descsz (4B) | type (4B) | name | desc

And current usdt record (with "stapsdt" name) placed in the note's
desc data look like:

  loc_addr  | 8 bytes
  base_addr | 8 bytes
  sema_addr | 8 bytes
  provider  | zero terminated string
  name      | zero terminated string
  args      | zero terminated string

None of the tested parsers (bpftrace-bcc, libbpf) checked that the args
zero terminated byte is the actual end of the 'desc' data. As Andrii
suggested we could use this and place extra zero byte right there as an
indication for the parser we use the nop,nop5 instructions.

It's bit tricky, but the other way would be to introduce new elf note type
or note name and change all existing parsers to recognize it. With the change
above the existing parsers would still recognize such usdt probes.

Note we do not emit this extra byte if app defined its own nop through
USDT_NOP macro.

Suggested-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
Adding uprobe syscall feature detection that will be used
in following changes.

Signed-off-by: Jiri Olsa <[email protected]>
Adding support to parse extra info in usdt note record that
indicates there's nop,nop5 emitted for probe.

We detect this by checking extra zero byte placed in between
args zero termination byte and desc data end. Please see [1]
for more details.

Together with uprobe syscall feature detection we can decide
if we want to place the probe on top of nop or nop5.

[1] https://github.com/libbpf/usdt
Signed-off-by: Jiri Olsa <[email protected]>
Adding test that attaches bpf program on usdt probe in 2 scenarios;

- attach program on top of usdt_1 which is standard nop probe
  incidentally followed by nop5. The usdt probe does not have
  extra data in elf note record, so we expect the probe to land
  on the first nop without being optimized.

- attach program on top of usdt_2 which is probe defined on top
  of nop,nop5 combo. The extra data in the elf note record and
  presence of upeobe syscall ensures that the probe is placed
  on top of nop5 and optimized.

Signed-off-by: Jiri Olsa <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.