-
Notifications
You must be signed in to change notification settings - Fork 149
New netdev feature flags for XDP #350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Master branch: 024cd2c |
Master branch: b93ef08 |
b4a98ad
to
afe5756
Compare
Implement support for checking if a netdev has XDP and AF_XDP zero copy support. Previously, there was no way to do this other than to try to create an AF_XDP socket on the interface or load an XDP program and see if it worked. This commit changes this by extending existing netdev_features in the following way: * xdp - full XDP support (XDP_{TX, PASS, DROP, ABORT, REDIRECT}) * af-xdp-zc - AF_XDP zero copy support By default these new flags are disabled for all drivers. $ ethtool -k enp1s0f0 Features for enp1s0f0: .. xdp: off [fixed] af-xdp-zc: off [fixed] Signed-off-by: Marek Majtyka <[email protected]>
Turn 'xdp' and 'af-xdp-zc' feature flags on for: - i40e - ice - ixgbe - mlx5. Turn 'xdp' feature flag on for: - igb - tun - veth - dpaa2 - mvneta - mvpp2 - qede - sfc - netsec - cpsw - xen - virtio_net. The first group of NICs is currently visible with ethtool as: $ ethtool -k enp1s0f0 Features for enp1s0f0: .. xdp: on [fixed] af-xdp-zc: on [fixed] whereas for the second group output is as: $ ethtool -k enp1s0f0 Features for enp1s0f0: .. xdp: on [fixed] af-xdp-zc: off [fixed] Signed-off-by: Marek Majtyka <[email protected]>
Change necessary condition check for XSK from ndo functions to netdev_features flags. Signed-off-by: Marek Majtyka <[email protected]>
Add check for full support of XDP in AF_XDP socket bind. To be able to use an AF_XDP socket with zero-copy, there needs to be support for both XDP_REDIRECT in the driver (XDP native mode) and the driver needs to support zero-copy. The problem is that there are drivers out there that only support XDP partially, so it is possible to successfully load the XDP program in native mode, but it will still not be able to support zero-copy as it does not have XDP_REDIRECT support. We can now alleviate this problem by using the new XDP netdev capability that signifies if full XDP support is indeed present. This check can be triggered by a new bind flag called XDP_CHECK_NATIVE_MODE. To simplify usage, this check is triggered automatically from inside libbpf library via turning on the new XDP_CHECK_NATIVE_MODE flag if and only if the driver mode is selected for the socket. As a result, the xsk_bind function decides if the native mode for a given interface makes sense or not using xdp netdev feature flags. Eventually the xsk socket is bound or an error is returned. Apart from this change and to catch all invalid inputs in a single place, an additional check is set to forbid sbk mode and zero copy settings at the same time as that combination makes no sense. Signed-off-by: Marek Majtyka <[email protected]>
Extend netlink attribute API to put a different attribute into the netlink message (nest{start, end}, string, u32, flag, etc). Add new API to parse attribute array. Signed-off-by: Marek Majtyka <[email protected]>
Add functions to get XDP/XSK modes from netdev feature flags over netlink ethtool family interface. These functions provide functionalities that are going to be used in upcoming changes together constituting new libbpf public API function which informs about key xsk capabilities of given network interface. Signed-off-by: Marek Majtyka <[email protected]>
Add public xsk API to read supported XDP functions of a netdev: - XDP driver mode (SKB, DRV), - XSK bind mode (COPY, ZEROCOPY). Signed-off-by: Marek Majtyka <[email protected]>
Update xdpsock sample so that it utilizes netlink ethtool interface to get available XDP/XSK modes. This allows to automatically choose the best available mode of operation, if these are not provided explicitly. Signed-off-by: Marek Majtyka <[email protected]>
Master branch: de91e63 |
afe5756
to
07ae865
Compare
At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=384857 expired. Closing PR. |
Add a test case to assert that the skb->pkt_type which was set from the BPF program is retained from the netkit xmit side to the peer's device at tcx ingress location. # ./vmtest.sh -- ./test_progs -t netkit [...] ./test_progs -t netkit [ 1.140780] bpf_testmod: loading out-of-tree module taints kernel. [ 1.141127] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel [ 1.284601] tsc: Refined TSC clocksource calibration: 3408.006 MHz [ 1.286672] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fd9b189d, max_idle_ns: 440795225691 ns [ 1.290384] clocksource: Switched to clocksource tsc #345 tc_netkit_basic:OK #346 tc_netkit_device:OK #347 tc_netkit_multi_links:OK #348 tc_netkit_multi_opts:OK #349 tc_netkit_neigh_links:OK #350 tc_netkit_pkt_type:OK Summary: 6/0 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <[email protected]>
Add a test case to assert that the skb->pkt_type which was set from the BPF program is retained from the netkit xmit side to the peer's device at tcx ingress location. # ./vmtest.sh -- ./test_progs -t netkit [...] ./test_progs -t netkit [ 1.140780] bpf_testmod: loading out-of-tree module taints kernel. [ 1.141127] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel [ 1.284601] tsc: Refined TSC clocksource calibration: 3408.006 MHz [ 1.286672] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fd9b189d, max_idle_ns: 440795225691 ns [ 1.290384] clocksource: Switched to clocksource tsc #345 tc_netkit_basic:OK #346 tc_netkit_device:OK #347 tc_netkit_multi_links:OK #348 tc_netkit_multi_opts:OK #349 tc_netkit_neigh_links:OK #350 tc_netkit_pkt_type:OK Summary: 6/0 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <[email protected]>
Add a test case to assert that the skb->pkt_type which was set from the BPF program is retained from the netkit xmit side to the peer's device at tcx ingress location. # ./vmtest.sh -- ./test_progs -t netkit [...] ./test_progs -t netkit [ 1.140780] bpf_testmod: loading out-of-tree module taints kernel. [ 1.141127] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel [ 1.284601] tsc: Refined TSC clocksource calibration: 3408.006 MHz [ 1.286672] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fd9b189d, max_idle_ns: 440795225691 ns [ 1.290384] clocksource: Switched to clocksource tsc #345 tc_netkit_basic:OK #346 tc_netkit_device:OK #347 tc_netkit_multi_links:OK #348 tc_netkit_multi_opts:OK #349 tc_netkit_neigh_links:OK #350 tc_netkit_pkt_type:OK Summary: 6/0 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
While working on the lazy MMU mode enablement for s390 I hit pretty curious issues in the kasan code. The first is related to a custom kasan-based sanitizer aimed at catching invalid accesses to PTEs and is inspired by [1] conversation. The kasan complains on valid PTE accesses, while the shadow memory is reported as unpoisoned: [ 102.783993] ================================================================== [ 102.784008] BUG: KASAN: out-of-bounds in set_pte_range+0x36c/0x390 [ 102.784016] Read of size 8 at addr 0000780084cf9608 by task vmalloc_test/0/5542 [ 102.784019] [ 102.784040] CPU: 1 UID: 0 PID: 5542 Comm: vmalloc_test/0 Kdump: loaded Tainted: G OE 6.16.0-gcc-ipte-kasan-11657-gb2d930c4950e kernel-patches#340 PREEMPT [ 102.784047] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE [ 102.784049] Hardware name: IBM 8561 T01 703 (LPAR) [ 102.784052] Call Trace: [ 102.784054] [<00007fffe0147ac0>] dump_stack_lvl+0xe8/0x140 [ 102.784059] [<00007fffe0112484>] print_address_description.constprop.0+0x34/0x2d0 [ 102.784066] [<00007fffe011282c>] print_report+0x10c/0x1f8 [ 102.784071] [<00007fffe090785a>] kasan_report+0xfa/0x220 [ 102.784078] [<00007fffe01d3dec>] set_pte_range+0x36c/0x390 [ 102.784083] [<00007fffe01d41c2>] leave_ipte_batch+0x3b2/0xb10 [ 102.784088] [<00007fffe07d3650>] apply_to_pte_range+0x2f0/0x4e0 [ 102.784094] [<00007fffe07e62e4>] apply_to_pmd_range+0x194/0x3e0 [ 102.784099] [<00007fffe07e820e>] __apply_to_page_range+0x2fe/0x7a0 [ 102.784104] [<00007fffe07e86d8>] apply_to_page_range+0x28/0x40 [ 102.784109] [<00007fffe090a3ec>] __kasan_populate_vmalloc+0xec/0x310 [ 102.784114] [<00007fffe090aa36>] kasan_populate_vmalloc+0x96/0x130 [ 102.784118] [<00007fffe0833a04>] alloc_vmap_area+0x3d4/0xf30 [ 102.784123] [<00007fffe083a8ba>] __get_vm_area_node+0x1aa/0x4c0 [ 102.784127] [<00007fffe083c4f6>] __vmalloc_node_range_noprof+0x126/0x4e0 [ 102.784131] [<00007fffe083c980>] __vmalloc_node_noprof+0xd0/0x110 [ 102.784135] [<00007fffe083ca32>] vmalloc_noprof+0x32/0x40 [ 102.784139] [<00007fff608aa336>] fix_size_alloc_test+0x66/0x150 [test_vmalloc] [ 102.784147] [<00007fff608aa710>] test_func+0x2f0/0x430 [test_vmalloc] [ 102.784153] [<00007fffe02841f8>] kthread+0x3f8/0x7a0 [ 102.784159] [<00007fffe014d8b4>] __ret_from_fork+0xd4/0x7d0 [ 102.784164] [<00007fffe299c00a>] ret_from_fork+0xa/0x30 [ 102.784173] no locks held by vmalloc_test/0/5542. [ 102.784176] [ 102.784178] The buggy address belongs to the physical page: [ 102.784186] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x84cf9 [ 102.784198] flags: 0x3ffff00000000000(node=0|zone=1|lastcpupid=0x1ffff) [ 102.784212] page_type: f2(table) [ 102.784225] raw: 3ffff00000000000 0000000000000000 0000000000000122 0000000000000000 [ 102.784234] raw: 0000000000000000 0000000000000000 f200000000000001 0000000000000000 [ 102.784248] page dumped because: kasan: bad access detected [ 102.784250] [ 102.784252] Memory state around the buggy address: [ 102.784260] 0000780084cf9500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 102.784274] 0000780084cf9580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 102.784277] >0000780084cf9600: fd 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 102.784290] ^ [ 102.784293] 0000780084cf9680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 102.784303] 0000780084cf9700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 102.784306] ================================================================== The second issue hits when the custom sanitizer above is not implemented, but the kasan itself is still active: [ 1554.438028] Unable to handle kernel pointer dereference in virtual kernel address space [ 1554.438065] Failing address: 001c0ff0066f0000 TEID: 001c0ff0066f0403 [ 1554.438076] Fault in home space mode while using kernel ASCE. [ 1554.438103] AS:00000000059d400b R2:0000000ffec5c00b R3:00000000c6c9c007 S:0000000314470001 P:00000000d0ab413d [ 1554.438158] Oops: 0011 ilc:2 [kernel-patches#1]SMP [ 1554.438175] Modules linked in: test_vmalloc(E+) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) sunrpc(E) pkey_pckmo(E) uvdevice(E) s390_trng(E) rng_core(E) eadm_sch(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) sch_fq_codel(E) drm(E) loop(E) i2c_core(E) drm_panel_orientation_quirks(E) nfnetlink(E) ctcm(E) fsm(E) zfcp(E) scsi_transport_fc(E) diag288_wdt(E) watchdog(E) ghash_s390(E) prng(E) aes_s390(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha512_s390(E) sha1_s390(E) sha_common(E) pkey(E) autofs4(E) [ 1554.438319] Unloaded tainted modules: pkey_uv(E):1 hmac_s390(E):2 [ 1554.438354] CPU: 1 UID: 0 PID: 1715 Comm: vmalloc_test/0 Kdump: loaded Tainted: G E 6.16.0-gcc-ipte-kasan-11657-gb2d930c4950e kernel-patches#350 PREEMPT [ 1554.438368] Tainted: [E]=UNSIGNED_MODULE [ 1554.438374] Hardware name: IBM 8561 T01 703 (LPAR) [ 1554.438381] Krnl PSW : 0704e00180000000 00007fffe1d3d6ae (memset+0x5e/0x98) [ 1554.438396] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 [ 1554.438409] Krnl GPRS: 0000000000000001 001c0ff0066f0000 001c0ff0066f0000 00000000000000f8 [ 1554.438418] 00000000000009fe 0000000000000009 0000000000000000 0000000000000002 [ 1554.438426] 0000000000005000 000078031ae655c8 00000feffdcf9f59 0000780258672a20 [ 1554.438433] 0000780243153500 00007f8033780000 00007fffe083a510 00007f7fee7cfa00 [ 1554.438452] Krnl Code: 00007fffe1d3d6a0: eb540008000c srlg %r5,%r4,8 00007fffe1d3d6a6: b9020055 ltgr %r5,%r5 #00007fffe1d3d6aa: a784000b brc 8,00007fffe1d3d6c0 >00007fffe1d3d6ae: 42301000 stc %r3,0(%r1) 00007fffe1d3d6b2: d2fe10011000 mvc 1(255,%r1),0(%r1) 00007fffe1d3d6b8: 41101100 la %r1,256(%r1) 00007fffe1d3d6bc: a757fff9 brctg %r5,00007fffe1d3d6ae 00007fffe1d3d6c0: 42301000 stc %r3,0(%r1) [ 1554.438539] Call Trace: [ 1554.438545] [<00007fffe1d3d6ae>] memset+0x5e/0x98 [ 1554.438552] ([<00007fffe083a510>] remove_vm_area+0x220/0x400) [ 1554.438562] [<00007fffe083a9d6>] vfree.part.0+0x26/0x810 [ 1554.438569] [<00007fff6073bd50>] fix_align_alloc_test+0x50/0x90 [test_vmalloc] [ 1554.438583] [<00007fff6073c73a>] test_func+0x46a/0x6c0 [test_vmalloc] [ 1554.438593] [<00007fffe0283ac8>] kthread+0x3f8/0x7a0 [ 1554.438603] [<00007fffe014d8b4>] __ret_from_fork+0xd4/0x7d0 [ 1554.438613] [<00007fffe299ac0a>] ret_from_fork+0xa/0x30 [ 1554.438622] INFO: lockdep is turned off. [ 1554.438627] Last Breaking-Event-Address: [ 1554.438632] [<00007fffe1d3d65c>] memset+0xc/0x98 [ 1554.438644] Kernel panic - not syncing: Fatal exception: panic_on_oops This series fixes the above issues and is a pre-requisite for the s390 lazy MMU mode implementation. test_vmalloc was used to stress-test the fixes. This patch (of 2): When vmalloc shadow memory is established the modification of the corresponding page tables is not protected by any locks. Instead, the locking is done per-PTE. This scheme however has defects. kasan_populate_vmalloc_pte() - while ptep_get() read is atomic the sequence pte_none(ptep_get()) is not. Doing that outside of the lock might lead to a concurrent PTE update and what could be seen as a shadow memory corruption as result. kasan_depopulate_vmalloc_pte() - by the time a page whose address was extracted from ptep_get() read and cached in a local variable outside of the lock is attempted to get free, could actually be freed already. To avoid these put ptep_get() itself and the code that manipulates the result of the read under lock. In addition, move freeing of the page out of the atomic context. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/adb258634194593db294c0d1fb35646e894d6ead.1755528662.git.agordeev@linux.ibm.com Link: https://lore.kernel.org/linux-mm/[email protected]/ [1] Fixes: 3c5c3cf ("kasan: support backing vmalloc space with real shadow memory") Signed-off-by: Alexander Gordeev <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Daniel Axtens <[email protected]> Cc: Marc Rutland <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
Pull request for series with
subject: New netdev feature flags for XDP
version: 1
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=384857