Skip to content

Commit 68822bd

Browse files
edumazetkuba-moo
authored andcommitted
net: generalize skb freeing deferral to per-cpu lists
Logic added in commit f35f821 ("tcp: defer skb freeing after socket lock is released") helped bulk TCP flows to move the cost of skbs frees outside of critical section where socket lock was held. But for RPC traffic, or hosts with RFS enabled, the solution is far from being ideal. For RPC traffic, recvmsg() has to return to user space right after skb payload has been consumed, meaning that BH handler has no chance to pick the skb before recvmsg() thread. This issue is more visible with BIG TCP, as more RPC fit one skb. For RFS, even if BH handler picks the skbs, they are still picked from the cpu on which user thread is running. Ideally, it is better to free the skbs (and associated page frags) on the cpu that originally allocated them. This patch removes the per socket anchor (sk->defer_list) and instead uses a per-cpu list, which will hold more skbs per round. This new per-cpu list is drained at the end of net_action_rx(), after incoming packets have been processed, to lower latencies. In normal conditions, skbs are added to the per-cpu list with no further action. In the (unlikely) cases where the cpu does not run net_action_rx() handler fast enough, we use an IPI to raise NET_RX_SOFTIRQ on the remote cpu. Also, we do not bother draining the per-cpu list from dev_cpu_dead() This is because skbs in this list have no requirement on how fast they should be freed. Note that we can add in the future a small per-cpu cache if we see any contention on sd->defer_lock. Tested on a pair of hosts with 100Gbit NIC, RFS enabled, and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around page recycling strategy used by NIC driver (its page pool capacity being too small compared to number of skbs/pages held in sockets receive queues) Note that this tuning was only done to demonstrate worse conditions for skb freeing for this particular test. These conditions can happen in more general production workload. 10 runs of one TCP_STREAM flow Before: Average throughput: 49685 Mbit. Kernel profiles on cpu running user thread recvmsg() show high cost for skb freeing related functions (*) 57.81% [kernel] [k] copy_user_enhanced_fast_string (*) 12.87% [kernel] [k] skb_release_data (*) 4.25% [kernel] [k] __free_one_page (*) 3.57% [kernel] [k] __list_del_entry_valid 1.85% [kernel] [k] __netif_receive_skb_core 1.60% [kernel] [k] __skb_datagram_iter (*) 1.59% [kernel] [k] free_unref_page_commit (*) 1.16% [kernel] [k] __slab_free 1.16% [kernel] [k] _copy_to_iter (*) 1.01% [kernel] [k] kfree (*) 0.88% [kernel] [k] free_unref_page 0.57% [kernel] [k] ip6_rcv_core 0.55% [kernel] [k] ip6t_do_table 0.54% [kernel] [k] flush_smp_call_function_queue (*) 0.54% [kernel] [k] free_pcppages_bulk 0.51% [kernel] [k] llist_reverse_order 0.38% [kernel] [k] process_backlog (*) 0.38% [kernel] [k] free_pcp_prepare 0.37% [kernel] [k] tcp_recvmsg_locked (*) 0.37% [kernel] [k] __list_add_valid 0.34% [kernel] [k] sock_rfree 0.34% [kernel] [k] _raw_spin_lock_irq (*) 0.33% [kernel] [k] __page_cache_release 0.33% [kernel] [k] tcp_v6_rcv (*) 0.33% [kernel] [k] __put_page (*) 0.29% [kernel] [k] __mod_zone_page_state 0.27% [kernel] [k] _raw_spin_lock After patch: Average throughput: 73076 Mbit. Kernel profiles on cpu running user thread recvmsg() looks better: 81.35% [kernel] [k] copy_user_enhanced_fast_string 1.95% [kernel] [k] _copy_to_iter 1.95% [kernel] [k] __skb_datagram_iter 1.27% [kernel] [k] __netif_receive_skb_core 1.03% [kernel] [k] ip6t_do_table 0.60% [kernel] [k] sock_rfree 0.50% [kernel] [k] tcp_v6_rcv 0.47% [kernel] [k] ip6_rcv_core 0.45% [kernel] [k] read_tsc 0.44% [kernel] [k] _raw_spin_lock_irqsave 0.37% [kernel] [k] _raw_spin_lock 0.37% [kernel] [k] native_irq_return_iret 0.33% [kernel] [k] __inet6_lookup_established 0.31% [kernel] [k] ip6_protocol_deliver_rcu 0.29% [kernel] [k] tcp_rcv_established 0.29% [kernel] [k] llist_reverse_order v2: kdoc issue (kernel bots) do not defer if (alloc_cpu == smp_processor_id()) (Paolo) replace the sk_buff_head with a single-linked list (Jakub) add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Paolo Abeni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
1 parent 5612154 commit 68822bd

File tree

11 files changed

+90
-46
lines changed

11 files changed

+90
-46
lines changed

include/linux/netdevice.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3081,6 +3081,11 @@ struct softnet_data {
30813081
struct sk_buff_head input_pkt_queue;
30823082
struct napi_struct backlog;
30833083

3084+
/* Another possibly contended cache line */
3085+
spinlock_t defer_lock ____cacheline_aligned_in_smp;
3086+
int defer_count;
3087+
struct sk_buff *defer_list;
3088+
call_single_data_t defer_csd;
30843089
};
30853090

30863091
static inline void input_queue_head_incr(struct softnet_data *sd)

include/linux/skbuff.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -888,6 +888,7 @@ typedef unsigned char *sk_buff_data_t;
888888
* delivery_time at egress.
889889
* @napi_id: id of the NAPI struct this skb came from
890890
* @sender_cpu: (aka @napi_id) source CPU in XPS
891+
* @alloc_cpu: CPU which did the skb allocation.
891892
* @secmark: security marking
892893
* @mark: Generic packet mark
893894
* @reserved_tailroom: (aka @mark) number of bytes of free space available
@@ -1080,6 +1081,7 @@ struct sk_buff {
10801081
unsigned int sender_cpu;
10811082
};
10821083
#endif
1084+
u16 alloc_cpu;
10831085
#ifdef CONFIG_NETWORK_SECMARK
10841086
__u32 secmark;
10851087
#endif
@@ -1321,6 +1323,7 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size);
13211323
struct sk_buff *build_skb(void *data, unsigned int frag_size);
13221324
struct sk_buff *build_skb_around(struct sk_buff *skb,
13231325
void *data, unsigned int frag_size);
1326+
void skb_attempt_defer_free(struct sk_buff *skb);
13241327

13251328
struct sk_buff *napi_build_skb(void *data, unsigned int frag_size);
13261329

include/net/sock.h

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -292,7 +292,6 @@ struct sk_filter;
292292
* @sk_pacing_shift: scaling factor for TCP Small Queues
293293
* @sk_lingertime: %SO_LINGER l_linger setting
294294
* @sk_backlog: always used with the per-socket spinlock held
295-
* @defer_list: head of llist storing skbs to be freed
296295
* @sk_callback_lock: used with the callbacks in the end of this struct
297296
* @sk_error_queue: rarely used
298297
* @sk_prot_creator: sk_prot of original sock creator (see ipv6_setsockopt,
@@ -417,7 +416,6 @@ struct sock {
417416
struct sk_buff *head;
418417
struct sk_buff *tail;
419418
} sk_backlog;
420-
struct llist_head defer_list;
421419

422420
#define sk_rmem_alloc sk_backlog.rmem_alloc
423421

include/net/tcp.h

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1375,18 +1375,6 @@ static inline bool tcp_checksum_complete(struct sk_buff *skb)
13751375
bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb,
13761376
enum skb_drop_reason *reason);
13771377

1378-
#ifdef CONFIG_INET
1379-
void __sk_defer_free_flush(struct sock *sk);
1380-
1381-
static inline void sk_defer_free_flush(struct sock *sk)
1382-
{
1383-
if (llist_empty(&sk->defer_list))
1384-
return;
1385-
__sk_defer_free_flush(sk);
1386-
}
1387-
#else
1388-
static inline void sk_defer_free_flush(struct sock *sk) {}
1389-
#endif
13901378

13911379
int tcp_filter(struct sock *sk, struct sk_buff *skb);
13921380
void tcp_set_state(struct sock *sk, int state);

net/core/dev.c

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4545,6 +4545,12 @@ static void rps_trigger_softirq(void *data)
45454545

45464546
#endif /* CONFIG_RPS */
45474547

4548+
/* Called from hardirq (IPI) context */
4549+
static void trigger_rx_softirq(void *data __always_unused)
4550+
{
4551+
__raise_softirq_irqoff(NET_RX_SOFTIRQ);
4552+
}
4553+
45484554
/*
45494555
* Check if this softnet_data structure is another cpu one
45504556
* If yes, queue it to our IPI list and return 1
@@ -6571,6 +6577,28 @@ static int napi_threaded_poll(void *data)
65716577
return 0;
65726578
}
65736579

6580+
static void skb_defer_free_flush(struct softnet_data *sd)
6581+
{
6582+
struct sk_buff *skb, *next;
6583+
unsigned long flags;
6584+
6585+
/* Paired with WRITE_ONCE() in skb_attempt_defer_free() */
6586+
if (!READ_ONCE(sd->defer_list))
6587+
return;
6588+
6589+
spin_lock_irqsave(&sd->defer_lock, flags);
6590+
skb = sd->defer_list;
6591+
sd->defer_list = NULL;
6592+
sd->defer_count = 0;
6593+
spin_unlock_irqrestore(&sd->defer_lock, flags);
6594+
6595+
while (skb != NULL) {
6596+
next = skb->next;
6597+
__kfree_skb(skb);
6598+
skb = next;
6599+
}
6600+
}
6601+
65746602
static __latent_entropy void net_rx_action(struct softirq_action *h)
65756603
{
65766604
struct softnet_data *sd = this_cpu_ptr(&softnet_data);
@@ -6616,6 +6644,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
66166644
__raise_softirq_irqoff(NET_RX_SOFTIRQ);
66176645

66186646
net_rps_action_and_irq_enable(sd);
6647+
skb_defer_free_flush(sd);
66196648
}
66206649

66216650
struct netdev_adjacent {
@@ -11326,6 +11355,8 @@ static int __init net_dev_init(void)
1132611355
INIT_CSD(&sd->csd, rps_trigger_softirq, sd);
1132711356
sd->cpu = i;
1132811357
#endif
11358+
INIT_CSD(&sd->defer_csd, trigger_rx_softirq, NULL);
11359+
spin_lock_init(&sd->defer_lock);
1132911360

1133011361
init_gro_hash(&sd->backlog);
1133111362
sd->backlog.poll = process_backlog;

net/core/skbuff.c

Lines changed: 50 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -204,7 +204,7 @@ static void __build_skb_around(struct sk_buff *skb, void *data,
204204
skb_set_end_offset(skb, size);
205205
skb->mac_header = (typeof(skb->mac_header))~0U;
206206
skb->transport_header = (typeof(skb->transport_header))~0U;
207-
207+
skb->alloc_cpu = raw_smp_processor_id();
208208
/* make sure we initialize shinfo sequentially */
209209
shinfo = skb_shinfo(skb);
210210
memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
@@ -1037,6 +1037,7 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
10371037
#ifdef CONFIG_NET_RX_BUSY_POLL
10381038
CHECK_SKB_FIELD(napi_id);
10391039
#endif
1040+
CHECK_SKB_FIELD(alloc_cpu);
10401041
#ifdef CONFIG_XPS
10411042
CHECK_SKB_FIELD(sender_cpu);
10421043
#endif
@@ -6486,3 +6487,51 @@ void __skb_ext_put(struct skb_ext *ext)
64866487
}
64876488
EXPORT_SYMBOL(__skb_ext_put);
64886489
#endif /* CONFIG_SKB_EXTENSIONS */
6490+
6491+
/**
6492+
* skb_attempt_defer_free - queue skb for remote freeing
6493+
* @skb: buffer
6494+
*
6495+
* Put @skb in a per-cpu list, using the cpu which
6496+
* allocated the skb/pages to reduce false sharing
6497+
* and memory zone spinlock contention.
6498+
*/
6499+
void skb_attempt_defer_free(struct sk_buff *skb)
6500+
{
6501+
int cpu = skb->alloc_cpu;
6502+
struct softnet_data *sd;
6503+
unsigned long flags;
6504+
bool kick;
6505+
6506+
if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
6507+
!cpu_online(cpu) ||
6508+
cpu == raw_smp_processor_id()) {
6509+
__kfree_skb(skb);
6510+
return;
6511+
}
6512+
6513+
sd = &per_cpu(softnet_data, cpu);
6514+
/* We do not send an IPI or any signal.
6515+
* Remote cpu will eventually call skb_defer_free_flush()
6516+
*/
6517+
spin_lock_irqsave(&sd->defer_lock, flags);
6518+
skb->next = sd->defer_list;
6519+
/* Paired with READ_ONCE() in skb_defer_free_flush() */
6520+
WRITE_ONCE(sd->defer_list, skb);
6521+
sd->defer_count++;
6522+
6523+
/* kick every time queue length reaches 128.
6524+
* This should avoid blocking in smp_call_function_single_async().
6525+
* This condition should hardly be bit under normal conditions,
6526+
* unless cpu suddenly stopped to receive NIC interrupts.
6527+
*/
6528+
kick = sd->defer_count == 128;
6529+
6530+
spin_unlock_irqrestore(&sd->defer_lock, flags);
6531+
6532+
/* Make sure to trigger NET_RX_SOFTIRQ on the remote CPU
6533+
* if we are unlucky enough (this seems very unlikely).
6534+
*/
6535+
if (unlikely(kick))
6536+
smp_call_function_single_async(cpu, &sd->defer_csd);
6537+
}

net/core/sock.c

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2082,9 +2082,6 @@ void sk_destruct(struct sock *sk)
20822082
{
20832083
bool use_call_rcu = sock_flag(sk, SOCK_RCU_FREE);
20842084

2085-
WARN_ON_ONCE(!llist_empty(&sk->defer_list));
2086-
sk_defer_free_flush(sk);
2087-
20882085
if (rcu_access_pointer(sk->sk_reuseport_cb)) {
20892086
reuseport_detach_sock(sk);
20902087
use_call_rcu = true;

net/ipv4/tcp.c

Lines changed: 1 addition & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -843,7 +843,6 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
843843
}
844844

845845
release_sock(sk);
846-
sk_defer_free_flush(sk);
847846

848847
if (spliced)
849848
return spliced;
@@ -1589,32 +1588,14 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
15891588
tcp_send_ack(sk);
15901589
}
15911590

1592-
void __sk_defer_free_flush(struct sock *sk)
1593-
{
1594-
struct llist_node *head;
1595-
struct sk_buff *skb, *n;
1596-
1597-
head = llist_del_all(&sk->defer_list);
1598-
llist_for_each_entry_safe(skb, n, head, ll_node) {
1599-
prefetch(n);
1600-
skb_mark_not_on_list(skb);
1601-
__kfree_skb(skb);
1602-
}
1603-
}
1604-
EXPORT_SYMBOL(__sk_defer_free_flush);
1605-
16061591
static void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb)
16071592
{
16081593
__skb_unlink(skb, &sk->sk_receive_queue);
16091594
if (likely(skb->destructor == sock_rfree)) {
16101595
sock_rfree(skb);
16111596
skb->destructor = NULL;
16121597
skb->sk = NULL;
1613-
if (!skb_queue_empty(&sk->sk_receive_queue) ||
1614-
!llist_empty(&sk->defer_list)) {
1615-
llist_add(&skb->ll_node, &sk->defer_list);
1616-
return;
1617-
}
1598+
return skb_attempt_defer_free(skb);
16181599
}
16191600
__kfree_skb(skb);
16201601
}
@@ -2453,7 +2434,6 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
24532434
__sk_flush_backlog(sk);
24542435
} else {
24552436
tcp_cleanup_rbuf(sk, copied);
2456-
sk_defer_free_flush(sk);
24572437
sk_wait_data(sk, &timeo, last);
24582438
}
24592439

@@ -2571,7 +2551,6 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags,
25712551
lock_sock(sk);
25722552
ret = tcp_recvmsg_locked(sk, msg, len, flags, &tss, &cmsg_flags);
25732553
release_sock(sk);
2574-
sk_defer_free_flush(sk);
25752554

25762555
if (cmsg_flags && ret >= 0) {
25772556
if (cmsg_flags & TCP_CMSG_TS)
@@ -3096,7 +3075,6 @@ int tcp_disconnect(struct sock *sk, int flags)
30963075
sk->sk_frag.page = NULL;
30973076
sk->sk_frag.offset = 0;
30983077
}
3099-
sk_defer_free_flush(sk);
31003078
sk_error_report(sk);
31013079
return 0;
31023080
}
@@ -4225,7 +4203,6 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
42254203
err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname,
42264204
&zc, &len, err);
42274205
release_sock(sk);
4228-
sk_defer_free_flush(sk);
42294206
if (len >= offsetofend(struct tcp_zerocopy_receive, msg_flags))
42304207
goto zerocopy_rcv_cmsg;
42314208
switch (len) {

net/ipv4/tcp_ipv4.c

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2065,7 +2065,6 @@ int tcp_v4_rcv(struct sk_buff *skb)
20652065

20662066
sk_incoming_cpu_update(sk);
20672067

2068-
sk_defer_free_flush(sk);
20692068
bh_lock_sock_nested(sk);
20702069
tcp_segs_in(tcp_sk(sk), skb);
20712070
ret = 0;

net/ipv6/tcp_ipv6.c

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1728,7 +1728,6 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
17281728

17291729
sk_incoming_cpu_update(sk);
17301730

1731-
sk_defer_free_flush(sk);
17321731
bh_lock_sock_nested(sk);
17331732
tcp_segs_in(tcp_sk(sk), skb);
17341733
ret = 0;

0 commit comments

Comments
 (0)