Skip to content

Commit 61552d2

Browse files
committed
Merge branch 'net-batched-receive-in-GRO-path'
Edward Cree says: ==================== net: batched receive in GRO path This series listifies part of GRO processing, in a manner which allows those packets which are not GROed (i.e. for which dev_gro_receive returns GRO_NORMAL) to be passed on to the listified regular receive path. dev_gro_receive() itself is not listified, nor the per-protocol GRO callback, since GRO's need to hold packets on lists under napi->gro_hash makes keeping the packets on other lists awkward, and since the GRO control block state of held skbs can refer only to one 'new' skb at a time. Instead, when napi_frags_finish() handles a GRO_NORMAL result, stash the skb onto a list in the napi struct, which is received at the end of the napi poll or when its length exceeds the (new) sysctl net.core.gro_normal_batch. Performance figures with this series, collected on a back-to-back pair of Solarflare sfn8522-r2 NICs with 120-second NetPerf tests. In the stats, sample size n for old and new code is 6 runs each; p is from a Welch t-test. Tests were run both with GRO enabled and disabled, the latter simulating uncoalesceable packets (e.g. due to IP or TCP options). The receive side (which was the device under test) had the NetPerf process pinned to one CPU, and the device interrupts pinned to a second CPU. CPU utilisation figures (used in cases of line-rate performance) are summed across all CPUs. net.core.gro_normal_batch was left at its default value of 8. TCP 4 streams, GRO on: all results line rate (9.415Gbps) net-next: 210.3% cpu after #1: 181.5% cpu (-13.7%, p=0.031 vs net-next) after #3: 196.7% cpu (- 8.4%, p=0.136 vs net-next) TCP 4 streams, GRO off: net-next: 8.017 Gbps after #1: 7.785 Gbps (- 2.9%, p=0.385 vs net-next) after #3: 7.604 Gbps (- 5.1%, p=0.282 vs net-next. But note *) TCP 1 stream, GRO off: net-next: 6.553 Gbps after #1: 6.444 Gbps (- 1.7%, p=0.302 vs net-next) after #3: 6.790 Gbps (+ 3.6%, p=0.169 vs net-next) TCP 1 stream, GRO on, busy_read = 50: all results line rate net-next: 156.0% cpu after #1: 174.5% cpu (+11.9%, p=0.015 vs net-next) after #3: 165.0% cpu (+ 5.8%, p=0.147 vs net-next) TCP 1 stream, GRO off, busy_read = 50: net-next: 6.488 Gbps after #1: 6.625 Gbps (+ 2.1%, p=0.059 vs net-next) after #3: 7.351 Gbps (+13.3%, p=0.026 vs net-next) TCP_RR 100 streams, GRO off, 8000 byte payload net-next: 995.083 us after #1: 969.167 us (- 2.6%, p=0.204 vs net-next) after #3: 976.433 us (- 1.9%, p=0.254 vs net-next) TCP_RR 100 streams, GRO off, 8000 byte payload, busy_read = 50: net-next: 2.851 ms after #1: 2.871 ms (+ 0.7%, p=0.134 vs net-next) after #3: 2.937 ms (+ 3.0%, p<0.001 vs net-next) TCP_RR 100 streams, GRO off, 1 byte payload, busy_read = 50: net-next: 867.317 us after #1: 865.717 us (- 0.2%, p=0.334 vs net-next) after #3: 868.517 us (+ 0.1%, p=0.414 vs net-next) (*) These tests produced a mixture of line-rate and below-line-rate results, meaning that statistically speaking the results were 'censored' by the upper bound, and were thus not normally distributed, making a Welch t-test mathematically invalid. I therefore also calculated estimators according to [1], which gave the following: net-next: 8.133 Gbps after #1: 8.130 Gbps (- 0.0%, p=0.499 vs net-next) after #3: 7.680 Gbps (- 5.6%, p=0.285 vs net-next) (though my procedure for determining ν wasn't mathematically well-founded either, so take that p-value with a grain of salt). A further check came from dividing the bandwidth figure by the CPU usage for each test run, giving: net-next: 3.461 after #1: 3.198 (- 7.6%, p=0.145 vs net-next) after #3: 3.641 (+ 5.2%, p=0.280 vs net-next) The above results are fairly mixed, and in most cases not statistically significant. But I think we can roughly conclude that the series marginally improves non-GROable throughput, without hurting latency (except in the large-payload busy-polling case, which in any case yields horrid performance even on net-next (almost triple the latency without busy-poll). Also, drivers which, unlike sfc, pass UDP traffic to GRO would expect to see a benefit from gaining access to batching. Changed in v3: * gro_normal_batch sysctl now uses SYSCTL_ONE instead of &one * removed RFC tags (no comments after a week means no-one objects, right?) Changed in v2: * During busy poll, call gro_normal_list() to receive batched packets after each cycle of the napi busy loop. See comments in Patch #3 for complications of doing the same in busy_poll_stop(). [1]: Cohen 1959, doi: 10.1080/00401706.1959.10489859 ==================== Signed-off-by: David S. Miller <[email protected]>
2 parents 5e6d9fc + 323ebb6 commit 61552d2

File tree

5 files changed

+54
-11
lines changed

5 files changed

+54
-11
lines changed

drivers/net/ethernet/sfc/falcon/rx.c

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -424,7 +424,6 @@ ef4_rx_packet_gro(struct ef4_channel *channel, struct ef4_rx_buffer *rx_buf,
424424
unsigned int n_frags, u8 *eh)
425425
{
426426
struct napi_struct *napi = &channel->napi_str;
427-
gro_result_t gro_result;
428427
struct ef4_nic *efx = channel->efx;
429428
struct sk_buff *skb;
430429

@@ -460,9 +459,7 @@ ef4_rx_packet_gro(struct ef4_channel *channel, struct ef4_rx_buffer *rx_buf,
460459

461460
skb_record_rx_queue(skb, channel->rx_queue.core_index);
462461

463-
gro_result = napi_gro_frags(napi);
464-
if (gro_result != GRO_DROP)
465-
channel->irq_mod_score += 2;
462+
napi_gro_frags(napi);
466463
}
467464

468465
/* Allocate and construct an SKB around page fragments */

drivers/net/ethernet/sfc/rx.c

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -412,7 +412,6 @@ efx_rx_packet_gro(struct efx_channel *channel, struct efx_rx_buffer *rx_buf,
412412
unsigned int n_frags, u8 *eh)
413413
{
414414
struct napi_struct *napi = &channel->napi_str;
415-
gro_result_t gro_result;
416415
struct efx_nic *efx = channel->efx;
417416
struct sk_buff *skb;
418417

@@ -449,9 +448,7 @@ efx_rx_packet_gro(struct efx_channel *channel, struct efx_rx_buffer *rx_buf,
449448

450449
skb_record_rx_queue(skb, channel->rx_queue.core_index);
451450

452-
gro_result = napi_gro_frags(napi);
453-
if (gro_result != GRO_DROP)
454-
channel->irq_mod_score += 2;
451+
napi_gro_frags(napi);
455452
}
456453

457454
/* Allocate and construct an SKB around page fragments */

include/linux/netdevice.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -332,6 +332,8 @@ struct napi_struct {
332332
struct net_device *dev;
333333
struct gro_list gro_hash[GRO_HASH_BUCKETS];
334334
struct sk_buff *skb;
335+
struct list_head rx_list; /* Pending GRO_NORMAL skbs */
336+
int rx_count; /* length of rx_list */
335337
struct hrtimer timer;
336338
struct list_head dev_list;
337339
struct hlist_node napi_hash_node;
@@ -4239,6 +4241,7 @@ extern int dev_weight_rx_bias;
42394241
extern int dev_weight_tx_bias;
42404242
extern int dev_rx_weight;
42414243
extern int dev_tx_weight;
4244+
extern int gro_normal_batch;
42424245

42434246
bool netdev_has_upper_dev(struct net_device *dev, struct net_device *upper_dev);
42444247
struct net_device *netdev_upper_get_next_dev_rcu(struct net_device *dev,

net/core/dev.c

Lines changed: 41 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3963,6 +3963,8 @@ int dev_weight_rx_bias __read_mostly = 1; /* bias for backlog weight */
39633963
int dev_weight_tx_bias __read_mostly = 1; /* bias for output_queue quota */
39643964
int dev_rx_weight __read_mostly = 64;
39653965
int dev_tx_weight __read_mostly = 64;
3966+
/* Maximum number of GRO_NORMAL skbs to batch up for list-RX */
3967+
int gro_normal_batch __read_mostly = 8;
39663968

39673969
/* Called with irq disabled */
39683970
static inline void ____napi_schedule(struct softnet_data *sd,
@@ -5747,6 +5749,26 @@ struct sk_buff *napi_get_frags(struct napi_struct *napi)
57475749
}
57485750
EXPORT_SYMBOL(napi_get_frags);
57495751

5752+
/* Pass the currently batched GRO_NORMAL SKBs up to the stack. */
5753+
static void gro_normal_list(struct napi_struct *napi)
5754+
{
5755+
if (!napi->rx_count)
5756+
return;
5757+
netif_receive_skb_list_internal(&napi->rx_list);
5758+
INIT_LIST_HEAD(&napi->rx_list);
5759+
napi->rx_count = 0;
5760+
}
5761+
5762+
/* Queue one GRO_NORMAL SKB up for list processing. If batch size exceeded,
5763+
* pass the whole batch up to the stack.
5764+
*/
5765+
static void gro_normal_one(struct napi_struct *napi, struct sk_buff *skb)
5766+
{
5767+
list_add_tail(&skb->list, &napi->rx_list);
5768+
if (++napi->rx_count >= gro_normal_batch)
5769+
gro_normal_list(napi);
5770+
}
5771+
57505772
static gro_result_t napi_frags_finish(struct napi_struct *napi,
57515773
struct sk_buff *skb,
57525774
gro_result_t ret)
@@ -5756,8 +5778,8 @@ static gro_result_t napi_frags_finish(struct napi_struct *napi,
57565778
case GRO_HELD:
57575779
__skb_push(skb, ETH_HLEN);
57585780
skb->protocol = eth_type_trans(skb, skb->dev);
5759-
if (ret == GRO_NORMAL && netif_receive_skb_internal(skb))
5760-
ret = GRO_DROP;
5781+
if (ret == GRO_NORMAL)
5782+
gro_normal_one(napi, skb);
57615783
break;
57625784

57635785
case GRO_DROP:
@@ -6034,6 +6056,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
60346056
NAPIF_STATE_IN_BUSY_POLL)))
60356057
return false;
60366058

6059+
gro_normal_list(n);
6060+
60376061
if (n->gro_bitmask) {
60386062
unsigned long timeout = 0;
60396063

@@ -6119,10 +6143,19 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock)
61196143
* Ideally, a new ndo_busy_poll_stop() could avoid another round.
61206144
*/
61216145
rc = napi->poll(napi, BUSY_POLL_BUDGET);
6146+
/* We can't gro_normal_list() here, because napi->poll() might have
6147+
* rearmed the napi (napi_complete_done()) in which case it could
6148+
* already be running on another CPU.
6149+
*/
61226150
trace_napi_poll(napi, rc, BUSY_POLL_BUDGET);
61236151
netpoll_poll_unlock(have_poll_lock);
6124-
if (rc == BUSY_POLL_BUDGET)
6152+
if (rc == BUSY_POLL_BUDGET) {
6153+
/* As the whole budget was spent, we still own the napi so can
6154+
* safely handle the rx_list.
6155+
*/
6156+
gro_normal_list(napi);
61256157
__napi_schedule(napi);
6158+
}
61266159
local_bh_enable();
61276160
}
61286161

@@ -6167,6 +6200,7 @@ void napi_busy_loop(unsigned int napi_id,
61676200
}
61686201
work = napi_poll(napi, BUSY_POLL_BUDGET);
61696202
trace_napi_poll(napi, work, BUSY_POLL_BUDGET);
6203+
gro_normal_list(napi);
61706204
count:
61716205
if (work > 0)
61726206
__NET_ADD_STATS(dev_net(napi->dev),
@@ -6272,6 +6306,8 @@ void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
62726306
napi->timer.function = napi_watchdog;
62736307
init_gro_hash(napi);
62746308
napi->skb = NULL;
6309+
INIT_LIST_HEAD(&napi->rx_list);
6310+
napi->rx_count = 0;
62756311
napi->poll = poll;
62766312
if (weight > NAPI_POLL_WEIGHT)
62776313
netdev_err_once(dev, "%s() called with weight %d\n", __func__,
@@ -6368,6 +6404,8 @@ static int napi_poll(struct napi_struct *n, struct list_head *repoll)
63686404
goto out_unlock;
63696405
}
63706406

6407+
gro_normal_list(n);
6408+
63716409
if (n->gro_bitmask) {
63726410
/* flush too old packets
63736411
* If HZ < 1000, flush all packets.

net/core/sysctl_net_core.c

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -567,6 +567,14 @@ static struct ctl_table net_core_table[] = {
567567
.mode = 0644,
568568
.proc_handler = proc_do_static_key,
569569
},
570+
{
571+
.procname = "gro_normal_batch",
572+
.data = &gro_normal_batch,
573+
.maxlen = sizeof(unsigned int),
574+
.mode = 0644,
575+
.proc_handler = proc_dointvec_minmax,
576+
.extra1 = SYSCTL_ONE,
577+
},
570578
{ }
571579
};
572580

0 commit comments

Comments
 (0)