Skip to content

Commit 95d1815

Browse files
committed
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next 1) Incorrect error check in nft_expr_inner_parse(), from Dan Carpenter. 2) Add DATA_SENT state to SCTP connection tracking helper, from Sriram Yagnaraman. 3) Consolidate nf_confirm for ipv4 and ipv6, from Florian Westphal. 4) Add bitmask support for ipset, from Vishwanath Pai. 5) Handle icmpv6 redirects as RELATED, from Florian Westphal. 6) Add WARN_ON_ONCE() to impossible case in flowtable datapath, from Li Qiong. 7) A large batch of IPVS updates to replace timer-based estimators by kthreads to scale up wrt. CPUs and workload (millions of estimators). Julian Anastasov says: This patchset implements stats estimation in kthread context. It replaces the code that runs on single CPU in timer context every 2 seconds and causing latency splats as shown in reports [1], [2], [3]. The solution targets setups with thousands of IPVS services, destinations and multi-CPU boxes. Spread the estimation on multiple (configured) CPUs and multiple time slots (timer ticks) by using multiple chains organized under RCU rules. When stats are not needed, it is recommended to use run_estimation=0 as already implemented before this change. RCU Locking: - As stats are now RCU-locked, tot_stats, svc and dest which hold estimator structures are now always freed from RCU callback. This ensures RCU grace period after the ip_vs_stop_estimator() call. Kthread data: - every kthread works over its own data structure and all such structures are attached to array. For now we limit kthreads depending on the number of CPUs. - even while there can be a kthread structure, its task may not be running, eg. before first service is added or while the sysctl var is set to an empty cpulist or when run_estimation is set to 0 to disable the estimation. - the allocated kthread context may grow from 1 to 50 allocated structures for timer ticks which saves memory for setups with small number of estimators - a task and its structure may be released if all estimators are unlinked from its chains, leaving the slot in the array empty - every kthread data structure allows limited number of estimators. Kthread 0 is also used to initially calculate the max number of estimators to allow in every chain considering a sub-100 microsecond cond_resched rate. This number can be from 1 to hundreds. - kthread 0 has an additional job of optimizing the adding of estimators: they are first added in temp list (est_temp_list) and later kthread 0 distributes them to other kthreads. The optimization is based on the fact that newly added estimator should be estimated after 2 seconds, so we have the time to offload the adding to chain from controlling process to kthread 0. - to add new estimators we use the last added kthread context (est_add_ktid). The new estimators are linked to the chains just before the estimated one, based on add_row. This ensures their estimation will start after 2 seconds. If estimators are added in bursts, common case if all services and dests are initially configured, we may spread the estimators to more chains and as result, reducing the initial delay below 2 seconds. Many thanks to Jiri Wiesner for his valuable comments and for spending a lot of time reviewing and testing the changes on different platforms with 48-256 CPUs and 1-8 NUMA nodes under different cpufreq governors. The new IPVS estimators do not use workqueue infrastructure because: - The estimation can take long time when using multiple IPVS rules (eg. millions estimator structures) and especially when box has multiple CPUs due to the for_each_possible_cpu usage that expects packets from any CPU. With est_nice sysctl we have more control how to prioritize the estimation kthreads compared to other processes/kthreads that have latency requirements (such as servers). As a benefit, we can see these kthreads in top and decide if we will need some further control to limit their CPU usage (max number of structure to estimate per kthread). - with kthreads we run code that is read-mostly, no write/lock operations to process the estimators in 2-second intervals. - work items are one-shot: as estimators are processed every 2 seconds, they need to be re-added every time. This again loads the timers (add_timer) if we use delayed works, as there are no kthreads to do the timings. [1] Report from Yunhong Jiang: https://lore.kernel.org/netdev/[email protected]/ [2] https://marc.info/?l=linux-virtual-server&m=159679809118027&w=2 [3] Report from Dust: https://archive.linuxvirtualserver.org/html/lvs-devel/2020-12/msg00000.html * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: ipvs: run_estimation should control the kthread tasks ipvs: add est_cpulist and est_nice sysctl vars ipvs: use kthreads for stats estimation ipvs: use u64_stats_t for the per-cpu counters ipvs: use common functions for stats allocation ipvs: add rcu protection to stats netfilter: flowtable: add a 'default' case to flowtable datapath netfilter: conntrack: set icmpv6 redirects as RELATED netfilter: ipset: Add support for new bitmask parameter netfilter: conntrack: merge ipv4+ipv6 confirm functions netfilter: conntrack: add sctp DATA_SENT state netfilter: nft_inner: fix IS_ERR() vs NULL check ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2 parents 15eb162 + 144361c commit 95d1815

File tree

22 files changed

+1731
-360
lines changed

22 files changed

+1731
-360
lines changed

Documentation/networking/ipvs-sysctl.rst

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,26 @@ drop_packet - INTEGER
129129
threshold. When the mode 3 is set, the always mode drop rate
130130
is controlled by the /proc/sys/net/ipv4/vs/am_droprate.
131131

132+
est_cpulist - CPULIST
133+
Allowed CPUs for estimation kthreads
134+
135+
Syntax: standard cpulist format
136+
empty list - stop kthread tasks and estimation
137+
default - the system's housekeeping CPUs for kthreads
138+
139+
Example:
140+
"all": all possible CPUs
141+
"0-N": all possible CPUs, N denotes last CPU number
142+
"0,1-N:1/2": first and all CPUs with odd number
143+
"": empty list
144+
145+
est_nice - INTEGER
146+
default 0
147+
Valid range: -20 (more favorable) .. 19 (less favorable)
148+
149+
Niceness value to use for the estimation kthreads (scheduling
150+
priority)
151+
132152
expire_nodest_conn - BOOLEAN
133153
- 0 - disabled (default)
134154
- not 0 - enabled
@@ -304,8 +324,8 @@ run_estimation - BOOLEAN
304324
0 - disabled
305325
not 0 - enabled (default)
306326

307-
If disabled, the estimation will be stop, and you can't see
308-
any update on speed estimation data.
327+
If disabled, the estimation will be suspended and kthread tasks
328+
stopped.
309329

310330
You can always re-enable estimation by setting this value to 1.
311331
But be careful, the first estimation after re-enable is not

include/linux/netfilter/ipset/ip_set.h

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -515,6 +515,16 @@ ip_set_init_skbinfo(struct ip_set_skbinfo *skbinfo,
515515
*skbinfo = ext->skbinfo;
516516
}
517517

518+
static inline void
519+
nf_inet_addr_mask_inplace(union nf_inet_addr *a1,
520+
const union nf_inet_addr *mask)
521+
{
522+
a1->all[0] &= mask->all[0];
523+
a1->all[1] &= mask->all[1];
524+
a1->all[2] &= mask->all[2];
525+
a1->all[3] &= mask->all[3];
526+
}
527+
518528
#define IP_SET_INIT_KEXT(skb, opt, set) \
519529
{ .bytes = (skb)->len, .packets = 1, .target = true,\
520530
.timeout = ip_set_adt_opt_timeout(opt, set) }

include/net/ip_vs.h

Lines changed: 160 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
#include <net/netfilter/nf_conntrack.h>
3030
#endif
3131
#include <net/net_namespace.h> /* Netw namespace */
32+
#include <linux/sched/isolation.h>
3233

3334
#define IP_VS_HDR_INVERSE 1
3435
#define IP_VS_HDR_ICMP 2
@@ -42,6 +43,8 @@ static inline struct netns_ipvs *net_ipvs(struct net* net)
4243
/* Connections' size value needed by ip_vs_ctl.c */
4344
extern int ip_vs_conn_tab_size;
4445

46+
extern struct mutex __ip_vs_mutex;
47+
4548
struct ip_vs_iphdr {
4649
int hdr_flags; /* ipvs flags */
4750
__u32 off; /* Where IP or IPv4 header starts */
@@ -351,21 +354,24 @@ struct ip_vs_seq {
351354

352355
/* counters per cpu */
353356
struct ip_vs_counters {
354-
__u64 conns; /* connections scheduled */
355-
__u64 inpkts; /* incoming packets */
356-
__u64 outpkts; /* outgoing packets */
357-
__u64 inbytes; /* incoming bytes */
358-
__u64 outbytes; /* outgoing bytes */
357+
u64_stats_t conns; /* connections scheduled */
358+
u64_stats_t inpkts; /* incoming packets */
359+
u64_stats_t outpkts; /* outgoing packets */
360+
u64_stats_t inbytes; /* incoming bytes */
361+
u64_stats_t outbytes; /* outgoing bytes */
359362
};
360363
/* Stats per cpu */
361364
struct ip_vs_cpu_stats {
362365
struct ip_vs_counters cnt;
363366
struct u64_stats_sync syncp;
364367
};
365368

369+
/* Default nice for estimator kthreads */
370+
#define IPVS_EST_NICE 0
371+
366372
/* IPVS statistics objects */
367373
struct ip_vs_estimator {
368-
struct list_head list;
374+
struct hlist_node list;
369375

370376
u64 last_inbytes;
371377
u64 last_outbytes;
@@ -378,6 +384,10 @@ struct ip_vs_estimator {
378384
u64 outpps;
379385
u64 inbps;
380386
u64 outbps;
387+
388+
s32 ktid:16, /* kthread ID, -1=temp list */
389+
ktrow:8, /* row/tick ID for kthread */
390+
ktcid:8; /* chain ID for kthread tick */
381391
};
382392

383393
/*
@@ -405,6 +415,76 @@ struct ip_vs_stats {
405415
struct ip_vs_kstats kstats0; /* reset values */
406416
};
407417

418+
struct ip_vs_stats_rcu {
419+
struct ip_vs_stats s;
420+
struct rcu_head rcu_head;
421+
};
422+
423+
int ip_vs_stats_init_alloc(struct ip_vs_stats *s);
424+
struct ip_vs_stats *ip_vs_stats_alloc(void);
425+
void ip_vs_stats_release(struct ip_vs_stats *stats);
426+
void ip_vs_stats_free(struct ip_vs_stats *stats);
427+
428+
/* Process estimators in multiple timer ticks (20/50/100, see ktrow) */
429+
#define IPVS_EST_NTICKS 50
430+
/* Estimation uses a 2-second period containing ticks (in jiffies) */
431+
#define IPVS_EST_TICK ((2 * HZ) / IPVS_EST_NTICKS)
432+
433+
/* Limit of CPU load per kthread (8 for 12.5%), ratio of CPU capacity (1/C).
434+
* Value of 4 and above ensures kthreads will take work without exceeding
435+
* the CPU capacity under different circumstances.
436+
*/
437+
#define IPVS_EST_LOAD_DIVISOR 8
438+
439+
/* Kthreads should not have work that exceeds the CPU load above 50% */
440+
#define IPVS_EST_CPU_KTHREADS (IPVS_EST_LOAD_DIVISOR / 2)
441+
442+
/* Desired number of chains per timer tick (chain load factor in 100us units),
443+
* 48=4.8ms of 40ms tick (12% CPU usage):
444+
* 2 sec * 1000 ms in sec * 10 (100us in ms) / 8 (12.5%) / 50
445+
*/
446+
#define IPVS_EST_CHAIN_FACTOR \
447+
ALIGN_DOWN(2 * 1000 * 10 / IPVS_EST_LOAD_DIVISOR / IPVS_EST_NTICKS, 8)
448+
449+
/* Compiled number of chains per tick
450+
* The defines should match cond_resched_rcu
451+
*/
452+
#if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
453+
#define IPVS_EST_TICK_CHAINS IPVS_EST_CHAIN_FACTOR
454+
#else
455+
#define IPVS_EST_TICK_CHAINS 1
456+
#endif
457+
458+
#if IPVS_EST_NTICKS > 127
459+
#error Too many timer ticks for ktrow
460+
#endif
461+
462+
/* Multiple chains processed in same tick */
463+
struct ip_vs_est_tick_data {
464+
struct hlist_head chains[IPVS_EST_TICK_CHAINS];
465+
DECLARE_BITMAP(present, IPVS_EST_TICK_CHAINS);
466+
DECLARE_BITMAP(full, IPVS_EST_TICK_CHAINS);
467+
int chain_len[IPVS_EST_TICK_CHAINS];
468+
};
469+
470+
/* Context for estimation kthread */
471+
struct ip_vs_est_kt_data {
472+
struct netns_ipvs *ipvs;
473+
struct task_struct *task; /* task if running */
474+
struct ip_vs_est_tick_data __rcu *ticks[IPVS_EST_NTICKS];
475+
DECLARE_BITMAP(avail, IPVS_EST_NTICKS); /* tick has space for ests */
476+
unsigned long est_timer; /* estimation timer (jiffies) */
477+
struct ip_vs_stats *calc_stats; /* Used for calculation */
478+
int tick_len[IPVS_EST_NTICKS]; /* est count */
479+
int id; /* ktid per netns */
480+
int chain_max; /* max ests per tick chain */
481+
int tick_max; /* max ests per tick */
482+
int est_count; /* attached ests to kthread */
483+
int est_max_count; /* max ests per kthread */
484+
int add_row; /* row for new ests */
485+
int est_row; /* estimated row */
486+
};
487+
408488
struct dst_entry;
409489
struct iphdr;
410490
struct ip_vs_conn;
@@ -688,6 +768,7 @@ struct ip_vs_dest {
688768
union nf_inet_addr vaddr; /* virtual IP address */
689769
__u32 vfwmark; /* firewall mark of service */
690770

771+
struct rcu_head rcu_head;
691772
struct list_head t_list; /* in dest_trash */
692773
unsigned int in_rs_table:1; /* we are in rs_table */
693774
};
@@ -869,7 +950,7 @@ struct netns_ipvs {
869950
atomic_t conn_count; /* connection counter */
870951

871952
/* ip_vs_ctl */
872-
struct ip_vs_stats tot_stats; /* Statistics & est. */
953+
struct ip_vs_stats_rcu *tot_stats; /* Statistics & est. */
873954

874955
int num_services; /* no of virtual services */
875956
int num_services6; /* IPv6 virtual services */
@@ -932,6 +1013,12 @@ struct netns_ipvs {
9321013
int sysctl_schedule_icmp;
9331014
int sysctl_ignore_tunneled;
9341015
int sysctl_run_estimation;
1016+
#ifdef CONFIG_SYSCTL
1017+
cpumask_var_t sysctl_est_cpulist; /* kthread cpumask */
1018+
int est_cpulist_valid; /* cpulist set */
1019+
int sysctl_est_nice; /* kthread nice */
1020+
int est_stopped; /* stop tasks */
1021+
#endif
9351022

9361023
/* ip_vs_lblc */
9371024
int sysctl_lblc_expiration;
@@ -942,9 +1029,17 @@ struct netns_ipvs {
9421029
struct ctl_table_header *lblcr_ctl_header;
9431030
struct ctl_table *lblcr_ctl_table;
9441031
/* ip_vs_est */
945-
struct list_head est_list; /* estimator list */
946-
spinlock_t est_lock;
947-
struct timer_list est_timer; /* Estimation timer */
1032+
struct delayed_work est_reload_work;/* Reload kthread tasks */
1033+
struct mutex est_mutex; /* protect kthread tasks */
1034+
struct hlist_head est_temp_list; /* Ests during calc phase */
1035+
struct ip_vs_est_kt_data **est_kt_arr; /* Array of kthread data ptrs */
1036+
unsigned long est_max_threads;/* Hard limit of kthreads */
1037+
int est_calc_phase; /* Calculation phase */
1038+
int est_chain_max; /* Calculated chain_max */
1039+
int est_kt_count; /* Allocated ptrs */
1040+
int est_add_ktid; /* ktid where to add ests */
1041+
atomic_t est_genid; /* kthreads reload genid */
1042+
atomic_t est_genid_done; /* applied genid */
9481043
/* ip_vs_sync */
9491044
spinlock_t sync_lock;
9501045
struct ipvs_master_sync_state *ms;
@@ -1077,6 +1172,19 @@ static inline int sysctl_run_estimation(struct netns_ipvs *ipvs)
10771172
return ipvs->sysctl_run_estimation;
10781173
}
10791174

1175+
static inline const struct cpumask *sysctl_est_cpulist(struct netns_ipvs *ipvs)
1176+
{
1177+
if (ipvs->est_cpulist_valid)
1178+
return ipvs->sysctl_est_cpulist;
1179+
else
1180+
return housekeeping_cpumask(HK_TYPE_KTHREAD);
1181+
}
1182+
1183+
static inline int sysctl_est_nice(struct netns_ipvs *ipvs)
1184+
{
1185+
return ipvs->sysctl_est_nice;
1186+
}
1187+
10801188
#else
10811189

10821190
static inline int sysctl_sync_threshold(struct netns_ipvs *ipvs)
@@ -1174,6 +1282,16 @@ static inline int sysctl_run_estimation(struct netns_ipvs *ipvs)
11741282
return 1;
11751283
}
11761284

1285+
static inline const struct cpumask *sysctl_est_cpulist(struct netns_ipvs *ipvs)
1286+
{
1287+
return housekeeping_cpumask(HK_TYPE_KTHREAD);
1288+
}
1289+
1290+
static inline int sysctl_est_nice(struct netns_ipvs *ipvs)
1291+
{
1292+
return IPVS_EST_NICE;
1293+
}
1294+
11771295
#endif
11781296

11791297
/* IPVS core functions
@@ -1475,10 +1593,41 @@ int stop_sync_thread(struct netns_ipvs *ipvs, int state);
14751593
void ip_vs_sync_conn(struct netns_ipvs *ipvs, struct ip_vs_conn *cp, int pkts);
14761594

14771595
/* IPVS rate estimator prototypes (from ip_vs_est.c) */
1478-
void ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
1596+
int ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
14791597
void ip_vs_stop_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
14801598
void ip_vs_zero_estimator(struct ip_vs_stats *stats);
14811599
void ip_vs_read_estimator(struct ip_vs_kstats *dst, struct ip_vs_stats *stats);
1600+
void ip_vs_est_reload_start(struct netns_ipvs *ipvs);
1601+
int ip_vs_est_kthread_start(struct netns_ipvs *ipvs,
1602+
struct ip_vs_est_kt_data *kd);
1603+
void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd);
1604+
1605+
static inline void ip_vs_est_stopped_recalc(struct netns_ipvs *ipvs)
1606+
{
1607+
#ifdef CONFIG_SYSCTL
1608+
/* Stop tasks while cpulist is empty or if disabled with flag */
1609+
ipvs->est_stopped = !sysctl_run_estimation(ipvs) ||
1610+
(ipvs->est_cpulist_valid &&
1611+
cpumask_empty(sysctl_est_cpulist(ipvs)));
1612+
#endif
1613+
}
1614+
1615+
static inline bool ip_vs_est_stopped(struct netns_ipvs *ipvs)
1616+
{
1617+
#ifdef CONFIG_SYSCTL
1618+
return ipvs->est_stopped;
1619+
#else
1620+
return false;
1621+
#endif
1622+
}
1623+
1624+
static inline int ip_vs_est_max_threads(struct netns_ipvs *ipvs)
1625+
{
1626+
unsigned int limit = IPVS_EST_CPU_KTHREADS *
1627+
cpumask_weight(sysctl_est_cpulist(ipvs));
1628+
1629+
return max(1U, limit);
1630+
}
14821631

14831632
/* Various IPVS packet transmitters (from ip_vs_xmit.c) */
14841633
int ip_vs_null_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,

include/net/netfilter/nf_conntrack_core.h

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -71,8 +71,7 @@ static inline int nf_conntrack_confirm(struct sk_buff *skb)
7171
return ret;
7272
}
7373

74-
unsigned int nf_confirm(struct sk_buff *skb, unsigned int protoff,
75-
struct nf_conn *ct, enum ip_conntrack_info ctinfo);
74+
unsigned int nf_confirm(void *priv, struct sk_buff *skb, const struct nf_hook_state *state);
7675

7776
void print_tuple(struct seq_file *s, const struct nf_conntrack_tuple *tuple,
7877
const struct nf_conntrack_l4proto *proto);

include/uapi/linux/netfilter/ipset/ip_set.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,7 @@ enum {
8585
IPSET_ATTR_CADT_LINENO = IPSET_ATTR_LINENO, /* 9 */
8686
IPSET_ATTR_MARK, /* 10 */
8787
IPSET_ATTR_MARKMASK, /* 11 */
88+
IPSET_ATTR_BITMASK, /* 12 */
8889
/* Reserve empty slots */
8990
IPSET_ATTR_CADT_MAX = 16,
9091
/* Create-only specific attributes */
@@ -153,6 +154,7 @@ enum ipset_errno {
153154
IPSET_ERR_COMMENT,
154155
IPSET_ERR_INVALID_MARKMASK,
155156
IPSET_ERR_SKBINFO,
157+
IPSET_ERR_BITMASK_NETMASK_EXCL,
156158

157159
/* Type specific error codes */
158160
IPSET_ERR_TYPE_SPECIFIC = 4352,

include/uapi/linux/netfilter/nf_conntrack_sctp.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ enum sctp_conntrack {
1616
SCTP_CONNTRACK_SHUTDOWN_ACK_SENT,
1717
SCTP_CONNTRACK_HEARTBEAT_SENT,
1818
SCTP_CONNTRACK_HEARTBEAT_ACKED,
19+
SCTP_CONNTRACK_DATA_SENT,
1920
SCTP_CONNTRACK_MAX
2021
};
2122

include/uapi/linux/netfilter/nfnetlink_cttimeout.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,7 @@ enum ctattr_timeout_sctp {
9595
CTA_TIMEOUT_SCTP_SHUTDOWN_ACK_SENT,
9696
CTA_TIMEOUT_SCTP_HEARTBEAT_SENT,
9797
CTA_TIMEOUT_SCTP_HEARTBEAT_ACKED,
98+
CTA_TIMEOUT_SCTP_DATA_SENT,
9899
__CTA_TIMEOUT_SCTP_MAX
99100
};
100101
#define CTA_TIMEOUT_SCTP_MAX (__CTA_TIMEOUT_SCTP_MAX - 1)

0 commit comments

Comments
 (0)