Skip to content

Commit 42df6e1

Browse files
l1kummakynes
authored andcommitted
netfilter: Introduce egress hook
Support classifying packets with netfilter on egress to satisfy user requirements such as: * outbound security policies for containers (Laura) * filtering and mangling intra-node Direct Server Return (DSR) traffic on a load balancer (Laura) * filtering locally generated traffic coming in through AF_PACKET, such as local ARP traffic generated for clustering purposes or DHCP (Laura; the AF_PACKET plumbing is contained in a follow-up commit) * L2 filtering from ingress and egress for AVB (Audio Video Bridging) and gPTP with nftables (Pablo) * in the future: in-kernel NAT64/NAT46 (Pablo) The egress hook introduced herein complements the ingress hook added by commit e687ad6 ("netfilter: add netfilter ingress hook after handle_ing() under unique static key"). A patch for nftables to hook up egress rules from user space has been submitted separately, so users may immediately take advantage of the feature. Alternatively or in addition to netfilter, packets can be classified with traffic control (tc). On ingress, packets are classified first by tc, then by netfilter. On egress, the order is reversed for symmetry. Conceptually, tc and netfilter can be thought of as layers, with netfilter layered above tc. Traffic control is capable of redirecting packets to another interface (man 8 tc-mirred). E.g., an ingress packet may be redirected from the host namespace to a container via a veth connection: tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container) In this case, netfilter egress classifying is not performed when leaving the host namespace! That's because the packet is still on the tc layer. If tc redirects the packet to a physical interface in the host namespace such that it leaves the system, the packet is never subjected to netfilter egress classifying. That is only logical since it hasn't passed through netfilter ingress classifying either. Packets can alternatively be redirected at the netfilter layer using nft fwd. Such a packet *is* subjected to netfilter egress classifying since it has reached the netfilter layer. Internally, the skb->nf_skip_egress flag controls whether netfilter is invoked on egress by __dev_queue_xmit(). Because __dev_queue_xmit() may be called recursively by tunnel drivers such as vxlan, the flag is reverted to false after sch_handle_egress(). This ensures that netfilter is applied both on the overlay and underlying network. Interaction between tc and netfilter is possible by setting and querying skb->mark. If netfilter egress classifying is not enabled on any interface, it is patched out of the data path by way of a static_key and doesn't make a performance difference that is discernible from noise: Before: 1537 1538 1538 1537 1538 1537 Mb/sec After: 1536 1534 1539 1539 1539 1540 Mb/sec Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec After + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec Before + tc drop: 1620 1619 1619 1619 1620 1620 Mb/sec After + tc drop: 1616 1624 1625 1624 1622 1619 Mb/sec When netfilter egress classifying is enabled on at least one interface, a minimal performance penalty is incurred for every egress packet, even if the interface it's transmitted over doesn't have any netfilter egress rules configured. That is caused by checking dev->nf_hooks_egress against NULL. Measurements were performed on a Core i7-3615QM. Commands to reproduce: ip link add dev foo type dummy ip link set dev foo up modprobe pktgen echo "add_device foo" > /proc/net/pktgen/kpktgend_3 samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1 Accept all traffic with tc: tc qdisc add dev foo clsact tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,' Drop all traffic with tc: tc qdisc add dev foo clsact tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,' Apply this patch when measuring packet drops to avoid errors in dmesg: https://lore.kernel.org/netdev/[email protected]/ Signed-off-by: Lukas Wunner <[email protected]> Cc: Laura García Liébana <[email protected]> Cc: John Fastabend <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Thomas Graf <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
1 parent 17d2078 commit 42df6e1

File tree

10 files changed

+168
-10
lines changed

10 files changed

+168
-10
lines changed

drivers/net/ifb.c

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@
3131
#include <linux/init.h>
3232
#include <linux/interrupt.h>
3333
#include <linux/moduleparam.h>
34+
#include <linux/netfilter_netdev.h>
3435
#include <net/pkt_sched.h>
3536
#include <net/net_namespace.h>
3637

@@ -75,8 +76,10 @@ static void ifb_ri_tasklet(struct tasklet_struct *t)
7576
}
7677

7778
while ((skb = __skb_dequeue(&txp->tq)) != NULL) {
79+
/* Skip tc and netfilter to prevent redirection loop. */
7880
skb->redirected = 0;
7981
skb->tc_skip_classify = 1;
82+
nf_skip_egress(skb, true);
8083

8184
u64_stats_update_begin(&txp->tsync);
8285
txp->tx_packets++;

include/linux/netdevice.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1861,6 +1861,7 @@ enum netdev_ml_priv_type {
18611861
* @xps_maps: XXX: need comments on this one
18621862
* @miniq_egress: clsact qdisc specific data for
18631863
* egress processing
1864+
* @nf_hooks_egress: netfilter hooks executed for egress packets
18641865
* @qdisc_hash: qdisc hash table
18651866
* @watchdog_timeo: Represents the timeout that is used by
18661867
* the watchdog (see dev_watchdog())
@@ -2161,6 +2162,9 @@ struct net_device {
21612162
#ifdef CONFIG_NET_CLS_ACT
21622163
struct mini_Qdisc __rcu *miniq_egress;
21632164
#endif
2165+
#ifdef CONFIG_NETFILTER_EGRESS
2166+
struct nf_hook_entries __rcu *nf_hooks_egress;
2167+
#endif
21642168

21652169
#ifdef CONFIG_NET_SCHED
21662170
DECLARE_HASHTABLE (qdisc_hash, 4);

include/linux/netfilter_netdev.h

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,11 +50,97 @@ static inline int nf_hook_ingress(struct sk_buff *skb)
5050
}
5151
#endif /* CONFIG_NETFILTER_INGRESS */
5252

53+
#ifdef CONFIG_NETFILTER_EGRESS
54+
static inline bool nf_hook_egress_active(void)
55+
{
56+
#ifdef CONFIG_JUMP_LABEL
57+
if (!static_key_false(&nf_hooks_needed[NFPROTO_NETDEV][NF_NETDEV_EGRESS]))
58+
return false;
59+
#endif
60+
return true;
61+
}
62+
63+
/**
64+
* nf_hook_egress - classify packets before transmission
65+
* @skb: packet to be classified
66+
* @rc: result code which shall be returned by __dev_queue_xmit() on failure
67+
* @dev: netdev whose egress hooks shall be applied to @skb
68+
*
69+
* Returns @skb on success or %NULL if the packet was consumed or filtered.
70+
* Caller must hold rcu_read_lock.
71+
*
72+
* On ingress, packets are classified first by tc, then by netfilter.
73+
* On egress, the order is reversed for symmetry. Conceptually, tc and
74+
* netfilter can be thought of as layers, with netfilter layered above tc:
75+
* When tc redirects a packet to another interface, netfilter is not applied
76+
* because the packet is on the tc layer.
77+
*
78+
* The nf_skip_egress flag controls whether netfilter is applied on egress.
79+
* It is updated by __netif_receive_skb_core() and __dev_queue_xmit() when the
80+
* packet passes through tc and netfilter. Because __dev_queue_xmit() may be
81+
* called recursively by tunnel drivers such as vxlan, the flag is reverted to
82+
* false after sch_handle_egress(). This ensures that netfilter is applied
83+
* both on the overlay and underlying network.
84+
*/
85+
static inline struct sk_buff *nf_hook_egress(struct sk_buff *skb, int *rc,
86+
struct net_device *dev)
87+
{
88+
struct nf_hook_entries *e;
89+
struct nf_hook_state state;
90+
int ret;
91+
92+
#ifdef CONFIG_NETFILTER_SKIP_EGRESS
93+
if (skb->nf_skip_egress)
94+
return skb;
95+
#endif
96+
97+
e = rcu_dereference(dev->nf_hooks_egress);
98+
if (!e)
99+
return skb;
100+
101+
nf_hook_state_init(&state, NF_NETDEV_EGRESS,
102+
NFPROTO_NETDEV, dev, NULL, NULL,
103+
dev_net(dev), NULL);
104+
ret = nf_hook_slow(skb, &state, e, 0);
105+
106+
if (ret == 1) {
107+
return skb;
108+
} else if (ret < 0) {
109+
*rc = NET_XMIT_DROP;
110+
return NULL;
111+
} else { /* ret == 0 */
112+
*rc = NET_XMIT_SUCCESS;
113+
return NULL;
114+
}
115+
}
116+
#else /* CONFIG_NETFILTER_EGRESS */
117+
static inline bool nf_hook_egress_active(void)
118+
{
119+
return false;
120+
}
121+
122+
static inline struct sk_buff *nf_hook_egress(struct sk_buff *skb, int *rc,
123+
struct net_device *dev)
124+
{
125+
return skb;
126+
}
127+
#endif /* CONFIG_NETFILTER_EGRESS */
128+
129+
static inline void nf_skip_egress(struct sk_buff *skb, bool skip)
130+
{
131+
#ifdef CONFIG_NETFILTER_SKIP_EGRESS
132+
skb->nf_skip_egress = skip;
133+
#endif
134+
}
135+
53136
static inline void nf_hook_netdev_init(struct net_device *dev)
54137
{
55138
#ifdef CONFIG_NETFILTER_INGRESS
56139
RCU_INIT_POINTER(dev->nf_hooks_ingress, NULL);
57140
#endif
141+
#ifdef CONFIG_NETFILTER_EGRESS
142+
RCU_INIT_POINTER(dev->nf_hooks_egress, NULL);
143+
#endif
58144
}
59145

60146
#endif /* _NETFILTER_NETDEV_H_ */

include/linux/skbuff.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -652,6 +652,7 @@ typedef unsigned char *sk_buff_data_t;
652652
* @tc_at_ingress: used within tc_classify to distinguish in/egress
653653
* @redirected: packet was redirected by packet classifier
654654
* @from_ingress: packet was redirected from the ingress path
655+
* @nf_skip_egress: packet shall skip nf egress - see netfilter_netdev.h
655656
* @peeked: this packet has been seen already, so stats have been
656657
* done for it, don't do them again
657658
* @nf_trace: netfilter packet trace flag
@@ -868,6 +869,9 @@ struct sk_buff {
868869
#ifdef CONFIG_NET_REDIRECT
869870
__u8 from_ingress:1;
870871
#endif
872+
#ifdef CONFIG_NETFILTER_SKIP_EGRESS
873+
__u8 nf_skip_egress:1;
874+
#endif
871875
#ifdef CONFIG_TLS_DEVICE
872876
__u8 decrypted:1;
873877
#endif

include/uapi/linux/netfilter.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@ enum nf_inet_hooks {
5151

5252
enum nf_dev_hooks {
5353
NF_NETDEV_INGRESS,
54+
NF_NETDEV_EGRESS,
5455
NF_NETDEV_NUMHOOKS
5556
};
5657

net/core/dev.c

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3920,6 +3920,7 @@ EXPORT_SYMBOL(dev_loopback_xmit);
39203920
static struct sk_buff *
39213921
sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
39223922
{
3923+
#ifdef CONFIG_NET_CLS_ACT
39233924
struct mini_Qdisc *miniq = rcu_dereference_bh(dev->miniq_egress);
39243925
struct tcf_result cl_res;
39253926

@@ -3955,6 +3956,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
39553956
default:
39563957
break;
39573958
}
3959+
#endif /* CONFIG_NET_CLS_ACT */
39583960

39593961
return skb;
39603962
}
@@ -4148,13 +4150,20 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
41484150
qdisc_pkt_len_init(skb);
41494151
#ifdef CONFIG_NET_CLS_ACT
41504152
skb->tc_at_ingress = 0;
4151-
# ifdef CONFIG_NET_EGRESS
4153+
#endif
4154+
#ifdef CONFIG_NET_EGRESS
41524155
if (static_branch_unlikely(&egress_needed_key)) {
4156+
if (nf_hook_egress_active()) {
4157+
skb = nf_hook_egress(skb, &rc, dev);
4158+
if (!skb)
4159+
goto out;
4160+
}
4161+
nf_skip_egress(skb, true);
41534162
skb = sch_handle_egress(skb, &rc, dev);
41544163
if (!skb)
41554164
goto out;
4165+
nf_skip_egress(skb, false);
41564166
}
4157-
# endif
41584167
#endif
41594168
/* If device/qdisc don't need skb->dst, release it right now while
41604169
* its hot in this cpu cache.
@@ -5296,13 +5305,15 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
52965305
if (static_branch_unlikely(&ingress_needed_key)) {
52975306
bool another = false;
52985307

5308+
nf_skip_egress(skb, true);
52995309
skb = sch_handle_ingress(skb, &pt_prev, &ret, orig_dev,
53005310
&another);
53015311
if (another)
53025312
goto another_round;
53035313
if (!skb)
53045314
goto out;
53055315

5316+
nf_skip_egress(skb, false);
53065317
if (nf_ingress(skb, &pt_prev, &ret, orig_dev) < 0)
53075318
goto out;
53085319
}

net/netfilter/Kconfig

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,17 @@ config NETFILTER_INGRESS
1010
This allows you to classify packets from ingress using the Netfilter
1111
infrastructure.
1212

13+
config NETFILTER_EGRESS
14+
bool "Netfilter egress support"
15+
default y
16+
select NET_EGRESS
17+
help
18+
This allows you to classify packets before transmission using the
19+
Netfilter infrastructure.
20+
21+
config NETFILTER_SKIP_EGRESS
22+
def_bool NETFILTER_EGRESS && (NET_CLS_ACT || IFB)
23+
1324
config NETFILTER_NETLINK
1425
tristate
1526

net/netfilter/core.c

Lines changed: 31 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -316,6 +316,12 @@ nf_hook_entry_head(struct net *net, int pf, unsigned int hooknum,
316316
if (dev && dev_net(dev) == net)
317317
return &dev->nf_hooks_ingress;
318318
}
319+
#endif
320+
#ifdef CONFIG_NETFILTER_EGRESS
321+
if (hooknum == NF_NETDEV_EGRESS) {
322+
if (dev && dev_net(dev) == net)
323+
return &dev->nf_hooks_egress;
324+
}
319325
#endif
320326
WARN_ON_ONCE(1);
321327
return NULL;
@@ -344,6 +350,11 @@ static inline bool nf_ingress_hook(const struct nf_hook_ops *reg, int pf)
344350
return false;
345351
}
346352

353+
static inline bool nf_egress_hook(const struct nf_hook_ops *reg, int pf)
354+
{
355+
return pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_EGRESS;
356+
}
357+
347358
static void nf_static_key_inc(const struct nf_hook_ops *reg, int pf)
348359
{
349360
#ifdef CONFIG_JUMP_LABEL
@@ -383,9 +394,18 @@ static int __nf_register_net_hook(struct net *net, int pf,
383394

384395
switch (pf) {
385396
case NFPROTO_NETDEV:
386-
err = nf_ingress_check(net, reg, NF_NETDEV_INGRESS);
387-
if (err < 0)
388-
return err;
397+
#ifndef CONFIG_NETFILTER_INGRESS
398+
if (reg->hooknum == NF_NETDEV_INGRESS)
399+
return -EOPNOTSUPP;
400+
#endif
401+
#ifndef CONFIG_NETFILTER_EGRESS
402+
if (reg->hooknum == NF_NETDEV_EGRESS)
403+
return -EOPNOTSUPP;
404+
#endif
405+
if ((reg->hooknum != NF_NETDEV_INGRESS &&
406+
reg->hooknum != NF_NETDEV_EGRESS) ||
407+
!reg->dev || dev_net(reg->dev) != net)
408+
return -EINVAL;
389409
break;
390410
case NFPROTO_INET:
391411
if (reg->hooknum != NF_INET_INGRESS)
@@ -417,6 +437,10 @@ static int __nf_register_net_hook(struct net *net, int pf,
417437
#ifdef CONFIG_NETFILTER_INGRESS
418438
if (nf_ingress_hook(reg, pf))
419439
net_inc_ingress_queue();
440+
#endif
441+
#ifdef CONFIG_NETFILTER_EGRESS
442+
if (nf_egress_hook(reg, pf))
443+
net_inc_egress_queue();
420444
#endif
421445
nf_static_key_inc(reg, pf);
422446

@@ -474,6 +498,10 @@ static void __nf_unregister_net_hook(struct net *net, int pf,
474498
#ifdef CONFIG_NETFILTER_INGRESS
475499
if (nf_ingress_hook(reg, pf))
476500
net_dec_ingress_queue();
501+
#endif
502+
#ifdef CONFIG_NETFILTER_EGRESS
503+
if (nf_egress_hook(reg, pf))
504+
net_dec_egress_queue();
477505
#endif
478506
nf_static_key_dec(reg, pf);
479507
} else {

net/netfilter/nfnetlink_hook.c

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -185,7 +185,7 @@ static const struct nf_hook_entries *
185185
nfnl_hook_entries_head(u8 pf, unsigned int hook, struct net *net, const char *dev)
186186
{
187187
const struct nf_hook_entries *hook_head = NULL;
188-
#ifdef CONFIG_NETFILTER_INGRESS
188+
#if defined(CONFIG_NETFILTER_INGRESS) || defined(CONFIG_NETFILTER_EGRESS)
189189
struct net_device *netdev;
190190
#endif
191191

@@ -221,9 +221,9 @@ nfnl_hook_entries_head(u8 pf, unsigned int hook, struct net *net, const char *de
221221
hook_head = rcu_dereference(net->nf.hooks_decnet[hook]);
222222
break;
223223
#endif
224-
#ifdef CONFIG_NETFILTER_INGRESS
224+
#if defined(CONFIG_NETFILTER_INGRESS) || defined(CONFIG_NETFILTER_EGRESS)
225225
case NFPROTO_NETDEV:
226-
if (hook != NF_NETDEV_INGRESS)
226+
if (hook >= NF_NETDEV_NUMHOOKS)
227227
return ERR_PTR(-EOPNOTSUPP);
228228

229229
if (!dev)
@@ -233,7 +233,15 @@ nfnl_hook_entries_head(u8 pf, unsigned int hook, struct net *net, const char *de
233233
if (!netdev)
234234
return ERR_PTR(-ENODEV);
235235

236-
return rcu_dereference(netdev->nf_hooks_ingress);
236+
#ifdef CONFIG_NETFILTER_INGRESS
237+
if (hook == NF_NETDEV_INGRESS)
238+
return rcu_dereference(netdev->nf_hooks_ingress);
239+
#endif
240+
#ifdef CONFIG_NETFILTER_EGRESS
241+
if (hook == NF_NETDEV_EGRESS)
242+
return rcu_dereference(netdev->nf_hooks_egress);
243+
#endif
244+
fallthrough;
237245
#endif
238246
default:
239247
return ERR_PTR(-EPROTONOSUPPORT);

net/netfilter/nft_chain_filter.c

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -310,9 +310,11 @@ static const struct nft_chain_type nft_chain_filter_netdev = {
310310
.name = "filter",
311311
.type = NFT_CHAIN_T_DEFAULT,
312312
.family = NFPROTO_NETDEV,
313-
.hook_mask = (1 << NF_NETDEV_INGRESS),
313+
.hook_mask = (1 << NF_NETDEV_INGRESS) |
314+
(1 << NF_NETDEV_EGRESS),
314315
.hooks = {
315316
[NF_NETDEV_INGRESS] = nft_do_chain_netdev,
317+
[NF_NETDEV_EGRESS] = nft_do_chain_netdev,
316318
},
317319
};
318320

0 commit comments

Comments
 (0)