Skip to content

Commit 91d0b78

Browse files
jsitnickikuba-moo
authored andcommitted
inet: Add IP_LOCAL_PORT_RANGE socket option
Users who want to share a single public IP address for outgoing connections between several hosts traditionally reach for SNAT. However, SNAT requires state keeping on the node(s) performing the NAT. A stateless alternative exists, where a single IP address used for egress can be shared between several hosts by partitioning the available ephemeral port range. In such a setup: 1. Each host gets assigned a disjoint range of ephemeral ports. 2. Applications open connections from the host-assigned port range. 3. Return traffic gets routed to the host based on both, the destination IP and the destination port. An application which wants to open an outgoing connection (connect) from a given port range today can choose between two solutions: 1. Manually pick the source port by bind()'ing to it before connect()'ing the socket. This approach has a couple of downsides: a) Search for a free port has to be implemented in the user-space. If the chosen 4-tuple happens to be busy, the application needs to retry from a different local port number. Detecting if 4-tuple is busy can be either easy (TCP) or hard (UDP). In TCP case, the application simply has to check if connect() returned an error (EADDRNOTAVAIL). That is assuming that the local port sharing was enabled (REUSEADDR) by all the sockets. # Assume desired local port range is 60_000-60_511 s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s.bind(("192.0.2.1", 60_000)) s.connect(("1.1.1.1", 53)) # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy # Application must retry with another local port In case of UDP, the network stack allows binding more than one socket to the same 4-tuple, when local port sharing is enabled (REUSEADDR). Hence detecting the conflict is much harder and involves querying sock_diag and toggling the REUSEADDR flag [1]. b) For TCP, bind()-ing to a port within the ephemeral port range means that no connecting sockets, that is those which leave it to the network stack to find a free local port at connect() time, can use the this port. IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port will be skipped during the free port search at connect() time. 2. Isolate the app in a dedicated netns and use the use the per-netns ip_local_port_range sysctl to adjust the ephemeral port range bounds. The per-netns setting affects all sockets, so this approach can be used only if: - there is just one egress IP address, or - the desired egress port range is the same for all egress IP addresses used by the application. For TCP, this approach avoids the downsides of (1). Free port search and 4-tuple conflict detection is done by the network stack: system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'") s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1) s.bind(("192.0.2.1", 0)) s.connect(("1.1.1.1", 53)) # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy For UDP this approach has limited applicability. Setting the IP_BIND_ADDRESS_NO_PORT socket option does not result in local source port being shared with other connected UDP sockets. Hence relying on the network stack to find a free source port, limits the number of outgoing UDP flows from a single IP address down to the number of available ephemeral ports. To put it another way, partitioning the ephemeral port range between hosts using the existing Linux networking API is cumbersome. To address this use case, add a new socket option at the SOL_IP level, named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the ephemeral port range for each socket individually. The option can be used only to narrow down the per-netns local port range. If the per-socket range lies outside of the per-netns range, the latter takes precedence. UAPI-wise, the low and high range bounds are passed to the kernel as a pair of u16 values in host byte order packed into a u32. This avoids pointer passing. PORT_LO = 40_000 PORT_HI = 40_511 s = socket(AF_INET, SOCK_STREAM) v = struct.pack("I", PORT_HI << 16 | PORT_LO) s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v) s.bind(("127.0.0.1", 0)) s.getsockname() # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511), # if there is a free port. EADDRINUSE otherwise. [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116 Reviewed-by: Marek Majkowski <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Jakub Sitnicki <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
1 parent 6a7a2c1 commit 91d0b78

File tree

8 files changed

+51
-6
lines changed

8 files changed

+51
-6
lines changed

include/net/inet_sock.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -249,6 +249,10 @@ struct inet_sock {
249249
__be32 mc_addr;
250250
struct ip_mc_socklist __rcu *mc_list;
251251
struct inet_cork_full cork;
252+
struct {
253+
__u16 lo;
254+
__u16 hi;
255+
} local_port_range;
252256
};
253257

254258
#define IPCORK_OPT 1 /* ip-options has been held in ipcork.opt */

include/net/ip.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -340,7 +340,8 @@ static inline u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_o
340340
} \
341341
}
342342

343-
void inet_get_local_port_range(struct net *net, int *low, int *high);
343+
void inet_get_local_port_range(const struct net *net, int *low, int *high);
344+
void inet_sk_get_local_port_range(const struct sock *sk, int *low, int *high);
344345

345346
#ifdef CONFIG_SYSCTL
346347
static inline bool inet_is_local_reserved_port(struct net *net, unsigned short port)

include/uapi/linux/in.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,7 @@ struct in_addr {
162162
#define MCAST_MSFILTER 48
163163
#define IP_MULTICAST_ALL 49
164164
#define IP_UNICAST_IF 50
165+
#define IP_LOCAL_PORT_RANGE 51
165166

166167
#define MCAST_EXCLUDE 0
167168
#define MCAST_INCLUDE 1

net/ipv4/inet_connection_sock.c

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,7 @@ bool inet_rcv_saddr_any(const struct sock *sk)
117117
return !sk->sk_rcv_saddr;
118118
}
119119

120-
void inet_get_local_port_range(struct net *net, int *low, int *high)
120+
void inet_get_local_port_range(const struct net *net, int *low, int *high)
121121
{
122122
unsigned int seq;
123123

@@ -130,6 +130,27 @@ void inet_get_local_port_range(struct net *net, int *low, int *high)
130130
}
131131
EXPORT_SYMBOL(inet_get_local_port_range);
132132

133+
void inet_sk_get_local_port_range(const struct sock *sk, int *low, int *high)
134+
{
135+
const struct inet_sock *inet = inet_sk(sk);
136+
const struct net *net = sock_net(sk);
137+
int lo, hi, sk_lo, sk_hi;
138+
139+
inet_get_local_port_range(net, &lo, &hi);
140+
141+
sk_lo = inet->local_port_range.lo;
142+
sk_hi = inet->local_port_range.hi;
143+
144+
if (unlikely(lo <= sk_lo && sk_lo <= hi))
145+
lo = sk_lo;
146+
if (unlikely(lo <= sk_hi && sk_hi <= hi))
147+
hi = sk_hi;
148+
149+
*low = lo;
150+
*high = hi;
151+
}
152+
EXPORT_SYMBOL(inet_sk_get_local_port_range);
153+
133154
static bool inet_use_bhash2_on_bind(const struct sock *sk)
134155
{
135156
#if IS_ENABLED(CONFIG_IPV6)
@@ -316,7 +337,7 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
316337
ports_exhausted:
317338
attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0;
318339
other_half_scan:
319-
inet_get_local_port_range(net, &low, &high);
340+
inet_sk_get_local_port_range(sk, &low, &high);
320341
high++; /* [32768, 60999] -> [32768, 61000[ */
321342
if (high - low < 4)
322343
attempt_half = 0;

net/ipv4/inet_hashtables.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1016,7 +1016,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
10161016

10171017
l3mdev = inet_sk_bound_l3mdev(sk);
10181018

1019-
inet_get_local_port_range(net, &low, &high);
1019+
inet_sk_get_local_port_range(sk, &low, &high);
10201020
high++; /* [32768, 60999] -> [32768, 61000[ */
10211021
remaining = high - low;
10221022
if (likely(remaining > 1))

net/ipv4/ip_sockglue.c

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -923,6 +923,7 @@ int do_ip_setsockopt(struct sock *sk, int level, int optname,
923923
case IP_CHECKSUM:
924924
case IP_RECVFRAGSIZE:
925925
case IP_RECVERR_RFC4884:
926+
case IP_LOCAL_PORT_RANGE:
926927
if (optlen >= sizeof(int)) {
927928
if (copy_from_sockptr(&val, optval, sizeof(val)))
928929
return -EFAULT;
@@ -1365,6 +1366,20 @@ int do_ip_setsockopt(struct sock *sk, int level, int optname,
13651366
WRITE_ONCE(inet->min_ttl, val);
13661367
break;
13671368

1369+
case IP_LOCAL_PORT_RANGE:
1370+
{
1371+
const __u16 lo = val;
1372+
const __u16 hi = val >> 16;
1373+
1374+
if (optlen != sizeof(__u32))
1375+
goto e_inval;
1376+
if (lo != 0 && hi != 0 && lo > hi)
1377+
goto e_inval;
1378+
1379+
inet->local_port_range.lo = lo;
1380+
inet->local_port_range.hi = hi;
1381+
break;
1382+
}
13681383
default:
13691384
err = -ENOPROTOOPT;
13701385
break;
@@ -1743,6 +1758,9 @@ int do_ip_getsockopt(struct sock *sk, int level, int optname,
17431758
case IP_MINTTL:
17441759
val = inet->min_ttl;
17451760
break;
1761+
case IP_LOCAL_PORT_RANGE:
1762+
val = inet->local_port_range.hi << 16 | inet->local_port_range.lo;
1763+
break;
17461764
default:
17471765
sockopt_release_sock(sk);
17481766
return -ENOPROTOOPT;

net/ipv4/udp.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -248,7 +248,7 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
248248
int low, high, remaining;
249249
unsigned int rand;
250250

251-
inet_get_local_port_range(net, &low, &high);
251+
inet_sk_get_local_port_range(sk, &low, &high);
252252
remaining = (high - low) + 1;
253253

254254
rand = get_random_u32();

net/sctp/socket.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8322,7 +8322,7 @@ static int sctp_get_port_local(struct sock *sk, union sctp_addr *addr)
83228322
int low, high, remaining, index;
83238323
unsigned int rover;
83248324

8325-
inet_get_local_port_range(net, &low, &high);
8325+
inet_sk_get_local_port_range(sk, &low, &high);
83268326
remaining = (high - low) + 1;
83278327
rover = get_random_u32_below(remaining) + low;
83288328

0 commit comments

Comments
 (0)