Skip to content

Commit 35c55c9

Browse files
Jon Paul Maloydavem330
authored andcommitted
tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link connectivity between all nodes. Those links are expected to provide a short failure detection time, by default set to 1500 ms. Because of this, the background load for neighbor monitoring in an N-node cluster increases with a factor N on each node, while the overall monitoring traffic through the network infrastructure increases at a ~(N * (N - 1)) rate. Experience has shown that such clusters don't scale well beyond ~100 nodes unless we significantly increase failure discovery tolerance. This commit introduces a framework and an algorithm that drastically reduces this background load, while basically maintaining the original failure detection times across the whole cluster. Using this algorithm, background load will now grow at a rate of ~(2 * sqrt(N)) per node, and at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will now have to actively monitor 38 neighbors in a 400-node cluster, instead of as before 399. This "Overlapping Ring Supervision Algorithm" is completely distributed and employs no centralized or coordinated state. It goes as follows: - Each node makes up a linearly ascending, circular list of all its N known neighbors, based on their TIPC node identity. This algorithm must be the same on all nodes. - The node then selects the next M = sqrt(N) - 1 nodes downstream from itself in the list, and chooses to actively monitor those. This is called its "local monitoring domain". - It creates a domain record describing the monitoring domain, and piggy-backs this in the data area of all neighbor monitoring messages (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in the cluster eventually (default within 400 ms) will learn about its monitoring domain. - Whenever a node discovers a change in its local domain, e.g., a node has been added or has gone down, it creates and sends out a new version of its node record to inform all neighbors about the change. - A node receiving a domain record from anybody outside its local domain matches this against its own list (which may not look the same), and chooses to not actively monitor those members of the received domain record that are also present in its own list. Instead, it relies on indications from the direct monitoring nodes if an indirectly monitored node has gone up or down. If a node is indicated lost, the receiving node temporarily activates its own direct monitoring towards that node in order to confirm, or not, that it is actually gone. - Since each node is actively monitoring sqrt(N) downstream neighbors, each node is also actively monitored by the same number of upstream neighbors. This means that all non-direct monitoring nodes normally will receive sqrt(N) indications that a node is gone. - A major drawback with ring monitoring is how it handles failures that cause massive network partitionings. If both a lost node and all its direct monitoring neighbors are inside the lost partition, the nodes in the remaining partition will never receive indications about the loss. To overcome this, each node also chooses to actively monitor some nodes outside its local domain. Those nodes are called remote domain "heads", and are selected in such a way that no node in the cluster will be more than two direct monitoring hops away. Because of this, each node, apart from monitoring the member of its local domain, will also typically monitor sqrt(N) remote head nodes. - As an optimization, local list status, domain status and domain records are marked with a generation number. This saves senders from unnecessarily conveying unaltered domain records, and receivers from performing unneeded re-adaptations of their node monitoring list, such as re-assigning domain heads. - As a measure of caution we have added the possibility to disable the new algorithm through configuration. We do this by keeping a threshold value for the cluster size; a cluster that grows beyond this value will switch from full-mesh to ring monitoring, and vice versa when it shrinks below the value. This means that if the threshold is set to a value larger than any anticipated cluster size (default size is 32) the new algorithm is effectively disabled. A patch set for altering the threshold value and for listing the table contents will follow shortly. - This change is fully backwards compatible. Acked-by: Ying Xue <[email protected]> Signed-off-by: Jon Maloy <[email protected]> Signed-off-by: David S. Miller <[email protected]>
1 parent 7889681 commit 35c55c9

File tree

10 files changed

+797
-31
lines changed

10 files changed

+797
-31
lines changed

net/tipc/Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ obj-$(CONFIG_TIPC) := tipc.o
66

77
tipc-y += addr.o bcast.o bearer.o \
88
core.o link.o discover.o msg.o \
9-
name_distr.o subscr.o name_table.o net.o \
9+
name_distr.o subscr.o monitor.o name_table.o net.o \
1010
netlink.o netlink_compat.o node.o socket.o eth_media.o \
1111
server.o socket.o
1212

net/tipc/addr.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,4 +73,5 @@ int tipc_addr_node_valid(u32 addr);
7373
int tipc_in_scope(u32 domain, u32 addr);
7474
int tipc_addr_scope(u32 domain);
7575
char *tipc_addr_string_fill(char *string, u32 addr);
76+
7677
#endif

net/tipc/bearer.c

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
/*
22
* net/tipc/bearer.c: TIPC bearer code
33
*
4-
* Copyright (c) 1996-2006, 2013-2014, Ericsson AB
4+
* Copyright (c) 1996-2006, 2013-2016, Ericsson AB
55
* Copyright (c) 2004-2006, 2010-2013, Wind River Systems
66
* All rights reserved.
77
*
@@ -39,6 +39,7 @@
3939
#include "bearer.h"
4040
#include "link.h"
4141
#include "discover.h"
42+
#include "monitor.h"
4243
#include "bcast.h"
4344
#include "netlink.h"
4445

@@ -313,6 +314,10 @@ static int tipc_enable_bearer(struct net *net, const char *name,
313314
rcu_assign_pointer(tn->bearer_list[bearer_id], b);
314315
if (skb)
315316
tipc_bearer_xmit_skb(net, bearer_id, skb, &b->bcast_addr);
317+
318+
if (tipc_mon_create(net, bearer_id))
319+
return -ENOMEM;
320+
316321
pr_info("Enabled bearer <%s>, discovery domain %s, priority %u\n",
317322
name,
318323
tipc_addr_string_fill(addr_string, disc_domain), priority);
@@ -348,6 +353,7 @@ static void bearer_disable(struct net *net, struct tipc_bearer *b)
348353
tipc_disc_delete(b->link_req);
349354
RCU_INIT_POINTER(tn->bearer_list[bearer_id], NULL);
350355
kfree_rcu(b, rcu);
356+
tipc_mon_delete(net, bearer_id);
351357
}
352358

353359
int tipc_enable_l2_media(struct net *net, struct tipc_bearer *b,

net/tipc/bearer.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
/*
22
* net/tipc/bearer.h: Include file for TIPC bearer code
33
*
4-
* Copyright (c) 1996-2006, 2013-2014, Ericsson AB
4+
* Copyright (c) 1996-2006, 2013-2016, Ericsson AB
55
* Copyright (c) 2005, 2010-2011, Wind River Systems
66
* All rights reserved.
77
*

net/tipc/core.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@ static int __net_init tipc_init_net(struct net *net)
5757

5858
tn->net_id = 4711;
5959
tn->own_addr = 0;
60+
tn->mon_threshold = TIPC_DEF_MON_THRESHOLD;
6061
get_random_bytes(&tn->random, sizeof(int));
6162
INIT_LIST_HEAD(&tn->node_list);
6263
spin_lock_init(&tn->node_list_lock);

net/tipc/core.h

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,11 +66,13 @@ struct tipc_bc_base;
6666
struct tipc_link;
6767
struct tipc_name_table;
6868
struct tipc_server;
69+
struct tipc_monitor;
6970

7071
#define TIPC_MOD_VER "2.0.0"
7172

72-
#define NODE_HTABLE_SIZE 512
73-
#define MAX_BEARERS 3
73+
#define NODE_HTABLE_SIZE 512
74+
#define MAX_BEARERS 3
75+
#define TIPC_DEF_MON_THRESHOLD 32
7476

7577
extern int tipc_net_id __read_mostly;
7678
extern int sysctl_tipc_rmem[3] __read_mostly;
@@ -88,6 +90,10 @@ struct tipc_net {
8890
u32 num_nodes;
8991
u32 num_links;
9092

93+
/* Neighbor monitoring list */
94+
struct tipc_monitor *monitors[MAX_BEARERS];
95+
int mon_threshold;
96+
9197
/* Bearer list */
9298
struct tipc_bearer __rcu *bearer_list[MAX_BEARERS + 1];
9399

@@ -126,6 +132,11 @@ static inline struct list_head *tipc_nodes(struct net *net)
126132
return &tipc_net(net)->node_list;
127133
}
128134

135+
static inline unsigned int tipc_hashfn(u32 addr)
136+
{
137+
return addr & (NODE_HTABLE_SIZE - 1);
138+
}
139+
129140
static inline u16 mod(u16 x)
130141
{
131142
return x & 0xffffu;

net/tipc/link.c

Lines changed: 37 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@
4242
#include "name_distr.h"
4343
#include "discover.h"
4444
#include "netlink.h"
45+
#include "monitor.h"
4546

4647
#include <linux/pkt_sched.h>
4748

@@ -95,6 +96,7 @@ struct tipc_stats {
9596
* @pmsg: convenience pointer to "proto_msg" field
9697
* @priority: current link priority
9798
* @net_plane: current link network plane ('A' through 'H')
99+
* @mon_state: cookie with information needed by link monitor
98100
* @backlog_limit: backlog queue congestion thresholds (indexed by importance)
99101
* @exp_msg_count: # of tunnelled messages expected during link changeover
100102
* @reset_rcv_checkpt: seq # of last acknowledged message at time of link reset
@@ -138,6 +140,7 @@ struct tipc_link {
138140
char if_name[TIPC_MAX_IF_NAME];
139141
u32 priority;
140142
char net_plane;
143+
struct tipc_mon_state mon_state;
141144
u16 rst_cnt;
142145

143146
/* Failover/synch */
@@ -708,18 +711,25 @@ int tipc_link_timeout(struct tipc_link *l, struct sk_buff_head *xmitq)
708711
bool setup = false;
709712
u16 bc_snt = l->bc_sndlink->snd_nxt - 1;
710713
u16 bc_acked = l->bc_rcvlink->acked;
711-
712-
link_profile_stats(l);
714+
struct tipc_mon_state *mstate = &l->mon_state;
713715

714716
switch (l->state) {
715717
case LINK_ESTABLISHED:
716718
case LINK_SYNCHING:
717-
if (l->silent_intv_cnt > l->abort_limit)
718-
return tipc_link_fsm_evt(l, LINK_FAILURE_EVT);
719719
mtyp = STATE_MSG;
720+
link_profile_stats(l);
721+
tipc_mon_get_state(l->net, l->addr, mstate, l->bearer_id);
722+
if (mstate->reset || (l->silent_intv_cnt > l->abort_limit))
723+
return tipc_link_fsm_evt(l, LINK_FAILURE_EVT);
720724
state = bc_acked != bc_snt;
721-
probe = l->silent_intv_cnt;
722-
l->silent_intv_cnt++;
725+
state |= l->bc_rcvlink->rcv_unacked;
726+
state |= l->rcv_unacked;
727+
state |= !skb_queue_empty(&l->transmq);
728+
state |= !skb_queue_empty(&l->deferdq);
729+
probe = mstate->probing;
730+
probe |= l->silent_intv_cnt;
731+
if (probe || mstate->monitoring)
732+
l->silent_intv_cnt++;
723733
break;
724734
case LINK_RESET:
725735
setup = l->rst_cnt++ <= 4;
@@ -830,6 +840,7 @@ void tipc_link_reset(struct tipc_link *l)
830840
l->stats.recv_info = 0;
831841
l->stale_count = 0;
832842
l->bc_peer_is_up = false;
843+
memset(&l->mon_state, 0, sizeof(l->mon_state));
833844
tipc_link_reset_stats(l);
834845
}
835846

@@ -1238,6 +1249,9 @@ static void tipc_link_build_proto_msg(struct tipc_link *l, int mtyp, bool probe,
12381249
struct tipc_msg *hdr;
12391250
struct sk_buff_head *dfq = &l->deferdq;
12401251
bool node_up = link_is_up(l->bc_rcvlink);
1252+
struct tipc_mon_state *mstate = &l->mon_state;
1253+
int dlen = 0;
1254+
void *data;
12411255

12421256
/* Don't send protocol message during reset or link failover */
12431257
if (tipc_link_is_blocked(l))
@@ -1250,12 +1264,13 @@ static void tipc_link_build_proto_msg(struct tipc_link *l, int mtyp, bool probe,
12501264
rcvgap = buf_seqno(skb_peek(dfq)) - l->rcv_nxt;
12511265

12521266
skb = tipc_msg_create(LINK_PROTOCOL, mtyp, INT_H_SIZE,
1253-
TIPC_MAX_IF_NAME, l->addr,
1267+
tipc_max_domain_size, l->addr,
12541268
tipc_own_addr(l->net), 0, 0, 0);
12551269
if (!skb)
12561270
return;
12571271

12581272
hdr = buf_msg(skb);
1273+
data = msg_data(hdr);
12591274
msg_set_session(hdr, l->session);
12601275
msg_set_bearer_id(hdr, l->bearer_id);
12611276
msg_set_net_plane(hdr, l->net_plane);
@@ -1271,14 +1286,18 @@ static void tipc_link_build_proto_msg(struct tipc_link *l, int mtyp, bool probe,
12711286

12721287
if (mtyp == STATE_MSG) {
12731288
msg_set_seq_gap(hdr, rcvgap);
1274-
msg_set_size(hdr, INT_H_SIZE);
12751289
msg_set_probe(hdr, probe);
1290+
tipc_mon_prep(l->net, data, &dlen, mstate, l->bearer_id);
1291+
msg_set_size(hdr, INT_H_SIZE + dlen);
1292+
skb_trim(skb, INT_H_SIZE + dlen);
12761293
l->stats.sent_states++;
12771294
l->rcv_unacked = 0;
12781295
} else {
12791296
/* RESET_MSG or ACTIVATE_MSG */
12801297
msg_set_max_pkt(hdr, l->advertised_mtu);
1281-
strcpy(msg_data(hdr), l->if_name);
1298+
strcpy(data, l->if_name);
1299+
msg_set_size(hdr, INT_H_SIZE + TIPC_MAX_IF_NAME);
1300+
skb_trim(skb, INT_H_SIZE + TIPC_MAX_IF_NAME);
12821301
}
12831302
if (probe)
12841303
l->stats.sent_probes++;
@@ -1371,7 +1390,9 @@ static int tipc_link_proto_rcv(struct tipc_link *l, struct sk_buff *skb,
13711390
u16 peers_tol = msg_link_tolerance(hdr);
13721391
u16 peers_prio = msg_linkprio(hdr);
13731392
u16 rcv_nxt = l->rcv_nxt;
1393+
u16 dlen = msg_data_sz(hdr);
13741394
int mtyp = msg_type(hdr);
1395+
void *data;
13751396
char *if_name;
13761397
int rc = 0;
13771398

@@ -1381,6 +1402,10 @@ static int tipc_link_proto_rcv(struct tipc_link *l, struct sk_buff *skb,
13811402
if (tipc_own_addr(l->net) > msg_prevnode(hdr))
13821403
l->net_plane = msg_net_plane(hdr);
13831404

1405+
skb_linearize(skb);
1406+
hdr = buf_msg(skb);
1407+
data = msg_data(hdr);
1408+
13841409
switch (mtyp) {
13851410
case RESET_MSG:
13861411

@@ -1391,16 +1416,14 @@ static int tipc_link_proto_rcv(struct tipc_link *l, struct sk_buff *skb,
13911416
/* fall thru' */
13921417

13931418
case ACTIVATE_MSG:
1394-
skb_linearize(skb);
1395-
hdr = buf_msg(skb);
13961419

13971420
/* Complete own link name with peer's interface name */
13981421
if_name = strrchr(l->name, ':') + 1;
13991422
if (sizeof(l->name) - (if_name - l->name) <= TIPC_MAX_IF_NAME)
14001423
break;
14011424
if (msg_data_sz(hdr) < TIPC_MAX_IF_NAME)
14021425
break;
1403-
strncpy(if_name, msg_data(hdr), TIPC_MAX_IF_NAME);
1426+
strncpy(if_name, data, TIPC_MAX_IF_NAME);
14041427

14051428
/* Update own tolerance if peer indicates a non-zero value */
14061429
if (in_range(peers_tol, TIPC_MIN_LINK_TOL, TIPC_MAX_LINK_TOL))
@@ -1448,6 +1471,8 @@ static int tipc_link_proto_rcv(struct tipc_link *l, struct sk_buff *skb,
14481471
rc = TIPC_LINK_UP_EVT;
14491472
break;
14501473
}
1474+
tipc_mon_rcv(l->net, data, dlen, l->addr,
1475+
&l->mon_state, l->bearer_id);
14511476

14521477
/* Send NACK if peer has sent pkts we haven't received yet */
14531478
if (more(peers_snd_nxt, rcv_nxt) && !tipc_link_is_synching(l))

0 commit comments

Comments
 (0)