You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/main/asciidoc/_chapters/ops_mgt.adoc
+65-5Lines changed: 65 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2436,10 +2436,13 @@ states.
2436
2436
HBASE-15867 is only half done, as although we have abstract these two interfaces, we still only
2437
2437
have zookeeper based implementations.
2438
2438
2439
+
And in HBASE-27110, we have implemented a file system based replication peer storage, to store replication peer state on file system. Of course you can still use the zookeeper based replication peer storage.
2440
+
And in HBASE-27109, we have changed the replication queue storage from zookeeper based to hbase table based. See the below `Replication Queue State` in hbase:replication table section for more details.
2441
+
2439
2442
Replication State in ZooKeeper::
2440
2443
By default, the state is contained in the base node _/hbase/replication_.
2441
-
Usually this nodes contains two child nodes, the `peers` znode is for storing replication peer
2442
-
state, and the `rs` znodes is for storing replication queue state.
2444
+
Usually this nodes contains two child nodes, the peers znode is for storing replication peer state, and the rs znodes is for storing replication queue state. And if you choose the file system based replication peer storage, you will not see the peers znode.
2445
+
And starting from 3.0.0, we have moved the replication queue state to <<hbase:replication,hbase:replication>> table, so you will not see the rs znode.
2443
2446
2444
2447
The `Peers` Znode::
2445
2448
The `peers` znode is stored in _/hbase/replication/peers_ by default.
@@ -2454,6 +2457,12 @@ The `RS` Znode::
2454
2457
The child znode name is the region server's hostname, client port, and start code.
2455
2458
This list includes both live and dead region servers.
2456
2459
2460
+
[[hbase:replication]]
2461
+
The hbase:replication Table::
2462
+
After 3.0.0, the `Queue` has been stored in the hbase:replication table, where the row key is <PeerId>-<ServerName>[/<SourceServerName>], the WAL group will be the qualifier, and the serialized ReplicationGroupOffset will be the value.
2463
+
The ReplicationGroupOffset includes the wal file of the corresponding queue (<PeerId>-<ServerName>[/<SourceServerName>]) and its offset.
2464
+
Because we track replication offset per queue instead of per file, we only need to store one replication offset per queue.
2465
+
2457
2466
Other implementations for `ReplicationPeerStorage`::
2458
2467
Starting from 2.6.0, we introduce a file system based `ReplicationPeerStorage`, which stores
2459
2468
the replication peer state with files on HFile file system, instead of znodes on ZooKeeper.
@@ -2473,7 +2482,7 @@ A ZooKeeper watcher is placed on the _${zookeeper.znode.parent}/rs_ node of the
2473
2482
This watch is used to monitor changes in the composition of the slave cluster.
2474
2483
When nodes are removed from the slave cluster, or if nodes go down or come back up, the master cluster's region servers will respond by selecting a new pool of slave region servers to replicate to.
2475
2484
2476
-
==== Keeping Track of Logs
2485
+
==== Keeping Track of Logs(based on ZooKeeper)
2477
2486
2478
2487
Each master cluster region server has its own znode in the replication znodes hierarchy.
2479
2488
It contains one znode per peer cluster (if 5 slave clusters, 5 znodes are created), and each of these contain a queue of WALs to process.
@@ -2494,6 +2503,18 @@ If the log is in the queue, the path will be updated in memory.
2494
2503
If the log is currently being replicated, the change will be done atomically so that the reader doesn't attempt to open the file when has already been moved.
2495
2504
Because moving a file is a NameNode operation , if the reader is currently reading the log, it won't generate any exception.
2496
2505
2506
+
==== Keeping Track of Logs(based on hbase table)
2507
+
2508
+
After 3.0.0, for table based implementation, we have server name in row key, which means we will have lots of rows for a given peer.
2509
+
2510
+
For a normal replication queue, the WAL files belong to the region server that is still alive, all the WAL files are kept in memory, so we do not need to get the WAL files from replication queue storage.
2511
+
And for a recovered replication queue, we could get the WAL files of the dead region server by listing the old WAL directory on HDFS. So theoretically, we do not need to store every WAL file in replication queue storage.
2512
+
And what’s more, we store the created time(usually) in the WAL file name, so for all the WAL files in a WAL group, we can sort them(actually we will sort them in the current replication framework), which means we only need to store one replication offset per queue.
2513
+
When starting a recovered replication queue, we will skip all the files before this offset, and start replicating from this offset.
2514
+
2515
+
For ReplicationLogCleaner, all the files before this offset can be deleted, otherwise not.
2516
+
2517
+
2497
2518
==== Reading, Filtering and Sending Edits
2498
2519
2499
2520
By default, a source attempts to read from a WAL and ship log entries to a sink as quickly as possible.
@@ -2523,8 +2544,8 @@ NOTE: WALs are saved when replication is enabled or disabled as long as peers ex
2523
2544
2524
2545
When no region servers are failing, keeping track of the logs in ZooKeeper adds no value.
2525
2546
Unfortunately, region servers do fail, and since ZooKeeper is highly available, it is useful for managing the transfer of the queues in the event of a failure.
2526
-
2527
-
Each of the master cluster region servers keeps a watcher on every other region server, in order to be notified when one dies (just as the master does). When a failure happens, they all race to create a znode called `lock` inside the dead region server's znode that contains its queues.
2547
+
Each of the master cluster region servers keeps a watcher on every other region server, in order to be notified when one dies (just as the master does).
2548
+
When a failure happens, they all race to create a znode called `lock` inside the dead region server's znode that contains its queues.
2528
2549
The region server that creates it successfully then transfers all the queues to its own znode, one at a time since ZooKeeper does not support renaming queues.
2529
2550
After queues are all transferred, they are deleted from the old location.
2530
2551
The znodes that were recovered are renamed with the ID of the slave cluster appended with the name of the dead server.
@@ -2533,6 +2554,11 @@ Next, the master cluster region server creates one new source thread per copied
2533
2554
The main difference is that those queues will never receive new data, since they do not belong to their new region server.
2534
2555
When the reader hits the end of the last log, the queue's znode is deleted and the master cluster region server closes that replication source.
2535
2556
2557
+
And starting from 2.5.0, the failover logic has been moved to SCP, where we add a SERVER_CRASH_CLAIM_REPLICATION_QUEUES step in SCP to claim the replication queues for a dead server.
2558
+
And starting from 3.0.0, where we changed the replication queue storage from zookeeper to table, the update to the replication queue storage is async, so we also need an extra step to add the missing replication queues before claiming.
2559
+
2560
+
==== The replication queue claiming(based on ZooKeeper)
2561
+
2536
2562
Given a master cluster with 3 region servers replicating to a single slave with id `2`, the following hierarchy represents what the znodes layout could be at some point in time.
2537
2563
The region servers' znodes all contain a `peers` znode which contains a single queue.
2538
2564
The znode names in the queues represent the actual file names on HDFS in the form `address,port.timestamp`.
@@ -2610,6 +2636,32 @@ The new layout will be:
2610
2636
1.1.1.2,60020.1312 (Contains a position)
2611
2637
----
2612
2638
2639
+
==== The replication queue claiming(based on hbase table)
2640
+
2641
+
Given a master cluster with 3 region servers replicating to a single slave with id `2`, the following info represents what the storage layout of queue in the hbase:replication at some point in time.
2642
+
Row key is <PeerId>-<ServerName>[/<SourceServerName>], and value is WAL && Offset.
2-1.1.1.1,60020,123456780 1.1.1.1,60020.1234 (Contains a position)
2648
+
2-1.1.1.2,60020,123456790 1.1.1.2,60020.1214 (Contains a position)
2649
+
2-1.1.1.3,60020,123456630 1.1.1.3,60020.1280 (Contains a position)
2650
+
----
2651
+
2652
+
Assume that 1.1.1.2 failed.
2653
+
The survivors will claim queue of that, and, arbitrarily, 1.1.1.3 wins.
2654
+
It will claim all the queue of 1.1.1.2, including removing the row of a replication queue, and inserting a new row(where we change the server name to the region server which claims the queue).
0 commit comments