From 24b9173c9e1e065545aaa139e4d41c3cee95b855 Mon Sep 17 00:00:00 2001 From: Andi Skrgat Date: Mon, 22 Sep 2025 08:25:52 +0200 Subject: [PATCH 1/4] docs: Document max_failover_replica_lag --- pages/clustering/high-availability.mdx | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/pages/clustering/high-availability.mdx b/pages/clustering/high-availability.mdx index fa0d39f1f..08a3f3070 100644 --- a/pages/clustering/high-availability.mdx +++ b/pages/clustering/high-availability.mdx @@ -395,7 +395,7 @@ server that coordinator instances use for communication. Flag `--instance-health check the status of the replication instance to update its status. Flag `--instance-down-timeout-sec` gives the user the ability to control how much time should pass before the coordinator starts considering the instance to be down. -There is also a configuration option for specifying whether reads from the main are enabled. The configuration value is by default false but can be changed in run-time +There is a configuration option for specifying whether reads from the main are enabled. The configuration value is by default false but can be changed in run-time using the following query: ``` @@ -408,6 +408,22 @@ Users can also choose whether failover to the async replica is allowed by using SET COORDINATOR SETTING 'sync_failover_only' TO 'true'/'false' ; ``` +Users can control the maximum transaction lag allowed during failover through configuration. If a replica is behind the main instance by more than the configured threshold, +that replica becomes ineligible for failover. This prevents data loss beyond the user's acceptable limits. + +To implement this functionality, we employ a caching mechanism on the cluster leader that tracks replica lag. The cache updates with each StateCheckRpc response from +replicas. During the brief failover window, the cluster leader may not have current lag information for all instances. This trade-off is intentional—it avoids flooding Raft +logs with frequently-changing lag data while maintaining failover safety guarantees. + + +The configuration value can be controlled using the query: + +``` +SET COORDINATOR SETTING 'max_failover_replica_lag' TO 'true'/'false' ; +``` + + + By default, the value is `true`, which means that only sync replicas are candidates in the election. When the value is set to `false`, the async replica is also considered, but there is an additional risk of experiencing data loss. However, failover to an async replica may be necessary when other sync replicas are down and you want to From bc63cbad4e5c01a55e4455cf933b8dc2bbbfbff9 Mon Sep 17 00:00:00 2001 From: Andi Skrgat Date: Mon, 22 Sep 2025 08:31:57 +0200 Subject: [PATCH 2/4] docs: Describe the failover process --- pages/clustering/high-availability.mdx | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/pages/clustering/high-availability.mdx b/pages/clustering/high-availability.mdx index 08a3f3070..010966f53 100644 --- a/pages/clustering/high-availability.mdx +++ b/pages/clustering/high-availability.mdx @@ -523,11 +523,18 @@ about other replicas to which it will replicate data. Once that request succeeds ### Choosing new main from available replicas -When failover is happening, some replicas can also be down. From the list of alive replicas, a new main is chosen. First, the leader coordinator contacts each alive replica -to get info about each database's last commit timestamp. In the case of enabled multi-tenancy, from each instance coordinator will get info on all databases and their last commit -timestamp. Currently, the coordinator chooses an instance to become a new main by comparing the latest commit timestamps of all databases. The instance which is newest on most -databases is considered the best candidate for the new main. If there are multiple instances which have the same number of newest databases, we sum timestamps of all databases -and consider instance with a larger sum as the better candidate. + +During failover, the coordinator must select a new main instance from available replicas, as some may be offline. The leader coordinator queries each live replica to +retrieve the committed transaction count for every database. + +The selection algorithm prioritizes data recency using a two-phase approach: + +1. Database majority rule: The coordinator identifies which replica has the highest committed transaction count for each database. The replica that leads in the most +databases becomes the preferred candidate. +2. Total transaction tiebreaker: If multiple replicas tie for leading the most databases, the coordinator sums each replica's committed transactions across all databases. +The replica with the highest total becomes the new main. + +This approach ensures the new main instance has the most up-to-date data across the cluster while maintaining consistency guarantees. ### Old main rejoining to the cluster From 615abd1762439f67263d60d57c3bbaaea34cf010 Mon Sep 17 00:00:00 2001 From: Andi Skrgat Date: Tue, 23 Sep 2025 13:24:06 +0200 Subject: [PATCH 3/4] docs: Clarify and fix the example --- pages/clustering/high-availability.mdx | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/pages/clustering/high-availability.mdx b/pages/clustering/high-availability.mdx index 010966f53..b334db1e2 100644 --- a/pages/clustering/high-availability.mdx +++ b/pages/clustering/high-availability.mdx @@ -411,15 +411,16 @@ SET COORDINATOR SETTING 'sync_failover_only' TO 'true'/'false' ; Users can control the maximum transaction lag allowed during failover through configuration. If a replica is behind the main instance by more than the configured threshold, that replica becomes ineligible for failover. This prevents data loss beyond the user's acceptable limits. -To implement this functionality, we employ a caching mechanism on the cluster leader that tracks replica lag. The cache updates with each StateCheckRpc response from -replicas. During the brief failover window, the cluster leader may not have current lag information for all instances. This trade-off is intentional—it avoids flooding Raft -logs with frequently-changing lag data while maintaining failover safety guarantees. +To implement this functionality, we employ a caching mechanism on the cluster leader that tracks replicas' lag. The cache gets updated with each StateCheckRpc response from +replicas. During the brief failover window on the cooordinators' side, the new cluster leader may not have the current lag information for all data instances and in that case, +any replica can become main. This trade-off is intentional and it avoids flooding Raft logs with frequently-changing lag data while maintaining failover safety guarantees +in the large majority of situations. The configuration value can be controlled using the query: ``` -SET COORDINATOR SETTING 'max_failover_replica_lag' TO 'true'/'false' ; +SET COORDINATOR SETTING 'max_failover_replica_lag' TO '10' ; ``` From 8f2255083d3b37812de1acced33d4ea41ff64757 Mon Sep 17 00:00:00 2001 From: Matea Pesic <80577904+matea16@users.noreply.github.com> Date: Thu, 25 Sep 2025 10:07:23 +0200 Subject: [PATCH 4/4] Update pages/clustering/high-availability.mdx --- pages/clustering/high-availability.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pages/clustering/high-availability.mdx b/pages/clustering/high-availability.mdx index b334db1e2..3eef37158 100644 --- a/pages/clustering/high-availability.mdx +++ b/pages/clustering/high-availability.mdx @@ -530,9 +530,9 @@ retrieve the committed transaction count for every database. The selection algorithm prioritizes data recency using a two-phase approach: -1. Database majority rule: The coordinator identifies which replica has the highest committed transaction count for each database. The replica that leads in the most +1. **Database majority rule**: The coordinator identifies which replica has the highest committed transaction count for each database. The replica that leads in the most databases becomes the preferred candidate. -2. Total transaction tiebreaker: If multiple replicas tie for leading the most databases, the coordinator sums each replica's committed transactions across all databases. +2. **Total transaction tiebreaker**: If multiple replicas tie for leading the most databases, the coordinator sums each replica's committed transactions across all databases. The replica with the highest total becomes the new main. This approach ensures the new main instance has the most up-to-date data across the cluster while maintaining consistency guarantees.