Skip to content

Introduce cross-cluster replication #30086

@elasticmachine

Description

@elasticmachine

Issue description

Original comment by @jasontedor:

The goal of cross-cluster replication is to enable users to replicate operations from one cluster to another cluster via active-passive replication. There are three drivers for providing CCR functionality in X-Pack:

  • resiliency in case the primary cluster fails, a secondary cluster can serve as a hot backup
  • geo-proximity so that reads can be served locally
  • ECE can use the infrastructure for a UI to replicate data from one cluster to another

The purpose of this meta-issue is to serve as a high-level plan for the initial implementation.

Our initial implementation will build a shard-to-shard pull-based model without automatic setup built on the transport layer.

In this model, shards on a following index are responsible for pulling from shards on the leader index. We chose this model because:

  • this model has simpler state management than a push-based model (each follower has one leader, but not vice-versa)
  • recovery during failure is simpler as the follower knows how far it is in the sequence of operations
  • operations can be streamed from any copy of the leader shard (the primary shard, or any of its replicas)

As far as automatic setup, for now users will have to manually set up a following index to a leader index, and would have to use on bootstrapping existing data via snapshot/restore. This does not necessarily mean that we will not have something more sophisticated when the first version ships, only that for the initial implementation we will not look at anything else. We can consider automatic setup (for example, for time-based indices) and remote recovery infrastructure later.

Utilizing the transport layer allows us to reuse existing infrastructure, and we can follow the path blazed by cross-cluster search for reading from remote clusters.

Basic infrastructure:

  • API to serve operations from a given sequence number
    • needs to fetch the sequence number
    • respond with batch of requested size to client
    • only return documents below the global checkpoint
  • persistent task for the following index to pull data from the leader index
  • a following engine implementation that can index operations that already have a sequence number (from the leader)
    • the engine should reject operations that do not have a sequence number
    • the engine will add a primary term
    • this engine will not close gaps in histories upon recoveries, that is expected to be done from the leader
    • we need to carefully consider the tradeoff of having the engine open for pluggability versus a key component of the CCR infrastructure being in open source
  • implement a mechanism to transfer index metadata changes (e.g., mapping) to a follower from the leader
    • could be done inline with the shard operation stream
    • alternatively, the shard operation stream could indicate the minimum index metadata version required and the following index could wait (through an observer) until the local version catches up
  • CCR REST API for users to set up remote replication
    • design API
    • implement API

Things to do and investigate:

Metadata

Metadata

Labels

:Distributed Indexing/CCRIssues around the Cross Cluster State Replication featuresMeta

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions