From 5e3d686971508b2fd9b153cfdf4a69a3b56333cd Mon Sep 17 00:00:00 2001 From: Andi Skrgat Date: Mon, 1 Sep 2025 12:41:18 +0200 Subject: [PATCH 1/3] ISSU docs --- pages/clustering/high-availability.mdx | 116 +++++++++++++++++++++++++ 1 file changed, 116 insertions(+) diff --git a/pages/clustering/high-availability.mdx b/pages/clustering/high-availability.mdx index fa0d39f1f..a67e48351 100644 --- a/pages/clustering/high-availability.mdx +++ b/pages/clustering/high-availability.mdx @@ -685,6 +685,122 @@ distributed in any way you want between data centers. The failover time will be We support deploying Memgraph HA as part of the Kubernetes cluster through Helm charts. You can see example configurations [here](/getting-started/install-memgraph/kubernetes#memgraph-high-availability-helm-chart). +## In-Service Software Upgrade (ISSU) + +Memgraph's high availability supports ISSU. Here will be described steps which are needed to perform the upgrade when using [HA charts]((/getting-started/install-memgraph/kubernetes#memgraph-high-availability-helm-chart)) +but steps and the procedure are very similar for native deployment also. Although the upgrade process should always finish successfully, unexpected things can always happen. Therefore, we are strongly recommending doing +a backup of your `lib` directory on all of your `StatefulSets` or native instances depending on the deployment type. + +If you are using HA charts, make sure to set `updateStrategy.type` config parameter to `OnDelete` before actually doing any upgrade. Depending on the infrastructure on which you have your Memgraph cluster, the details +will differ a bit, but the backbone is the same. + + +First, backup all of your data from all instances so in the case something goes wrong during the upgrade, you can safely downgrade cluster to the last stable version you had. For the native deployment, tools like `cp` or `rsync` +will suffice. When using K8s, create a `VolumeSnapshotClass` with the yaml file similar to this: + +``` +apiVersion: snapshot.storage.k8s.io/v1 +kind: VolumeSnapshotClass +metadata: + name: csi-azure-disk-snapclass +driver: disk.csi.azure.com +deletionPolicy: Delete +``` + +`kubectl apply -f azure_class.yaml` + + +If you are using Google Kubernetes Engine, the default CSI driver is `pd.csi.storage.gke.io` so make sure to change the field `driver`. If you are using AWS cluster, refer to the documentation [here](https://docs.aws.amazon.com/eks/latest/userguide/csi-snapshot-controller.html) +to check how to take volume snapshots on your K8s deployment. + +Now you can create a `VolumeSnapshot` of the lib directory using the yaml file: + +``` +apiVersion: snapshot.storage.k8s.io/v1 +kind: VolumeSnapshot +metadata: + name: coord-3-snap # Use different names for all instances + namespace: default +spec: + volumeSnapshotClassName: csi-azure-disk-snapclass + source: + persistentVolumeClaimName: memgraph-coordinator-3-lib-storage-memgraph-coordinator-3-0 # This is the lib PVC for the coordinator 3. Change the field to take a snapshot for other instances in the cluster. +``` + +``` +kubectl apply -f azure_snapshot.yaml +``` + +Repeat this step for all instances in the cluster. + + +Next you should update `image.tag` field in the `values.yaml` configuration file to the version to which you want to upgrade your cluster. Run `helm upgrade -f `. Since we are using +`updateStrategy.type=OnDelete`, this step will not restart any pod, rather it will just prepare pods for running the new version. If you are using natively deployed Memgraph HA cluster, just make sure you have your new +binary ready to be started. + +Our procedure for achieving zero-downtime upgrades consists of restarting one instance at a time. Since we use primary-secondary type of replication, we should first upgrade replicas then main and then we will upgrade +coordinator followers, finishing with the coordinator leader. In order to find out on which pod/server the current main and the current cluster leader sits, run `SHOW INSTANCES`. + +If you are using K8s, the upgrade can be performed by deleting the pod. Start by deleting the replica pod (in this example replica is running on the pod `memgraph-data-1-0`): + +``` +kubectl delete pod memgraph-data-1-0 +``` + +For the native type of deployment, stop your old binary and start the new one. + +Before starting the upgrade of the next pod, it is important to wait until all pods are ready. Otherwise, you may end up with a data loss. On K8s you can easily achieve that by running: + +``` +kubectl wait --for=condition=ready pod -all +``` + +For the native deployment, check if all your instances are alived manually. + +This step should be repeated for all of your replicas in the cluster. After upgrading all of your replicas, you can delete the main pod. Right before upgrading the main pod, run `SHOW REPLICATION LAG` to check whether +replicas are behind MAIN. In case they are, your upgrade will be prone to a data loss. In order to achieve zero-downtime upgrade without any data loss, your replicas should be running in the `STRICT_SYNC` mode which effectively +disables writes while upgrading any `STRICT_SYNC` instance. Your read queries should however work without any issues. + +``` +kubectl delete pod memgraph-data-0-0 +kubectl wait --for=condition=ready pod --all +``` + +The upgrade of coordinators is done in exactly the same way. Start by upgrading followers and finish with deleting the leader pod. + +``` +kubectl delete pod memgraph-coordinator-3-0 +kubectl wait --for=condition=ready pod --all +kubectl delete pod memgraph-coordinator-2-0 +kubectl wait --for=condition=ready pod --all +kubectl delete pod memgraph-coordinator-1-0 +kubectl wait --for=condition=ready pod --all +``` + + +Your upgrade should be finished now, to check that everything works OK run `SHOW VERSION`, it should show you the new Memgraph version. + + +If during the upgrade, you figured out that an error happened or even after upgrading all of your pods something doesn't work (e.g. write queries don't pass), you can safely downgrade your cluster to the previous version +using `VolumeSnapshots` you took on K8s or file backups for native deployments. For the K8s deployment, run `helm uninstall `. Open `values.yaml` and set `restoreDataFromSnapshot` for all instances to true. +Make sure to set correct name of the snapshot you will use to recover your instances. + + + + +If you're doing an upgrade on `minikube`, it is important to make sure that the snapshot resides on the same node on which the `StatefulSet` is installed. Otherwise, it won't be able to restore `StatefulSet's` attached +PersistentVolumeClaim from the `VolumeSnapshot`. + + + + + + + + + + + ## Docker Compose The following example shows you how to setup Memgraph cluster using Docker Compose. The cluster will use user-defined bridge network. From 1fd6f80c303a7c11eb2e68bce695afff0b3fcbc0 Mon Sep 17 00:00:00 2001 From: Andi Skrgat Date: Tue, 2 Sep 2025 09:10:17 +0200 Subject: [PATCH 2/3] docs: Explain write and read downtime --- pages/clustering/high-availability.mdx | 13 +++---------- 1 file changed, 3 insertions(+), 10 deletions(-) diff --git a/pages/clustering/high-availability.mdx b/pages/clustering/high-availability.mdx index a67e48351..992030a56 100644 --- a/pages/clustering/high-availability.mdx +++ b/pages/clustering/high-availability.mdx @@ -688,7 +688,7 @@ You can see example configurations [here](/getting-started/install-memgraph/kube ## In-Service Software Upgrade (ISSU) Memgraph's high availability supports ISSU. Here will be described steps which are needed to perform the upgrade when using [HA charts]((/getting-started/install-memgraph/kubernetes#memgraph-high-availability-helm-chart)) -but steps and the procedure are very similar for native deployment also. Although the upgrade process should always finish successfully, unexpected things can always happen. Therefore, we are strongly recommending doing +but steps and the procedure are very similar for the native deployment too. Although the upgrade process should always finish successfully, unexpected things can always happen. Therefore, we are strongly recommending doing a backup of your `lib` directory on all of your `StatefulSets` or native instances depending on the deployment type. If you are using HA charts, make sure to set `updateStrategy.type` config parameter to `OnDelete` before actually doing any upgrade. Depending on the infrastructure on which you have your Memgraph cluster, the details @@ -759,7 +759,8 @@ For the native deployment, check if all your instances are alived manually. This step should be repeated for all of your replicas in the cluster. After upgrading all of your replicas, you can delete the main pod. Right before upgrading the main pod, run `SHOW REPLICATION LAG` to check whether replicas are behind MAIN. In case they are, your upgrade will be prone to a data loss. In order to achieve zero-downtime upgrade without any data loss, your replicas should be running in the `STRICT_SYNC` mode which effectively -disables writes while upgrading any `STRICT_SYNC` instance. Your read queries should however work without any issues. +disables writes while upgrading any `STRICT_SYNC` instance. The other option is to wait until replicas are up-to-date, stop writes and then perform the upgrade process. In this way, you can use any replication mode. +Read queries should however work without any issues independently from the replica type you are using. ``` kubectl delete pod memgraph-data-0-0 @@ -793,14 +794,6 @@ PersistentVolumeClaim from the `VolumeSnapshot`. - - - - - - - - ## Docker Compose The following example shows you how to setup Memgraph cluster using Docker Compose. The cluster will use user-defined bridge network. From b3fb5c9efee6af93911a0dac45191f3f19e1cef3 Mon Sep 17 00:00:00 2001 From: matea16 Date: Fri, 26 Sep 2025 10:06:55 +0200 Subject: [PATCH 3/3] restructure docs --- pages/clustering/high-availability.mdx | 215 ++++++++++++++++++++----- 1 file changed, 171 insertions(+), 44 deletions(-) diff --git a/pages/clustering/high-availability.mdx b/pages/clustering/high-availability.mdx index 992030a56..a02ef1106 100644 --- a/pages/clustering/high-availability.mdx +++ b/pages/clustering/high-availability.mdx @@ -687,109 +687,236 @@ You can see example configurations [here](/getting-started/install-memgraph/kube ## In-Service Software Upgrade (ISSU) -Memgraph's high availability supports ISSU. Here will be described steps which are needed to perform the upgrade when using [HA charts]((/getting-started/install-memgraph/kubernetes#memgraph-high-availability-helm-chart)) -but steps and the procedure are very similar for the native deployment too. Although the upgrade process should always finish successfully, unexpected things can always happen. Therefore, we are strongly recommending doing -a backup of your `lib` directory on all of your `StatefulSets` or native instances depending on the deployment type. +Memgraph’s **High Availability** supports in-service software upgrades (ISSU). +This guide explains the process when using [HA Helm +charts]((/getting-started/install-memgraph/kubernetes#memgraph-high-availability-helm-chart)). +The procedure is very similar for native deployments. -If you are using HA charts, make sure to set `updateStrategy.type` config parameter to `OnDelete` before actually doing any upgrade. Depending on the infrastructure on which you have your Memgraph cluster, the details -will differ a bit, but the backbone is the same. + +**Important**: Although the upgrade process is designed to complete +successfully, unexpected issues may occur. We strongly recommend doing a backup +of your `lib` directory on all of your `StatefulSets` or native instances +depending on the deployment type. -First, backup all of your data from all instances so in the case something goes wrong during the upgrade, you can safely downgrade cluster to the last stable version you had. For the native deployment, tools like `cp` or `rsync` -will suffice. When using K8s, create a `VolumeSnapshotClass` with the yaml file similar to this: -``` -apiVersion: snapshot.storage.k8s.io/v1 -kind: VolumeSnapshotClass -metadata: - name: csi-azure-disk-snapclass -driver: disk.csi.azure.com -deletionPolicy: Delete -``` + + + +{

Prerequisites

} + +If you are using **HA Helm charts**, set the following configuration before +doing any upgrade. + + ```yaml + updateStrategy.type: OnDelete + ``` + + Depending on the infrastructure on which you have your Memgraph cluster, the +details will differ a bit, but the backbone is the same. + +Prepare a backup of all data from all instances. This ensures you can safely +downgrade cluster to the last stable version you had. + + - For **native deployments**, tools like `cp` or `rsync` are sufficient. + - For **Kubernetes**, create a `VolumeSnapshotClass`with the yaml file fimilar + to this: + + ```yaml + apiVersion: snapshot.storage.k8s.io/v1 + kind: VolumeSnapshotClass + metadata: + name: csi-azure-disk-snapclass + driver: disk.csi.azure.com + deletionPolicy: Delete + ``` + + Apply it: -`kubectl apply -f azure_class.yaml` + ```bash + kubectl apply -f azure_class.yaml + ``` + - On **Google Kubernetes Engine**, the default CSI driver is + `pd.csi.storage.gke.io` so make sure to change the field `driver`. + - On **AWS EKS**, refer to the [AWS snapshot controller + docs](https://docs.aws.amazon.com/eks/latest/userguide/csi-snapshot-controller.html). -If you are using Google Kubernetes Engine, the default CSI driver is `pd.csi.storage.gke.io` so make sure to change the field `driver`. If you are using AWS cluster, refer to the documentation [here](https://docs.aws.amazon.com/eks/latest/userguide/csi-snapshot-controller.html) -to check how to take volume snapshots on your K8s deployment. + +{

Create snapshots

} Now you can create a `VolumeSnapshot` of the lib directory using the yaml file: -``` +```yaml apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshot metadata: - name: coord-3-snap # Use different names for all instances + name: coord-3-snap # Use a unique name for each instance namespace: default spec: volumeSnapshotClassName: csi-azure-disk-snapclass source: - persistentVolumeClaimName: memgraph-coordinator-3-lib-storage-memgraph-coordinator-3-0 # This is the lib PVC for the coordinator 3. Change the field to take a snapshot for other instances in the cluster. + persistentVolumeClaimName: memgraph-coordinator-3-lib-storage-memgraph-coordinator-3-0 ``` -``` +Apply it: + +```bash kubectl apply -f azure_snapshot.yaml ``` -Repeat this step for all instances in the cluster. +Repeat for every instance in the cluster. + + +{

Update configuration

} + +Next you should update `image.tag` field in the `values.yaml` configuration file +to the version to which you want to upgrade your cluster. + +1. In your `values.yaml`, update the image version: + ```yaml + image: + tag: + ``` +2. Apply the upgrade: -Next you should update `image.tag` field in the `values.yaml` configuration file to the version to which you want to upgrade your cluster. Run `helm upgrade -f `. Since we are using -`updateStrategy.type=OnDelete`, this step will not restart any pod, rather it will just prepare pods for running the new version. If you are using natively deployed Memgraph HA cluster, just make sure you have your new -binary ready to be started. + ```bash + helm upgrade -f + ``` -Our procedure for achieving zero-downtime upgrades consists of restarting one instance at a time. Since we use primary-secondary type of replication, we should first upgrade replicas then main and then we will upgrade -coordinator followers, finishing with the coordinator leader. In order to find out on which pod/server the current main and the current cluster leader sits, run `SHOW INSTANCES`. + Since we are using `updateStrategy.type=OnDelete`, this step will not restart + any pod, rather it will just prepare pods for running the new version. + - For **native deployments**, ensure the new binary is available. -If you are using K8s, the upgrade can be performed by deleting the pod. Start by deleting the replica pod (in this example replica is running on the pod `memgraph-data-1-0`): +{

Upgrade procedure (zero downtime)

} + +Our procedure for achieving zero-downtime upgrades consists of restarting one +instance at a time. Memgraph uses **primary–secondary replication**. To avoid +downtime: + +1. Upgrade **replicas** first. +2. Upgrade the **main** instance. +3. Upgrade **coordinator followers**, then the **leader**. + +In order to find out on which pod/server the current main and the current +cluster leader sits, run: + +```cypher +SHOW INSTANCES; ``` + + +{

Upgrade replicas

} + +If you are using K8s, the upgrade can be performed by deleting the pod. Start by +deleting the replica pod (in this example replica is running on the pod +`memgraph-data-1-0`): + +```bash kubectl delete pod memgraph-data-1-0 ``` -For the native type of deployment, stop your old binary and start the new one. +**Native deployment:** stop the old binary and start the new one. -Before starting the upgrade of the next pod, it is important to wait until all pods are ready. Otherwise, you may end up with a data loss. On K8s you can easily achieve that by running: +Before starting the upgrade of the next pod, it is important to wait until all +pods are ready. Otherwise, you may end up with a data loss. On K8s you can +easily achieve that by running: -``` -kubectl wait --for=condition=ready pod -all +```bash +kubectl wait --for=condition=ready pod --all ``` For the native deployment, check if all your instances are alived manually. -This step should be repeated for all of your replicas in the cluster. After upgrading all of your replicas, you can delete the main pod. Right before upgrading the main pod, run `SHOW REPLICATION LAG` to check whether -replicas are behind MAIN. In case they are, your upgrade will be prone to a data loss. In order to achieve zero-downtime upgrade without any data loss, your replicas should be running in the `STRICT_SYNC` mode which effectively -disables writes while upgrading any `STRICT_SYNC` instance. The other option is to wait until replicas are up-to-date, stop writes and then perform the upgrade process. In this way, you can use any replication mode. -Read queries should however work without any issues independently from the replica type you are using. +This step should be repeated for all of your replicas in the cluster. + +{

Upgrade the main

} + +Before deleting the main pod, check replication lag to see whether replicas are +behind MAIN: + +```cypher +SHOW REPLICATION LAG; ``` + +If replicas are behind, your upgrade will be prone to a data loss. In order to +achieve zero-downtime upgrade without any data loss, either: + + - Use `STRICT_SYNC` mode (writes will be blocked during upgrade), or + - Wait until replicas are fully caught up, then pause writes. This way, you +can use any replication mode. Read queries should however work without any +issues independently from the replica type you are using. + +Upgrade the main pod: + +```bash kubectl delete pod memgraph-data-0-0 kubectl wait --for=condition=ready pod --all ``` -The upgrade of coordinators is done in exactly the same way. Start by upgrading followers and finish with deleting the leader pod. -``` +{

Upgrade coordinators

} + +The upgrade of coordinators is done in exactly the same way. Start by upgrading +followers and finish with deleting the leader pod: + +```bash kubectl delete pod memgraph-coordinator-3-0 kubectl wait --for=condition=ready pod --all + kubectl delete pod memgraph-coordinator-2-0 kubectl wait --for=condition=ready pod --all + kubectl delete pod memgraph-coordinator-1-0 kubectl wait --for=condition=ready pod --all ``` +
+ + +{

Verify upgrade

} + +Your upgrade should be finished now, to check that everything works, run: + +```cypher +SHOW VERSION; +``` + +It should show you the new Memgraph version. + + +{

Rollback

} + +If during the upgrade, you figured out that an error happened or even after +upgrading all of your pods something doesn't work (e.g. write queries don't +pass), you can safely downgrade your cluster to the previous version using +`VolumeSnapshots` you took on K8s or file backups for native deployments. + +- **Kubernetes:** + + ```bash + helm uninstall + ``` + In `values.yaml`, for all instances set: -Your upgrade should be finished now, to check that everything works OK run `SHOW VERSION`, it should show you the new Memgraph version. + ```yaml + restoreDataFromSnapshot: true + ``` + Make sure to set correct name of the snapshot you will use to recover your +instances. -If during the upgrade, you figured out that an error happened or even after upgrading all of your pods something doesn't work (e.g. write queries don't pass), you can safely downgrade your cluster to the previous version -using `VolumeSnapshots` you took on K8s or file backups for native deployments. For the K8s deployment, run `helm uninstall `. Open `values.yaml` and set `restoreDataFromSnapshot` for all instances to true. -Make sure to set correct name of the snapshot you will use to recover your instances. +- **Native deployments:** restore from your file backups. -If you're doing an upgrade on `minikube`, it is important to make sure that the snapshot resides on the same node on which the `StatefulSet` is installed. Otherwise, it won't be able to restore `StatefulSet's` attached +If you're doing an upgrade on `minikube`, it is important to make sure that the +snapshot resides on the same node on which the `StatefulSet` is installed. +Otherwise, it won't be able to restore `StatefulSet's` attached PersistentVolumeClaim from the `VolumeSnapshot`.