[Zen2] Storage layer WriteStateException propagation #36052

andrershov · 2018-11-29T13:24:22Z

Currently, we only log that WriteStateException has occurred, in
GatewayMetaState.

This PR goal is to re-throw WriteStateException to upper layers.
If dirty flag is set to false, we wrap WriteStateException in
UncheckedIOException, we prefer unchecked exception not to add
explicit throws in the multitude of layers.
If dirty flag is set to true - the world is broken. And we need to
halt the JVM. Instead of explicit halting in GatewayMetaState, we
prefer to throw IOError, which will be subsequently handled by
ElasticsearchUncaughtExceptionHandler and JVM will be halted. Sadly,
IOError accepts no message argument, that's why we first wrap
WriteStateException with IOException and then with IOError.

This PR also adds tests for WriteStateException.

elasticmachine · 2018-11-29T13:24:24Z

Pinging @elastic/es-distributed

elasticmachine · 2018-11-29T15:11:24Z

Pinging @elastic/es-core-infra

DaveCTurner · 2018-11-29T15:25:20Z

I think this is how it should work, but I would like opinions from @elastic/es-core-infra too. To that end here is a high-level description of what's going on here:

It is important not to lose any metadata updates. The consequences of losing a metadata update include forgetting that a shard copy was marked out-of-sync so it erroneously gets promoted to primary on restart, losing acked writes. "Don't lose any metadata updates" is one of the goals of the Zen2 project, and the solution is only to consider a metadata update to be successful when it has been written to disk (i.e. fsync()ed) on a majority of the master-eligible nodes.

Zen2 is happiest if its writes to disk actually succeed (of course) but is also tolerant of failures which leaves the on-disk metadata untouched. However, if an fsync() fails then we do not know which of these states we are in, and this means we cannot take any further action. The presence of multiple data paths further complicates the situation: we write the metadata to every data path, so some of these can succeed and others can fail.

The change proposed here is to throw an IOError, shutting down the node, if the on-disk metadata ends up in an indeterminate state. We use the normal rename-a-temp-file-and-fsync dance to make the indeterminate states as rare as possible:

An fsync() of (one copy of) the metadata manifest file failed. The metadata manifest file is a small-ish file containing references to the files containing the cluster metadata (both global and per-index) and is written to each data path.
An atomic rename of a temporary version of the metadata manifest file into its final resting place failed and we are using multiple data paths and the failure wasn't on the first path.

I think the most common situation that'll yield these kinds of errors will be running out of storage, but even then most out-of-storage problems will manifest as earlier failures. It's a bit rude of the filesystem to wait for the fsync() before discovering it doesn't have enough space 🙂

The main alternative we discussed is simply to retry on this kind of failure. The argument against this is that the world is kinda broken if fsync() failed, and retrying alone is unlikely to help and may even make things worse. A node that fails like this will obstruct further cluster-wide progress until it recovers or is removed from the cluster, and failing fast here seems like a better plan.

andrershov · 2018-11-30T13:38:15Z

run gradle build tests 1

DaveCTurner · 2018-12-05T15:36:08Z

We discussed this issue as a team today and concluded that we are ok with the node shutting down with an exception in these situations.

DaveCTurner

Asked for a couple of simplifications.

DaveCTurner · 2018-12-05T15:40:14Z

server/src/main/java/org/elasticsearch/gateway/WriteStateException.java

+            // IOError is the best thing we have in java library to indicate that serious IO problem has occurred.
+            // IOError will be caught by ElasticsearchUncaughtExceptionHandler and JVM will be halted.
+            // Sadly, it has no constructor that accepts error message, so we first wrap WriteStateException with IOException.
+            throw new IOError(new IOException(msg + ", storage is dirty", this));


I think we do not need the message here: the message indicates the calling method but this is also in the stack trace (and the previous log line) and we would know that the storage was dirty if we have an IOError caused by a WriteStateException.

DaveCTurner · 2018-12-05T15:40:28Z

server/src/main/java/org/elasticsearch/gateway/WriteStateException.java

+            // Sadly, it has no constructor that accepts error message, so we first wrap WriteStateException with IOException.
+            throw new IOError(new IOException(msg + ", storage is dirty", this));
+        } else {
+            throw new UncheckedIOException(msg, this);


Similarly I think we don't need to add this message.

DaveCTurner · 2018-12-05T15:42:06Z

server/src/main/java/org/elasticsearch/gateway/GatewayMetaState.java

        } catch (WriteStateException e) {
-            logger.warn("Exception occurred when setting current term", e);
-            //TODO re-throw exception
+            logger.error("Failed to set current term to {}", currentTerm);


Could we log the exception and stack trace too? logger.error(new ParameterizedMessage("Failed to set current term to {}", currentTerm), e);

DaveCTurner · 2018-12-05T15:42:17Z

server/src/main/java/org/elasticsearch/gateway/GatewayMetaState.java

        } catch (WriteStateException e) {
-            logger.warn("Exception occurred when setting last accepted state", e);
-            //TODO re-throw exception
+            logger.error("Failed to set last accepted state with version {}", clusterState.version());


Could we log the exception and stack trace too?

andrershov · 2018-12-06T10:44:24Z

@DaveCTurner I've finished with the changes, could you please take a look?

DaveCTurner

LGTM, thanks @andrershov

andrershov · 2018-12-06T16:03:37Z

@DaveCTurner thanks for your reviewing efforts and bringing it up on the team meeting!

Zen2 storage layer WriteStateException propagation

c68ca5e

andrershov added >enhancement :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Nov 29, 2018

andrershov self-assigned this Nov 29, 2018

andrershov requested a review from DaveCTurner November 29, 2018 13:24

DaveCTurner added the :Core/Infra/Resiliency Keep running when everything is ok. Die quickly if things go horribly wrong. label Nov 29, 2018

DaveCTurner added team-discuss and removed team-discuss labels Dec 4, 2018

DaveCTurner requested changes Dec 5, 2018

View reviewed changes

Merge branch 'master' into zen2_storage_exceptions

d2b1d17

ywelsch closed this Dec 6, 2018

ywelsch changed the base branch from zen2 to master December 6, 2018 10:19

ywelsch reopened this Dec 6, 2018

Fix code review issues

4094b91

DaveCTurner approved these changes Dec 6, 2018

View reviewed changes

andrershov merged commit f0340d6 into elastic:master Dec 6, 2018

tomcallahan added the v7.0.0 label Dec 8, 2018

tomcallahan removed the :Core/Infra/Resiliency Keep running when everything is ok. Die quickly if things go horribly wrong. label Dec 18, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

[Zen2] Storage layer WriteStateException propagation #36052

[Zen2] Storage layer WriteStateException propagation #36052

Uh oh!

Conversation

andrershov commented Nov 29, 2018

Uh oh!

elasticmachine commented Nov 29, 2018

Uh oh!

elasticmachine commented Nov 29, 2018

Uh oh!

DaveCTurner commented Nov 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrershov commented Nov 30, 2018

Uh oh!

DaveCTurner commented Dec 5, 2018

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Dec 5, 2018

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Dec 5, 2018

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Dec 5, 2018

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Dec 5, 2018

Choose a reason for hiding this comment

Uh oh!

andrershov commented Dec 6, 2018

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

andrershov commented Dec 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

DaveCTurner commented Nov 29, 2018 •

edited

Loading