Skip to content

Conversation

@andrershov
Copy link
Contributor

Currently, we only log that WriteStateException has occurred, in
GatewayMetaState.

This PR goal is to re-throw WriteStateException to upper layers.
If dirty flag is set to false, we wrap WriteStateException in
UncheckedIOException, we prefer unchecked exception not to add
explicit throws in the multitude of layers.
If dirty flag is set to true - the world is broken. And we need to
halt the JVM. Instead of explicit halting in GatewayMetaState, we
prefer to throw IOError, which will be subsequently handled by
ElasticsearchUncaughtExceptionHandler and JVM will be halted. Sadly,
IOError accepts no message argument, that's why we first wrap
WriteStateException with IOException and then with IOError.

This PR also adds tests for WriteStateException.

@andrershov andrershov added >enhancement :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Nov 29, 2018
@andrershov andrershov self-assigned this Nov 29, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner DaveCTurner added the :Core/Infra/Resiliency Keep running when everything is ok. Die quickly if things go horribly wrong. label Nov 29, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Nov 29, 2018

I think this is how it should work, but I would like opinions from @elastic/es-core-infra too. To that end here is a high-level description of what's going on here:

It is important not to lose any metadata updates. The consequences of losing a metadata update include forgetting that a shard copy was marked out-of-sync so it erroneously gets promoted to primary on restart, losing acked writes. "Don't lose any metadata updates" is one of the goals of the Zen2 project, and the solution is only to consider a metadata update to be successful when it has been written to disk (i.e. fsync()ed) on a majority of the master-eligible nodes.

Zen2 is happiest if its writes to disk actually succeed (of course) but is also tolerant of failures which leaves the on-disk metadata untouched. However, if an fsync() fails then we do not know which of these states we are in, and this means we cannot take any further action. The presence of multiple data paths further complicates the situation: we write the metadata to every data path, so some of these can succeed and others can fail.

The change proposed here is to throw an IOError, shutting down the node, if the on-disk metadata ends up in an indeterminate state. We use the normal rename-a-temp-file-and-fsync dance to make the indeterminate states as rare as possible:

  • An fsync() of (one copy of) the metadata manifest file failed. The metadata manifest file is a small-ish file containing references to the files containing the cluster metadata (both global and per-index) and is written to each data path.

  • An atomic rename of a temporary version of the metadata manifest file into its final resting place failed and we are using multiple data paths and the failure wasn't on the first path.

I think the most common situation that'll yield these kinds of errors will be running out of storage, but even then most out-of-storage problems will manifest as earlier failures. It's a bit rude of the filesystem to wait for the fsync() before discovering it doesn't have enough space 🙂

The main alternative we discussed is simply to retry on this kind of failure. The argument against this is that the world is kinda broken if fsync() failed, and retrying alone is unlikely to help and may even make things worse. A node that fails like this will obstruct further cluster-wide progress until it recovers or is removed from the cluster, and failing fast here seems like a better plan.

@andrershov
Copy link
Contributor Author

run gradle build tests 1

@DaveCTurner
Copy link
Contributor

We discussed this issue as a team today and concluded that we are ok with the node shutting down with an exception in these situations.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asked for a couple of simplifications.

// IOError is the best thing we have in java library to indicate that serious IO problem has occurred.
// IOError will be caught by ElasticsearchUncaughtExceptionHandler and JVM will be halted.
// Sadly, it has no constructor that accepts error message, so we first wrap WriteStateException with IOException.
throw new IOError(new IOException(msg + ", storage is dirty", this));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do not need the message here: the message indicates the calling method but this is also in the stack trace (and the previous log line) and we would know that the storage was dirty if we have an IOError caused by a WriteStateException.

// Sadly, it has no constructor that accepts error message, so we first wrap WriteStateException with IOException.
throw new IOError(new IOException(msg + ", storage is dirty", this));
} else {
throw new UncheckedIOException(msg, this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly I think we don't need to add this message.

} catch (WriteStateException e) {
logger.warn("Exception occurred when setting current term", e);
//TODO re-throw exception
logger.error("Failed to set current term to {}", currentTerm);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we log the exception and stack trace too? logger.error(new ParameterizedMessage("Failed to set current term to {}", currentTerm), e);

} catch (WriteStateException e) {
logger.warn("Exception occurred when setting last accepted state", e);
//TODO re-throw exception
logger.error("Failed to set last accepted state with version {}", clusterState.version());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we log the exception and stack trace too?

@ywelsch ywelsch closed this Dec 6, 2018
@ywelsch ywelsch changed the base branch from zen2 to master December 6, 2018 10:19
@ywelsch ywelsch reopened this Dec 6, 2018
@andrershov
Copy link
Contributor Author

@DaveCTurner I've finished with the changes, could you please take a look?

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @andrershov

@andrershov andrershov merged commit f0340d6 into elastic:master Dec 6, 2018
@andrershov
Copy link
Contributor Author

@DaveCTurner thanks for your reviewing efforts and bringing it up on the team meeting!

@tomcallahan tomcallahan removed the :Core/Infra/Resiliency Keep running when everything is ok. Die quickly if things go horribly wrong. label Dec 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement v7.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants