Skip to content

Conversation

@mushao999
Copy link
Contributor

@mushao999 mushao999 commented Mar 31, 2022

The RepositoryService on each node maintains a repository instance for
every repository defined in the cluster state. The master validates each
repository definition before inserting it into the cluster state, but in
some cases this validation is incomplete. For instance, there may be
node-local configuration of which the master is unaware which prevents
the repository from being instantiated on some other node in the
cluster.

Today if a repository cannot be instantiated then the node will log a
warning and continue as if the repository doesn't exist. This results in
a confusing RepositoryMissingException when trying to use the
repository, and various other surprises (e.g. #85550). With this commit
we create a placeholder InvalidRepository which reports a more
accurate exception when it is used.

Relates #82457 which did the same sort of thing for unknown plugins.
Closes #85550 since the repository in question is no longer null.

@elasticsearchmachine elasticsearchmachine added external-contributor Pull request authored by a developer outside the Elasticsearch team v8.3.0 labels Mar 31, 2022
Comment on lines 61 to 62
if (unstableNodes.contains(clusterService.getNodeName())) {
throw new RepositoryException(metadata.name(), "Failed to create repository: current node is not stable");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a more realistic example of how this might happen? Perhaps a better fix would be to avoid throwing an exception when creating the repository altogether.

Copy link
Contributor Author

@mushao999 mushao999 Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our case ,we use alibaba OSS(an Object Storage service) to store snapshot, but some time the network may not be good enough to connect, which may make the reppository creatation failed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And more general, repository creation logic is open to be implement in repository plugin which is uncontrolled, so if the implement is not good enough creation failed can happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a bug in your plugin, it should not require a healthy network to create the repository. Compare for instance S3Repository which merely validates some settings in its constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a bug in your plugin, it should not require a healthy network to create the repository. Compare for instance S3Repository which merely validates some settings in its constructor.

I‘ve discuss with the plugin owner. It's true there are some network action during repository creation which can cause creation failure.
However, I still think elasticsearch core should not lay it's robustness on plugin implement.

Copy link
Contributor

@DaveCTurner DaveCTurner Apr 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think elasticsearch core should not lay it's robustness on plugin implement.

Yes I agree, I just want to be clear that protecting against this NPE will not really address the problem in the plugin. If transient network issues can prevent the plugin repository from even being created on some nodes then you will need to fix this by retrying the put-repository request once the transient issues are resolved. I don't see a great way to detect that a retry is needed here.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather fix this using an approach closer to the one in #82457 which avoids the null by creating a dummy repository that returns a more specific exception message than the RepositoryMissingException we get today. The repository isn't really missing, it's invalid on some nodes, and we should distinguish these cases.

@DaveCTurner DaveCTurner added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Apr 1, 2022
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 1, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner DaveCTurner self-assigned this Apr 1, 2022
@mushao999
Copy link
Contributor Author

Thanks, I will update in that way.

@mushao999
Copy link
Contributor Author

@DaveCTurner hi, I've update this PR by adding a new DummyRepository, please help to reivew it.

@mushao999
Copy link
Contributor Author

@DaveCTurner By the way , there is a small PR of mine which has been there for a long time , could you please help to review it as well: #83706

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, thanks @mushao999. I left a few small comments.

@mushao999
Copy link
Contributor Author

@DaveCTurner Thanks for your suggestion, I've updated the code.

@mushao999
Copy link
Contributor Author

mushao999 commented Apr 1, 2022

Should we add fail("xxx"); after code.run();?

public static void assertThatThrows(
LuceneTestCase.ThrowingRunnable code,
Class<? extends Exception> exceptionType,
Matcher<String> messageMatcher
) {
try {
code.run();
} catch (Throwable e) {
assertThatException(e, exceptionType, messageMatcher);
}
}

@DaveCTurner
Copy link
Contributor

Should we add fail("xxx"); after code.run();?

Yes, I think so.

@mushao999
Copy link
Contributor Author

Should we add fail("xxx"); after code.run();?

Yes, I think so.

so we add it in this PR or maybe open a new one to add it?

@DaveCTurner
Copy link
Contributor

Here would be fine with me.

@DaveCTurner
Copy link
Contributor

@elasticmachine ok to test

@mushao999
Copy link
Contributor Author

Here would be fine with me.

code added

@DaveCTurner
Copy link
Contributor

Looks like your branch is based off an old commit, you will need to merge latest master before CI will pass.

@DaveCTurner DaveCTurner changed the title Fix NPE in RepositoriesService Distinguish missing and invalid repositories Apr 2, 2022
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I left a handful of comments and suggested changes but structurally this is good.

);
// verification should fail with some node has InvalidRepository
try {
client().admin().cluster().prepareVerifyRepository(repositoryName).get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using expectThrows here:

Suggested change
client().admin().cluster().prepareVerifyRepository(repositoryName).get();
final var verificationException = expectThrows(RepositoryVerificationException.class,
() -> client().admin().cluster().prepareVerifyRepository(repositoryName).get());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can not use expectThrows if we want to assert inner exception.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you missed that expectThrows() returns the exception so you can assert more things about it. I think we can use expectThrows rather than the utilities in ThrowableAssertions here, and I'll open a follow-up to remove ThrowableAssertions entirely.

Edit: see #85671

// verification should fail with some node has InvalidRepository
boolean verifyPass = false;
try {
client().admin().cluster().prepareVerifyRepository(repositoryName).get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned before, but this should be final var repositoryVerificationException = expectThrows(RepositoryVerificationException.class, () -> client().admin().cluster().prepareVerifyRepository(repositoryName).get());

Note that expectThrows returns the exception it caught so you can assert things about it such as its causes and suppressed exceptions.

@DaveCTurner
Copy link
Contributor

See #85671 for the PR to remove ThrowableAssertions in favour of using expectThrows.

@mushao999
Copy link
Contributor Author

@DaveCTurner Thanks for your suggestion , I've replaced all the ThrowableAssertions with expectedThrows.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the extra iterations @mushao999

@DaveCTurner DaveCTurner merged commit 133e34d into elastic:master Apr 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs external-contributor Pull request authored by a developer outside the Elasticsearch team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NPE in RepositoriesService

4 participants