Skip to content

Conversation

@original-brownbear
Copy link
Contributor

It is not realistic to drop messages without eventually failing.
To retain the coverage of long pauses this PR adjusts the blockholed
behavior to fail a send after 24h (which is assumed to be longer than any
timeout in the system) instead of never.

Closes #61034

It is not realistic to drop messages without eventually failing.
To retain the coverage of long pauses this PR adjusts the blockholed
behavior to fail a send after 24h (which is assumed to be longer than any
timeout in the system) instead of never.

Closes elastic#61034
@original-brownbear original-brownbear added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.10.0 labels Aug 19, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Aug 19, 2020
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need this for responses sent in the BLACK_HOLE and DISCONNECTED states.

Does this appreciably change the running time of the tests?

@original-brownbear
Copy link
Contributor Author

I think we also need this for responses sent in the BLACK_HOLE and DISCONNECTED states.

I wasn't sure about adding that. It did seemed unnecessary and we currently don't do anything on disconnected either and just handle it as we handle BLACK_HOLE. I assumed that was because it made no functional difference because we don't have any listener or so on response sending so all it changes is add a WARN log in a bunch of places?

Does this appreciably change the running time of the tests?

Not really in my testing, I bet there's some degenerate case where it does :P but over 1k+ iterations it looks irrelevant so far.

@original-brownbear
Copy link
Contributor Author

@DaveCTurner thanks for taking a look, see here https://github.com/elastic/elasticsearch/blob/master/test/framework/src/main/java/org/elasticsearch/test/disruption/DisruptableMockTransport.java#L200 for the response handling, currently we do the same for black hole and disconnect there.

@DaveCTurner
Copy link
Contributor

I wasn't sure about adding that. It did seemed unnecessary and we currently don't do anything on disconnected either and just handle it as we handle BLACK_HOLE. I assumed that was because it made no functional difference because we don't have any listener or so on response sending so all it changes is add a WARN log in a bunch of places?

I think the removal of the join timeout will expose the same bug there, given enough iterations on CI. I.e. the join request gets through but then the connection is blackholed/disconnected before the response comes back, so it's never delivered. In reality the requester would drop the connection eventually thanks to keepalives.

@original-brownbear
Copy link
Contributor Author

original-brownbear commented Aug 19, 2020

but then the connection is blackholed/disconnected before the response comes back, so it's never delivered

I think the reason this wasn't and isn't an issue already is that we never black-hole while we have a subset of all runnable tasks at a given timestamp when we blackhole/disconnect a connection. So the send and respond cycle will always happen in one go and it's impossible that we blackhole between the send and respond right?

Also, I'm not sure it would change any behavior for the join (or any other part of the code if we were to throw on the response sending) because it's always code like this for the response:

    private JoinCallback transportJoinCallback(TransportRequest request, TransportChannel channel) {
        return new JoinCallback() {

            @Override
            public void onSuccess() {
                try {
                    channel.sendResponse(Empty.INSTANCE);
                } catch (IOException e) {
                    onFailure(e);
                }
            }

            @Override
            public void onFailure(Exception e) {
                try {
                    channel.sendResponse(e);
                } catch (Exception inner) {
                    inner.addSuppressed(e);
                    logger.warn("failed to send back failure on join request", inner);
                }
            }

where it's just logging as a result of a failed response send.
Plus, it's kind of tricky to throw exceptions when sending responses with the way things are currently coded up because we do this:

            @Override
            public void sendResponse(final TransportResponse response) {
                execute(new Runnable() {
                    @Override
                    public void run() {

to simulate some differences in timing when sending responses so we can't really throw to whatever code invoked sendResponse without losing that randomization (which makes a lot of sense since actual transport implementations will mostly fork off to an IO thread as well and won't be able to throw to the caller either) => I think we're good here in practice even though it doesn't look+feel great maybe?

@DaveCTurner
Copy link
Contributor

Yeah delivering an exception to the responder is kinda pointless, there's nothing it can do about it, but we should still deliver an exception response to the requester in those cases.

@original-brownbear
Copy link
Contributor Author

but we should still deliver an exception response to the requester in those cases.

Well currently this situation isn't a thing to begin with since we always scheduleNow the response handling as I mentioned so we don't have to worry about this? It would just be dead code to add this handling right now wouldn't it?

@DaveCTurner
Copy link
Contributor

Discussed this sync: scheduleNow doesn't really mean "now", we can still break the network before delivering the response.

@original-brownbear
Copy link
Contributor Author

@DaveCTurner alright, I think d0b3d1f should do it here right? (ran ~20k iterations of the coordinator tests with it without issues or excessive slowness)

@original-brownbear
Copy link
Contributor Author

urgh nervermind this needs a test adjustment now :) on it

@original-brownbear
Copy link
Contributor Author

@DaveCTurner sorry for the noise, should be good to review now :)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One further request about blackholed-response behaviour. I may be persuaded to keep things as they are now tho.

case DISCONNECTED:
logger.trace("dropping response to {}: channel is {}", requestDescription, connectionStatus);
logger.trace("disconnected during response to {}: channel is {}", requestDescription, connectionStatus);
onDisconnectedDuringSend(requestId, action, destinationTransport);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think I'd prefer a long delay on the response here using onBlackholedDuringSend too. We're using DISCONNECTED to indicate that the connection actively rejects the message, e.g. sends a RST, but if it rejects the response then the original requester is none the wiser and may wait for a long time before discovering the disconnect.

In practice it's almost never going to be that bad but I'd rather err on the pathological side if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ adjusted accordingly in b181920 for both spots

logger.trace("dropping exception response to {}: channel is {}", requestDescription, connectionStatus);
logger.trace("disconnected during exception response to {}: channel is {}",
requestDescription, connectionStatus);
onDisconnectedDuringSend(requestId, action, destinationTransport);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, we should delay notifying the sender here too.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Contributor Author

Thanks David!

@original-brownbear original-brownbear merged commit 9dc0ca0 into elastic:master Aug 20, 2020
@original-brownbear original-brownbear deleted the 61034-fix branch August 20, 2020 17:11
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Aug 20, 2020
…ic#61310)

It is not realistic to drop messages without eventually failing.
To retain the coverage of long pauses this PR adjusts the blackholed
behavior to fail a send after 24h (which is assumed to be longer than any
timeout in the system) instead of never.

Closes elastic#61034
original-brownbear added a commit that referenced this pull request Aug 21, 2020
… (#61381)

It is not realistic to drop messages without eventually failing.
To retain the coverage of long pauses this PR adjusts the blackholed
behavior to fail a send after 24h (which is assumed to be longer than any
timeout in the system) instead of never.

Closes #61034
@original-brownbear original-brownbear restored the 61034-fix branch December 6, 2020 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test Issues or PRs that are addressing/adding tests v7.10.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] Reproducible Failure in CoordinatorTests.testUnresponsiveFollowerDetectedEventually

4 participants