Fix Blackholed Connection Behavior in DisruptableMockTransport #61310

original-brownbear · 2020-08-19T06:26:07Z

It is not realistic to drop messages without eventually failing.
To retain the coverage of long pauses this PR adjusts the blockholed
behavior to fail a send after 24h (which is assumed to be longer than any
timeout in the system) instead of never.

Closes #61034

It is not realistic to drop messages without eventually failing. To retain the coverage of long pauses this PR adjusts the blockholed behavior to fail a send after 24h (which is assumed to be longer than any timeout in the system) instead of never. Closes elastic#61034

elasticmachine · 2020-08-19T06:26:10Z

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

DaveCTurner

I think we also need this for responses sent in the BLACK_HOLE and DISCONNECTED states.

Does this appreciably change the running time of the tests?

original-brownbear · 2020-08-19T07:01:30Z

I think we also need this for responses sent in the BLACK_HOLE and DISCONNECTED states.

I wasn't sure about adding that. It did seemed unnecessary and we currently don't do anything on disconnected either and just handle it as we handle BLACK_HOLE. I assumed that was because it made no functional difference because we don't have any listener or so on response sending so all it changes is add a WARN log in a bunch of places?

Does this appreciably change the running time of the tests?

Not really in my testing, I bet there's some degenerate case where it does :P but over 1k+ iterations it looks irrelevant so far.

original-brownbear · 2020-08-19T07:03:14Z

@DaveCTurner thanks for taking a look, see here https://github.com/elastic/elasticsearch/blob/master/test/framework/src/main/java/org/elasticsearch/test/disruption/DisruptableMockTransport.java#L200 for the response handling, currently we do the same for black hole and disconnect there.

DaveCTurner · 2020-08-19T07:47:57Z

I wasn't sure about adding that. It did seemed unnecessary and we currently don't do anything on disconnected either and just handle it as we handle BLACK_HOLE. I assumed that was because it made no functional difference because we don't have any listener or so on response sending so all it changes is add a WARN log in a bunch of places?

I think the removal of the join timeout will expose the same bug there, given enough iterations on CI. I.e. the join request gets through but then the connection is blackholed/disconnected before the response comes back, so it's never delivered. In reality the requester would drop the connection eventually thanks to keepalives.

original-brownbear · 2020-08-19T09:13:41Z

but then the connection is blackholed/disconnected before the response comes back, so it's never delivered

I think the reason this wasn't and isn't an issue already is that we never black-hole while we have a subset of all runnable tasks at a given timestamp when we blackhole/disconnect a connection. So the send and respond cycle will always happen in one go and it's impossible that we blackhole between the send and respond right?

Also, I'm not sure it would change any behavior for the join (or any other part of the code if we were to throw on the response sending) because it's always code like this for the response:

    private JoinCallback transportJoinCallback(TransportRequest request, TransportChannel channel) {
        return new JoinCallback() {

            @Override
            public void onSuccess() {
                try {
                    channel.sendResponse(Empty.INSTANCE);
                } catch (IOException e) {
                    onFailure(e);
                }
            }

            @Override
            public void onFailure(Exception e) {
                try {
                    channel.sendResponse(e);
                } catch (Exception inner) {
                    inner.addSuppressed(e);
                    logger.warn("failed to send back failure on join request", inner);
                }
            }

where it's just logging as a result of a failed response send.
Plus, it's kind of tricky to throw exceptions when sending responses with the way things are currently coded up because we do this:

            @Override
            public void sendResponse(final TransportResponse response) {
                execute(new Runnable() {
                    @Override
                    public void run() {

to simulate some differences in timing when sending responses so we can't really throw to whatever code invoked sendResponse without losing that randomization (which makes a lot of sense since actual transport implementations will mostly fork off to an IO thread as well and won't be able to throw to the caller either) => I think we're good here in practice even though it doesn't look+feel great maybe?

DaveCTurner · 2020-08-19T09:23:28Z

Yeah delivering an exception to the responder is kinda pointless, there's nothing it can do about it, but we should still deliver an exception response to the requester in those cases.

original-brownbear · 2020-08-19T09:38:15Z

but we should still deliver an exception response to the requester in those cases.

Well currently this situation isn't a thing to begin with since we always scheduleNow the response handling as I mentioned so we don't have to worry about this? It would just be dead code to add this handling right now wouldn't it?

DaveCTurner · 2020-08-19T09:54:36Z

Discussed this sync: scheduleNow doesn't really mean "now", we can still break the network before delivering the response.

original-brownbear · 2020-08-19T11:05:05Z

@DaveCTurner alright, I think d0b3d1f should do it here right? (ran ~20k iterations of the coordinator tests with it without issues or excessive slowness)

original-brownbear · 2020-08-19T11:41:30Z

urgh nervermind this needs a test adjustment now :) on it

original-brownbear · 2020-08-19T12:59:48Z

@DaveCTurner sorry for the noise, should be good to review now :)

DaveCTurner

One further request about blackholed-response behaviour. I may be persuaded to keep things as they are now tho.

DaveCTurner · 2020-08-20T09:48:15Z

test/framework/src/main/java/org/elasticsearch/test/disruption/DisruptableMockTransport.java

                            case DISCONNECTED:
-                                logger.trace("dropping response to {}: channel is {}", requestDescription, connectionStatus);
+                                logger.trace("disconnected during response to {}: channel is {}", requestDescription, connectionStatus);
+                                onDisconnectedDuringSend(requestId, action, destinationTransport);


Hmm I think I'd prefer a long delay on the response here using onBlackholedDuringSend too. We're using DISCONNECTED to indicate that the connection actively rejects the message, e.g. sends a RST, but if it rejects the response then the original requester is none the wiser and may wait for a long time before discovering the disconnect.

In practice it's almost never going to be that bad but I'd rather err on the pathological side if possible.

++ adjusted accordingly in b181920 for both spots

DaveCTurner · 2020-08-20T09:49:26Z

test/framework/src/main/java/org/elasticsearch/test/disruption/DisruptableMockTransport.java

-                                logger.trace("dropping exception response to {}: channel is {}", requestDescription, connectionStatus);
+                                logger.trace("disconnected during exception response to {}: channel is {}",
+                                        requestDescription, connectionStatus);
+                                onDisconnectedDuringSend(requestId, action, destinationTransport);


Similarly, we should delay notifying the sender here too.

DaveCTurner

LGTM

original-brownbear · 2020-08-20T17:10:40Z

Thanks David!

…ic#61310) It is not realistic to drop messages without eventually failing. To retain the coverage of long pauses this PR adjusts the blackholed behavior to fail a send after 24h (which is assumed to be longer than any timeout in the system) instead of never. Closes elastic#61034

… (#61381) It is not realistic to drop messages without eventually failing. To retain the coverage of long pauses this PR adjusts the blackholed behavior to fail a send after 24h (which is assumed to be longer than any timeout in the system) instead of never. Closes #61034

original-brownbear added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.10.0 labels Aug 19, 2020

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Aug 19, 2020

DaveCTurner reviewed Aug 19, 2020

View reviewed changes

Merge remote-tracking branch 'elastic/master' into 61034-fix

0dadb97

original-brownbear requested a review from DaveCTurner August 19, 2020 07:03

original-brownbear added 2 commits August 19, 2020 12:09

Merge remote-tracking branch 'elastic/master' into 61034-fix

ad79758

CR: fail responses as well

d0b3d1f

fix test

6f4526b

DaveCTurner reviewed Aug 20, 2020

View reviewed changes

original-brownbear added 2 commits August 20, 2020 15:10

Merge remote-tracking branch 'elastic/master' into 61034-fix

f914113

CR: delay it all

b181920

original-brownbear requested a review from DaveCTurner August 20, 2020 13:17

DaveCTurner approved these changes Aug 20, 2020

View reviewed changes

original-brownbear merged commit 9dc0ca0 into elastic:master Aug 20, 2020

original-brownbear deleted the 61034-fix branch August 20, 2020 17:11

original-brownbear mentioned this pull request Aug 20, 2020

Fix Blackholed Connection Behavior in DisruptableMockTransport (#61310) #61381

Merged

original-brownbear restored the 61034-fix branch December 6, 2020 19:00

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Fix Blackholed Connection Behavior in DisruptableMockTransport #61310

Fix Blackholed Connection Behavior in DisruptableMockTransport #61310

Uh oh!

Conversation

original-brownbear commented Aug 19, 2020

Uh oh!

elasticmachine commented Aug 19, 2020

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Aug 19, 2020

Uh oh!

original-brownbear commented Aug 19, 2020

Uh oh!

DaveCTurner commented Aug 19, 2020

Uh oh!

original-brownbear commented Aug 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaveCTurner commented Aug 19, 2020

Uh oh!

original-brownbear commented Aug 19, 2020

Uh oh!

DaveCTurner commented Aug 19, 2020

Uh oh!

original-brownbear commented Aug 19, 2020

Uh oh!

original-brownbear commented Aug 19, 2020

Uh oh!

original-brownbear commented Aug 19, 2020

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Aug 20, 2020

Choose a reason for hiding this comment

Uh oh!

original-brownbear Aug 20, 2020

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Aug 20, 2020

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Aug 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

original-brownbear commented Aug 19, 2020 •

edited

Loading