Implement transport circuit breaking in aggregator #54610

Tim-Brooks · 2020-04-01T20:35:46Z

This commit moves the action name validation and circuit breaking into
the InboundAggregator. This work is valuable because it lays the
groundwork for incrementally circuit breaking as data is received.

This PR includes the follow behavioral change:

Handshakes contribute to circuit breaking, but cannot be broken. They
currently do not contribute nor are they broken.

elasticmachine · 2020-04-01T20:35:48Z

Pinging @elastic/es-distributed (:Distributed/Network)

…inside_pipeline

ywelsch

Thanks Tim. I've left some comments.

server/src/main/java/org/elasticsearch/transport/InboundAggregator.java

ywelsch · 2020-04-03T08:12:29Z

server/src/main/java/org/elasticsearch/transport/InboundAggregator.java

+        }
+
+        private void incrementReservedBytes(int delta) {
+            bytesToRelease.getAndAdd(delta);


I wonder if we should assert that this method is not called anymore after close has been called

Also, we should assert right now that this method is only called once.

I made a change and renamed the method to set.

server/src/main/java/org/elasticsearch/transport/InboundAggregator.java

ywelsch · 2020-04-03T12:58:41Z

server/src/main/java/org/elasticsearch/transport/InboundHandler.java

-                CircuitBreaker breaker = circuitBreakerService.getBreaker(CircuitBreaker.IN_FLIGHT_REQUESTS);
-                if (reg.canTripCircuitBreaker()) {
-                    breaker.addEstimateBytesAndMaybeBreak(messageLengthBytes, "<transport_request>");
+        messageListener.onRequestReceived(requestId, action);


you've moved this out of the try block. Some implementations of this can throw an exception though. I think we need to handle those.

My answer here is related to:

If a request is received before a node is accepting requests, an exception is logged and the channel is closed. Currently we respond with an exception. But this is dangerous as we cannot negotiate a version.

and #54610 (comment). There is only one usage of this listener and it happens at a place where it is not safe to response with the exception.

We can theoretically respond with an exception after the version handshake message. So I can call the listeners in different places if you would like.

ywelsch · 2020-04-03T13:00:46Z

server/src/main/java/org/elasticsearch/transport/InboundHandler.java

+    private static void sendErrorResponse(String actionName, TransportChannel transportChannel, Exception e) {
+        try {
+            transportChannel.sendResponse(e);
+        } catch (IOException inner) {


should we catch Exception here? We probably never want to bubble anything up here

I did not make this change. I think even the practice of catching IOException here is bad and will need to be addressed in a follow up. We DO want exceptions like this to be bubbled up.

If we cannot send a response we need to kill the channel. Bubbled up exceptions will kill channels.

Essentially, quite a bit of this exception handling in InboundHandler I think has issues but is beyond the scope of my current PR.

Unless we can successfully send a handshake response, we should not catch and handle errors during version handshakes. Doing that leads to things like this (Failed transport handshake between 6.8 and 7.6 nodes may throw a fatal AssertionError #54337). Errors during handshakes should be logged at a high level and kill the channel.

Unknown network errors that prevent a response like the one you are referencing here should not be caught and handled. They should be bubbled up, logged, and the channel killed to send some level of notification to the other node.

These issues are beyond the scope of my PR. But I intend to address them in a follow-up.

I actually did make this change, since it looks like we were catching Exception before when this happened at the application layer. But were only catching IOException for things before the application layer. But I do think this still needs to be ironed out to ensure that a failure to send a response, does not leave us hanging.

I did a little more clean-up around here. I still think we need a follow-up dedicated to exception handling. But I tried to maintain consistent behavior while moving the correct direction.

ywelsch · 2020-04-03T13:05:08Z

server/src/main/java/org/elasticsearch/transport/InboundAggregator.java

+
+        @Override
+        public void close() {
+            final int toRelease = bytesToRelease.getAndSet(0);


I think we should protect against double-closing here, given how important it is to this correctly.

I made a change.

ywelsch · 2020-04-03T13:11:25Z

server/src/main/java/org/elasticsearch/transport/InboundPipeline.java


    public void handleBytes(TcpChannel channel, ReleasableBytesReference reference) throws IOException {
+        if (uncaughtException != null) {
+            throw new IllegalStateException("Pipeline state corrupted by uncaught exception", uncaughtException);


when do we expect this to happen? should we assert false here?

Theoretically this could happen I think on the HTTP on transport error. But with all of the async handling involved here I thought it was appropriate to add a IllegalStateException, but not strict enough to add an assertion as we are very dependent on the different implements (Mock, Nio, and Netty) for the channel close path.

…inside_pipeline

ywelsch

I've left one more comment, looking good o.w.

Would appreciate a second pair of eyes from @original-brownbear, as this is critical infrastructure code.

ywelsch · 2020-04-06T08:07:15Z

server/src/main/java/org/elasticsearch/transport/InboundHandler.java

+            try (Releasable breakerRelease = message.takeBreakerReleaseControl()) {
+                final TransportChannel transportChannel = new TcpTransportChannel(outboundHandler, channel, action, requestId, version,
+                    header.isCompressed(), header.isHandshake(), () -> {});
                handshaker.handleHandshake(transportChannel, requestId, stream);


If TransportHandshaker.handleHandshake throws an exception (e.g. IllegalStateException), that is no longer bubbled up back to the node that initiated the handshake. I'm not sure what this change of behavior entails, but would suggest backing it out of this PR.

Made this change.

original-brownbear

I think the code is ok, but I'm a little uneasy about the plan here:

incrementally circuit breaking as data is received

Are we sure this is what we want in the first place? It seems to me what we really want is to circuit break before reading full messages.
If we circuit break incrementally, that means we're blowing through a bunch of buffer space only to then throw away a message mid-way. How would we ensure liveness here? If a node has a number of large messages come in concurrently, it will never process any of them but waste buffers for all of them for large (smaller than available memory) message sizes given enough concurrent messages?

Wouldn't we instead want to check the circuit breaker in headerReceived? That way we can circuit break a message without wasting all the buffers for it. We can increment the circuit breaker right after reading the header.
If we trip it, we just drop/release all the bytes in the remainder of the message as we aggregate and prevent needlessly holding on to buffered bytes that will never be deserialized.
If we don't trip it, we would've incremented it already before reading the message and would have the guarantee that the buffers we starting holding on to for it, will not be wasted?

=> to me it seems like that's what we want and could have in this PR at low cost with the changes it introduces.

Tim-Brooks · 2020-04-06T16:00:49Z

Are we sure this is what we want in the first place?

This commit does not change any circuit breaking behavior. The end result of what circuit breaking will look like I imagine will be a combination of incremental and pre-breaking bytes that we know we are about to received.

The header int will inform this work. But it is not always present, accurate (compression), and there is some tension between that and our current usage of the MXBeans heap stats to circuit break. So the end result I think will look something like what you're describing. But, there are some complications and this is only an infrastructure PR.

original-brownbear

LGTM, thanks for the explanations Tim! Definitely more flexible to have the circuit breaking further up stream no matter what we do going forward :)

…inside_pipeline

ywelsch

LGTM

This commit moves the action name validation and circuit breaking into the InboundAggregator. This work is valuable because it lays the groundwork for incrementally circuit breaking as data is received. This PR includes the follow behavioral change: Handshakes contribute to circuit breaking, but cannot be broken. They currently do not contribute nor are they broken.

Tim-Brooks added 6 commits March 31, 2020 13:06

Chnages

749c957

WIP

1de985c

WIP

03cd118

WIP

e2d10b9

Chnages

555d5a8

Changes

7dcc09c

Tim-Brooks added >non-issue :Distributed Coordination/Network Http and internode communication implementations v8.0.0 v7.8.0 labels Apr 1, 2020

Tim-Brooks added 3 commits April 1, 2020 16:13

Fix issue

e94d17a

Fix

fcd0ce7

Merge remote-tracking branch 'upstream/master' into circuit_breaking_…

10c44a4

…inside_pipeline

Tim-Brooks requested review from original-brownbear and ywelsch April 2, 2020 00:30

Tim-Brooks added 2 commits April 2, 2020 13:09

Merge remote-tracking branch 'upstream/master' into circuit_breaking_…

fe39a30

…inside_pipeline

Merge remote-tracking branch 'upstream/master' into circuit_breaking_…

901d63e

…inside_pipeline

ywelsch suggested changes Apr 3, 2020

View reviewed changes

Tim-Brooks added 3 commits April 3, 2020 10:08

Merge remote-tracking branch 'upstream/master' into circuit_breaking_…

299feeb

…inside_pipeline

Delete comment

ac8b0f3

Changes

13f6285

Tim-Brooks requested a review from ywelsch April 3, 2020 18:21

Tim-Brooks added 6 commits April 3, 2020 13:04

Change

f8cbb2e

Cleanup exception handling

6283366

Whitespace

21bbda9

Merge remote-tracking branch 'upstream/master' into circuit_breaking_…

3f2a32c

…inside_pipeline

Change

3c7ae75

Merge remote-tracking branch 'upstream/master' into circuit_breaking_…

17f42a3

…inside_pipeline

ywelsch suggested changes Apr 6, 2020

View reviewed changes

original-brownbear reviewed Apr 6, 2020

View reviewed changes

Changes

d972a62

Tim-Brooks requested a review from ywelsch April 6, 2020 15:54

original-brownbear approved these changes Apr 6, 2020

View reviewed changes

Tim-Brooks added 2 commits April 6, 2020 11:00

Merge remote-tracking branch 'upstream/master' into circuit_breaking_…

5f5fb0f

…inside_pipeline

Merge remote-tracking branch 'upstream/master' into circuit_breaking_…

c300562

…inside_pipeline

ywelsch approved these changes Apr 7, 2020

View reviewed changes

Tim-Brooks merged commit 4f0ccd3 into elastic:master Apr 7, 2020

Tim-Brooks added the backport pending label Apr 7, 2020

Tim-Brooks removed the backport pending label Apr 7, 2020

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Implement transport circuit breaking in aggregator #54610

Implement transport circuit breaking in aggregator #54610

Uh oh!

Conversation

Tim-Brooks commented Apr 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Apr 1, 2020

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tim-Brooks Apr 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

original-brownbear left a comment

Choose a reason for hiding this comment

Uh oh!

Tim-Brooks commented Apr 6, 2020

Uh oh!

original-brownbear left a comment

Choose a reason for hiding this comment

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Tim-Brooks commented Apr 1, 2020 •

edited

Loading

Tim-Brooks Apr 3, 2020 •

edited

Loading