Skip to content

Conversation

@tlrx
Copy link
Member

@tlrx tlrx commented Oct 25, 2019

This PR adjusts the total time that a Google Cloud Storage request can take in the GCS repository tests from 50 sec to 120 seconds (see https://gradle-enterprise.elastic.co/s/zz5bstxyoqunk/tests/avmwsupzrdux6-dp22mjd7itzv6 for a test failure).

@original-brownbear I still think that instead of increasing the timeout we should configure an executor service in the HTTP handler, as GCS requests take more time than S3/Azure to be processed (most of them are multipart encoded).

@tlrx tlrx added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.5.0 v7.6.0 labels Oct 25, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@tlrx
Copy link
Member Author

tlrx commented Oct 28, 2019

@elasticmachine run elasticsearch-ci/1
(unrelated test failure)

Copy link
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm it seems to me something else but slowness in the HttpServer must be going on here?
I just profiled this test a little and it doesn't seem like we're spending a lot of time on the http server's IO loop in the failing test here.

What we are doing though is allocating a ton of buffers on it during the error simulations, maybe that's where we're burning so much time when under memory pressure?

How about adding this optimization:

--- a/test/framework/src/main/java/org/elasticsearch/repositories/blobstore/ESMockAPIBasedRepositoryIntegTestCase.java
+++ b/test/framework/src/main/java/org/elasticsearch/repositories/blobstore/ESMockAPIBasedRepositoryIntegTestCase.java
@@ -26,7 +26,6 @@ import org.elasticsearch.action.admin.indices.forcemerge.ForceMergeResponse;
 import org.elasticsearch.cluster.metadata.IndexMetaData;
 import org.elasticsearch.common.Strings;
 import org.elasticsearch.common.SuppressForbidden;
-import org.elasticsearch.common.io.Streams;
 import org.elasticsearch.common.network.InetAddresses;
 import org.elasticsearch.common.settings.Settings;
 import org.elasticsearch.mocksocket.MockHttpServer;
@@ -37,6 +36,7 @@ import org.junit.Before;
 import org.junit.BeforeClass;
 
 import java.io.IOException;
+import java.io.InputStream;
 import java.net.InetAddress;
 import java.net.InetSocketAddress;
 import java.util.Map;
@@ -165,8 +165,11 @@ public abstract class ESMockAPIBasedRepositoryIntegTestCase extends ESBlobStoreR
             }
         }
 
+        private static final byte[] BUFFER = new byte[1024];
+
         protected void handleAsError(final HttpExchange exchange) throws IOException {
-            Streams.readFully(exchange.getRequestBody());
+            final InputStream inputStream = exchange.getRequestBody();
+            while (inputStream.read(BUFFER) >= 0);
             exchange.sendResponseHeaders(HttpStatus.SC_INTERNAL_SERVER_ERROR, -1);
             exchange.close();
         }

to dump the bytes more efficiently than through actually instantiating a 1MB or so buffer on every run and see if that helps?
I don't see where we'd need another executor from profiling here and also don't think we should up the timeout here, if it's not GC pressure something else must be off here unless you were able to find some massive CPU usage on the http server dispatcher thread?

@tlrx
Copy link
Member Author

tlrx commented Oct 28, 2019

Thanks @original-brownbear. I'm not sure the allocating are the cause of the read timeouts, but your suggestion is a good one anyway so I pushed 71fd095 👍

We'll see if it has any impacts on the test failures. I also agree that increasing the timeout is not a good thing to do unless we have exhausted all other source of trouble.

Copy link
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Tanguy :)

Let's see if this helps (at least it should be an improvement, it's even a speedup on my local box where things are completely quiet otherwise) and otherwise try and find ways of addting debuging that would track down the source of slowness maybe?

@tlrx tlrx merged commit 47a1fc1 into elastic:master Oct 29, 2019
@tlrx
Copy link
Member Author

tlrx commented Oct 29, 2019

Thanks Armin!

@tlrx tlrx deleted the adjust-total-timeout-gcs-tests branch October 29, 2019 08:13
@tlrx tlrx changed the title Give more time to GCS requests to complete Reduce allocations when draining HTTP requests bodies in repository tests Oct 29, 2019
tlrx added a commit that referenced this pull request Oct 29, 2019
…ests (#48541)

In repository integration tests, we drain the HTTP request body before 
returning a response. Before this change this operation was done using
Streams.readFully() which uses a 8kb buffer to read the input stream, it
 now uses a 1kb for the same operation. This should reduce the allocations 
made during the tests and speed them up a bit on CI.

Co-authored-by: Armin Braun <[email protected]>
tlrx added a commit that referenced this pull request Oct 29, 2019
…ests (#48541)

In repository integration tests, we drain the HTTP request body before 
returning a response. Before this change this operation was done using
Streams.readFully() which uses a 8kb buffer to read the input stream, it
 now uses a 1kb for the same operation. This should reduce the allocations 
made during the tests and speed them up a bit on CI.

Co-authored-by: Armin Braun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test Issues or PRs that are addressing/adding tests v7.5.0 v7.6.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants