Limit the number of concurrent shard requests per search request #25632

s1monw · 2017-07-10T15:01:27Z

This is a protection mechanism to prevent a single search request from
hitting a large number of shards in the cluster concurrently. If a search is
executed against all indices in the cluster this can easily overload the cluster
causing rejections etc. which is not necessarily desirable. Instead this PR adds
a per request limit of max_concurrent_shard_requests that throttles the number of
concurrent initial phase requests to 256 by default. This limit can be increased per request
and protects single search requests from overloading the cluster. Subsequent PRs can introduces
additional improvements ie. limiting this on a _msearch level, making defaults a factor of
the number of nodes or sort shards iters such that we gain the best concurrency across nodes.

This is a protection mechanism to prevent a single search request from hitting a large number of shards in the cluster concurrently. If a search is executed against all indices in the cluster this can easily overload the cluster causing rejections etc. which is not necessarily desirable. Instead this PR adds a per request limit of `max_concurrent_shard_requests` that throttles the number of concurrent initial phase requests to `256` by default. This limit can be increased per request and protects single search requests from overloading the cluster. Subsequent PRs can introduces addiontional improvemetns ie. limiting this on a `_msearch` level, making defaults a factor of the number of nodes or sort shards iters such that we gain the best concurrency across nodes.

jpountz

I left some minor comments.

jpountz · 2017-07-10T15:30:18Z

core/src/main/java/org/elasticsearch/action/search/SearchRequest.java


    private int batchedReduceSize = 512;

+    private int maxNunConcurrentShardRequests = 256;


jpountz · 2017-07-10T15:33:40Z

core/src/main/java/org/elasticsearch/action/search/SearchRequest.java

+     * Sets the number of shard requests that should be executed concurrently. This value should be used as a protection mechanism to
+     * reduce the number of shard reqeusts fired per high level search request. Searches that hit the entire cluster can be throttled
+     * with this number to reduce the cluster load. The default is <tt>256</tt>
+     */


or maybe rather remove Num from the variable name since the same variable seems to be called maxConcurrentShardRequests elsewhere and in the dsl, so it would be more consistent?

jpountz · 2017-07-10T15:34:22Z

core/src/main/java/org/elasticsearch/rest/action/search/RestSearchAction.java

        final int batchedReduceSize = request.paramAsInt("batched_reduce_size", searchRequest.getBatchedReduceSize());
        searchRequest.setBatchedReduceSize(batchedReduceSize);

+        final int maxConcurrentShardRequests = request.paramAsInt("max_concurrent_shard_requests", searchRequest.getBatchedReduceSize());


the default value looks wrong

jpountz · 2017-07-10T15:35:10Z

rest-api-spec/src/main/resources/rest-api-spec/api/search.json

+        },
+        "max_concurrent_shard_requests" : {
+          "type" : "number",
+          "description" : "The number of concurrent shard requests this search executes concurrently. This values should be used to limitthe impact of the search on the cluster in order to limit the number of concurrent shard requests",


s/values/value;s/limitsthe/limits the/

jpountz · 2017-07-10T15:35:37Z

test/framework/src/main/java/org/elasticsearch/test/client/RandomizingClient.java

    private final SearchType defaultSearchType;
    private final String defaultPreference;
    private final int batchedReduceSize;
+    private final int maxConcurrentRequests;


maybe add "shard" to the name of this variable for consistency?

jpountz · 2017-07-10T15:36:46Z

test/framework/src/main/java/org/elasticsearch/test/client/RandomizingClient.java

            defaultPreference = null;
        }
        this.batchedReduceSize = 2 + random.nextInt(10);
+        this.maxConcurrentRequests = 1 + random.nextInt(100);


can we do something that is more likely to trigger interesting behaviours such as 1 + random.nextInt(1 << random.nextInt(8));

jpountz · 2017-07-10T15:40:29Z

core/src/main/java/org/elasticsearch/action/search/InitialSearchPhase.java

+        boolean success = shardExecutionIndex.compareAndSet(0, concurrentRunnables);
+        assert success;
+        for (int i = 0; i < concurrentRunnables; i++) {
+            int index = i;


let's just use i directly? I presume a previous iteration of this work used a lambda that needed a final variable?

jimczi

I like it because it is simple but it seems difficult to pick a good a value for the limit. The main unknown for the user is the number of nodes that the search will hit so I thought that we would apply this limit per targeted node. It can be a follow up but I think it would make the usage of this much simpler since the goal is to no overload a single node. By doing that we could lower the default value (say 64 which is already high if it's per node) and still get fast search when you have a lot of nodes?

jimczi

LGTM

Subsequent PRs can introduces
additional improvements ie. limiting this on a _msearch level, making defaults a factor of
the number of nodes or sort shards iters such that we gain the best concurrency across nodes.

The shard iters are sorted per shard number first then by index which is a good heuristic to diversify the nodes in the group iteration. We can try to do a better job but after checking the code I think this is much nicer and simpler as it is ;).

Also the setting is currently only applicable to query_then_fetch queries, we should also have a follow up to apply the setting in the dfs_query_then_fetch ? As is this PR throttles the dfs queries but not the query phase ?

s1monw · 2017-07-11T08:44:17Z

Also the setting is currently only applicable to query_then_fetch queries, we should also have a follow up to apply the setting in the dfs_query_then_fetch ? As is this PR throttles the dfs queries but not the query phase ?

hehe good catch I was curious if reviews would catch it :) I agree I will do that too.

s1monw · 2017-07-11T09:17:55Z

@jpountz @jimczi I pushed updates

jimczi

I like the new heuristic with the node count and the default number of shards.
+1

* master: Add an underscore to flood stage setting Avoid failing install if system-sysctl is masked Add another parent value option to join documentation (elastic#25609) Ensure we rewrite common queries to `match_none` if possible (elastic#25650) Remove reference to field-stats docs. Optimize the order of bytes in uuids for better compression. (elastic#24615) Fix BytesReferenceStreamInput#skip with offset (elastic#25634) Limit the number of concurrent shard requests per search request (elastic#25632)

) This is a protection mechanism to prevent a single search request from hitting a large number of shards in the cluster concurrently. If a search is executed against all indices in the cluster this can easily overload the cluster causing rejections etc. which is not necessarily desirable. Instead this PR adds a per request limit of `max_concurrent_shard_requests` that throttles the number of concurrent initial phase requests to `256` by default. This limit can be increased per request and protects single search requests from overloading the cluster. Subsequent PRs can introduces addiontional improvemetns ie. limiting this on a `_msearch` level, making defaults a factor of the number of nodes or sort shards iters such that we gain the best concurrency across nodes.

s1monw added :Search/Search Search-related issues that do not fall into other categories >enhancement v6.0.0 labels Jul 10, 2017

s1monw requested review from jimczi and jpountz July 10, 2017 15:01

org imports:

8375489

jpountz suggested changes Jul 10, 2017

View reviewed changes

jimczi reviewed Jul 10, 2017

View reviewed changes

s1monw added 2 commits July 11, 2017 08:03

apply feedback

f725e34

fix spelling

51f9653

jimczi approved these changes Jul 11, 2017

View reviewed changes

make default a function of nodes and document stuff

1089012

s1monw added 2 commits July 11, 2017 11:23

rename variable

346c71f

remove newlines

a119abf

jpountz approved these changes Jul 11, 2017

View reviewed changes

jimczi approved these changes Jul 11, 2017

View reviewed changes

Merge branch 'master' into limit_number_of_shard_requests

4477eb1

s1monw merged commit 98c91a3 into elastic:master Jul 11, 2017

jimczi mentioned this pull request Jul 13, 2017

Remove _field_stats endpoint #25577

Closed

s1monw added the v5.6.0 label Jul 15, 2017

clintongormley added v6.0.0-beta1 and removed v6.0.0 labels Jul 25, 2017

clintongormley mentioned this pull request Apr 18, 2018

Timelion search ignores time range when choosing indices elastic/kibana#10475

Closed

ctrlaltdel mentioned this pull request Jul 6, 2018

Performance regression for search query in single node cluster #31877

Closed


		private int batchedReduceSize = 512;

		private int maxNunConcurrentShardRequests = 256;

Limit the number of concurrent shard requests per search request #25632

Limit the number of concurrent shard requests per search request #25632

Uh oh!

Conversation

s1monw commented Jul 10, 2017

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jimczi left a comment

Choose a reason for hiding this comment

Uh oh!

jimczi left a comment

Choose a reason for hiding this comment

Uh oh!

s1monw commented Jul 11, 2017

Uh oh!

s1monw commented Jul 11, 2017

Uh oh!

jimczi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants