-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Automatically adjust search threadpool queue_size #23884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I think there is more testing I can add to this (there are already unit tests), and I need to benchmark the difference between this and normal execution, but I wanted to get comments on the implementation earlier rather than later. |
|
Out of interest, why did you choose 10,000 and 25? My gut feeling would be to adjust more frequently and by a smaller delta, but that's a complete guess on my part. |
For the window size (10,000), I wanted a large enough window that a single operation didn't skew the adjustment too much, since there are a very large number of search operations happening for each search on the search threadpool (a task for each shard, plus the coordinating task) the number should be reached fairly quickly (the "should" is another reason why it's configurable) For the adjustment size, I picked it arbitrarily, the current values for the search threadpool are: initial_size: 500, min: 10, max: 1000, I wanted the adjustments to be more gradual, but I don't have any real numbers around it other than 25 seemed like a nice number. I'm happy to change it :) |
|
Here are some benchmarks run between master and the auto-queue-size branch:
The query performance looks extremely similar, with very slightly higher latency on the auto-queueing branch, ~1ms for term queries and <1ms for phrase queries. |
s1monw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a first pass and I really like it so far. There are some structural issues (nitpicks) and a bunch of questions that I have but it looks pretty good IMO. I am not so sure about the adjustsize of 25 but I might be missing something. What I'd love to see is a benchmark where a node is under load and see how we behave with and without the queue impl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a hard check please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one needs javadocs and you can go back to 140 chars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file can go back to 140 chars as well...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's hard to tell what totalRuntime is and if we'd need any error correction here ie. if totalRuntime can grow much bigger than task count. My intuition says no but I'd appreciate some docs that explain the scale of these parameters or what they are (totalRuntime in nanos? millis?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use Math.toIntExact here when we cast to long first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, can we log this on the consumer level if at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just fail if this is the case? I mean we should not call it in that case right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with this TimedRunnable is a better generic wrapper, because it allows checking task time before the task is actually complete (returning -1), rather than blowing up, if you think it should be used only for this calculation though, I can change it to a hard error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well then document this? I think it should have a javadoc? Also does it need to be public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also can this class be final and pkg private? we should not leak it?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if a wikipedia link or something would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be final too?
16534fc to
723b5a0
Compare
|
Thanks for the review @s1monw! I've pushed a lot of changes addressing your feedback.
You're not missing anything at all, I picked it arbitrarily. I was thinking of a better idea though, how about 1/10th of the difference between So with
Great idea! I can work on this. |
that seems like a good first step. I also wonder what the default is for the |
Sounds good, I'll work on that!
The default is 10,000, but this is configurable in the settings |
I feel we shouldn't change the default of the search threadpool in this PR. Lets keep it to 1k? |
s1monw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments is looking good so far! I think you can remove the WIP label?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you do this, make sure it has a comment. I am ok with adding a static method that delegates and has an assertion in it with a supressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well then document this? I think it should have a javadoc? Also does it need to be public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we stick to the size we had or why did you change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is too much by default maybe go with 1k for a start?
|
Can one of you share why we apply this to the Search the search threadpool first and also if we have similar plans for Bulk? |
@LucaWintergerst to be honest I am leaning towards not even applying it to the search thread-pool in this PR but in a followup. We are just introducing this feature and it's not considered good practice to change more than one variable in a system in order to observe behavioral change. Search is also very different from indexing since on a search request you can retry on a replica, on indexing this is not possible. To make this work well for indexing we also need more benchmarks and infrastructure for this to test so I don't think that this will be applied to indexing by default before we have production feedback from the real word on this feature for search. I hope this makes sense |
f8ad4fa to
d2c1334
Compare
I think 1k is a little too small (only ~166 queries for a single ES node), I changed the default to 2k, does that work? I also reset the search defaults back to an initial size of 1000 (like it was before) with a min of 10 and max of 2000.
Thanks! I removed the WIP label |
s1monw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code wise I think we are ready to go. I still want to see benchmarks / tests on how better we do compared to non-autoscaling if a node is slow as well as how quickly it recovers in terms of queue length when it gets healthy again.
a general thought, I wonder if we should make the window size a function of the queue length since with shorter queues we should check more often if we need to make it longer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a check that max > min make it a hard check and throw IAE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make it final?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I extend TimedRunnable in QueueResizingEsThreadPoolExecutorTests as SettableTimedRunnable so I can substitute the time taken for a task without Thread.sleep, so I can't make this final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add custom validators / parser to make sure that min <= max at the settings parsing level. I know they have a cyclic dep. So I think you need to create two instance of each for the validation and for the actual registration, I thinks worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a public int capacity() in the super class isn't this enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also override public int remainingCapacity() with:
public int remainingCapacity() {
return Math.max(0, super.remainingCapacity());
}
jasontedor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I reviewed this! I like it. I left a bunch of comments though. Two things I noticed that I'm not sure where to comment:
- the naming of frame/window is inconsistent, you use both
- the naming of queue resizing executor, auto queue adjusting executor, and
EsExecutors#newAutoQueueFixedare inconsistent with each other, can we get them in sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add based on queue throughput.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These always read clearer to me as <= 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must be positive? Include the illegal value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this could fit in 140 columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a regular comment and not a Javadoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate code comments, but I think this one can be dropped. It doesn't communicate anything that the code doesn't already show and it's at risk of becoming inconsistent with the code over time as it happens so frequently that a change is made to the code but comments aren't updated. What I would support if you think this comment is essential is:
builders.put(
Names.SEARCH,
new AutoQueueAdjustingExecutorBuilder(
settings,
Names.SEARCH,
searchThreadPoolSize(availableProcessors),
1000 /* initial queue size */,
10 /* min queue size */,
2000 /* max queue size */,
2000 /* frame size */));but even this I think is unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the offending value? If this breaks, it's going to be in tests in concurrent code where it's possible it will not reproduce easily to any hint as to the problem will be welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we can probably drop the second size from the name, make it less verbose.
jasontedor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left two more comments, sorry, still very much thinking about this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced in th soundness of the bookkeeping here. I think that totalNanos can include the total task time for more than tasksPerWindow tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How so? I am trying to think of how it will do that, but I wasn't able to come up with something, do you have a specific scenario in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced we should ignore failures. These tasks still occupy queue capacity, and there is no guarantee they failed quickly, a thread can be executing for awhile before failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, I'll remove the null check and use all the runnables as tasks
|
(Heads up for anyone following this PR, I am currently testing this in a real-world scenario and adjusting/changing things accordingly) |
|
Okay, I ran some benchmarks with this change, there are a number of things I found that were interesting. Some that caused me to rethink the design, keep reading below for the update. I tested this on a 6-node cluster with 5 extremely beefy data nodes (32 CPUs, 64gb RAM, JVM heap at 30gb) and a single client node (to which all the requests were sent). The cluster was loaded with 120 million documents on an index with 5 primaries and 4 replicas (so each node could serve a request for any shard). I used Pius' setup for JMeter to spawn 5 threads each running 40000 queries at a target rate of 20 queries a second (so with 5 threads it'd be targeted at 100 queries a second). The query was a fairly complex query with During this test, I consistently stressed a single node by using This does the following:
I did this to simulate a situation where one node in a cluster cannot keep up, whether because of an expensive query, or degraded disk, etc. The desired behaviorThe behavior I'd like to see in this situation is that the stressed node (es5) would eventually lower its threadpool queue down to a level where it would start rejecting requests, causing other nodes in the cluster to serve them. This would lead to better overall performance since the load would be routed away from the overloaded node. The results with this branch as-isFirst I tried this branch as-is, I measured how the threadpools changed and whether the desired behavior was accomplished. Spoiler alert: It was not. My initial premise was based on the (deceptively simple) Little's Law formula of Where I was measuring both I discovered a number of issues with this premise. First, by leaving both The alternative implementation"Okay" I thought, I need to make the relationship between the optimal queue size and actual queue size inverse, so that as the "optimal" queue went up, the actual queue size went down. The second thing I tried was to take the maximum queue size and subtract the optimal queue size, so that as the optimal queue size went up ( This sort of worked, but it had a number of issues - It felt hacky, and not like an actual good implementation, being able to set a good maximum queue size would be almost impossible without knowing up front what your queue size already was at, set too high of a max and the adjustment would have no effect, too low and you'd be rejecting before you actually had stress on the node. Additionally, it suffered the same problem as the previous solution, both The alternative-alternative implementationBack to the drawing board, I spent some time thinking and realized that we should not be measuring both With this in mind, I implemented an alternative where I then use I tested my theory; with the same setup, I set up a I am happy to report that with my initial testing, I was able to show that with this setup, the stressed node correctly routed traffic away from itself by lowering and subsequently rejecting search requests, routing them to the unstressed nodes, and improving average response times for the entire run. Here's an overall graph of the response times with "None" and "Auto" queue adjustment: I also captured the handled tasks per second for an unstressed and stressed node as well as the queue adjustment for each node, which you can check out the graphs below: And you can see the actual versus "optimal" queue in these two graphs: For more information as well as the raw data (this explanation is already long enough) you can see the rest at https://writequit.org/org/es/stress-run.html I have been spending time thinking about this new direction (setting a reasonable |
|
@dakrone thanks for running these tests, I'm happy we are trying things out 👍 Looking at your results my first question would be what is a good default |
Cool! |
|
Good stuff! It’s promising that you found an alt-alt implementation that is showing improvements 🙂 I think the target response rate is a reasonable configuration for those who really want to use it, and since this is intended to address a specific problem for the high volume search use cases only, we can make it an opt-in thing if we can’t find a good default. What would happen if someone sets an overly aggressive target response rate (much lower than their actual avg query processing time). I am assuming it will just cause a lot more search rejections than the default queue size? We may also want to rename the target_response_rate to something else (so we don't have admins set it to a number thinking that it will give them the response rate they are looking for :P). |
|
Response to @s1monw:
That's a good question, I set it to 5 seconds in this case mostly because I knew
I agree with this, it's something the user will always have more intimate
Yep, the default is the hardest, however, we've already made a decision like Response to @ppf2:
I agree, I do think this can really help people that do have this situation,
I ran some numbers and made a graph! For simplicity, let's assume queries arrive at a rate of 1 per second, and take At task # 26 (where the first measurement/adjustment happens), 26 seconds have With our auto queuing adjustment and no min or max, the queue would trend |
really? 5 seconds, I was rather looking into |
3b1a6b6 to
bc6a04b
Compare
|
retest this please |
33021ca to
8cbd20f
Compare
|
@s1monw as we discussed, I have updated this branch to have the third behavior as described in the benchmarking above. I've also changed it to use the same value (1000) for min and max of the queue_size, that way this has no user-facing changes as-is. Sorry for the rebasing, there were a lot of unrelated failures due to master and x-pack breakage, so I needed to rebase it on a working copy of master. It should be ready for another review! |
s1monw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments. I think this is a good first iteration and we should get this in so we can gain some real world experience. We can improve as we do and we might end up with auto-adjusting the target response rate in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we return a simple EsThreadPoolExecutor if min and max queue size is the same here? I think that would be safer from a bwc perspective?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a BWC perspective it certainly would, though then a user couldn't turn on debug logging to see what kind of calculations we make about the optimal queue size.
I'll make the change though, since it's the most safe thing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we detect this rather than catching this exception? I'd feel better about it if we could
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me make sure I understand what you're asking - we switched to using Math.toIntExact, so are you saying we should manually check if the long value is > Integer.MAX_VALUE rather than catching the exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so who we use the new FIXED_AUTO_QUEUE_SIZE but if somebody had some custom settings on their search threadpool will they still work? I mean are the settings for FIXED a true subset of the one in FIXED_AUTO_QUEUE_SIZE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean are the settings for FIXED a true subset of the one in FIXED_AUTO_QUEUE_SIZE?
They are a true subset, the new type makes only additive setting changes
6137c85 to
eac960c
Compare
This PR adds a new thread pool type: `fixed_auto_queue_size`. This thread pool behaves like a regular `fixed` threadpool, except that every `auto_queue_frame_size` operations (default: 10,000) in the thread pool, [Little's Law](https://en.wikipedia.org/wiki/Little's_law) is calculated and used to adjust the pool's `queue_size` either up or down by 50. A minimum and maximum is taken into account also. When the min and max are the same value, a regular fixed executor is used instead. The `SEARCH` threadpool is changed to use this new type of thread pool. However, the min and max are both set to 1000, meaning auto adjustment is opt-in rather than opt-out. Resolves elastic#3890
eac960c to
d09e643
Compare






This PR adds a new thread pool type:
fixed_auto_queue_size. This thread poolbehaves like a regular
fixedthreadpool, except that everyauto_queue_frame_sizeoperations (default: 10,000) in the thread pool,Little's Law is calculated and
used to adjust the pool's
queue_sizeeither up or down by 25. A minimum andmaximum is taken into account also.
The
SEARCHthreadpool is changed to use this new type of thread pool.Relates to #3890