Handle Bulk Requests on Write Threadpool #40866

original-brownbear · 2019-04-04T16:57:11Z

Bulk requests can be thousands of items large and take more than O(10ms) time to handle => we should not handle them on the transport threadpool to not block select loops
- Adjusted some unit tests accordingly, used the direct executor here to keep things deterministic that were deterministic before
- Since we do have multiple context switches and IO during bulk request handling I don't see how adding this one additional switch will have any noticeable performance impact in the real world?
relates Log a Warning if NioSelector Thread is Blocked for an Extendend Period of Time #39128
relates Audit Logging on the IO Thread Appears to Cause Instability #39658 since it takes the audit logging that was causing trouble on the transport thread to the generic thread pool

Revisited this when profiling to find the slowest executions on the transport thread and finding that bulk request handling takes quite some (in tests up to 200ms!) time for larger requests. I'm not sure if there's a better alternative to the generic pool, but it seemed like a safe bet?
Marking this as discuss since it's just a suggestion and there could be side effects to this that I'm missing.

* Bulk requests can be thousands of items large and take more than O(10ms) time to handle => we should not handle them on the transport threadpool to not block select loops * relates elastic#39128 * closes elastic#39658

elasticmachine · 2019-04-04T16:57:13Z

Pinging @elastic/es-distributed

jaymode

As a workaround, LGTM, but I would not have this PR close the issue; we are just masking the problem. I think the @elastic/es-security team needs to evaluate that issue, investigate what can be done, and ultimately come to a decision. As it stands nothing will catch or prevent this type of issue from cropping up in a new action and the way security audits.

I'd definitely like for @ywelsch to chime in prior to merging.

original-brownbear · 2019-04-05T22:06:51Z

@jaymode sounds good, I updated the description to not close the issue then :)
I think @ywelsch will be off for the next two weeks but I guess this one can wait right?

original-brownbear · 2019-04-08T07:34:08Z

@DaveCTurner this might interest you btw :) You mentioned running into issues with logging about long compute times on the IO thread. I wonder if this helps (not sure if you have some reproducer for the issues you're seeing)?

DaveCTurner

Unfortunately I don't have a reliable reproduction of the issues I occasionally see on my CI machine. But typically I was seeing things blocked for many seconds and not obviously when it would have been processing enormous bulks.

If we are to use a different threadpool, it seems to me that the WRITE threadpool might be more appropriate for this?

I also think that this wants some benchmarking to back up the (IMO reasonable) argument that one more context switch can't possibly hurt here.

original-brownbear · 2019-04-08T12:57:01Z

Unfortunately I don't have a reliable reproduction of the issues I occasionally see on my CI machine. But typically I was seeing things blocked for many seconds and not obviously when it would have been processing enormous bulks.

Hmm yea that seems unlikely, unless something in the bulk processing here can block for a long time (not sure what and I looked pretty extensively).

If we are to use a different threadpool, it seems to me that the WRITE threadpool might be more appropriate for this?

The reason I didn't go with the WRITE pool was that it is bounded and can result in rejections right away. That may alter behavior in situations where there's bulk rejections already? Maybe that's ok though, but for now this seemed like the more neutral change. Moving to WRITE seems like it would somewhat alter the dynamics of things under significant load because we're doing more things on the WRITE pool and you'd have to size it larger to retain the same behavior you had before?

I also think that this wants some benchmarking to back up the (IMO reasonable) argument that one more context switch can't possibly hurt here.

This seems like a situation similar to #39286 where the benchmarking team also told us to merge and then look at the nightly performance.
I ran this through the PMC track and I couldn't get any statistically significant change in throughput though. Just theoretically, it seems pretty much impossible to see one in an isolated benchmark.
I guess the interesting thing would rather be if there are situations where we have some more systemic effect for systems under high indexing load whose GENERIC pool is now under more pressure, but I'm not sure how to benchmark that easily.

jaymode · 2019-04-08T13:49:56Z

If we are to use a different threadpool, it seems to me that the WRITE threadpool might be more appropriate for this?

I'd argue against doing this since this action is not actually writing data; the auditing action is writing logs. Ultimately the data writes will occur on the write threadpool with the TransportShardBulkAction. We're just moving off of a network thread here to avoid blocking those on I/O and I think adding additional heavy I/O to the threads in the write threadpool could lead to unintended consequences.

tomcallahan · 2019-04-11T20:26:42Z

@jasontedor can you please summarize the thoughts from our FixIt meeting?

jasontedor · 2019-04-12T12:41:54Z

I am not in favor of shuttling this kind of work over to the generic thread pool. The generic thread pool is large (can scale up to at least 128 threads and in some cases 512), and doesn't have a bounded queue. That means there would be effectively no backpressure here. While we have a goal to remove these limitations, we don't today (it's not easy because we can deadlock ourselves today if the generic thread pool is too small) and so I would not want to see the generic thread pool used for high throughput work like handling bulk requests.

jaymode · 2019-04-12T13:44:18Z

@jasontedor other than not using the generic threadpool do you have other thoughts on how you'd like to see this addressed? There is blocking I/O when security and auditing are enabled that is being done on a network thread in the case of a bulk request.

jasontedor · 2019-04-12T14:43:54Z

How about only if security and auditing are enabled do we use a different thread pool, otherwise keep this as-is today? I would even be in favor of a dedicated security or audit specific thread pool. This will require adding some hooks.

original-brownbear · 2019-04-12T15:07:42Z

@jasontedor but wouldn't it make sense, in general, to get looping through these potentially huge requests off of the IO threads? (with all the ingest/index creation/ checks and such large requests can take quite a bit of time to process)

jasontedor · 2019-04-12T15:11:53Z

I don't know, has it proven to be a problem today other than the audit logging case?

original-brownbear · 2019-04-12T15:48:36Z

@jasontedor

I don't know, has it proven to be a problem today other than the audit logging case?

One data point I could give here is that running e.g Rally's PMC track with default settings results in ~100ms average runtime for org.elasticsearch.action.bulk.TransportBulkAction#doExecute.

So for large bulk requests, we're definitely introducing significant latency on the Selector loop(s).
At least to me, it seems wrong that a large bulk request can arbitrarily slow down any other network IO by this much.

jasontedor · 2019-04-12T18:26:29Z

One data point I could give here is that running e.g Rally's PMC track with default settings results in ~100ms average runtime for org.elasticsearch.action.bulk.TransportBulkAction#doExecute.

Okay, that is significant and I agree worth addressing.

As I think about this some more, I have general concerns about the blocking of audit logging, in theory this blocking on the network thread could be problematic for any request.

I think we have two problems to solve here then:

the fact that we are doing more expensive work than we should be on the network thread -> we need to move to another thread pool
the fact that audit logging can block the networking thread; I think we need to find a general solution to this problem -> maybe we need a different model there, like audit logging writes need to be put into a non-blocking queue

So let's focus this pull request on the first problem and try to figure out which thread pool we should use. I continue to have strong reservations about using the generic thread pool for the reasons that I gave previously. In the fix-it meeting yesterday I said I would be fine with the write thread pool. I understand @jaymode concerns about that, but if I understand it correctly it's related to the concerns about audit logging which I think we should explore separately.

original-brownbear · 2019-04-12T20:43:54Z

@jasontedor sounds good :)

In the fix-it meeting yesterday I said I would be fine with the write thread pool.

Already moved to that this afternoon 1b05e50 :)

original-brownbear · 2019-04-17T18:28:14Z

thanks @jasontedor @jaymode @DaveCTurner !

* Bulk requests can be thousands of items large and take more than O(10ms) time to handle => we should not handle them on the transport threadpool to not block select loops * relates elastic#39128 * relates elastic#39658

* Bulk requests can be thousands of items large and take more than O(10ms) time to handle => we should not handle them on the transport threadpool to not block select loops * relates #39128 * relates #39658

* The test fails for the retry backoff enabled case because the retry handler in the bulk processor hasn't been adjusted to account for elastic#40866 which now might lead to an outright rejection of the request instead of its items individually * Fixed by adding retry functionality to the top level request as well * Also fixed the duplicate test for the HLRC that wasn't handling the non-backoff case yet the same way the non-client IT did * closes elastic#41324

* The test fails for the retry backoff enabled case because the retry handler in the bulk processor hasn't been adjusted to account for #40866 which now might lead to an outright rejection of the request instead of its items individually * Fixed by adding retry functionality to the top level request as well * Also fixed the duplicate test for the HLRC that wasn't handling the non-backoff case yet the same way the non-client IT did * closes #41324

* Due to elastic#40866 one of the two parallel bulk requests can randomly be rejected outright when the write queue is full already, we can catch this situation and ignore it since we can still have the rejection for the dynamic mapping udate for the other reuqest and it's somewhat rare to run into this anyway * Closes elastic#41363

* The test fails for the retry backoff enabled case because the retry handler in the bulk processor hasn't been adjusted to account for elastic#40866 which now might lead to an outright rejection of the request instead of its items individually * Fixed by adding retry functionality to the top level request as well * Also fixed the duplicate test for the HLRC that wasn't handling the non-backoff case yet the same way the non-client IT did * closes elastic#41324

* The test fails for the retry backoff enabled case because the retry handler in the bulk processor hasn't been adjusted to account for #40866 which now might lead to an outright rejection of the request instead of its items individually * Fixed by adding retry functionality to the top level request as well * Also fixed the duplicate test for the HLRC that wasn't handling the non-backoff case yet the same way the non-client IT did * closes #41324

* Due to #40866 one of the two parallel bulk requests can randomly be rejected outright when the write queue is full already, we can catch this situation and ignore it since we can still have the rejection for the dynamic mapping udate for the other reuqest and it's somewhat rare to run into this anyway * Closes #41363

* Due to elastic#40866 one of the two parallel bulk requests can randomly be rejected outright when the write queue is full already, we can catch this situation and ignore it since we can still have the rejection for the dynamic mapping udate for the other reuqest and it's somewhat rare to run into this anyway * Closes elastic#41363

* Due to #40866 one of the two parallel bulk requests can randomly be rejected outright when the write queue is full already, we can catch this situation and ignore it since we can still have the rejection for the dynamic mapping udate for the other reuqest and it's somewhat rare to run into this anyway * Closes #41363

* Due to elastic#40866 one of the two parallel bulk requests can randomly be rejected outright when the write queue is full already, we can catch this situation and ignore it since we can still have the rejection for the dynamic mapping udate for the other reuqest and it's somewhat rare to run into this anyway * Closes elastic#41363

* Bulk requests can be thousands of items large and take more than O(10ms) time to handle => we should not handle them on the transport threadpool to not block select loops * relates elastic#39128 * relates elastic#39658

* The test fails for the retry backoff enabled case because the retry handler in the bulk processor hasn't been adjusted to account for elastic#40866 which now might lead to an outright rejection of the request instead of its items individually * Fixed by adding retry functionality to the top level request as well * Also fixed the duplicate test for the HLRC that wasn't handling the non-backoff case yet the same way the non-client IT did * closes elastic#41324

* Due to elastic#40866 one of the two parallel bulk requests can randomly be rejected outright when the write queue is full already, we can catch this situation and ignore it since we can still have the rejection for the dynamic mapping udate for the other reuqest and it's somewhat rare to run into this anyway * Closes elastic#41363

Bukhtawar · 2020-07-21T17:20:58Z

@original-brownbear are there plans to do something similar at TransportSearchAction too to reject early given search threadpool queue is 1000 and some initial parts of the query handling is done on Network threads. I believe response handling is also on transport worker(unsure about latest version)

original-brownbear · 2020-07-21T18:39:51Z

@Bukhtawar no there's no plans like that. For the time being some initial request deserialization and handling as well as request serialization happens on the network thread and there is no plans to change that at this point (though if that's an issue in some configurations one could reasonably look into doing something like that).

Bukhtawar · 2020-07-21T23:23:47Z

 curl -s "localhost:9200/_tasks?nodes=<node_id>&pretty"| grep action| sort -nr| uniq -c                                                                                         
  30210           "action" : "indices:data/write/update/byquery",
  28168           "action" : "indices:data/read/search",
     22           "action" : "indices:data/read/scroll/clear",
    108           "action" : "indices:admin/seq_no/retention_lease_background_sync[p]",
    108           "action" : "indices:admin/seq_no/retention_lease_background_sync",
      2           "action" : "indices:admin/seq_no/global_checkpoint_sync[p]",
      2           "action" : "indices:admin/seq_no/global_checkpoint_sync",
      1           "action" : "cluster:monitor/tasks/lists[n]",
      1           "action" : "cluster:monitor/tasks/lists",

I have a corresponding heap dump which I would share, 1/3 of the heap was roughly transport requests though if that makes sense @original-brownbear

Edit: Note the intent here might just be to reject early though to protect the cluster and may not necessarily help with network threads

Handle Bulk Requests on Generic Threadpool

811f5cb

* Bulk requests can be thousands of items large and take more than O(10ms) time to handle => we should not handle them on the transport threadpool to not block select loops * relates elastic#39128 * closes elastic#39658

original-brownbear added >bug discuss :Distributed Coordination/Network Http and internode communication implementations v8.0.0 v7.2.0 labels Apr 4, 2019

original-brownbear mentioned this pull request Apr 4, 2019

Audit Logging on the IO Thread Appears to Cause Instability #39658

Open

original-brownbear requested review from jaymode and ywelsch April 5, 2019 10:12

jaymode approved these changes Apr 5, 2019

View reviewed changes

original-brownbear requested a review from DaveCTurner April 8, 2019 07:33

DaveCTurner reviewed Apr 8, 2019

View reviewed changes

Merge remote-tracking branch 'elastic/master' into 39658

80946e3

Merge remote-tracking branch 'elastic/master' into 39658

e213d52

CR: move bulk request initial processing to WRITE pool

1b05e50

original-brownbear requested a review from jasontedor April 12, 2019 20:44

original-brownbear merged commit 2931e9c into elastic:master Apr 17, 2019

original-brownbear deleted the 39658 branch April 17, 2019 18:28

original-brownbear added backport pending and removed backport pending labels Apr 17, 2019

original-brownbear mentioned this pull request Apr 17, 2019

Handle Bulk Requests on Write Threadpool (#40866) #41315

Merged

original-brownbear mentioned this pull request Apr 18, 2019

[CI] Test Failure in org.elasticsearch.client.BulkProcessorRetryIT#testBulkRejectionLoadWithBackoff #41324

Closed

original-brownbear mentioned this pull request Apr 18, 2019

Fix BulkProcessor Retry Functionality #41338

Merged

original-brownbear mentioned this pull request Apr 23, 2019

Fix BulkRejectionIT #41446

Merged

original-brownbear mentioned this pull request Apr 24, 2019

Fix BulkProcessor Retry ITs (#41338) #41472

Merged

original-brownbear mentioned this pull request Apr 24, 2019

Fix BulkRejectionIT (#41446) #41500

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Handle Bulk Requests on Write Threadpool #40866

Handle Bulk Requests on Write Threadpool #40866

Uh oh!

Conversation

original-brownbear commented Apr 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Apr 4, 2019

Uh oh!

jaymode left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Apr 5, 2019

Uh oh!

original-brownbear commented Apr 8, 2019

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Apr 8, 2019

Uh oh!

jaymode commented Apr 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomcallahan commented Apr 11, 2019

Uh oh!

jasontedor commented Apr 12, 2019

Uh oh!

jaymode commented Apr 12, 2019

Uh oh!

jasontedor commented Apr 12, 2019

Uh oh!

original-brownbear commented Apr 12, 2019

Uh oh!

jasontedor commented Apr 12, 2019

Uh oh!

original-brownbear commented Apr 12, 2019

Uh oh!

jasontedor commented Apr 12, 2019

Uh oh!

original-brownbear commented Apr 12, 2019

Uh oh!

original-brownbear commented Apr 17, 2019

Uh oh!

Bukhtawar commented Jul 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

original-brownbear commented Jul 21, 2020

Uh oh!

Bukhtawar commented Jul 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

original-brownbear commented Apr 4, 2019 •

edited

Loading

jaymode commented Apr 8, 2019 •

edited

Loading

Bukhtawar commented Jul 21, 2020 •

edited

Loading

Bukhtawar commented Jul 21, 2020 •

edited

Loading