Skip to content

Conversation

@original-brownbear
Copy link
Contributor

@original-brownbear original-brownbear commented Apr 4, 2019

  • Bulk requests can be thousands of items large and take more than O(10ms) time to handle => we should not handle them on the transport threadpool to not block select loops
    • Adjusted some unit tests accordingly, used the direct executor here to keep things deterministic that were deterministic before
    • Since we do have multiple context switches and IO during bulk request handling I don't see how adding this one additional switch will have any noticeable performance impact in the real world?
  • relates Log a Warning if NioSelector Thread is Blocked for an Extendend Period of Time #39128
  • relates Audit Logging on the IO Thread Appears to Cause Instability #39658 since it takes the audit logging that was causing trouble on the transport thread to the generic thread pool

Revisited this when profiling to find the slowest executions on the transport thread and finding that bulk request handling takes quite some (in tests up to 200ms!) time for larger requests. I'm not sure if there's a better alternative to the generic pool, but it seemed like a safe bet?
Marking this as discuss since it's just a suggestion and there could be side effects to this that I'm missing.

* Bulk requests can be thousands of items large and take more than O(10ms) time to handle => we should not handle them on the transport threadpool to not block select loops
* relates elastic#39128
* closes elastic#39658
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Member

@jaymode jaymode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a workaround, LGTM, but I would not have this PR close the issue; we are just masking the problem. I think the @elastic/es-security team needs to evaluate that issue, investigate what can be done, and ultimately come to a decision. As it stands nothing will catch or prevent this type of issue from cropping up in a new action and the way security audits.

I'd definitely like for @ywelsch to chime in prior to merging.

@original-brownbear
Copy link
Contributor Author

@jaymode sounds good, I updated the description to not close the issue then :)
I think @ywelsch will be off for the next two weeks but I guess this one can wait right?

@original-brownbear
Copy link
Contributor Author

@DaveCTurner this might interest you btw :) You mentioned running into issues with logging about long compute times on the IO thread. I wonder if this helps (not sure if you have some reproducer for the issues you're seeing)?

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I don't have a reliable reproduction of the issues I occasionally see on my CI machine. But typically I was seeing things blocked for many seconds and not obviously when it would have been processing enormous bulks.

If we are to use a different threadpool, it seems to me that the WRITE threadpool might be more appropriate for this?

I also think that this wants some benchmarking to back up the (IMO reasonable) argument that one more context switch can't possibly hurt here.

@original-brownbear
Copy link
Contributor Author

Unfortunately I don't have a reliable reproduction of the issues I occasionally see on my CI machine. But typically I was seeing things blocked for many seconds and not obviously when it would have been processing enormous bulks.

Hmm yea that seems unlikely, unless something in the bulk processing here can block for a long time (not sure what and I looked pretty extensively).

If we are to use a different threadpool, it seems to me that the WRITE threadpool might be more appropriate for this?

The reason I didn't go with the WRITE pool was that it is bounded and can result in rejections right away. That may alter behavior in situations where there's bulk rejections already? Maybe that's ok though, but for now this seemed like the more neutral change. Moving to WRITE seems like it would somewhat alter the dynamics of things under significant load because we're doing more things on the WRITE pool and you'd have to size it larger to retain the same behavior you had before?

I also think that this wants some benchmarking to back up the (IMO reasonable) argument that one more context switch can't possibly hurt here.

This seems like a situation similar to #39286 where the benchmarking team also told us to merge and then look at the nightly performance.
I ran this through the PMC track and I couldn't get any statistically significant change in throughput though. Just theoretically, it seems pretty much impossible to see one in an isolated benchmark.
I guess the interesting thing would rather be if there are situations where we have some more systemic effect for systems under high indexing load whose GENERIC pool is now under more pressure, but I'm not sure how to benchmark that easily.

@jaymode
Copy link
Member

jaymode commented Apr 8, 2019

If we are to use a different threadpool, it seems to me that the WRITE threadpool might be more appropriate for this?

I'd argue against doing this since this action is not actually writing data; the auditing action is writing logs. Ultimately the data writes will occur on the write threadpool with the TransportShardBulkAction. We're just moving off of a network thread here to avoid blocking those on I/O and I think adding additional heavy I/O to the threads in the write threadpool could lead to unintended consequences.

@tomcallahan
Copy link
Contributor

@jasontedor can you please summarize the thoughts from our FixIt meeting?

@jasontedor
Copy link
Member

I am not in favor of shuttling this kind of work over to the generic thread pool. The generic thread pool is large (can scale up to at least 128 threads and in some cases 512), and doesn't have a bounded queue. That means there would be effectively no backpressure here. While we have a goal to remove these limitations, we don't today (it's not easy because we can deadlock ourselves today if the generic thread pool is too small) and so I would not want to see the generic thread pool used for high throughput work like handling bulk requests.

@jaymode
Copy link
Member

jaymode commented Apr 12, 2019

@jasontedor other than not using the generic threadpool do you have other thoughts on how you'd like to see this addressed? There is blocking I/O when security and auditing are enabled that is being done on a network thread in the case of a bulk request.

@jasontedor
Copy link
Member

How about only if security and auditing are enabled do we use a different thread pool, otherwise keep this as-is today? I would even be in favor of a dedicated security or audit specific thread pool. This will require adding some hooks.

@original-brownbear
Copy link
Contributor Author

@jasontedor but wouldn't it make sense, in general, to get looping through these potentially huge requests off of the IO threads? (with all the ingest/index creation/ checks and such large requests can take quite a bit of time to process)

@jasontedor
Copy link
Member

I don't know, has it proven to be a problem today other than the audit logging case?

@original-brownbear
Copy link
Contributor Author

@jasontedor

I don't know, has it proven to be a problem today other than the audit logging case?

One data point I could give here is that running e.g Rally's PMC track with default settings results in ~100ms average runtime for org.elasticsearch.action.bulk.TransportBulkAction#doExecute.

So for large bulk requests, we're definitely introducing significant latency on the Selector loop(s).
At least to me, it seems wrong that a large bulk request can arbitrarily slow down any other network IO by this much.

@jasontedor
Copy link
Member

One data point I could give here is that running e.g Rally's PMC track with default settings results in ~100ms average runtime for org.elasticsearch.action.bulk.TransportBulkAction#doExecute.

Okay, that is significant and I agree worth addressing.

As I think about this some more, I have general concerns about the blocking of audit logging, in theory this blocking on the network thread could be problematic for any request.

I think we have two problems to solve here then:

  • the fact that we are doing more expensive work than we should be on the network thread -> we need to move to another thread pool
  • the fact that audit logging can block the networking thread; I think we need to find a general solution to this problem -> maybe we need a different model there, like audit logging writes need to be put into a non-blocking queue

So let's focus this pull request on the first problem and try to figure out which thread pool we should use. I continue to have strong reservations about using the generic thread pool for the reasons that I gave previously. In the fix-it meeting yesterday I said I would be fine with the write thread pool. I understand @jaymode concerns about that, but if I understand it correctly it's related to the concerns about audit logging which I think we should explore separately.

@original-brownbear
Copy link
Contributor Author

@jasontedor sounds good :)

In the fix-it meeting yesterday I said I would be fine with the write thread pool.

Already moved to that this afternoon 1b05e50 :)

@original-brownbear
Copy link
Contributor Author

thanks @jasontedor @jaymode @DaveCTurner !

@original-brownbear original-brownbear merged commit 2931e9c into elastic:master Apr 17, 2019
@original-brownbear original-brownbear deleted the 39658 branch April 17, 2019 18:28
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Apr 17, 2019
* Bulk requests can be thousands of items large and take more than O(10ms) time to handle => we should not handle them on the transport threadpool to not block select loops
* relates elastic#39128
* relates elastic#39658
original-brownbear added a commit that referenced this pull request Apr 18, 2019
* Bulk requests can be thousands of items large and take more than O(10ms) time to handle => we should not handle them on the transport threadpool to not block select loops
* relates #39128
* relates #39658
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Apr 18, 2019
* The test fails for the retry backoff enabled case because the retry handler in the bulk processor hasn't been adjusted to account for elastic#40866 which now might lead to an outright rejection of the request instead of its items individually
   * Fixed by adding retry functionality to the top level request as well
* Also fixed the duplicate test for the HLRC that wasn't handling the non-backoff case yet the same way the non-client IT did
* closes elastic#41324
original-brownbear added a commit that referenced this pull request Apr 18, 2019
* The test fails for the retry backoff enabled case because the retry handler in the bulk processor hasn't been adjusted to account for #40866 which now might lead to an outright rejection of the request instead of its items individually
   * Fixed by adding retry functionality to the top level request as well
* Also fixed the duplicate test for the HLRC that wasn't handling the non-backoff case yet the same way the non-client IT did
* closes #41324
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Apr 23, 2019
* Due to elastic#40866 one of the two parallel bulk requests can randomly be
rejected outright when the write queue is full already, we can catch
this situation and ignore it since we can still have the rejection for
the dynamic mapping udate for the other reuqest and it's somewhat rare
to run into this anyway
* Closes elastic#41363
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Apr 24, 2019
* The test fails for the retry backoff enabled case because the retry handler in the bulk processor hasn't been adjusted to account for elastic#40866 which now might lead to an outright rejection of the request instead of its items individually
   * Fixed by adding retry functionality to the top level request as well
* Also fixed the duplicate test for the HLRC that wasn't handling the non-backoff case yet the same way the non-client IT did
* closes elastic#41324
original-brownbear added a commit that referenced this pull request Apr 24, 2019
* The test fails for the retry backoff enabled case because the retry handler in the bulk processor hasn't been adjusted to account for #40866 which now might lead to an outright rejection of the request instead of its items individually
   * Fixed by adding retry functionality to the top level request as well
* Also fixed the duplicate test for the HLRC that wasn't handling the non-backoff case yet the same way the non-client IT did
* closes #41324
original-brownbear added a commit that referenced this pull request Apr 24, 2019
* Due to #40866 one of the two parallel bulk requests can randomly be
rejected outright when the write queue is full already, we can catch
this situation and ignore it since we can still have the rejection for
the dynamic mapping udate for the other reuqest and it's somewhat rare
to run into this anyway
* Closes #41363
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Apr 24, 2019
* Due to elastic#40866 one of the two parallel bulk requests can randomly be
rejected outright when the write queue is full already, we can catch
this situation and ignore it since we can still have the rejection for
the dynamic mapping udate for the other reuqest and it's somewhat rare
to run into this anyway
* Closes elastic#41363
original-brownbear added a commit that referenced this pull request Apr 24, 2019
* Due to #40866 one of the two parallel bulk requests can randomly be
rejected outright when the write queue is full already, we can catch
this situation and ignore it since we can still have the rejection for
the dynamic mapping udate for the other reuqest and it's somewhat rare
to run into this anyway
* Closes #41363
akhil10x5 pushed a commit to akhil10x5/elasticsearch that referenced this pull request May 2, 2019
* Due to elastic#40866 one of the two parallel bulk requests can randomly be
rejected outright when the write queue is full already, we can catch
this situation and ignore it since we can still have the rejection for
the dynamic mapping udate for the other reuqest and it's somewhat rare
to run into this anyway
* Closes elastic#41363
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
* Bulk requests can be thousands of items large and take more than O(10ms) time to handle => we should not handle them on the transport threadpool to not block select loops
* relates elastic#39128
* relates elastic#39658
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
* The test fails for the retry backoff enabled case because the retry handler in the bulk processor hasn't been adjusted to account for elastic#40866 which now might lead to an outright rejection of the request instead of its items individually
   * Fixed by adding retry functionality to the top level request as well
* Also fixed the duplicate test for the HLRC that wasn't handling the non-backoff case yet the same way the non-client IT did
* closes elastic#41324
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
* Due to elastic#40866 one of the two parallel bulk requests can randomly be
rejected outright when the write queue is full already, we can catch
this situation and ignore it since we can still have the rejection for
the dynamic mapping udate for the other reuqest and it's somewhat rare
to run into this anyway
* Closes elastic#41363
@Bukhtawar
Copy link
Contributor

Bukhtawar commented Jul 21, 2020

@original-brownbear are there plans to do something similar at TransportSearchAction too to reject early given search threadpool queue is 1000 and some initial parts of the query handling is done on Network threads. I believe response handling is also on transport worker(unsure about latest version)

@original-brownbear
Copy link
Contributor Author

@Bukhtawar no there's no plans like that. For the time being some initial request deserialization and handling as well as request serialization happens on the network thread and there is no plans to change that at this point (though if that's an issue in some configurations one could reasonably look into doing something like that).

@Bukhtawar
Copy link
Contributor

Bukhtawar commented Jul 21, 2020

 curl -s "localhost:9200/_tasks?nodes=<node_id>&pretty"| grep action| sort -nr| uniq -c                                                                                         
  30210           "action" : "indices:data/write/update/byquery",
  28168           "action" : "indices:data/read/search",
     22           "action" : "indices:data/read/scroll/clear",
    108           "action" : "indices:admin/seq_no/retention_lease_background_sync[p]",
    108           "action" : "indices:admin/seq_no/retention_lease_background_sync",
      2           "action" : "indices:admin/seq_no/global_checkpoint_sync[p]",
      2           "action" : "indices:admin/seq_no/global_checkpoint_sync",
      1           "action" : "cluster:monitor/tasks/lists[n]",
      1           "action" : "cluster:monitor/tasks/lists",

I have a corresponding heap dump which I would share, 1/3 of the heap was roughly transport requests though if that makes sense @original-brownbear

Edit: Note the intent here might just be to reject early though to protect the cluster and may not necessarily help with network threads

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed Coordination/Network Http and internode communication implementations v7.2.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants