Skip to content

Conversation

@spinscale
Copy link
Contributor

@spinscale spinscale commented Jul 31, 2018

Currently a watch execution results in one bulk request, when the
triggered watches are written into the that index, that need to be
executed. However the update of the watch status, the creation of the
watch history entry as well as the deletion of the triggered watches
index are all single document operations.

This can have quite a negative impact, once you are executing a lot of
watches, as each execution results in 4 documents writes, three of them
being single document actions.

This commit switches to a bulk processor instead of a single document
action for writing watch history entries and deleting triggered watch
entries. However the defaults are to run synchronous as before because
the number of concurrent requests is set to 0. This also fixes a bug,
where the deletion of the triggered watch entry was done asynchronously.

However if you have a high number of watches being executed, you can
configure watcher to delete the triggered watches entries as well as
writing the watch history entries via bulk requests.

The triggered watches deletions should still happen in a timely manner,
where as the history entries might actually be bound by size as one
entry can easily have 20kb.

The following settings have been added:

  • xpack.watcher.bulk.actions (default 1)
  • xpack.watcher.bulk.concurrent_requests (default 0)
  • xpack.watcher.bulk.flush_interval (default 1s)
  • xpack.watcher.bulk.size (default 1mb)

The drawback of this is of course, that on a node outage you might end
up with watch history entries not being written or watches needing to be
executing again because they have not been deleted from the triggered
watches index. The window of these two cases increases configuring the bulk processor to wait to reach certain thresholds.

Currently a watch execution results in one bulk request, when the
triggered watches are written into the that index, that need to be
executed. However the update of the watch status, the creation of the
watch history entry as well as the deletion of the triggered watches
index are all single document operations.

This can have quite a negative impact, once you are executing a lot of
watches, as each execution results in 4 documents writes, three of them
being single document actions.

This commit switches to a bulk processor instead of a single document
action for writing watch history entries and deleting triggered watch
entries. However the defaults are to run synchronous as before because
the number of concurrent requests is set to 0. This also fixes a bug,
where the deletion of the triggered watch entry was done asynchronously.

However if you have a high number of watches being executed, you can
configure watcher to delete the triggered watches entries as well as
writing the watch history entries via bulk requests.

The triggered watches deletions should still happen in a timely manner,
where as the history entries might actually be bound by size as one
entry can easily have 20kb.

The following settings have been added:

- xpack.watcher.triggered_watch_store.bulk.actions (default 1)
- xpack.watcher.triggered_watch_store.bulk.concurrent_requests (default 0)
- xpack.watcher.triggered_watch_store.bulk.flush_interval (default 1s)
- xpack.watcher.triggered_watch_store.bulk.size (default 1mb)
- xpack.watcher.history_store.bulk.actions (default 1)
- xpack.watcher.history_store.bulk.concurrent_requests (default 0)
- xpack.watcher.history_store.bulk.flush_interval (default 1s)
- xpack.watcher.history_store.bulk.size (default 1mb)

The drawback of this is of course, that on a node outage you might end
up with watch history entries not being written or watches needing to be
executing again because they have not been deleted from the triggered
watches index. The window of these two cases increases when running bulk
requests.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@spinscale
Copy link
Contributor Author

@elasticmachine retest this please

Copy link
Contributor

@hub-cap hub-cap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow this actually cleaned things up a good bit. Great job. I had one question about sync vs async deletes, and once you answer me on that, this is good to go. Ill pre-approve your loan tho :)

client.delete(request); // FIXME shouldn't we wait before saying the delete was successful
}
logger.trace("successfully deleted triggered watch with id [{}]", wid);
bulkProcessor.add(request);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does executing watches take into consideration any deleted requests here? what happens if i delete a watch and then wait to submit the bulk, might that watch trigger again? Should we make deletes be the only sync thing? I think thats a fair tradeoff given my very little knowledge here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it does. If the watch is deleted and then a triggered watch is picked, there will be a watch history entry telling you that the referenced watch could not be found.

@spinscale
Copy link
Contributor Author

I thought about this as I wanted to merge and did a change that warrants some discussion. I have added a new commit using only one bulk processor and one set of settings instead. Do you think that's better or should be reverted and we go with two?

My initial thought was, that splitting those makes sense because the documents are extremely different in size, so one could configure different settings - but at the end you configure a flush interval or based on size so all of this can be done within one bulk processor. One bulk processor with correctly tuned settings will be as good as two.

@hub-cap
Copy link
Contributor

hub-cap commented Aug 21, 2018

I think this change makes sense. Its less junk in the service, making them easier to test, and it removes a bunch of nested close logic that we might forget if we change/add something. As long as we dont start adding this to the components list for guice, im all for it ;)

I also dont see the value in 2 diff settings given that we can flush on size and time. I agree w/ your tuning statement above. :shipit:

@hub-cap
Copy link
Contributor

hub-cap commented Aug 21, 2018

be sure to modify your commit msg and remove the extra set of settings ;)

@hub-cap
Copy link
Contributor

hub-cap commented Aug 31, 2018

changes look good, gl w the test failure :)

@spinscale
Copy link
Contributor Author

@elasticmachine retest this please

@spinscale spinscale merged commit 1391288 into elastic:master Sep 18, 2018
spinscale added a commit that referenced this pull request Sep 18, 2018
Currently a watch execution results in one bulk request, when the
triggered watches are written into the that index, that need to be
executed. However the update of the watch status, the creation of the
watch history entry as well as the deletion of the triggered watches
index are all single document operations.

This can have quite a negative impact, once you are executing a lot of
watches, as each execution results in 4 documents writes, three of them
being single document actions.

This commit switches to a bulk processor instead of a single document
action for writing watch history entries and deleting triggered watch
entries. However the defaults are to run synchronous as before because
the number of concurrent requests is set to 0. This also fixes a bug,
where the deletion of the triggered watch entry was done asynchronously.

However if you have a high number of watches being executed, you can
configure watcher to delete the triggered watches entries as well as
writing the watch history entries via bulk requests.

The triggered watches deletions should still happen in a timely manner,
where as the history entries might actually be bound by size as one
entry can easily have 20kb.

The following settings have been added:

- xpack.watcher.bulk.actions (default 1)
- xpack.watcher.bulk.concurrent_requests (default 0)
- xpack.watcher.bulk.flush_interval (default 1s)
- xpack.watcher.bulk.size (default 1mb)

The drawback of this is of course, that on a node outage you might end
up with watch history entries not being written or watches needing to be
executing again because they have not been deleted from the triggered
watches index. The window of these two cases increases configuring the bulk processor to wait to reach certain thresholds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants