Skip to content

Conversation

@benwtrent
Copy link
Member

This PR enables stats on inference to be gathered and stored in the .ml-stats-* indices.

Each node + model_id will have its own running stats document and these will later be summed together when returning _stats to the user.

.ml-stats-* is ILM managed (when possible). So, at any point the underlying index could change. This means that a stats document that is read in and then later updated will actually be a new doc in a new index. This complicates matters as this means that having a running knowledge of seq_no and primary_term is complicated and almost impossible. This is because we don't know the latest index name.

We should also strive for throughput, as this code sits in the middle of an ingest pipeline (or even a query).

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

ActionListener.wrap(
r -> this.loadModel(modelId, r),
e -> {
logger.error("[{}] failed to get previous model stats", modelId);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might actually want to fail completely if the unwrapped error is anything other than a ResourceNotFound. If .ml-stats-* exists but has unallocated primary shards, we may want to bail.

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good but I'd like you to reconsider how valuable the time spent metric is.

If only a few documents are inferred then time spent in nanos will be rounded down to 0 millis.

The overhead to measuring the time is tiny compared to the cost of the modelling function but there is overhead calling System.nanoTime()

My main concern is that we give a false impression of accuracy where we know there is rounding error and loss. How valuable will the user find the metric?

Let's open up the discussion my feeling is that it is not necessary and it is easy to put the timing code back in a later version if required but difficult to remove.

@bpintea bpintea added v7.8.0 and removed v7.7.0 labels Mar 25, 2020
@benwtrent
Copy link
Member Author

I'd like you to reconsider how valuable the time spent metric is.

I included it as it is a statistic gathered by ingest. But, I do agree, the overhead and loss of accuracy here might not make it worth it.

All the other use cases of inference would include timing stats themselves (ingest and search). So, information of how long a model takes to infer can be well...inferred, by the time used in search/ingest.

I have no problem removing timing stats right now.

@benwtrent
Copy link
Member Author

@elasticmachine update branch

@benwtrent
Copy link
Member Author

@elasticmachine update branch

@benwtrent benwtrent requested a review from davidkyle April 1, 2020 12:57
@benwtrent
Copy link
Member Author

@elasticmachine update branch

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

The persist interval can probably be bump a couple of seconds

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent merged commit c087ee1 into elastic:master Apr 3, 2020
@benwtrent benwtrent deleted the feature/ml-inference-stats-collection branch April 3, 2020 18:09
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Apr 13, 2020
This PR enables stats on inference to be gathered and stored in the `.ml-stats-*` indices.

Each node + model_id will have its own running stats document and these will later be summed together when returning _stats to the user.

`.ml-stats-*` is ILM managed (when possible). So, at any point the underlying index could change. This means that a stats document that is read in and then later updated will actually be a new doc in a new index. This complicates matters as this means that having a running knowledge of seq_no and primary_term is complicated and almost impossible. This is because we don't know the latest index name.

We should also strive for throughput, as this code sits in the middle of an ingest pipeline (or even a query).
benwtrent added a commit that referenced this pull request Apr 13, 2020
* [ML] Start gathering and storing inference stats (#53429)

This PR enables stats on inference to be gathered and stored in the `.ml-stats-*` indices.

Each node + model_id will have its own running stats document and these will later be summed together when returning _stats to the user.

`.ml-stats-*` is ILM managed (when possible). So, at any point the underlying index could change. This means that a stats document that is read in and then later updated will actually be a new doc in a new index. This complicates matters as this means that having a running knowledge of seq_no and primary_term is complicated and almost impossible. This is because we don't know the latest index name.

We should also strive for throughput, as this code sits in the middle of an ingest pipeline (or even a query).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants