Skip to content

Conversation

@droberts195
Copy link

@droberts195 droberts195 commented Feb 4, 2020

This change adds support for the following new model_size_stats
fields:

  • categorized_doc_count
  • total_category_count
  • frequent_category_count
  • rare_category_count
  • dead_category_count
  • categorization_status

Relates #50749

This change adds support for the following new model_size_stats
fields:

- categorized_doc_count
- total_category_count
- frequent_category_count
- rare_category_count
- categorization_status
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@droberts195
Copy link
Author

This is labelled >non-issue because a corresponding ml-cpp PR will populate the stats and the release note will be added as part of that PR.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions outside this PR:

Is there a plan to add a dead flag to the categories that are superseded? Or are dead categories just not returned via the results API?

Similarly, I wonder if including some simple distribution/ranking data in the categories is also good. Knowing common/rare categories might be useful

.setAliases("mbaf", "modelBucketAllocationFailures")
.build());
table.addCell("model.categorization_status",
TableColumnAttributeBuilder.builder("current categorization status")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is configured to be included by default, you will probably have to update the cat job yaml test

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, since only a small proportion of jobs use categorization I should probably change this one so it's not included by default.

@droberts195
Copy link
Author

Is there a plan to add a dead flag to the categories that are superseded?

This is a possibility. The dead categories could contain a field that says which other category killed them.

Or are dead categories just not returned via the results API?

They have to be returned forever, because we could have generated anomalies that reference them. (It would be unusual, as dead categories tend to occur early on in a job's lifetime, before we start generating anomalies, but nevertheless possible.) Still, I agree it would be useful to have the information.

Similarly, I wonder if including some simple distribution/ranking data in the categories is also good. Knowing common/rare categories might be useful

This has been discussed before, and the problem is how often to update the information. The running C++ process already knows both the match count and rank of every category. However, if the category documents were rewritten every time one of these changed then that would lead to every input document causing a result to be written, and the resulting indexing load would be unacceptable. So the match count and rank would need to be updated just occasionally. We would need to make sure that the way "occasionally" was done didn't invalidate the likely uses of these statistics.

@droberts195 droberts195 merged commit 72346b9 into elastic:master Feb 6, 2020
@droberts195 droberts195 deleted the categorizer_model_stats branch February 6, 2020 17:08
droberts195 pushed a commit to droberts195/elasticsearch that referenced this pull request Feb 7, 2020
The new stats will be available from 7.7.0, but the
BWC tests in master need disabling until the backport
PR is merged.

Relates elastic#51879
Relates elastic#52009
droberts195 pushed a commit that referenced this pull request Feb 10, 2020
The new stats will be available from 7.7.0, but the
BWC tests in master need disabling until the backport
PR is merged.

Relates #51879
Relates #52009
droberts195 pushed a commit that referenced this pull request Feb 10, 2020
This change adds support for the following new model_size_stats
fields:

- categorized_doc_count
- total_category_count
- frequent_category_count
- rare_category_count
- dead_category_count
- categorization_status

Backport of #51879
droberts195 pushed a commit to droberts195/elasticsearch that referenced this pull request Feb 11, 2020
In elastic#51146 a rudimentary check for poor categorization was added to
7.6.

This change replaces that warning based on a Java-side check with
a new one based on the categorization_status field that the ML C++
sets.  categorization_status was added in 7.7 and above by elastic#51879,
so this new warning based on more advanced conditions will also be
in 7.7 and above.

Closes elastic#50749
droberts195 pushed a commit that referenced this pull request Feb 11, 2020
…2195)

In #51146 a rudimentary check for poor categorization was added to
7.6.

This change replaces that warning based on a Java-side check with
a new one based on the categorization_status field that the ML C++
sets.  categorization_status was added in 7.7 and above by #51879,
so this new warning based on more advanced conditions will also be
in 7.7 and above.

Closes #50749
droberts195 pushed a commit that referenced this pull request Feb 11, 2020
…2195)

In #51146 a rudimentary check for poor categorization was added to
7.6.

This change replaces that warning based on a Java-side check with
a new one based on the categorization_status field that the ML C++
sets.  categorization_status was added in 7.7 and above by #51879,
so this new warning based on more advanced conditions will also be
in 7.7 and above.

Closes #50749
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants