Skip to content

Commit e74b271

Browse files
szabostevelcawl
authored andcommitted
[DOCS] Adds text about data types to the categorization docs (#51145)
1 parent e3e2082 commit e74b271

File tree

1 file changed

+37
-14
lines changed

1 file changed

+37
-14
lines changed

docs/reference/ml/anomaly-detection/categories.asciidoc

Lines changed: 37 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,28 @@
11
[role="xpack"]
22
[[ml-configuring-categories]]
3-
=== Categorizing log messages
3+
=== Categorizing data
4+
5+
Categorization is a {ml} process that considers a tokenization of a field,
6+
clusters similar data together, and classifies them into categories. However,
7+
categorization doesn't work equally well on different data types. It works
8+
best on machine-written messages and application outputs, typically on data that
9+
consists of repeated elements, for example log messages for the purpose of
10+
system troubleshooting. Log categorization groups unstructured log messages into
11+
categories, then you can use {anomaly-detect} to model and identify rare or
12+
unusual counts of log message categories.
13+
14+
Categorization is tuned to work best on data like log messages by taking token
15+
order into account, not considering synonyms, and including stop words in its
16+
analysis. Complete sentences in human communication or literary text (for
17+
example emails, wiki pages, prose, or other human generated content) can be
18+
extremely diverse in structure. Since categorization is tuned for machine data
19+
it will give poor results on such human generated data. For example, the
20+
categorization job would create so many categories that couldn't be handled
21+
effectively. Categorization is _not_ natural language processing (NLP).
22+
23+
[float]
24+
[[ml-categorization-log-messages]]
25+
==== Categorizing log messages
426

527
Application log events are often unstructured and contain variable data. For
628
example:
@@ -65,8 +87,8 @@ defining categories. The categorization filters are applied in the order they
6587
are listed in the job configuration, which allows you to disregard multiple
6688
sections of the categorization field value. In this example, we have decided that
6789
we do not want the detailed SQL to be considered in the message categorization.
68-
This particular categorization filter removes the SQL statement from the categorization
69-
algorithm.
90+
This particular categorization filter removes the SQL statement from the
91+
categorization algorithm.
7092

7193
If your data is stored in {es}, you can create an advanced {anomaly-job} with
7294
these same properties:
@@ -79,7 +101,7 @@ NOTE: To add the `categorization_examples_limit` property, you must use the
79101

80102
[float]
81103
[[ml-configuring-analyzer]]
82-
==== Customizing the categorization analyzer
104+
===== Customizing the categorization analyzer
83105

84106
Categorization uses English dictionary words to identify log message categories.
85107
By default, it also uses English tokenization rules. For this reason, if you use
@@ -135,7 +157,8 @@ here achieves exactly the same as the `categorization_filters` in the first
135157
example.
136158
<2> The `ml_classic` tokenizer works like the non-customizable tokenization
137159
that was used for categorization in older versions of machine learning. If you
138-
want the same categorization behavior as older versions, use this property value.
160+
want the same categorization behavior as older versions, use this property
161+
value.
139162
<3> By default, English day or month words are filtered from log messages before
140163
categorization. If your logs are in a different language and contain
141164
dates, you might get better results by filtering the day or month words in your
@@ -178,9 +201,9 @@ POST _ml/anomaly_detectors/_validate
178201
If you specify any part of the `categorization_analyzer`, however, any omitted
179202
sub-properties are _not_ set to default values.
180203

181-
The `ml_classic` tokenizer and the day and month stopword filter are more or less
182-
equivalent to the following analyzer, which is defined using only built-in {es}
183-
{ref}/analysis-tokenizers.html[tokenizers] and
204+
The `ml_classic` tokenizer and the day and month stopword filter are more or
205+
less equivalent to the following analyzer, which is defined using only built-in
206+
{es} {ref}/analysis-tokenizers.html[tokenizers] and
184207
{ref}/analysis-tokenfilters.html[token filters]:
185208

186209
[source,console]
@@ -234,11 +257,11 @@ PUT _ml/anomaly_detectors/it_ops_new_logs3
234257
<4> Underscores, hyphens, and dots are removed from the beginning of tokens.
235258
<5> Underscores, hyphens, and dots are also removed from the end of tokens.
236259

237-
The key difference between the default `categorization_analyzer` and this example
238-
analyzer is that using the `ml_classic` tokenizer is several times faster. The
239-
difference in behavior is that this custom analyzer does not include accented
240-
letters in tokens whereas the `ml_classic` tokenizer does, although that could
241-
be fixed by using more complex regular expressions.
260+
The key difference between the default `categorization_analyzer` and this
261+
example analyzer is that using the `ml_classic` tokenizer is several times
262+
faster. The difference in behavior is that this custom analyzer does not include
263+
accented letters in tokens whereas the `ml_classic` tokenizer does, although
264+
that could be fixed by using more complex regular expressions.
242265

243266
If you are categorizing non-English messages in a language where words are
244267
separated by spaces, you might get better results if you change the day or month
@@ -263,7 +286,7 @@ API examples above.
263286

264287
[float]
265288
[[ml-viewing-categories]]
266-
==== Viewing categorization results
289+
===== Viewing categorization results
267290

268291
After you open the job and start the {dfeed} or supply data to the job, you can
269292
view the categorization results in {kib}. For example:

0 commit comments

Comments
 (0)