Skip to content

Commit 0782ae7

Browse files
authored
[7.x] [ML] Text/Log categorization multi-bucket aggregation (#71752) (#78623)
* [ML] Text/Log categorization multi-bucket aggregation (#71752) This commit adds a new multi-bucket aggregation: `categorize_text` The aggregation follows a similar design to significant text in that it reads from `_source` and re-analyzes the the text as it is read. Key difference is that it does not use the indexed field's analyzer, but instead relies on the `ml_standard` tokenizer with specialized ML token filters. The tokenizer + filters are the same that machine learning categorization anomaly jobs utilize. The high level logical flow is as follows: - at each shard, read in the text field with a custom analyzer using `ml_standard` tokenizer - Read in the particular tokens from the analyzer - Feed these tokens to a token tree algorithm (an adaptation of the drain categorization algorithm) - Gather the individual log categories (the leaf nodes), sort them by doc_count, ship those buckets to be merged - Merge all buckets that have the EXACT same key - Once all buckets are merged, pass those keys + counts to a new token tree for additional merging - That tree builds the final buckets and that is returned to the user Algorithm explanation: - Each log is parsed with the ml-standard tokenizer - each token is passed into a token tree - For `max_match_token` each token is stored in the tree and at `max_match_token+1` (or `len(tokens)`) a log group is created - If another log group exists at that leaf, merge it if they have `similarity_threshold` percentage of tokens in common - merging simply replaces tokens that are different in the group with `*` - If a layer in the tree has `max_unique_tokens` we add a `*` child and any new tokens are passed through there. Catch here is that on the final merge, we first attempt to merge together subtrees with the smallest number of documents. Especially if the new sub tree has more documents counted. ## Aggregation configuration. Here is an example on some openstack logs ```js POST openstack/_search?size=0 { "aggs": { "categories": { "categorize_text": { "field": "message", // The field to categorize "similarity_threshold": 20, // merge log groups if they are this similar "max_unique_tokens": 20, // Max Number of children per token position "max_match_token": 4, // Maximum tokens to build prefix trees "size": 1 } } } } ``` This will return buckets like ```json "aggregations" : { "categories" : { "buckets" : [ { "doc_count" : 806, "key" : "nova-api.log.1.2017-05-16_13 INFO nova.osapi_compute.wsgi.server * HTTP/1.1 status len time" } ] } } ``` * fixing for backport * fixing test after backport
1 parent 03905f8 commit 0782ae7

File tree

34 files changed

+3899
-8
lines changed

34 files changed

+3899
-8
lines changed

benchmarks/src/main/java/org/elasticsearch/benchmark/search/aggregations/AggConstructionContentionBenchmark.java

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
import org.elasticsearch.core.Releasables;
2424
import org.elasticsearch.index.Index;
2525
import org.elasticsearch.index.IndexSettings;
26+
import org.elasticsearch.index.analysis.NameOrDefinition;
2627
import org.elasticsearch.index.analysis.NamedAnalyzer;
2728
import org.elasticsearch.index.cache.bitset.BitsetFilterCache;
2829
import org.elasticsearch.index.fielddata.IndexFieldData;
@@ -198,6 +199,22 @@ public long nowInMillis() {
198199
return 0;
199200
}
200201

202+
@Override
203+
public Analyzer getNamedAnalyzer(String analyzer) {
204+
return null;
205+
}
206+
207+
@Override
208+
public Analyzer buildCustomAnalyzer(
209+
IndexSettings indexSettings,
210+
boolean normalizer,
211+
NameOrDefinition tokenizer,
212+
List<NameOrDefinition> charFilters,
213+
List<NameOrDefinition> tokenFilters
214+
) {
215+
return null;
216+
}
217+
201218
@Override
202219
protected IndexFieldData<?> buildFieldData(MappedFieldType ft) {
203220
IndexFieldDataCache indexFieldDataCache = indicesFieldDataCache.buildIndexFieldDataCache(new IndexFieldDataCache.Listener() {

docs/build.gradle

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1073,6 +1073,39 @@ buildRestTests.setups['farequote_datafeed'] = buildRestTests.setups['farequote_j
10731073
"indexes":"farequote"
10741074
}
10751075
'''
1076+
buildRestTests.setups['categorize_text'] = '''
1077+
- do:
1078+
indices.create:
1079+
index: log-messages
1080+
body:
1081+
settings:
1082+
number_of_shards: 1
1083+
number_of_replicas: 0
1084+
mappings:
1085+
properties:
1086+
time:
1087+
type: date
1088+
message:
1089+
type: text
1090+
1091+
- do:
1092+
bulk:
1093+
index: log-messages
1094+
refresh: true
1095+
body: |
1096+
{"index": {"_id":"1"}}
1097+
{"time":"2016-02-07T00:01:00+0000", "message": "2016-02-07T00:00:00+0000 Node 3 shutting down"}
1098+
{"index": {"_id":"2"}}
1099+
{"time":"2016-02-07T00:02:00+0000", "message": "2016-02-07T00:00:00+0000 Node 5 starting up"}
1100+
{"index": {"_id":"3"}}
1101+
{"time":"2016-02-07T00:03:00+0000", "message": "2016-02-07T00:00:00+0000 Node 4 shutting down"}
1102+
{"index": {"_id":"4"}}
1103+
{"time":"2016-02-08T00:01:00+0000", "message": "2016-02-08T00:00:00+0000 Node 5 shutting down"}
1104+
{"index": {"_id":"5"}}
1105+
{"time":"2016-02-08T00:02:00+0000", "message": "2016-02-08T00:00:00+0000 User foo_325 logging on"}
1106+
{"index": {"_id":"6"}}
1107+
{"time":"2016-02-08T00:04:00+0000", "message": "2016-02-08T00:00:00+0000 User foo_864 logged off"}
1108+
'''
10761109
buildRestTests.setups['server_metrics_index'] = '''
10771110
- do:
10781111
indices.create:

docs/reference/aggregations/bucket.asciidoc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@ include::bucket/adjacency-matrix-aggregation.asciidoc[]
2020

2121
include::bucket/autodatehistogram-aggregation.asciidoc[]
2222

23+
include::bucket/categorize-text-aggregation.asciidoc[]
24+
2325
include::bucket/children-aggregation.asciidoc[]
2426

2527
include::bucket/composite-aggregation.asciidoc[]

0 commit comments

Comments
 (0)