Skip to content
This repository was archived by the owner on Jan 10, 2025. It is now read-only.

Commit d030944

Browse files
committed
Add aggregated_issues_in_logs watch
1 parent 3719221 commit d030944

12 files changed

+1036
-0
lines changed
Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
# log issues watch
2+
3+
## Description
4+
5+
A watch which searches for issues in logs and aggregates them.
6+
7+
The idea is to find issues in logs of any source. Generic filters on the "severity" and "msg" fields are used which are uniform across all index-sets. Additionally, a blacklist is used to exclude certain events.
8+
9+
After filtering, a Elasticsearch aggregation on the fields "host", "severity" and "msg" is used to create one aggregated issue for similar events per hour.
10+
11+
The output is indexed into `log_issues__v4_{now/y{YYYY}}` for further analytics and long term archiving.
12+
The staging watch output is indexed into `log_issues_env=staging__v4_{now/y{YYYY}}`.
13+
14+
## Query overview
15+
16+
This is only an overview and not the full query.
17+
18+
```YAML
19+
## Include events
20+
## * severity: Critical or worse
21+
## * severity: Notice or worse AND msg contains one or more keywords
22+
must:
23+
- range:
24+
severity:
25+
lte: 5
26+
- bool:
27+
should:
28+
- terms:
29+
msg:
30+
- attack
31+
- crash
32+
- conflict
33+
- critical
34+
- denied
35+
- down
36+
- dump
37+
- error
38+
- exit
39+
- fail
40+
- failed
41+
- fatal
42+
- fault
43+
- overflow
44+
- poison
45+
- quit
46+
- restart
47+
- unable
48+
- range:
49+
severity:
50+
lte: 2
51+
```
52+
53+
## Schedules and date ranges
54+
55+
The watch is scheduled at minute 32 every hour and selects the complete last hour.
56+
Because the actual time where the query is executed various slightly (which determines the value of `now` in the range query), we use date rounding to the full hour.
57+
58+
Because events need some time from when they are emitted on the origin, until they have traveled though the log collection pipeline and got indexed and refreshed (`_refresh` API) in Elasticsearch, a watch running `now` does not look at the range from `now` until `now-1h` but an additional delay of 30 minutes is used.
59+
See: https://discuss.elastic.co/t/ensure-that-watcher-does-not-miss-documents-logs/127780/1
60+
61+
The first timestamp included is for example 2018-04-16T05:00:00.000Z and the last 2018-04-16T05:59:59.999Z.
62+
63+
This can be tested with the following query. Just index suitable documents before.
64+
2018-04-16T06:29:59.999Z represents `now`:
65+
66+
```
67+
GET test-2018/_search?filter_path=took,hits.total,hits.hits._source
68+
{
69+
"sort": [
70+
"@timestamp"
71+
],
72+
"query": {
73+
"range": {
74+
"@timestamp": {
75+
"gte": "2018-04-16T06:29:59.999Z||-1h-30m/h",
76+
"lt": "2018-04-16T06:29:59.999Z||-30m/h"
77+
}
78+
}
79+
}
80+
}
81+
```
82+
83+
## Fields in the aggregated output
84+
85+
As each generated document originates from an aggregation over one or multiple source documents, certain metadata fields were introduced. All metadata fields start with "#" as they are not contained in the original log events. The following fields are defined:
86+
87+
* host: Log source host name contained in source documents.
88+
* severity: Severity of log event contained in source documents.
89+
* msg: Log event message contained in source documents.
90+
* \#count: Count of source documents where all above fields are the same over the scheduled interval.
91+
* \#source: Source/Type of log event as defined by our Logstash configuration. This field determines the first part of the index name/index-set.
92+
* @first_timestamp: Timestamp of first occurrence of a document matching the query in the scheduled interval. Technically, this is the minimum @timestamp field in the aggregation.
93+
* @last_timestamp: Timestamp of last occurrence of a document matching the query in the scheduled interval. Technically, this is the maximum @timestamp field in the aggregation.
94+
* \#timeframe: Duration between @first_timestamp and @last_timestamp in milliseconds.
95+
* \#watch_timestamp: Timestamp the watch executed and created this document.
96+
* doc: Nested object containing all keys of one source document not contained somewhere else already. This can be useful for debugging and for excluding events where where wrongly classified as issues.
97+
* \#doc_ref: URL referring to one source document. This can be useful for debugging and for excluding false positives. It is faster than \#doc_context_ref.
98+
* \#doc_context_ref: URL referring to one source document in the context of surrounding documents based on @timestamp. This can be useful for debugging and for excluding false positives. It is slower than \#doc_id.
99+
100+
## Quality assurance
101+
102+
The watch implementation is integration tested using the official integration
103+
testing mechanism used by Elastic to test public watch examples.
104+
Please be sure to add new tests for any changes you do here to ensure that they
105+
have the desired effect and to avoid regressions. All tests must be run and
106+
pass before deploying to production.
107+
Those tests are to be run against a development Elasticsearch instance.
108+
109+
### Date selection considerations
110+
111+
> gte: 2018-04-16T05:00:00.000Z, offset -5400
112+
> lt: 2018-04-16T05:59:59.999Z, offset -1800.001
113+
> now: 2018-04-16T06:30:00.000Z
114+
>
115+
> gte: 2018-04-16T04:00:00.000Z, offset -8999.999
116+
> lt: 2018-04-16T04:59:59.999Z, offset -5400
117+
> now: 2018-04-16T06:29:59.999Z
118+
119+
offset -5400 is always in the range. Offset from -8999.999 until (including)
120+
-1800.001 is sometimes in the range, depending on `now`. `now` is not mocked by
121+
the test framework so we need to use -5400 for all test data or something
122+
outside of the range for deterministic tests.
123+
Unfortunately, deterministic tests can not be ensured currently because there is some delay between offset to timestamp calculation and the execution of the watch. If that happens in integration tests, rerun the test. The probability for this to happen is very low with ~0.005 % (~0.2s / 3600s * 100).
124+
125+
This has the negative affect that we can not test (reliably) with different
126+
timestamps so that the `#timeframe` field can not be tested anymore.
127+
128+
The offset calculations can be verified with this Python code:
129+
130+
```Python
131+
(datetime.datetime.strptime('2018-04-16T12:20:59.999Z', "%Y-%m-%dT%H:%M:%S.%fZ") - datetime.datetime.strptime('2018-04-16T12:14:00.000Z', "%Y-%m-%dT%H:%M:%S.%fZ")).total_seconds()
132+
```
133+
134+
135+
## Basic concepts
136+
137+
Only be as restrictive as needed. The input data might change and if we defined the conditions too precisely, the data will never match again and we will not know. Rather have false positives and change it to more restrictive in that case.
138+
139+
## Helpers
140+
141+
```Shell
142+
## Getting example data from production using a Elasticsearch query. After that, you can use this yq/jq one liner to generate watch test input:
143+
144+
curl --cacert "$(get_es_cacert)" -u "$(get_es_creds)" "$(get_es_url)/${INDEX_PATTERN}/_search?pretty" -H 'Content-Type: application/yaml' --data-binary @log_issues/helpers/get_from_production.yaml > /tmp/es_out.yaml
145+
146+
yq '{events: {"ignore this – needed to get proper indention": ([ .hits.hits[]._source ] | length as $length | to_entries | map(del(.value["@timestamp"], .value["#logstash_timestamp"]) | {id: (.key + 1), offset: (10 * (.key - $length))} + .value)) }}' /tmp/es_out.yaml -y
147+
yq '[ .hits.hits[] | ._source ] | length as $length | to_entries | map({id: (.key + 1), offset: (10 * (.key - $length))} + .value)' /tmp/es_out.yaml
148+
149+
## Painless debugging can be difficult. curl can be used to get proper syntax error messages:
150+
151+
curl -u 'elastic:changeme' 'http://localhost:9200/_scripts/log_issues-index_transform' -H 'Content-Type: application/yaml' --data-binary @log_issues/scripts/log_issues-index_transform.yaml
152+
```
153+
154+
## Mapping Assumptions
155+
156+
A mapping is provided in `mapping.json`. This watch requires source data producing the following fields:
157+
158+
* @timestamp (date): authoritative date field for each log message.
159+
* msg (string): Contents of the log message.
160+
* host (string): Log origin.
161+
* severity (byte): Field with severity as defined in RFC 5424. Ref: https://en.wikipedia.org/wiki/Syslog#Severity_level.
162+
163+
## Data Assumptions
164+
165+
The watch assumes each log message is represented by an Elasticsearch document. The watch assumes data is indexed in any index.
166+
167+
## Other Assumptions
168+
169+
* None
170+
171+
## Configuration
172+
173+
* The watch is scheduled to find errors every hour. Configurable in the `watch.yaml` configuration file.
174+
175+
## Deployment
176+
177+
The `./watch.yaml` mentions a `Makefile`
178+
179+
```Shell
180+
python run_test.py --test_file ./aggregated_issues_in_logs/tests_disabled/90_deploy_to_production_empty_test_data.yaml --metadata-git-commit --no-test-index --no-execute-watch --host "$(get_es_url)" --user "$(get_es_user)" --password "$(get_es_pw)" --cacert "$(get_es_cacert)" --modify-watch-by-eval "del watch['actions']['log']; watch['actions']['index_payload']['index']['index'] = '<' + watch['metadata']['index_category'] + '_' + watch['metadata']['index_type'] + watch['metadata']['index_kv'] + '__' + watch['metadata']['index_revision'] + '_{now/y{YYYY}}>';"
181+
```
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
---
2+
3+
# yamllint disable rule:line-length rule:comments-indentation
4+
5+
size: 10
6+
7+
_source:
8+
## Calculated using the offset property by the test framework.
9+
# - '@timestamp'
10+
11+
## Leave out if possible to avoid the need to censor them if making datasets public.
12+
- 'host'
13+
14+
- 'msg'
15+
- 'severity'
16+
- '#source'
17+
18+
# - 'logdesc'
19+
20+
query:
21+
query_string:
22+
query: 'msg:"handshake"'
23+
24+
25+
# aggregations:
26+
# host:
27+
# sum:
28+
# script:
29+
# source: 'doc.data[""]'
30+
31+
# aggregations:
32+
# severity:
33+
# terms:
34+
# field: 'severity'
35+
# aggregations:
36+
# msg:
37+
# terms:
38+
# field: '_uid'

0 commit comments

Comments
 (0)