diff --git a/Alerting/Sample Watches/aggregated_issues_in_logs/README.md b/Alerting/Sample Watches/aggregated_issues_in_logs/README.md new file mode 100644 index 00000000..6dfb74ec --- /dev/null +++ b/Alerting/Sample Watches/aggregated_issues_in_logs/README.md @@ -0,0 +1,181 @@ +# log issues watch + +## Description + +A watch which searches for issues in logs and aggregates them. + +The idea is to find issues in logs of any source. Generic filters on the "severity" and "msg" fields are used which are uniform across all index-sets. Additionally, a blacklist is used to exclude certain events. + +After filtering, a Elasticsearch aggregation on the fields "host", "severity" and "msg" is used to create one aggregated issue for similar events per hour. + +The output is indexed into `log_issues__v4_{now/y{YYYY}}` for further analytics and long term archiving. +The staging watch output is indexed into `log_issues_env=staging__v4_{now/y{YYYY}}`. + +## Query overview + +This is only an overview and not the full query. + +```YAML +## Include events +## * severity: Critical or worse +## * severity: Notice or worse AND msg contains one or more keywords +must: + - range: + severity: + lte: 5 + - bool: + should: + - terms: + msg: + - attack + - crash + - conflict + - critical + - denied + - down + - dump + - error + - exit + - fail + - failed + - fatal + - fault + - overflow + - poison + - quit + - restart + - unable + - range: + severity: + lte: 2 +``` + +## Schedules and date ranges + +The watch is scheduled at minute 32 every hour and selects the complete last hour. +Because the actual time where the query is executed various slightly (which determines the value of `now` in the range query), we use date rounding to the full hour. + +Because events need some time from when they are emitted on the origin, until they have traveled though the log collection pipeline and got indexed and refreshed (`_refresh` API) in Elasticsearch, a watch running `now` does not look at the range from `now` until `now-1h` but an additional delay of 30 minutes is used. +See: https://discuss.elastic.co/t/ensure-that-watcher-does-not-miss-documents-logs/127780/1 + +The first timestamp included is for example 2018-04-16T05:00:00.000Z and the last 2018-04-16T05:59:59.999Z. + +This can be tested with the following query. Just index suitable documents before. +2018-04-16T06:29:59.999Z represents `now`: + +``` +GET test-2018/_search?filter_path=took,hits.total,hits.hits._source +{ + "sort": [ + "@timestamp" + ], + "query": { + "range": { + "@timestamp": { + "gte": "2018-04-16T06:29:59.999Z||-1h-30m/h", + "lt": "2018-04-16T06:29:59.999Z||-30m/h" + } + } + } +} +``` + +## Fields in the aggregated output + +As each generated document originates from an aggregation over one or multiple source documents, certain metadata fields were introduced. All metadata fields start with "#" as they are not contained in the original log events. The following fields are defined: + +* host: Log source host name contained in source documents. +* severity: Severity of log event contained in source documents. +* msg: Log event message contained in source documents. +* \#count: Count of source documents where all above fields are the same over the scheduled interval. +* \#source: Source/Type of log event as defined by our Logstash configuration. This field determines the first part of the index name/index-set. +* @first_timestamp: Timestamp of first occurrence of a document matching the query in the scheduled interval. Technically, this is the minimum @timestamp field in the aggregation. +* @last_timestamp: Timestamp of last occurrence of a document matching the query in the scheduled interval. Technically, this is the maximum @timestamp field in the aggregation. +* \#timeframe: Duration between @first_timestamp and @last_timestamp in milliseconds. +* \#watch_timestamp: Timestamp the watch executed and created this document. +* doc: Nested object containing all keys of one source document not contained somewhere else already. This can be useful for debugging and for excluding events where where wrongly classified as issues. +* \#doc_ref: URL referring to one source document. This can be useful for debugging and for excluding false positives. It is faster than \#doc_context_ref. +* \#doc_context_ref: URL referring to one source document in the context of surrounding documents based on @timestamp. This can be useful for debugging and for excluding false positives. It is slower than \#doc_id. + +## Quality assurance + +The watch implementation is integration tested using the official integration +testing mechanism used by Elastic to test public watch examples. +Please be sure to add new tests for any changes you do here to ensure that they +have the desired effect and to avoid regressions. All tests must be run and +pass before deploying to production. +Those tests are to be run against a development Elasticsearch instance. + +### Date selection considerations + +> gte: 2018-04-16T05:00:00.000Z, offset -5400 +> lt: 2018-04-16T05:59:59.999Z, offset -1800.001 +> now: 2018-04-16T06:30:00.000Z +> +> gte: 2018-04-16T04:00:00.000Z, offset -8999.999 +> lt: 2018-04-16T04:59:59.999Z, offset -5400 +> now: 2018-04-16T06:29:59.999Z + +offset -5400 is always in the range. Offset from -8999.999 until (including) +-1800.001 is sometimes in the range, depending on `now`. `now` is not mocked by +the test framework so we need to use -5400 for all test data or something +outside of the range for deterministic tests. +Unfortunately, deterministic tests can not be ensured currently because there is some delay between offset to timestamp calculation and the execution of the watch. If that happens in integration tests, rerun the test. The probability for this to happen is very low with ~0.005 % (~0.2s / 3600s * 100). + +This has the negative affect that we can not test (reliably) with different +timestamps so that the `#timeframe` field can not be tested anymore. + +The offset calculations can be verified with this Python code: + +```Python +(datetime.datetime.strptime('2018-04-16T12:20:59.999Z', "%Y-%m-%dT%H:%M:%S.%fZ") - datetime.datetime.strptime('2018-04-16T12:14:00.000Z', "%Y-%m-%dT%H:%M:%S.%fZ")).total_seconds() +``` + + +## Basic concepts + +Only be as restrictive as needed. The input data might change and if we defined the conditions too precisely, the data will never match again and we will not know. Rather have false positives and change it to more restrictive in that case. + +## Helpers + +```Shell +## Getting example data from production using a Elasticsearch query. After that, you can use this yq/jq one liner to generate watch test input: + +curl --cacert "$(get_es_cacert)" -u "$(get_es_creds)" "$(get_es_url)/${INDEX_PATTERN}/_search?pretty" -H 'Content-Type: application/yaml' --data-binary @log_issues/helpers/get_from_production.yaml > /tmp/es_out.yaml + +yq '{events: {"ignore this – needed to get proper indention": ([ .hits.hits[]._source ] | length as $length | to_entries | map(del(.value["@timestamp"], .value["#logstash_timestamp"]) | {id: (.key + 1), offset: (10 * (.key - $length))} + .value)) }}' /tmp/es_out.yaml -y +yq '[ .hits.hits[] | ._source ] | length as $length | to_entries | map({id: (.key + 1), offset: (10 * (.key - $length))} + .value)' /tmp/es_out.yaml + +## Painless debugging can be difficult. curl can be used to get proper syntax error messages: + +curl -u 'elastic:changeme' 'http://localhost:9200/_scripts/log_issues-index_transform' -H 'Content-Type: application/yaml' --data-binary @log_issues/scripts/log_issues-index_transform.yaml +``` + +## Mapping Assumptions + +A mapping is provided in `mapping.json`. This watch requires source data producing the following fields: + +* @timestamp (date): authoritative date field for each log message. +* msg (string): Contents of the log message. +* host (string): Log origin. +* severity (byte): Field with severity as defined in RFC 5424. Ref: https://en.wikipedia.org/wiki/Syslog#Severity_level. + +## Data Assumptions + +The watch assumes each log message is represented by an Elasticsearch document. The watch assumes data is indexed in any index. + +## Other Assumptions + +* None + +## Configuration + +* The watch is scheduled to find errors every hour. Configurable in the `watch.yaml` configuration file. + +## Deployment + +The `./watch.yaml` mentions a `Makefile` + +```Shell +python run_test.py --test_file ./aggregated_issues_in_logs/tests_disabled/90_deploy_to_production_empty_test_data.yaml --metadata-git-commit --no-test-index --no-execute-watch --host "$(get_es_url)" --user "$(get_es_user)" --password "$(get_es_pw)" --cacert "$(get_es_cacert)" --modify-watch-by-eval "del watch['actions']['log']; watch['actions']['index_payload']['index']['index'] = '<' + watch['metadata']['index_category'] + '_' + watch['metadata']['index_type'] + watch['metadata']['index_kv'] + '__' + watch['metadata']['index_revision'] + '_{now/y{YYYY}}>';" +``` diff --git a/Alerting/Sample Watches/aggregated_issues_in_logs/helpers/get_from_production.yaml b/Alerting/Sample Watches/aggregated_issues_in_logs/helpers/get_from_production.yaml new file mode 100644 index 00000000..452e221f --- /dev/null +++ b/Alerting/Sample Watches/aggregated_issues_in_logs/helpers/get_from_production.yaml @@ -0,0 +1,38 @@ +--- + +# yamllint disable rule:line-length rule:comments-indentation + +size: 10 + +_source: + ## Calculated using the offset property by the test framework. + # - '@timestamp' + + ## Leave out if possible to avoid the need to censor them if making datasets public. + - 'host' + + - 'msg' + - 'severity' + - '#source' + + # - 'logdesc' + +query: + query_string: + query: 'msg:"handshake"' + + +# aggregations: +# host: +# sum: +# script: +# source: 'doc.data[""]' + +# aggregations: +# severity: +# terms: +# field: 'severity' +# aggregations: +# msg: +# terms: +# field: '_uid' diff --git a/Alerting/Sample Watches/aggregated_issues_in_logs/mapping.json b/Alerting/Sample Watches/aggregated_issues_in_logs/mapping.json new file mode 100644 index 00000000..3827a469 --- /dev/null +++ b/Alerting/Sample Watches/aggregated_issues_in_logs/mapping.json @@ -0,0 +1,189 @@ +{ + "order" : 100, + "version" : 5000, + "index_patterns" : [ + "*-v*-*", + "*_*__v*_*" + ], + "settings" : { + "index" : { + "refresh_interval" : "5s" + } + }, + "mappings" : { + "doc" : { + "dynamic_templates" : [ + { + "type_long_fields" : { + "match_pattern" : "regex", + "path_match" : "^(?:#?doc\\.)?#?(?:byte|duration|rate|count)$", + "mapping" : { + "norms" : false, + "type" : "long" + } + } + }, + { + "type_integer_fields" : { + "match_pattern" : "regex", + "path_match" : "^(?:#?doc\\.)?#?(?:signal|noise)$", + "mapping" : { + "norms" : false, + "type" : "integer" + } + } + }, + { + "type_short_fields" : { + "match_pattern" : "regex", + "path_match" : "^(?:#?doc\\.)?#?(?:wlan_channel|cpu_usage|disk_usage|mem_usage)$", + "mapping" : { + "norms" : false, + "type" : "integer" + } + } + }, + { + "type_byte_fields" : { + "match_pattern" : "regex", + "path_match" : "^(?:#?doc\\.)?#?(?:random|severity)$", + "mapping" : { + "norms" : false, + "type" : "byte" + } + } + }, + { + "ip_fields" : { + "match_pattern" : "regex", + "path_match" : "^(?:#?doc\\.)?#?(?:(?:tran_)?(?:local|remote|src|dst)_ip|event_data\\.IpAddress)$", + "mapping" : { + "type" : "ip" + } + } + }, + { + "tcpip_port_fields" : { + "match_pattern" : "regex", + "path_match" : "^(?:#?doc\\.)?#?(:?(?:tran_)?(?:dst|src|remote)_port|event_data\\.IpPort)$", + "mapping" : { + "norms" : false, + "type" : "integer" + } + } + }, + { + "switch_port_fields" : { + "match_pattern" : "regex", + "path_match" : "^(?:#?doc\\.)?#?port$", + "mapping" : { + "norms" : false, + "type" : "text", + "fields" : { + "keyword" : { + "type" : "keyword" + } + } + } + } + }, + { + "packet_count_fields" : { + "match_pattern" : "regex", + "path_match" : "^(?:#?doc\\.)?#?(?:sent|rcvd|received)_(?:pkt|packet)$", + "mapping" : { + "norms" : false, + "type" : "long" + } + } + }, + { + "information_size_fields" : { + "match_pattern" : "regex", + "path_match" : "^(?:#?doc\\.)?#?(?:sent|rcvd|received)_byte$", + "mapping" : { + "norms" : false, + "type" : "long" + } + } + }, + { + "scaled_float_1000_fields" : { + "match_pattern" : "regex", + "path_match" : "^(?:#?doc\\.)?#?(?:elapsed_time)$", + "mapping" : { + "scaling_factor" : 1000, + "type" : "scaled_float" + } + } + }, + { + "timestamp_post_ms" : { + "match_pattern" : "regex", + "path_match" : "^(?:#?doc\\.)?#?@timestamp_post_ms$", + "mapping" : { + "norms" : false, + "type" : "integer" + } + } + }, + { + "#logstash_timestamp" : { + "match_pattern" : "regex", + "path_match" : "^(?:#?doc\\.)?#?(?:@timestamp|logstash_timestamp)$", + "mapping" : { + "norms" : false, + "type" : "date" + } + } + }, + { + "forced_string_fields" : { + "match_pattern" : "regex", + "path_match" : "^(?:#?doc\\.)?#?(?:version)$", + "mapping" : { + "norms" : false, + "type" : "text", + "fields" : { + "keyword" : { + "type" : "keyword" + } + } + } + } + }, + { + "message_field" : { + "mapping" : { + "norms" : false, + "type" : "text", + "fields" : { + "keyword" : { + "type" : "keyword" + } + } + }, + "match_mapping_type" : "string", + "match" : "msg" + } + }, + { + "string_fields" : { + "mapping" : { + "norms" : false, + "type" : "text", + "fields" : { + "keyword" : { + "type" : "keyword" + } + } + }, + "match_mapping_type" : "string", + "match" : "*" + } + } + ] + } + }, + "aliases" : { } +} diff --git a/Alerting/Sample Watches/aggregated_issues_in_logs/scripts/aggregated_issues_in_logs-deterministic_log.yaml b/Alerting/Sample Watches/aggregated_issues_in_logs/scripts/aggregated_issues_in_logs-deterministic_log.yaml new file mode 100644 index 00000000..1208fe15 --- /dev/null +++ b/Alerting/Sample Watches/aggregated_issues_in_logs/scripts/aggregated_issues_in_logs-deterministic_log.yaml @@ -0,0 +1,40 @@ +--- + +# yamllint disable rule:line-length rule:comments-indentation + +script: + lang: 'painless' + source: | + /* Ensure deterministic log output for integration testing. + * Timestamps are converted to offsets. + */ + + def min_timestamp = Instant.parse( + Collections.min(ctx.payload._doc.stream().map( + new_doc -> new_doc['@first_timestamp'] + ).collect(Collectors.toList())) + ); + + for (HashMap new_doc : ctx.payload._doc) { + if (new_doc.containsKey('_index')) { + new_doc['_index'] = 'log_issues_env=test__v4_2023'; + } + + if (new_doc.containsKey('#watch_timestamp')) { + if (new_doc.containsKey('@first_timestamp')) { + new_doc['@first_timestamp'] = Instant.parse(new_doc['@first_timestamp']).getEpochSecond() - min_timestamp.getEpochSecond(); + } + if (new_doc.containsKey('@last_timestamp')) { + new_doc['@last_timestamp'] = Instant.parse(new_doc['@last_timestamp']).getEpochSecond() - min_timestamp.getEpochSecond(); + } + new_doc['#watch_timestamp'] = '2023-05-23T23:23:23.555Z'; + } + + if (new_doc.containsKey('#doc')) { + new_doc['#doc'].remove('@timestamp'); + } + /* TODO: Workaround: Integration testing. Output sometimes contains `_id`, sometimes not. This breaks reliability of the tests. -> Remove it for now. */ + new_doc.remove('_id'); + } + + return ctx.payload; diff --git a/Alerting/Sample Watches/aggregated_issues_in_logs/scripts/aggregated_issues_in_logs-index_transform.yaml b/Alerting/Sample Watches/aggregated_issues_in_logs/scripts/aggregated_issues_in_logs-index_transform.yaml new file mode 100644 index 00000000..b54c485d --- /dev/null +++ b/Alerting/Sample Watches/aggregated_issues_in_logs/scripts/aggregated_issues_in_logs-index_transform.yaml @@ -0,0 +1,111 @@ +--- + +# yamllint disable rule:line-length + +script: + lang: 'painless' + source: | + List new_docs = new ArrayList(); + + /* Error handling {{{ */ + /* TODO: This code might not be run if all shards fail. Implement some other way to catch watch errors. */ + if (false && ctx.payload._shards.failed > 0) { + def failed_on_indexes = ctx.payload._shards.failures.stream().map( + failure -> failure['index'] + ).collect(Collectors.toSet()); + + new_docs.add([ + 'host': 'gxmneh6[1-5]', + /* Warning */ + 'severity': 4, + 'msg': ctx.payload._shards.failed + ' Elasticsearch shards failed to being queried. This affects the following indexes: ' + failed_on_indexes, + '#count': 1, + '#source': 'automation-elastic', + '@first_timestamp': ctx.execution_time, + '@last_timestamp': ctx.execution_time, + '#timeframe': 0, + '#watch_timestamp': ctx.execution_time + ]); + } + + if (ctx.payload.aggregations.severity.containsKey('sum_other_doc_count') && ctx.payload.aggregations.severity.sum_other_doc_count > 0) { + new_docs.add([ + 'host': 'gxmneh6[1-5]', + /* Warning */ + 'severity': 4, + 'msg': 'Elasticsearch returned more aggregations than we assumed. Number of unprocessed aggregation buckets: ' + ctx.payload.aggregations.severity.sum_other_doc_count, + '#count': 1, + '#source': 'automation-elastic', + '@first_timestamp': ctx.execution_time, + '@last_timestamp': ctx.execution_time, + '#timeframe': 0, + '#watch_timestamp': ctx.execution_time + ]); + } + /* }}} */ + + /* Create new documents based on aggregation {{{ */ + for (HashMap severity : ctx.payload.aggregations.severity.buckets) { + for (HashMap host : severity.host.buckets) { + for (HashMap msg : host.msg.buckets) { + + List sources = msg.source.buckets.stream().map( + source_aggr -> source_aggr.key + ).collect(Collectors.toList()); + + String min_timestamp = msg.min_timestamp.hits.hits[0]._source['@timestamp']; + String max_timestamp = msg.max_timestamp.hits.hits[0]._source['@timestamp']; + + long min_timestamp_ms = msg.min_timestamp.hits.hits[0].sort[0]; + long max_timestamp_ms = msg.max_timestamp.hits.hits[0].sort[0]; + + String example_index = msg.max_timestamp.hits.hits[0]._index; + String example_type = msg.max_timestamp.hits.hits[0]._type; + String example_id = msg.max_timestamp.hits.hits[0]._id; + String example_source = msg.max_timestamp.hits.hits[0]._source['#source']; + + HashMap example_doc_source = msg.max_timestamp.hits.hits[0]._source; + example_doc_source.remove('host'); + example_doc_source.remove('severity'); + example_doc_source.remove('msg'); + example_doc_source.remove('#source'); + + /* Currently the same as @last_timestamp. */ + example_doc_source.remove('@timestamp'); + + /* This field is pointless for us. It serves a purpose in very big indexes thought. */ + example_doc_source.remove('@random'); + + if (example_doc_source.containsKey('user') && example_doc_source['user'] instanceof HashMap) { + /* Failsafe: "user" HashMap can not be indexed in "user" field + * because of conflicting data types across the source indexes. + */ + example_doc_source['user_object'] = example_doc_source.remove('user'); + } + + new_docs.add([ + 'host': host.key, + 'severity': severity.key, + 'msg': msg.key, + /* '_index': ctx.metadata.index_category + '_' + ctx.metadata.index_type + '_' + ctx.metadata.index_kv + '__' + ctx.metadata.index_revision + '_' + ZonedDateTime.ofInstant(Instant.parse(min_timestamp), ZoneId.of("UTC")).getYear(), */ + '_id': example_index + '_' + example_id, + // Link to single document: + // https://example.org/app/kibana#/doc/gnulinux-eventlog-*/gnulinux-eventlog-2017.11.14/event?id=AV-4P0nn1ZjIWeK5zvkg&_g=h@44136fa + '#doc_ref': '#/doc/' + example_source + '-*/' + example_index + '/' + example_type + '?id=' + example_id, + // Link to document plus surrounding documents: + // https://example.org/app/kibana#/context/gnulinux-eventlog-*/event/AV-4P0nn1ZjIWeK5zvkg?_g=h@44136fa&_a=h@c57b9b3 + '#doc_context_ref': '#/context/' + example_source + '-*/' + example_type + '/' + example_id, + 'doc': msg.max_timestamp.hits.hits[0]._source, + '#count': msg.doc_count, + '#source': sources, + '@first_timestamp': min_timestamp, + '@last_timestamp': max_timestamp, + '#timeframe': max_timestamp_ms - min_timestamp_ms, + '#watch_timestamp': ctx.execution_time + ]); + } + } + } + /* }}} */ + + return [ '_doc': new_docs ]; diff --git a/Alerting/Sample Watches/aggregated_issues_in_logs/tests/10_deploy_to_staging_empty_test_data.yaml b/Alerting/Sample Watches/aggregated_issues_in_logs/tests/10_deploy_to_staging_empty_test_data.yaml new file mode 100644 index 00000000..919662be --- /dev/null +++ b/Alerting/Sample Watches/aggregated_issues_in_logs/tests/10_deploy_to_staging_empty_test_data.yaml @@ -0,0 +1,28 @@ +--- + +# yamllint disable rule:line-length + +## This test is mainly here to have something which can be run against +## production without polluting anything. + +watch_name: aggregated_issues_in_logs_staging +mapping_file: ./aggregated_issues_in_logs/mapping.json +index: log_network-switch__v1_1984-05-23 +type: doc +match: false +watch_file: ./aggregated_issues_in_logs/watch.yaml +scripts: + - name: 'aggregated_issues_in_logs-index_transform' + path: './aggregated_issues_in_logs/scripts/aggregated_issues_in_logs-index_transform.yaml' + - name: 'aggregated_issues_in_logs-deterministic_log' + path: './aggregated_issues_in_logs/scripts/aggregated_issues_in_logs-deterministic_log.yaml' + +events: + + ## Watcher would fail if it does not find any document/indexes with a @timestamp field: + ## "SearchPhaseExecutionException[all shards failed]; nested: + ## QueryShardException[No mapping found for [@timestamp] in order to sort on]; + - id: 1 + offset: 0 + +expected_response: |- diff --git a/Alerting/Sample Watches/aggregated_issues_in_logs/tests/10_icinga_should_not_trigger.yaml b/Alerting/Sample Watches/aggregated_issues_in_logs/tests/10_icinga_should_not_trigger.yaml new file mode 100644 index 00000000..69d74ee3 --- /dev/null +++ b/Alerting/Sample Watches/aggregated_issues_in_logs/tests/10_icinga_should_not_trigger.yaml @@ -0,0 +1,31 @@ +--- + +# yamllint disable rule:line-length + +watch_name: aggregated_issues_in_logs +mapping_file: ./aggregated_issues_in_logs/mapping.json +index: log_monitoring-notifications__v1_1984-03 +type: doc +match: false +watch_file: ./aggregated_issues_in_logs/watch.yaml +scripts: + - name: 'aggregated_issues_in_logs-index_transform' + path: './aggregated_issues_in_logs/scripts/aggregated_issues_in_logs-index_transform.yaml' + - name: 'aggregated_issues_in_logs-deterministic_log' + path: './aggregated_issues_in_logs/scripts/aggregated_issues_in_logs-deterministic_log.yaml' + +events: + + - id: 1 + offset: -5400 + msg: Harmless message + severity: 1 + '#source': monitoring-alerts + host: gnu.example.org + + - id: 2 + offset: -5400 + msg: fault + severity: 4 + '#source': monitoring-alerts + host: mux.example.org diff --git a/Alerting/Sample Watches/aggregated_issues_in_logs/tests/10_ignore_events_outside_of_time_range.yaml b/Alerting/Sample Watches/aggregated_issues_in_logs/tests/10_ignore_events_outside_of_time_range.yaml new file mode 100644 index 00000000..6a8cb458 --- /dev/null +++ b/Alerting/Sample Watches/aggregated_issues_in_logs/tests/10_ignore_events_outside_of_time_range.yaml @@ -0,0 +1,31 @@ +--- + +# yamllint disable rule:line-length rule:comments-indentation + +watch_name: aggregated_issues_in_logs +mapping_file: ./aggregated_issues_in_logs/mapping.json +index: log_network-switch__v1_1984-05-23 +type: doc +match: false +watch_file: ./aggregated_issues_in_logs/watch.yaml +scripts: + - name: 'aggregated_issues_in_logs-index_transform' + path: './aggregated_issues_in_logs/scripts/aggregated_issues_in_logs-index_transform.yaml' + - name: 'aggregated_issues_in_logs-deterministic_log' + path: './aggregated_issues_in_logs/scripts/aggregated_issues_in_logs-deterministic_log.yaml' + +events: + + - id: 1 + offset: -9000 + msg: 'Events in the past should not trigger. PS: Over 9000!' + severity: 2 + '#source': test + host: gnu.example.org + + - id: 2 + offset: -1800 + msg: Events in the future should not trigger. + severity: 2 + '#source': test + host: gnu.example.org diff --git a/Alerting/Sample Watches/aggregated_issues_in_logs/tests/10_severity.yaml b/Alerting/Sample Watches/aggregated_issues_in_logs/tests/10_severity.yaml new file mode 100644 index 00000000..147df3e3 --- /dev/null +++ b/Alerting/Sample Watches/aggregated_issues_in_logs/tests/10_severity.yaml @@ -0,0 +1,101 @@ +--- + +# yamllint disable rule:line-length rule:comments-indentation + +watch_name: aggregated_issues_in_logs +mapping_file: ./aggregated_issues_in_logs/mapping.json +index: log_network-switch__v1_1984-05-23 +type: doc +match: true +watch_file: ./aggregated_issues_in_logs/watch.yaml +scripts: + - name: 'aggregated_issues_in_logs-index_transform' + path: './aggregated_issues_in_logs/scripts/aggregated_issues_in_logs-index_transform.yaml' + - name: 'aggregated_issues_in_logs-deterministic_log' + path: './aggregated_issues_in_logs/scripts/aggregated_issues_in_logs-deterministic_log.yaml' + +events: + + - id: 1 + offset: -5400 + msg: Ports Unrecoverable fault on PoE controller 1. + severity: 5 + '#source': switch-eventlog + host: gnu.example.org + category: chassis + + - id: 2 + offset: -5400 + msg: Ports Unrecoverable fault on PoE controller 2. + severity: 4 + '#source': switch-eventlog + host: gnu.example.org + category: chassis + + - id: 3 + offset: -5400 + msg: Ports Unrecoverable fault on PoE controller 3. + severity: 4 + '#source': different-source + host: gnu.example.org + category: chassis + + - id: 4 + offset: -5400 + msg: Ports Unrecoverable fault on PoE controller 2. + severity: 4 + '#source': switch-eventlog + host: gnu.example.org + category: chassis + + - id: 5 + offset: -5400 + msg: Ports Unrecoverable fault on PoE controller 3. + severity: 4 + '#source': switch-eventlog + host: gnu.example.org + category: chassis + + - id: 6 + offset: -5400 + msg: Harmless message + severity: 0 + '#source': switch-eventlog + host: gnu.example.org + category: chassis + + - id: 7 + offset: -5400 + msg: 'mac denied association on radio ''mux.example.org:R1'' : unsupported + data-rates ' + severity: 5 + '#source': switch-eventlog + host: mux.example.org + category: client + + - id: 8 + offset: -5400 + msg: 'mac denied association on radio ''host2.example.org:R1'' : unsupported + data-rates ' + severity: 5 + '#source': switch-eventlog + host: host2.example.org + category: client + + - id: 9 + offset: -5400 + msg: 'mac denied association on radio ''mux.example.org:R1'' : unsupported + data-rates ' + severity: 5 + '#source': switch-eventlog + host: mux.example.org + category: client + + - id: 13 + offset: -5400 + severity: 5 + msg: failed WPA2-AES handshake on wlan 'example-ssid' radio 'ap01.example.org:R2' + '#source': wireless-eventlog + host: ap01.example.org + +expected_response: "Watch payload to be indexed by Elasticsearch: {_doc=[{severity=4, msg=Ports Unrecoverable fault on PoE controller 2., @last_timestamp=0, #timeframe=0, #doc_ref=#/doc/switch-eventlog-*/log_network-switch__v1_1984-05-23/doc?id=2, @first_timestamp=0, #count=2, #source=[switch-eventlog], #doc_context_ref=#/context/switch-eventlog-*/doc/2, host=gnu.example.org, doc={offset=-5400, id=2, category=chassis}, #watch_timestamp=2023-05-23T23:23:23.555Z}, {severity=4, msg=Ports Unrecoverable fault on PoE controller 3., @last_timestamp=0, #timeframe=0, #doc_ref=#/doc/switch-eventlog-*/log_network-switch__v1_1984-05-23/doc?id=5, @first_timestamp=0, #count=2, #source=[different-source, switch-eventlog], #doc_context_ref=#/context/switch-eventlog-*/doc/5, host=gnu.example.org, doc={offset=-5400, id=5, category=chassis}, #watch_timestamp=2023-05-23T23:23:23.555Z}, {severity=5, msg=mac denied association on radio 'mux.example.org:R1' : unsupported data-rates , @last_timestamp=0, #timeframe=0, #doc_ref=#/doc/switch-eventlog-*/log_network-switch__v1_1984-05-23/doc?id=9, @first_timestamp=0, #count=2, #source=[switch-eventlog], #doc_context_ref=#/context/switch-eventlog-*/doc/9, host=mux.example.org, doc={offset=-5400, id=9, category=client}, #watch_timestamp=2023-05-23T23:23:23.555Z}, {severity=5, msg=Ports Unrecoverable fault on PoE controller 1., @last_timestamp=0, #timeframe=0, #doc_ref=#/doc/switch-eventlog-*/log_network-switch__v1_1984-05-23/doc?id=1, @first_timestamp=0, #count=1, #source=[switch-eventlog], #doc_context_ref=#/context/switch-eventlog-*/doc/1, host=gnu.example.org, doc={offset=-5400, id=1, category=chassis}, #watch_timestamp=2023-05-23T23:23:23.555Z}, {severity=5, msg=mac denied association on radio 'host2.example.org:R1' : unsupported data-rates , @last_timestamp=0, #timeframe=0, #doc_ref=#/doc/switch-eventlog-*/log_network-switch__v1_1984-05-23/doc?id=8, @first_timestamp=0, #count=1, #source=[switch-eventlog], #doc_context_ref=#/context/switch-eventlog-*/doc/8, host=host2.example.org, doc={offset=-5400, id=8, category=client}, #watch_timestamp=2023-05-23T23:23:23.555Z}, {severity=0, msg=Harmless message, @last_timestamp=0, #timeframe=0, #doc_ref=#/doc/switch-eventlog-*/log_network-switch__v1_1984-05-23/doc?id=6, @first_timestamp=0, #count=1, #source=[switch-eventlog], #doc_context_ref=#/context/switch-eventlog-*/doc/6, host=gnu.example.org, doc={offset=-5400, id=6, category=chassis}, #watch_timestamp=2023-05-23T23:23:23.555Z}]}" diff --git a/Alerting/Sample Watches/aggregated_issues_in_logs/tests/10_severity_simple.yaml b/Alerting/Sample Watches/aggregated_issues_in_logs/tests/10_severity_simple.yaml new file mode 100644 index 00000000..45ede358 --- /dev/null +++ b/Alerting/Sample Watches/aggregated_issues_in_logs/tests/10_severity_simple.yaml @@ -0,0 +1,39 @@ +--- + +# yamllint disable rule:line-length rule:comments-indentation + +watch_name: aggregated_issues_in_logs +mapping_file: ./aggregated_issues_in_logs/mapping.json +index: log_network-switch__v1_1984-05-23 +type: doc +match: true +watch_file: ./aggregated_issues_in_logs/watch.yaml +scripts: + - name: 'aggregated_issues_in_logs-index_transform' + path: './aggregated_issues_in_logs/scripts/aggregated_issues_in_logs-index_transform.yaml' + - name: 'aggregated_issues_in_logs-deterministic_log' + path: './aggregated_issues_in_logs/scripts/aggregated_issues_in_logs-deterministic_log.yaml' + +events: + + - id: 1 + offset: -5400 + msg: Harmless message + severity: 2 + '#source': test + host: sw01.example.org + category: chassis + # user: 'test string' + + - id: 2 + offset: -5400 + msg: Ports Unrecoverable fault on PoE controller 2. + severity: 5 + '#source': test2 + host: sw01.example.org + category: chassis + user: + id: 23 + name: 'fnord' + +expected_response: "Watch payload to be indexed by Elasticsearch: {_doc=[{severity=2, msg=Harmless message, @last_timestamp=0, #timeframe=0, #doc_ref=#/doc/test-*/log_network-switch__v1_1984-05-23/doc?id=1, @first_timestamp=0, #count=1, #source=[test], #doc_context_ref=#/context/test-*/doc/1, host=sw01.example.org, doc={offset=-5400, id=1, category=chassis}, #watch_timestamp=2023-05-23T23:23:23.555Z}, {severity=5, msg=Ports Unrecoverable fault on PoE controller 2., @last_timestamp=0, #timeframe=0, #doc_ref=#/doc/test2-*/log_network-switch__v1_1984-05-23/doc?id=2, @first_timestamp=0, #count=1, #source=[test2], #doc_context_ref=#/context/test2-*/doc/2, host=sw01.example.org, doc={offset=-5400, user_object={name=fnord, id=23}, id=2, category=chassis}, #watch_timestamp=2023-05-23T23:23:23.555Z}]}" diff --git a/Alerting/Sample Watches/aggregated_issues_in_logs/tests_disabled/90_deploy_to_production_empty_test_data.yaml b/Alerting/Sample Watches/aggregated_issues_in_logs/tests_disabled/90_deploy_to_production_empty_test_data.yaml new file mode 100644 index 00000000..545e19d9 --- /dev/null +++ b/Alerting/Sample Watches/aggregated_issues_in_logs/tests_disabled/90_deploy_to_production_empty_test_data.yaml @@ -0,0 +1,28 @@ +--- + +# yamllint disable rule:line-length + +## This test is mainly here to have something which can be run against +## production without polluting anything. +## Has been disabled because it can not be tested anymore because the watch +## expects different script ids to to protect production in case of errors. + +watch_name: aggregated_issues_in_logs +mapping_file: ./aggregated_issues_in_logs/mapping.json +index: log_network-switch__v1_1984-05-23 +type: doc +match: false +watch_file: ./aggregated_issues_in_logs/watch.yaml +scripts: + - name: 'aggregated_issues_in_logs-index_transform' + path: './aggregated_issues_in_logs/scripts/aggregated_issues_in_logs-index_transform.yaml' + - name: 'aggregated_issues_in_logs-deterministic_log' + path: './aggregated_issues_in_logs/scripts/aggregated_issues_in_logs-deterministic_log.yaml' + +events: + + ## Watcher would fail if it does not find any document/indexes with a @timestamp field: + ## "SearchPhaseExecutionException[all shards failed]; nested: + ## QueryShardException[No mapping found for [@timestamp] in order to sort on]; + - id: 1 + offset: 0 diff --git a/Alerting/Sample Watches/aggregated_issues_in_logs/watch.yaml b/Alerting/Sample Watches/aggregated_issues_in_logs/watch.yaml new file mode 100644 index 00000000..64c75b38 --- /dev/null +++ b/Alerting/Sample Watches/aggregated_issues_in_logs/watch.yaml @@ -0,0 +1,219 @@ +--- + +# yamllint disable rule:line-length rule:comments-indentation + +metadata: + ## Use a offset to allow events to travel thought their pipeline. + ## See: https://discuss.elastic.co/t/ensure-that-watcher-does-not-miss-documents-logs/127780/1 + time_offset: '30m' + time_window: '1h' + + index_category: 'log' + index_type: 'issues' + index_kv: '' + index_revision: 'v4' + +## We never want to throttle/drop actions. +throttle_period: '0s' + +trigger: + schedule: + ## `hourly` is based on ctx.metadata.time_window. + hourly: + ## `minute` can be changed freely, the actual time range is + ## calculated/rounded in the range filter to full hours. + minute: 32 + +input: + search: + timeout: '5m' + request: + + indices: + ## System and hidden indexes should be excluded. + ## + ## Also, the output index of this watch should be excluded for obvious + ## reasons. Even without this it would be ensured because we enforce + ## the '@timestamp' field to exist in all documents found. + + ## Include all (log) index-sets by default and blacklist later. + - '*-*' + + ## Exclude system and hidden indexes. + - '-.*' + + ## Exclude our own output index-set. + - '-critical-logs-*' + - '-log_issues_*' + + # ## With Index naming convention version 2, we could simply switch to + # ## a whitelist approach which would be much better. + # - 'logs_*' + # - '-log_issues_*' + # ## Exclude all non-prod logs. + # - '-logs_*_env=*' + + body: + + size: 0 + # _source: + # - 'msg' + # sort: + # - '@timestamp': + # order: 'desc' + + query: + bool: + must: + ## We require these fields to exist. + ## If they don’t, the aggregation might be shortened/incomplete which would cause the scripts to fail. + - exists: + field: '@timestamp' + - exists: + field: '#source.keyword' + - exists: + field: 'severity' + - exists: + field: 'host.keyword' + - exists: + field: 'msg.keyword' + + - range: + '@timestamp': + gte: 'now-{{ctx.metadata.time_window}}-{{ctx.metadata.time_offset}}/h' + lt: 'now-{{ctx.metadata.time_offset}}/h' + + - bool: + should: + + - bool: + ## Include events + ## * severity: Critical or worse + ## * severity: Notice or worse AND msg contains one or more keywords + must: + - range: + severity: + lte: 5 + - bool: + should: + - terms: + msg: + - attack + - crash + - conflict + - critical + - denied + - down + - dump + - error + - exit + - fail + - failed + - fatal + - fault + - overflow + - poison + - quit + - restart + - unable + - range: + severity: + lte: 2 + + must_not: + - terms: + '#source.keyword': + - 'monitoring-alerts' + + - bool: + must: + - term: + '#source.keyword': 'wireless-eventlog' + - bool: + should: + + - match_phrase: + msg: + query: 'denied association on radio' + + - regexp: + 'msg.keyword': '.*failed .* on wlan.*' + + + ## Aggregation is ordered from low bucket size to high bucket size + ## where possible to keep the total bucket count down. + aggregations: + severity: + terms: + field: 'severity' + size: 100 + aggregations: + host: + terms: + field: 'host.keyword' + size: 100000 + aggregations: + msg: + terms: + field: 'msg.keyword' + size: 100000000 + aggregations: + source: + terms: + field: '#source.keyword' + size: 100000000 + min_timestamp: + top_hits: + size: 1 + sort: + '@timestamp': 'asc' + _source: + includes: + - '@timestamp' + max_timestamp: + top_hits: + size: 1 + sort: + '@timestamp': 'desc' + # _source: + # includes: + # - '@timestamp' + # - '#source' + + ## We can not do that but we would like to do this. + ## Refer to aggregated_issues_in_logs-index_transform for details. + # example_doc: + # top_hits: + # size: 1 + # sort: + # '_uid': 'asc' + +condition: + compare: + ctx.payload.hits.total: + gt: 0 + +transform: + script: + id: 'aggregated_issues_in_logs-index_transform' + +actions: + index_payload: + index: + ## Mustache templates are not supported here as of Elasticsearch 6.2 so we + ## rewrite them via the Makefile. + # index: '{{ctx.metadata.index_category}}_{{ctx.metadata.index_type}}_{{ctx.metadata.index_kv}}__{{ctx.metadata.index_revision}}_<{now/y{YYYY}}>' + index: '' + doc_type: 'doc' + + ## Needs to be removed when deploying to production. This is done in the Makefile. + log: + + ## Bug: This transform also influences the `index_payload` action? + transform: + script: + id: 'aggregated_issues_in_logs-deterministic_log' + + logging: + level: info + text: 'Watch payload to be indexed by Elasticsearch: {{ctx.payload}}'