-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
To facilitate machine readability, log entries should be structured in a dictionary-like format. Structured logging is also known as semantic or typed logging. Examples of such structured formats are: key value pairs, JSON and XML. In general, flat text entries might be more concise and trade ambiguity for human readability. Machine readable entries are more verbose and explicit, shunning ambiguity.
The status quo for the audit logging is a mixture of the templated and the structured entry type. In general, the templated entry type is a good tradeoff between user readability and the dictionary-like format. It is not ambiguous and is also more concise compared to the structured type. Here is an example entry from the current implementation:
[2018-05-30T07:11:39,605] [transport] [access_granted] origin_type=[local_node], origin_address=[127.0.0.1], principal=[_xpack_security], roles=[superuser], action=[indices:data/read/search[free_context/scroll]]
Unfortunately, in the current implementation, entries are ambiguous because the set of allowed characters in user and role names includes '=', ' ', '[' and ']'. Fixing this would allow for unambiguous templated log entries.
Unambiguous means machine readable, i.e. you can write a regex to parse a specific field, so why bother with the verbose structured format? The answer is that the next step after parsing is indexing, and indexing requires that event fields have a name and a type. The parser of a templated entry type infers the field name and type from the token's position. But there may be several entry types with different positional parsing required, so one has to create several patterns. But what if patterns overlap (several can match the same line), or more commonly, what if parsers assign the same field name to fields with different data types. This is very possible because the code which generates entries and the code which parses and assigns type names are in different projects. For that matter, it is also relatively easy for the generating and the parsing pieces to go out of sync with each other, requiring integration tests...
Lecture over, here is the proposal:
The format of the entries should be JSON objects, UTF-8 encoded. This format is dictionary-like, unambiguous, where each field has a name and special characters in values are escaped. The new format will coexist alongside the present one, in an separate file trail.
Field names should align to the elastic common schema (ECS). ECS encourages dot notation for field names, i.e. related fields are similarly prefixed. This proposal will follow this suggestion, but note that entries will not be nested, values are not dictionaries.
The stretched goal is to have the format configurable, so even SYSLOG message format from RFC5424 might be supported. One possible avenue for this are messages in log4j 2.
This proposal is a narrowing of #8786 to the scope of the audit log.
Relates to #29881 .