[ML] Transform audit message improvements 

Feedback being given on 8.0.0 `{
  "build" : {
    "hash" : "bbea61a232db278db2941e726bbfec16bea8d7f6",
    "date" : "2019-08-21T14:20:38.772853Z"
  },`

// missing Insufficient memory message

Continuous transform audit messages no longer report the following message. This was a useful message in that it indicates a corrective action being made on the part of the transform and if being done frequently, will indicate resources are constrained. The following is still logged, but not audited. Where did it go? Can we have it back?

`Data frame transform [transform-01]:Insufficient memory for search, reducing number of buckets per search from [500] to [365]`

// frequency of messages

In a system with long running continuous data frames and index creation / deletion, there is a periodic `Failed to retrieve checkpoint`.

![image](https://user-images.githubusercontent.com/4185750/63514200-f9d8d900-c4df-11e9-8ead-cc756d1f3d87.png)

Looking in the logs, this is caused by failure to retrieve checkpoint info caused by a global mismatch. We retry.

1. Do we need to show this to the user? If the retries fail then we should. I'm not sure this is worth auditing. If we think it is, then would be good for the message that is presented to the user to indicate that a retry occurs.

2. We only show this audit message the first time it occurs. Subsequent similar messages throughout the night were not audited (but they were in the logs). What is the logic for the frequency of auditing this? (I saw 9 log entries for one transform but only 1 audit).

// Failed transform

In the case below, the number of buckets per search was set to 10. A CBE occurred due to load in the cluster from other processes. The 10 subsequent retries were fired in quick succession and also caused CBEs [1]. The final audit message indicated that the cluster state could not be updated. The "useful" audit message would have been that there was "insufficient memory unable to continue" which was the log message before. 

(note that this may have been due to the unfortunate coincidence of two failures - both could be audited, with the insufficient memory message being more informative).

![image](https://user-images.githubusercontent.com/4185750/63514570-c64a7e80-c4e0-11e9-91e9-7f9df6827c4a.png)

```
[2019-08-22T05:17:53,886][ERROR][o.e.x.d.t.DataFrameTransformTask] [node2] Data frame transform [transform-05]: Insufficient memory for search after repeated page size reductions to [10], unable to continue pivot, please simplify job or increase heap size on data nodes.
[2019-08-22T05:17:54,066][ERROR][o.e.x.d.t.DataFrameTransformTask] [node2] Failed to update state for data frame transform [transform-05]
org.elasticsearch.transport.RemoteTransportException: [node1][127.0.0.1:9351][cluster:admin/persistent/update_status]
```
cc @benwtrent 

[1] A topic for a different discussion on throttling and back off.
 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Transform audit message improvements #45834

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ML] Transform audit message improvements #45834

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions