-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[ML][Data Frame] treat bulk index failures as an indexing failure #44351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML][Data Frame] treat bulk index failures as an indexing failure #44351
Conversation
|
Pinging @elastic/ml-core |
|
|
||
| // Force stop the transform as bulk indexing caused it to go into a failed state | ||
| stopDataFrameTransform(transformId, true); | ||
| deleteIndex(dataFrameIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I delete the index here as these tests keep created indices by default. I delete it so that it can be ran with repeatedly with the -Dtests.iters flag
| + " \"field\": \"stars\"" | ||
| + " } } } }" | ||
| + " } } } }," | ||
| + "\"frequency\":\"1s\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default frequency is low (meaning more time between triggered actions) by default. Making it 1s for our tests so that things can progress more quickly.
| "] bulk index failures. See the logs of the node running the transform for details. " + | ||
| bulkResponse.buildFailureMessage()); | ||
| auditBulkFailures = false; | ||
| if (auditBulkFailures) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only write this large audit once per page. Now that we fail on bulk failures, it could occur more than once per page (possibly causing the task to fail).
| new BulkIndexFailure("Bulk index experienced failures. " + | ||
| "See the logs of the node running the transform for details.")); | ||
| } else { | ||
| auditBulkFailures = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can audit the bulk failures again on the next page
| // This calls AsyncTwoPhaseIndexer#finishWithIndexingFailure | ||
| // It increments the indexing failure, and then calls the `onFailure` logic | ||
| nextPhase.onFailure( | ||
| new BulkIndexFailure("Bulk index experienced failures. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I manually created a logging message here so that it is reliably the same thing each time. There could be different bulk failures on each attempt and thus gets logged/audit too many times as we continue to retry the indexing request.
| } | ||
|
|
||
| // Considered a recoverable indexing failure | ||
| private static class BulkIndexFailure extends Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These exception classes really only need to be seen by the isIrrecoverableFailure method. That method currently does not know about them, but it eventually will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I saw this I thought it was a bit dangerous passing it to the onFailure method of a listener. If this needed to be transported to a different node then it would cause an error because the exception transport wouldn't know how to serialise it.
I think it's OK as the code stands now because it's consumed on the same node where it's thrown.
I also think it's best practice that exception class names end in Exception. Certainly all the classes that extend ElasticsearchException follow this naming pattern.
hendrikmuhs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| } | ||
|
|
||
| // Considered a recoverable indexing failure | ||
| private static class BulkIndexFailure extends Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I saw this I thought it was a bit dangerous passing it to the onFailure method of a listener. If this needed to be transported to a different node then it would cause an error because the exception transport wouldn't know how to serialise it.
I think it's OK as the code stands now because it's consumed on the same node where it's thrown.
I also think it's best practice that exception class names end in Exception. Certainly all the classes that extend ElasticsearchException follow this naming pattern.
| // It increments the indexing failure, and then calls the `onFailure` logic | ||
| nextPhase.onFailure( | ||
| new BulkIndexFailure("Bulk index experienced failures. " + | ||
| "See the logs of the node running the transform for details.")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I saw this message my immediate question would be "which node is that?" How hard would it be to add this?
Adding the node name will not cause the message to be logged more frequently, because if the transform switches nodes then the local variables that remember the message has already been logged will be different.
If it's really hard or messy to add the node name then just add a TODO for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droberts195 when we write the audit message there is a field node_name: <nodeName> in the recorded audit message. Additionally, if they see the log, then they are already on the node.
droberts195
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…astic#44351) * [ML][Data Frame] treat bulk index failures as an indexing failure * removing redundant public modifier * changing to an ElasticsearchException * fixing redundant public modifier
…astic#44351) * [ML][Data Frame] treat bulk index failures as an indexing failure * removing redundant public modifier * changing to an ElasticsearchException * fixing redundant public modifier
Since 7.2 Bulk index failures have been skipped over and cause a completely silent failure. This is a very bad user experience as their transform could be writing NO data, but scrolling through all index without stopping.
In 7.3, with continuously reading from source index, this bad experience is exacerbated.
The only concern here is that our frequency is STATIC. So, we will not attempt this bulk action again until the task is triggered again. Which is
60sby default.closes #44101