Skip to content

Conversation

@benwtrent
Copy link
Member

@benwtrent benwtrent commented Jul 15, 2019

Since 7.2 Bulk index failures have been skipped over and cause a completely silent failure. This is a very bad user experience as their transform could be writing NO data, but scrolling through all index without stopping.

In 7.3, with continuously reading from source index, this bad experience is exacerbated.

The only concern here is that our frequency is STATIC. So, we will not attempt this bulk action again until the task is triggered again. Which is 60s by default.

closes #44101

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core


// Force stop the transform as bulk indexing caused it to go into a failed state
stopDataFrameTransform(transformId, true);
deleteIndex(dataFrameIndex);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I delete the index here as these tests keep created indices by default. I delete it so that it can be ran with repeatedly with the -Dtests.iters flag

+ " \"field\": \"stars\""
+ " } } } }"
+ " } } } },"
+ "\"frequency\":\"1s\""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default frequency is low (meaning more time between triggered actions) by default. Making it 1s for our tests so that things can progress more quickly.

"] bulk index failures. See the logs of the node running the transform for details. " +
bulkResponse.buildFailureMessage());
auditBulkFailures = false;
if (auditBulkFailures) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only write this large audit once per page. Now that we fail on bulk failures, it could occur more than once per page (possibly causing the task to fail).

new BulkIndexFailure("Bulk index experienced failures. " +
"See the logs of the node running the transform for details."));
} else {
auditBulkFailures = true;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can audit the bulk failures again on the next page

// This calls AsyncTwoPhaseIndexer#finishWithIndexingFailure
// It increments the indexing failure, and then calls the `onFailure` logic
nextPhase.onFailure(
new BulkIndexFailure("Bulk index experienced failures. " +
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I manually created a logging message here so that it is reliably the same thing each time. There could be different bulk failures on each attempt and thus gets logged/audit too many times as we continue to retry the indexing request.

}

// Considered a recoverable indexing failure
private static class BulkIndexFailure extends Exception {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These exception classes really only need to be seen by the isIrrecoverableFailure method. That method currently does not know about them, but it eventually will.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I saw this I thought it was a bit dangerous passing it to the onFailure method of a listener. If this needed to be transported to a different node then it would cause an error because the exception transport wouldn't know how to serialise it.

I think it's OK as the code stands now because it's consumed on the same node where it's thrown.

I also think it's best practice that exception class names end in Exception. Certainly all the classes that extend ElasticsearchException follow this naming pattern.

Copy link

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}

// Considered a recoverable indexing failure
private static class BulkIndexFailure extends Exception {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I saw this I thought it was a bit dangerous passing it to the onFailure method of a listener. If this needed to be transported to a different node then it would cause an error because the exception transport wouldn't know how to serialise it.

I think it's OK as the code stands now because it's consumed on the same node where it's thrown.

I also think it's best practice that exception class names end in Exception. Certainly all the classes that extend ElasticsearchException follow this naming pattern.

// It increments the indexing failure, and then calls the `onFailure` logic
nextPhase.onFailure(
new BulkIndexFailure("Bulk index experienced failures. " +
"See the logs of the node running the transform for details."));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I saw this message my immediate question would be "which node is that?" How hard would it be to add this?

Adding the node name will not cause the message to be logged more frequently, because if the transform switches nodes then the local variables that remember the message has already been logged will be different.

If it's really hard or messy to add the node name then just add a TODO for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195 when we write the audit message there is a field node_name: <nodeName> in the recorded audit message. Additionally, if they see the log, then they are already on the node.

Copy link

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jpountz jpountz added v7.3.1 and removed v7.3.0 labels Jul 15, 2019
@benwtrent benwtrent merged commit 92709f5 into elastic:master Jul 16, 2019
@benwtrent benwtrent deleted the bug/ml-df-fail-on-index-failures branch July 16, 2019 12:47
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Jul 16, 2019
…astic#44351)

* [ML][Data Frame] treat bulk index failures as an indexing failure

* removing redundant public modifier

* changing to an ElasticsearchException

* fixing redundant public modifier
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Jul 16, 2019
…astic#44351)

* [ML][Data Frame] treat bulk index failures as an indexing failure

* removing redundant public modifier

* changing to an ElasticsearchException

* fixing redundant public modifier
benwtrent added a commit that referenced this pull request Jul 16, 2019
…4351) (#44427)

* [ML][Data Frame] treat bulk index failures as an indexing failure

* removing redundant public modifier

* changing to an ElasticsearchException

* fixing redundant public modifier
benwtrent added a commit that referenced this pull request Jul 16, 2019
…4351) (#44428)

* [ML][Data Frame] treat bulk index failures as an indexing failure

* removing redundant public modifier

* changing to an ElasticsearchException

* fixing redundant public modifier
@jpountz jpountz added v7.3.0 and removed v7.3.1 labels Jul 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ML] Individual bulk index failures in data frame should be treated as index failures

6 participants