[ML][Data Frame] treat bulk index failures as an indexing failure #44351

benwtrent · 2019-07-15T14:38:38Z

Since 7.2 Bulk index failures have been skipped over and cause a completely silent failure. This is a very bad user experience as their transform could be writing NO data, but scrolling through all index without stopping.

In 7.3, with continuously reading from source index, this bad experience is exacerbated.

The only concern here is that our frequency is STATIC. So, we will not attempt this bulk action again until the task is triggered again. Which is 60s by default.

closes #44101

elasticmachine · 2019-07-15T14:38:42Z

Pinging @elastic/ml-core

benwtrent · 2019-07-15T14:39:35Z

...-tests/src/test/java/org/elasticsearch/xpack/dataframe/integration/DataFramePivotRestIT.java

+
+        // Force stop the transform as bulk indexing caused it to go into a failed state
+        stopDataFrameTransform(transformId, true);
+        deleteIndex(dataFrameIndex);


I delete the index here as these tests keep created indices by default. I delete it so that it can be ran with repeatedly with the -Dtests.iters flag

benwtrent · 2019-07-15T14:40:28Z

...tests/src/test/java/org/elasticsearch/xpack/dataframe/integration/DataFrameRestTestCase.java

                + "         \"field\": \"stars\""
-                + " } } } }"
+                + " } } } },"
+                + "\"frequency\":\"1s\""


Default frequency is low (meaning more time between triggered actions) by default. Making it 1s for our tests so that things can progress more quickly.

benwtrent · 2019-07-15T14:41:12Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java

-                                "] bulk index failures. See the logs of the node running the transform for details. " +
-                                bulkResponse.buildFailureMessage());
-                        auditBulkFailures = false;
+                        if (auditBulkFailures) {


We should only write this large audit once per page. Now that we fail on bulk failures, it could occur more than once per page (possibly causing the task to fail).

benwtrent · 2019-07-15T14:41:29Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java

+                            new BulkIndexFailure("Bulk index experienced failures. " +
+                                "See the logs of the node running the transform for details."));
+                    } else {
+                        auditBulkFailures = true;


We can audit the bulk failures again on the next page

benwtrent · 2019-07-15T14:42:33Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java

+                        // This calls AsyncTwoPhaseIndexer#finishWithIndexingFailure
+                        // It increments the indexing failure, and then calls the `onFailure` logic
+                        nextPhase.onFailure(
+                            new BulkIndexFailure("Bulk index experienced failures. " +


I manually created a logging message here so that it is reliably the same thing each time. There could be different bulk failures on each attempt and thus gets logged/audit too many times as we continue to retry the indexing request.

benwtrent · 2019-07-15T14:43:22Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java

    }
+
+    // Considered a recoverable indexing failure
+    private static class BulkIndexFailure extends Exception {


These exception classes really only need to be seen by the isIrrecoverableFailure method. That method currently does not know about them, but it eventually will.

When I saw this I thought it was a bit dangerous passing it to the onFailure method of a listener. If this needed to be transported to a different node then it would cause an error because the exception transport wouldn't know how to serialise it.

I think it's OK as the code stands now because it's consumed on the same node where it's thrown.

I also think it's best practice that exception class names end in Exception. Certainly all the classes that extend ElasticsearchException follow this naming pattern.

hendrikmuhs

LGTM

droberts195 · 2019-07-15T16:35:34Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java

    }
+
+    // Considered a recoverable indexing failure
+    private static class BulkIndexFailure extends Exception {


When I saw this I thought it was a bit dangerous passing it to the onFailure method of a listener. If this needed to be transported to a different node then it would cause an error because the exception transport wouldn't know how to serialise it.

I think it's OK as the code stands now because it's consumed on the same node where it's thrown.

I also think it's best practice that exception class names end in Exception. Certainly all the classes that extend ElasticsearchException follow this naming pattern.

droberts195 · 2019-07-15T16:38:14Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java

+                        // It increments the indexing failure, and then calls the `onFailure` logic
+                        nextPhase.onFailure(
+                            new BulkIndexFailure("Bulk index experienced failures. " +
+                                "See the logs of the node running the transform for details."));


If I saw this message my immediate question would be "which node is that?" How hard would it be to add this?

Adding the node name will not cause the message to be logged more frequently, because if the transform switches nodes then the local variables that remember the message has already been logged will be different.

If it's really hard or messy to add the node name then just add a TODO for now.

@droberts195 when we write the audit message there is a field node_name: <nodeName> in the recorded audit message. Additionally, if they see the log, then they are already on the node.

droberts195

LGTM

…astic#44351) * [ML][Data Frame] treat bulk index failures as an indexing failure * removing redundant public modifier * changing to an ElasticsearchException * fixing redundant public modifier

…4351) (#44427) * [ML][Data Frame] treat bulk index failures as an indexing failure * removing redundant public modifier * changing to an ElasticsearchException * fixing redundant public modifier

…4351) (#44428) * [ML][Data Frame] treat bulk index failures as an indexing failure * removing redundant public modifier * changing to an ElasticsearchException * fixing redundant public modifier

[ML][Data Frame] treat bulk index failures as an indexing failure

d5c69d2

benwtrent added >bug v8.0.0 :ml/Transform Transform v7.3.0 v7.4.0 labels Jul 15, 2019

benwtrent commented Jul 15, 2019

View reviewed changes

removing redundant public modifier

1b85146

hendrikmuhs approved these changes Jul 15, 2019

View reviewed changes

droberts195 reviewed Jul 15, 2019

View reviewed changes

benwtrent added 2 commits July 15, 2019 12:33

changing to an ElasticsearchException

bb94e68

fixing redundant public modifier

4177b3c

droberts195 approved these changes Jul 15, 2019

View reviewed changes

jpountz added v7.3.1 and removed v7.3.0 labels Jul 15, 2019

benwtrent merged commit 92709f5 into elastic:master Jul 16, 2019

benwtrent deleted the bug/ml-df-fail-on-index-failures branch July 16, 2019 12:47

This was referenced Jul 16, 2019

[7.x] [ML][Data Frame] treat bulk index failures as an indexing failure (#44351) #44427

Merged

[7.3] [ML][Data Frame] treat bulk index failures as an indexing failure (#44351) #44428

Merged

jpountz added v7.3.0 and removed v7.3.1 labels Jul 26, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

[ML][Data Frame] treat bulk index failures as an indexing failure #44351

[ML][Data Frame] treat bulk index failures as an indexing failure #44351

Uh oh!

Conversation

benwtrent commented Jul 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Jul 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hendrikmuhs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

benwtrent commented Jul 15, 2019 •

edited

Loading