Skip to content

org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT.testFailOverBasics failed #40546

@benwtrent

Description

@benwtrent

Failure not reproducible locally.
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+internalClusterTest/3881/consoleFull
Reproduce Line:

./gradlew :x-pack:plugin:ml:internalClusterTest -Dtests.seed=D9A30C4EA4AC438B -Dtests.class=org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT -Dtests.method="testFailOverBasics" -Dtests.security.manager=true -Dtests.locale=en-AU -Dtests.timezone=Etc/GMT-14 -Dcompiler.java=12 -Druntime.java=8

Digging into the failure, it appears that the test timed out waiting for ensureGreen

  1> [2019-03-28T07:05:01,572][INFO ][o.e.x.m.i.BasicDistributedJobsIT] [testFailOverBasics] ensureGreen timed out, cluster state:

The failure seems as if:
• We killed the master (node0) and they abdicated to node3
• We tried waiting for green but timed out
• This is because .ml-state never fully made a replica. Seems that we killed a node too quickly? So the .ml-state index kept the cluster from being green.

1> ----shard_id [.ml-state][0]
 1> --------[.ml-state][0], node[null], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2019-03-27T17:04:31.506Z], delayed=false, details[node_left [7teV8OudStufgnfSDcALpw]], allocation_status[no_valid_shard_copy]]
 1> --------[.ml-state][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=PRIMARY_FAILED], at[2019-03-27T17:04:31.506Z], delayed=false, details[primary failed while replica initializing], allocation_status[no_attempt]]

Metadata

Metadata

Assignees

No one assigned

    Labels

    :mlMachine learning>test-failureTriaged test failures from CI

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions