Skip to content

[CI] unexpected failure while replicating translog entry #38898

@droberts195

Description

@droberts195

A test failure occurred in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+intake/99/console, however, the root cause was that one of the nodes in the cluster suffered a fatal exception:

[2019-02-14T11:46:31,144][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-1] fatal error in thread [elasticsearch[node-1][generic][T#2]], exiting
java.lang.AssertionError: unexpected failure while replicating translog entry: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
    at org.elasticsearch.indices.recovery.RecoveryTarget.lambda$indexTranslogOperations$2(RecoveryTarget.java:362) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:191) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT] 
    at org.elasticsearch.indices.recovery.RecoveryTarget.indexTranslogOperations(RecoveryTarget.java:333) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$TranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:521) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$TranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:480) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1076) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]

The test that was running when this happened was:

./gradlew :qa:smoke-test-multinode:integTestRunner \
  -Dtests.seed=666DFE2D5892C30E \
  -Dtests.class=org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT \
  -Dtests.method="test {yaml=smoke_test_multinode/10_basic/cluster health basic test, wait for both nodes to join}" \
  -Dtests.security.manager=true \
  -Dtests.locale=mk \
  -Dtests.timezone=Europe/Malta \
  -Dcompiler.java=11 \
  -Druntime.java=8

Since an almost identical test immediately before succeeded I doubt there is anything wrong with that test. (The "stash dump on failure" in the Jenkins log is very confusing too as it contains the result of the previous successful test.)

cluster_logs.zip contains the logs from the nodes in the test cluster that died with the fatal error.

Metadata

Metadata

Assignees

Labels

:Distributed Indexing/EngineAnything around managing Lucene and the Translog in an open shard.>test-failureTriaged test failures from CI

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions