Skip to content

[CI] BasicDistributedJobsIT#testFailOverBasics_withDataFeeder times out #41742

@cbuescher

Description

@cbuescher

This is on 6.7: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+multijob-darwin-compatibility/142/console

Could not reproduce locally so far:

./gradlew :x-pack:plugin:ml:internalClusterTest \
  -Dtests.seed=4C44E5E21479B1F6 \
  -Dtests.class=org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT \
  -Dtests.method="testFailOverBasics_withDataFeeder" \
  -Dtests.security.manager=true \
  -Dtests.locale=be-BY \
  -Dtests.timezone=Iceland \
  -Dcompiler.java=12 \
  -Druntime.java=8
java.lang.AssertionError: timed out waiting for green state
	at __randomizedtesting.SeedInfo.seed([4C44E5E21479B1F6:A3F1B919D37EAF7B]:0)
	at org.junit.Assert.fail(Assert.java:88)
	at org.elasticsearch.test.ESIntegTestCase.ensureColor(ESIntegTestCase.java:975)
	at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:931)
	at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:920)
	at org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT.testFailOverBasics_withDataFeeder(BasicDistributedJobsIT.java:124)

Earlier the master leaves the test cluster and there are connection errors in the logs following this:

xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, local
  1> [2019-05-02T11:02:18,100][INFO ][o.e.t.d.MockZenPing      ] [node_t1] pinging using mock zen ping
  1> [2019-05-02T11:02:18,100][INFO ][o.e.t.d.MockZenPing      ] [node_t2] pinging using mock zen ping
  1> [2019-05-02T11:02:18,105][WARN ][o.e.t.OutboundHandler    ] [node_t0] send message failed [channel: MockChannel{profile='default', isOpen=false, localAddress=/127.0.0.1:59660, isServerSocket=false}]
  1> java.net.SocketException: Socket is closed
  1> 	at java.net.Socket.getOutputStream(Socket.java:943) ~[?:1.8.0_212]
  1> 	at org.elasticsearch.transport.MockTcpTransport$MockChannel.sendMessage(MockTcpTransport.java:437) [framework-6.7.2-SNAPSHOT.jar:6.7.2-SNAPSHOT]
  1> 	at org.elasticsearch.transport.OutboundHandler.internalSendMessage(OutboundHandler.java:80) [elasticsearch-6.7.2-SNAPSHOT.jar:6.7.2-SNAPSHOT]
  1> 	at org.elasticsearch.transport.OutboundHandler.sendMessage(OutboundHandler.java:70) [elasticsearch-6.7.2-SNAPSHOT.jar:6.7.2-SNAPSHOT]
  1> 	at org.elasticsearch.transport.TcpTransport.sendErrorResponse(TcpTransport.java:716) [elasticsearch-6.7.2-SNAPSHOT.jar:6.7.2-SNAPSHOT]
  1> 	at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:73) [elasticsearch-6.7.2-SNAPSHOT.jar:6.7.2-SNAPSHOT]
  1> 	at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:60) [elasticsearch-6.7.2-SNAPSHOT.jar:6.7.2-SNAPSHOT]
  1> 	at org.elasticsearch.action.support.replication.TransportReplicationAction$OperationTransportHandler$1.onFailure(TransportReplicationAction.java:296) [elasticsearch-6.7.2-SNAPSHOT.jar:6.7.2-SNAPSHOT]
  1> 	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:945) [elasticsearch-6.7.2-SNAPSHOT.jar:6.7.2-SNAPSHOT]
  1> 	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleException(TransportReplicationAction.java:903) [elasticsearch-6.7.2-SNAPSHOT.jar:6.7.2-SNAPSHOT]
  1> 	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1114) [elasticsearch-6.7.2-SNAPSHOT.jar:6.7.2-SNAPSHOT]
  1> 	at org.elasticsearch.transport.TransportService$6.doRun(TransportService.java:665) [elasticsearch-6.7.2-SNAPSHOT.jar:6.7.2-SNAPSHOT]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) [elasticsearch-6.7.2-SNAPSHOT.jar:6.7.2-SNAPSHOT]
  1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.7.2-SNAPSHOT.jar:6.7.2-SNAPSHOT]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
  1> [2019-05-02T11:02:18,119][INFO ][o.e.n.Node               ] [testFailOverBasics_withDataFeeder] stopped
  1> [2019-05-02T11:02:18,119][INFO ][o.e.n.Node               ] [testFailOverBasics_withDataFeeder] closing ...
  1> [2019-05-02T11:02:18,122][INFO ][o.e.n.Node               ] [testFailOverBasics_withDataFeeder] closed
  1> [2019-05-02T11:02:21,253][INFO ][o.e.c.s.MasterService    ] [node_t3] zen-disco-elected-as-master ([1] nodes joined)[{node_t2}{ukuNaOABRY-ROr1VHZoecQ}{IyBV5B2aTt-d--Vz5UISmA}{127.0.0.1}{127.0.0.1:59656}{ml.machine_memory=34359738368, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason: new_master {node_t3}{cnjj9MlKQOmUbruhDHORpA}{1XuyG_aVQhSggAsbR_UR0Q}{127.0.0.1}{127.0.0.1:59657}{ml.machine_memory=34359738368, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
  1> [2019-05-02T11:02:21,254][INFO ][o.e.c.s.ClusterApplierService] [node_t2] detected_master {node_t3}{cnjj9MlKQOmUbruhDHORpA}{1XuyG_aVQhSggAsbR_UR0Q}{127.0.0.1}{127.0.0.1:59657}{ml.machine_memory=34359738368, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, reason: apply cluster state (from master [master {node_t3}{cnjj9MlKQOmUbruhDHORpA}{1XuyG_aVQhSggAsbR_UR0Q}{127.0.0.1}{127.0.0.1:59657}{ml.machine_memory=34359738368, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} committed version [31]])
  1> [2019-05-02T11:02:21,254][INFO ][o.e.c.s.ClusterApplierService] [node_t1] detected_master {node_t3}{cnjj9MlKQOmUbruhDHORpA}{1XuyG_aVQhSggAsbR_UR0Q}{127.0.0.1}{127.0.0.1:59657}{ml.machine_memory=34359738368, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, reason: apply cluster state (from master [master {node_t3}{cnjj9MlKQOmUbruhDHORpA}{1XuyG_aVQhSggAsbR_UR0Q}{127.0.0.1}{127.0.0.1:59657}{ml.machine_memory=34359738368, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} committed version [31]])
  1> [2019-05-02T11:02:21,255][WARN ][o.e.c.NodeConnectionsService] [node_t2] failed to connect to node {node_t0}{bvTYSYCbRuyN07MlTgZOzA}{V-fLnWAxRUCcHPO3Ih5Tkw}{127.0.0.1}{127.0.0.1:59651}{ml.machine_memory=34359738368, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} (tried [1] times)
  1> org.elasticsearch.transport.ConnectTransportException: [node_t0][127.0.0.1:59651] connect_exception
  1> 	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:1309) ~[elasticsearch-6.7.2-SNAPSHOT.jar:6.7.2-SNAPSHOT]
  1> 	at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:100) ~[elasticsearch-6.7.2-SNAPSHOT.jar:6.7.2-SNAPSHOT]
  1> 	at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-6.7.2-SNAPSHOT.jar:6.7.2-SNAPSHOT]
  1> 	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) ~[?:1.8.0_212]
  1> 	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) ~[?:1.8.0_212]
  1> 	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_212]
  1> 	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[?:1.8.0_212]
  1> 	at org.elasticsearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:57) ~[elasticsearch-core-6.7.2-SNAPSHOT.jar:6.7.2-SNAPSHOT]
  1> 	at org.elasticsearch.transport.MockTcpTransport.lambda$initiateChannel$0(MockTcpTransport.java:195) ~[framework-6.7.2-SNAPSHOT.jar:6.7.2-SNAPSHOT]
  1> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_212]
  1> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_212]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
  1> Caused by: java.net.ConnectException: Connection refused (Connection refused)
  1> 	at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_212]
  1> 	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_212]
  1> 	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_212]
  1> 	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_212]
  1> 	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_212]
  1> 	at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_212]
  1> 	at org.elasticsearch.mocksocket.MockSocket.access$101(MockSocket.java:32) ~[mocksocket-1.2.jar:?]
  1> 	at org.elasticsearch.mocksocket.MockSocket.lambda$connect$0(MockSocket.java:66) ~[mocksocket-1.2.jar:?]
  1> 	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_212]
  1> 	at org.elasticsearch.mocksocket.MockSocket.connect(MockSocket.java:65) ~[mocksocket-1.2.jar:?]
  1> 	at org.elasticsearch.mocksocket.MockSocket.connect(MockSocket.java:59) ~[mocksocket-1.2.jar:?]
  1> 	at org.elasticsearch.transport.MockTcpTransport.lambda$initiateChannel$0(MockTcpTransport.java:190) ~[framework-6.7.2-SNAPSHOT.jar:6.7.2-SNAPSHOT]
  1> 	... 5 more

Metadata

Metadata

Assignees

No one assigned

    Labels

    :mlMachine learning>test-failureTriaged test failures from CI

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions