heartbeat timeout causing all executors exit

I run a spark application with 100 executors(each has 40G and 4cores),after running 4 hours,all executors exit with 56 and report the following logs:

```
...
Exit as unable to send heartbeats to driver more than 60 times
...
```

and driver hangs for several hours with nothing to do.

the relevant logs of driver list below:
```
2017-07-10 21:11:40 WARN  Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(31,[Lscala.Tuple2;@641834e8,BlockManagerId(31, x.x.x.x, 57787, None))]
2017-07-10 21:11:49 WARN  Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(52,[Lscala.Tuple2;@2de28748,BlockManagerId(52, x.x.x.x, 20423, None))]
2017-07-10 21:11:42 WARN  Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(20,[Lscala.Tuple2;@431947f,BlockManagerId(20, x.x.x.x, 28993, None))]
2017-07-10 21:11:44 WARN  Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(99,[Lscala.Tuple2;@6f8e1be0,BlockManagerId(99, x.x.x.x, 47398, None))]
2017-07-10 21:11:46 WARN  Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(87,[Lscala.Tuple2;@6ff2c572,BlockManagerId(87, x.x.x.x, 47063, None))]
```

Addition:
1. Task Resources(1 driver + 100 executors)
driver: 40G+4cores
single executor: 40G+4cores

2. There are no user-specified spark configurations for this spark task. All using default ones.

3. It does not look like the network problem as other tasks in the same cluster run normally.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

heartbeat timeout causing all executors exit #405

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

heartbeat timeout causing all executors exit #405

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions