This repository was archived by the owner on Jan 9, 2020. It is now read-only.
forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 117
This repository was archived by the owner on Jan 9, 2020. It is now read-only.
heartbeat timeout causing all executors exit #405
Copy link
Copy link
Closed
Description
I run a spark application with 100 executors(each has 40G and 4cores),after running 4 hours,all executors exit with 56 and report the following logs:
...
Exit as unable to send heartbeats to driver more than 60 times
...
and driver hangs for several hours with nothing to do.
the relevant logs of driver list below:
2017-07-10 21:11:40 WARN Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(31,[Lscala.Tuple2;@641834e8,BlockManagerId(31, x.x.x.x, 57787, None))]
2017-07-10 21:11:49 WARN Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(52,[Lscala.Tuple2;@2de28748,BlockManagerId(52, x.x.x.x, 20423, None))]
2017-07-10 21:11:42 WARN Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(20,[Lscala.Tuple2;@431947f,BlockManagerId(20, x.x.x.x, 28993, None))]
2017-07-10 21:11:44 WARN Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(99,[Lscala.Tuple2;@6f8e1be0,BlockManagerId(99, x.x.x.x, 47398, None))]
2017-07-10 21:11:46 WARN Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(87,[Lscala.Tuple2;@6ff2c572,BlockManagerId(87, x.x.x.x, 47063, None))]
Addition:
-
Task Resources(1 driver + 100 executors)
driver: 40G+4cores
single executor: 40G+4cores -
There are no user-specified spark configurations for this spark task. All using default ones.
-
It does not look like the network problem as other tasks in the same cluster run normally.
Metadata
Metadata
Assignees
Labels
No labels