Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

heartbeat timeout causing all executors exit #405

@duyanghao

Description

@duyanghao

I run a spark application with 100 executors(each has 40G and 4cores),after running 4 hours,all executors exit with 56 and report the following logs:

...
Exit as unable to send heartbeats to driver more than 60 times
...

and driver hangs for several hours with nothing to do.

the relevant logs of driver list below:

2017-07-10 21:11:40 WARN  Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(31,[Lscala.Tuple2;@641834e8,BlockManagerId(31, x.x.x.x, 57787, None))]
2017-07-10 21:11:49 WARN  Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(52,[Lscala.Tuple2;@2de28748,BlockManagerId(52, x.x.x.x, 20423, None))]
2017-07-10 21:11:42 WARN  Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(20,[Lscala.Tuple2;@431947f,BlockManagerId(20, x.x.x.x, 28993, None))]
2017-07-10 21:11:44 WARN  Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(99,[Lscala.Tuple2;@6f8e1be0,BlockManagerId(99, x.x.x.x, 47398, None))]
2017-07-10 21:11:46 WARN  Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(87,[Lscala.Tuple2;@6ff2c572,BlockManagerId(87, x.x.x.x, 47063, None))]

Addition:

  1. Task Resources(1 driver + 100 executors)
    driver: 40G+4cores
    single executor: 40G+4cores

  2. There are no user-specified spark configurations for this spark task. All using default ones.

  3. It does not look like the network problem as other tasks in the same cluster run normally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions