Spark executors may make the node run out of root disk space

@foxish 

I was running a large scale Spark TeraSort job against HDFS. The input size was 250 GB and the cluster had 9 nodes, each with 3 x 2 TB disks in addition to the 20 GB root disk.

HDFS total size was much larger than the TeraSort size. However, notice the root disk being relatively small.

Soon the executors finished reading input data. And started doing shuffle. In the middle of shuffle, many executor and datanode pods were suddenly killed. Here's `kuectl get pods` output:

```
NAMESPACE     NAME                                                    READY     STATUS    RESTARTS   AGE
default       hdfs-datanode-15t9j                                     0/1       Evicted   0          1h
default       hdfs-datanode-2zz6s                                     1/1       Running   0          1h
default       hdfs-datanode-32p6p                                     1/1       Running   0          1h
default       hdfs-datanode-4f19w                                     1/1       Running   0          1h
default       hdfs-datanode-6hwc4                                     1/1       Running   0          1h
default       hdfs-datanode-c12dr                                     1/1       Running   0          1h
default       hdfs-datanode-cdh50                                     0/1       Evicted   0          1h
default       hdfs-datanode-g24wz                                     0/1       Evicted   0          1h
default       hdfs-datanode-gw23n                                     1/1       Running   0          1h
default       hdfs-namenode-0                                         1/1       Running   0          1h
default       spark-terasort-1493767302075                            1/1       Running   0          21m
default       spark-terasort-1493767302075-exec-1                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-2                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-3                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-4                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-5                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-6                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-7                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-8                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-9                     0/1       Evicted   0          21m
```

Here's the output of `kubectl describe EXECUTOR-POD`:

```
Events:
  FirstSeen	LastSeen	Count	From					SubObjectPath			Type		Reason		Message
  ---------	--------	-----	----					-------------			--------	------		-------
  19m		19m		1	{default-scheduler }							Normal		Scheduled	Successfully assigned spark-terasort-1493767302075-exec-1 to ip-172-20-39-112.ec2.internal
  19m		19m		1	{kubelet ip-172-20-39-112.ec2.internal}	spec.containers{executor}	Normal		Pulled		Container image "gcr.io/smooth-copilot-163600/spark-executor:baseline-20170420-0a13206df6" already present on machine
  19m		19m		1	{kubelet ip-172-20-39-112.ec2.internal}	spec.containers{executor}	Normal		Created		Created container with docker id 005ba71fd89f; Security:[seccomp=unconfined]
  19m		19m		1	{kubelet ip-172-20-39-112.ec2.internal}	spec.containers{executor}	Normal		Started		Started container with docker id 005ba71fd89f
  12m		12m		1	{kubelet ip-172-20-39-112.ec2.internal}					Warning		Evicted		The node was low on resource: nodefs.
  12m		12m		1	{kubelet ip-172-20-39-112.ec2.internal}	spec.containers{executor}	Normal		Killing		Killing container with docker id 005ba71fd89f: Need to kill pod.
```

I believe this is because executors get its working space from the root volume and the root volume is too small for the job's data. (250 GB / 9 == 27 GB > 20 GB) I wonder if we thought about this before. Maybe external shffule service is required for us to process large amount of data.

This also can lead to slow shuffle performance as well, as executors don't utilize bandwidth offered by multiple disks.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark executors may make the node run out of root disk space #260

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Spark executors may make the node run out of root disk space #260

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions