Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Spark executors may make the node run out of root disk space #260

@kimoonkim

Description

@kimoonkim

@foxish

I was running a large scale Spark TeraSort job against HDFS. The input size was 250 GB and the cluster had 9 nodes, each with 3 x 2 TB disks in addition to the 20 GB root disk.

HDFS total size was much larger than the TeraSort size. However, notice the root disk being relatively small.

Soon the executors finished reading input data. And started doing shuffle. In the middle of shuffle, many executor and datanode pods were suddenly killed. Here's kuectl get pods output:

NAMESPACE     NAME                                                    READY     STATUS    RESTARTS   AGE
default       hdfs-datanode-15t9j                                     0/1       Evicted   0          1h
default       hdfs-datanode-2zz6s                                     1/1       Running   0          1h
default       hdfs-datanode-32p6p                                     1/1       Running   0          1h
default       hdfs-datanode-4f19w                                     1/1       Running   0          1h
default       hdfs-datanode-6hwc4                                     1/1       Running   0          1h
default       hdfs-datanode-c12dr                                     1/1       Running   0          1h
default       hdfs-datanode-cdh50                                     0/1       Evicted   0          1h
default       hdfs-datanode-g24wz                                     0/1       Evicted   0          1h
default       hdfs-datanode-gw23n                                     1/1       Running   0          1h
default       hdfs-namenode-0                                         1/1       Running   0          1h
default       spark-terasort-1493767302075                            1/1       Running   0          21m
default       spark-terasort-1493767302075-exec-1                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-2                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-3                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-4                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-5                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-6                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-7                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-8                     0/1       Evicted   0          21m
default       spark-terasort-1493767302075-exec-9                     0/1       Evicted   0          21m

Here's the output of kubectl describe EXECUTOR-POD:

Events:
  FirstSeen	LastSeen	Count	From					SubObjectPath			Type		Reason		Message
  ---------	--------	-----	----					-------------			--------	------		-------
  19m		19m		1	{default-scheduler }							Normal		Scheduled	Successfully assigned spark-terasort-1493767302075-exec-1 to ip-172-20-39-112.ec2.internal
  19m		19m		1	{kubelet ip-172-20-39-112.ec2.internal}	spec.containers{executor}	Normal		Pulled		Container image "gcr.io/smooth-copilot-163600/spark-executor:baseline-20170420-0a13206df6" already present on machine
  19m		19m		1	{kubelet ip-172-20-39-112.ec2.internal}	spec.containers{executor}	Normal		Created		Created container with docker id 005ba71fd89f; Security:[seccomp=unconfined]
  19m		19m		1	{kubelet ip-172-20-39-112.ec2.internal}	spec.containers{executor}	Normal		Started		Started container with docker id 005ba71fd89f
  12m		12m		1	{kubelet ip-172-20-39-112.ec2.internal}					Warning		Evicted		The node was low on resource: nodefs.
  12m		12m		1	{kubelet ip-172-20-39-112.ec2.internal}	spec.containers{executor}	Normal		Killing		Killing container with docker id 005ba71fd89f: Need to kill pod.

I believe this is because executors get its working space from the root volume and the root volume is too small for the job's data. (250 GB / 9 == 27 GB > 20 GB) I wonder if we thought about this before. Maybe external shffule service is required for us to process large amount of data.

This also can lead to slow shuffle performance as well, as executors don't utilize bandwidth offered by multiple disks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions