-
Notifications
You must be signed in to change notification settings - Fork 117
Spark executors may make the node run out of root disk space #260
Description
I was running a large scale Spark TeraSort job against HDFS. The input size was 250 GB and the cluster had 9 nodes, each with 3 x 2 TB disks in addition to the 20 GB root disk.
HDFS total size was much larger than the TeraSort size. However, notice the root disk being relatively small.
Soon the executors finished reading input data. And started doing shuffle. In the middle of shuffle, many executor and datanode pods were suddenly killed. Here's kuectl get pods output:
NAMESPACE NAME READY STATUS RESTARTS AGE
default hdfs-datanode-15t9j 0/1 Evicted 0 1h
default hdfs-datanode-2zz6s 1/1 Running 0 1h
default hdfs-datanode-32p6p 1/1 Running 0 1h
default hdfs-datanode-4f19w 1/1 Running 0 1h
default hdfs-datanode-6hwc4 1/1 Running 0 1h
default hdfs-datanode-c12dr 1/1 Running 0 1h
default hdfs-datanode-cdh50 0/1 Evicted 0 1h
default hdfs-datanode-g24wz 0/1 Evicted 0 1h
default hdfs-datanode-gw23n 1/1 Running 0 1h
default hdfs-namenode-0 1/1 Running 0 1h
default spark-terasort-1493767302075 1/1 Running 0 21m
default spark-terasort-1493767302075-exec-1 0/1 Evicted 0 21m
default spark-terasort-1493767302075-exec-2 0/1 Evicted 0 21m
default spark-terasort-1493767302075-exec-3 0/1 Evicted 0 21m
default spark-terasort-1493767302075-exec-4 0/1 Evicted 0 21m
default spark-terasort-1493767302075-exec-5 0/1 Evicted 0 21m
default spark-terasort-1493767302075-exec-6 0/1 Evicted 0 21m
default spark-terasort-1493767302075-exec-7 0/1 Evicted 0 21m
default spark-terasort-1493767302075-exec-8 0/1 Evicted 0 21m
default spark-terasort-1493767302075-exec-9 0/1 Evicted 0 21m
Here's the output of kubectl describe EXECUTOR-POD:
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
19m 19m 1 {default-scheduler } Normal Scheduled Successfully assigned spark-terasort-1493767302075-exec-1 to ip-172-20-39-112.ec2.internal
19m 19m 1 {kubelet ip-172-20-39-112.ec2.internal} spec.containers{executor} Normal Pulled Container image "gcr.io/smooth-copilot-163600/spark-executor:baseline-20170420-0a13206df6" already present on machine
19m 19m 1 {kubelet ip-172-20-39-112.ec2.internal} spec.containers{executor} Normal Created Created container with docker id 005ba71fd89f; Security:[seccomp=unconfined]
19m 19m 1 {kubelet ip-172-20-39-112.ec2.internal} spec.containers{executor} Normal Started Started container with docker id 005ba71fd89f
12m 12m 1 {kubelet ip-172-20-39-112.ec2.internal} Warning Evicted The node was low on resource: nodefs.
12m 12m 1 {kubelet ip-172-20-39-112.ec2.internal} spec.containers{executor} Normal Killing Killing container with docker id 005ba71fd89f: Need to kill pod.
I believe this is because executors get its working space from the root volume and the root volume is too small for the job's data. (250 GB / 9 == 27 GB > 20 GB) I wonder if we thought about this before. Maybe external shffule service is required for us to process large amount of data.
This also can lead to slow shuffle performance as well, as executors don't utilize bandwidth offered by multiple disks.