[SPARK-5198][Mesos] Change executorId more unique on mesos fine-grained ... #3994

jongyoul · 2015-01-12T02:30:33Z

...mode

Change executorId from slaveId to taskId.getValue
Directory structure
- Before
  - /path/to/local/prefix/slaves/{slaveId}/frameworks/{frameworksId}/executors/{slaveId}/runs/{containerId}
- After
  - /path/to/local/prefix/slaves/{slaveId}/frameworks/{frameworksId}/executors/{taskId}/runs/{containerId}

…ed mode - Change executorId from slaveId to taskId.getValue

SparkQA · 2015-01-12T02:32:31Z

Test build #25392 has started for PR 3994 at commit 29c7d60.

This patch merges cleanly.

SparkQA · 2015-01-12T03:41:25Z

Test build #25392 has finished for PR 3994 at commit 29c7d60.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-12T03:41:29Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25392/
Test PASSed.

jongyoul · 2015-01-12T04:20:14Z

/cc @JoshRosen @andrewor14 Review this, please.

jongyoul · 2015-01-14T05:14:32Z

@mateiz Could you please also review this PR? When two different tasks runs at same executor at the same time, those runs on the same container. That results different logs in a same file. The most fatal effect is one launcher unarchives executor.tgz when another launcher is unarchiving a same file at same location on a same node. This can make one launcher lost.

mateiz · 2015-01-17T03:07:10Z

This looks like it's going to create a separate executor for each task, which is the opposite of what we want for fine-grained mode. The goal of fine-grained mode is to share each executor between multiple tasks.

What was the problem here? Do you have two Mesos slaves on the same machine and they clash?

jongyoul · 2015-01-17T03:12:26Z

@mateiz I have one slave per node and the problem occurs when two tasks are launched at the same time. Two tasks run in a same container so it makes two tasks leave same log file and crash when tasks launch

jongyoul · 2015-01-17T03:27:38Z

@mateiz especially, I've found that mesos launcher copy spark file from hers and extract that file mesos working directory. When tasks are launched at the same time, one task are trying to launch it while another task are extracting the file. At this case, former task sometimes aren't launched so launcher returns TASK_LOST without any logs in my test. I think this is not a intended situation.

mateiz · 2015-01-17T05:55:22Z

This sounds to me like a Mesos bug, since Mesos is what's un-tarring the executor. I'd suggest asking them. We can't make this change in Spark, it would break fine-grained mode.

mateiz · 2015-01-17T05:56:15Z

It might also be that Mesos's behavior changed. It shouldn't be sending TASK_LOST while the executor is being unzipped, it should be queueing up the task and giving it to the executor once it's ready. @tnachen what do you think about this?

tnachen · 2015-01-17T07:34:04Z

@jongyoul when you're talking about the spark file, you're talking about the spark executor uri tar itself right?
The TASK_LOST behavior certainly shouldn't happen normally. The only thing I can think of is the executor takes longer to launch due to longer time downloading the file, and executor registration timeout triggered the task to fail/lost.
Can you repro this behavior with setting verbose logging (env variable GLOG_v=1) on the slave, and share the logs around when you get the TASK_LOST?

jongyoul · 2015-01-17T07:57:31Z

@tnachen Yes, I used spark executor uri like hdfs:///app/spark/test-spark.tgz. I'll upload test log as soon as possible. At last, GLOG_v=1 is a mesos slave setting, isn't it? And how can I find a setting about timeout that you said.

jongyoul · 2015-01-17T08:02:15Z

@mateiz I think I don't know fine-grained mode how you intend to behave exactly. What help me to understand more? I don't know how multi executor break spark's intended behaviour.

jongyoul · 2015-01-17T08:40:01Z

@tnachen

Slave page
Sandbox page
stderr

jongyoul · 2015-01-17T08:49:24Z

@tnachen And slave's logs around task 34, 63. It looks like that if any task occurs error while running, the executor running that task is terminated. Check this, please.

I0117 17:21:43.678827 41388 slave.cpp:625] Got assigned task 34 for framework 20150117-171023-3391097354-60030-7325-0004
I0117 17:21:43.679612 41388 slave.cpp:734] Launching task 34 for framework 20150117-171023-3391097354-60030-7325-0004
I0117 17:21:43.721297 41388 slave.cpp:844] Queuing task '34' for executor 20141110-112437-3374320138-60030-57359-44 of framework '20150117-171023-3391097354-60030-7325-0004
I0117 17:21:43.775977 41388 slave.cpp:358] Successfully attached file '/data03/mesos/slaves/20141110-112437-3374320138-60030-57359-44/frameworks/20150117-171023-3391097354-60030-7325-0004/executors/20141110-112437-3374320138-60030-57359-44/runs/3fdbdd09-98cd-4197-954f-d95d9b3b4aee'
I0117 17:21:43.721451 41386 mesos_containerizer.cpp:407] Starting container '3fdbdd09-98cd-4197-954f-d95d9b3b4aee' for executor '20141110-112437-3374320138-60030-57359-44' of framework '20150117-171023-3391097354-60030-7325-0004'
I0117 17:21:43.777179 41386 mesos_containerizer.cpp:528] Fetching URIs for container '3fdbdd09-98cd-4197-954f-d95d9b3b4aee' using command '/usr/bin/env MESOS_EXECUTOR_URIS="hdfs:///app/spark/spark-1.3.0-SNAPSHOT-bin-2.3.0-cdh5.0.1.tgz+0X" MESOS_WORK_DIRECTORY=/data03/mesos/slaves/20141110-112437-3374320138-60030-57359-44/frameworks/20150117-171023-3391097354-60030-7325-0004/executors/20141110-112437-3374320138-60030-57359-44/runs/3fdbdd09-98cd-4197-954f-d95d9b3b4aee HADOOP_HOME=/app/hdfs/ /app/mesos-0.18.1/libexec/mesos/mesos-fetcher'
I0117 17:22:28.863304 41374 slave.cpp:2523] Current usage 44.85%. Max allowed age: 3.160841566048842days
I0117 17:22:38.472086 41384 slave.cpp:625] Got assigned task 63 for framework 20150117-171023-3391097354-60030-7325-0004
I0117 17:22:38.472584 41384 slave.cpp:734] Launching task 63 for framework 20150117-171023-3391097354-60030-7325-0004
I0117 17:22:38.472801 41384 slave.cpp:844] Queuing task '63' for executor 20141110-112437-3374320138-60030-57359-44 of framework '20150117-171023-3391097354-60030-7325-0004
I0117 17:22:43.721726 41370 slave.cpp:2475] Terminating executor 20141110-112437-3374320138-60030-57359-44 of framework 20150117-171023-3391097354-60030-7325-0004 because it did not register within 1mins
I0117 17:22:43.722038 41378 mesos_containerizer.cpp:818] Destroying container '3fdbdd09-98cd-4197-954f-d95d9b3b4aee'
I0117 17:22:43.722295 41378 slave.cpp:2052] Executor '20141110-112437-3374320138-60030-57359-44' of framework 20150117-171023-3391097354-60030-7325-0004 has terminated with unknown status
E0117 17:22:43.722744 41376 slave.cpp:2332] Failed to unmonitor container for executor 20141110-112437-3374320138-60030-57359-44 of framework 20150117-171023-3391097354-60030-7325-0004: Not monitored
I0117 17:22:43.737566 41378 slave.cpp:1669] Handling status update TASK_LOST (UUID: cc571b34-e161-4ec9-bcf1-033bd967209f) for task 34 of framework 20150117-171023-3391097354-60030-7325-0004 from @0.0.0.0:0
I0117 17:22:43.737829 41378 slave.cpp:3142] Terminating task 34
I0117 17:22:43.738701 41372 status_update_manager.cpp:315] Received status update TASK_LOST (UUID: cc571b34-e161-4ec9-bcf1-033bd967209f) for task 34 of framework 20150117-171023-3391097354-60030-7325-0004
I0117 17:22:43.739341 41378 slave.cpp:1669] Handling status update TASK_LOST (UUID: f198e879-a762-4cce-97ff-5261cb4ff820) for task 63 of framework 20150117-171023-3391097354-60030-7325-0004 from @0.0.0.0:0
I0117 17:22:43.739398 41372 status_update_manager.cpp:494] Creating StatusUpdate stream for task 34 of framework 20150117-171023-3391097354-60030-7325-0004
I0117 17:22:43.739542 41378 slave.cpp:3142] Terminating task 63
I0117 17:22:43.739869 41372 status_update_manager.cpp:368] Forwarding status update TASK_LOST (UUID: cc571b34-e161-4ec9-bcf1-033bd967209f) for task 34 of framework 20150117-171023-3391097354-60030-7325-0004 to [email protected]:60030
I0117 17:22:43.740393 41372 status_update_manager.cpp:315] Received status update TASK_LOST (UUID: f198e879-a762-4cce-97ff-5261cb4ff820) for task 63 of framework 20150117-171023-3391097354-60030-7325-0004
I0117 17:22:43.740411 41384 slave.cpp:1789] Status update manager successfully handled status update TASK_LOST (UUID: cc571b34-e161-4ec9-bcf1-033bd967209f) for task 34 of framework 20150117-171023-3391097354-60030-7325-0004
I0117 17:22:43.740573 41372 status_update_manager.cpp:494] Creating StatusUpdate stream for task 63 of framework 20150117-171023-3391097354-60030-7325-0004
I0117 17:22:43.740892 41372 status_update_manager.cpp:368] Forwarding status update TASK_LOST (UUID: f198e879-a762-4cce-97ff-5261cb4ff820) for task 63 of framework 20150117-171023-3391097354-60030-7325-0004 to [email protected]:60030
I0117 17:22:43.741240 41379 slave.cpp:1789] Status update manager successfully handled status update TASK_LOST (UUID: f198e879-a762-4cce-97ff-5261cb4ff820) for task 63 of framework 20150117-171023-3391097354-60030-7325-0004
I0117 17:22:43.762383 41394 process.cpp:1010] Socket closed while receiving
I0117 17:22:43.762490 41370 status_update_manager.cpp:393] Received status update acknowledgement (UUID: cc571b34-e161-4ec9-bcf1-033bd967209f) for task 34 of framework 20150117-171023-3391097354-60030-7325-0004
I0117 17:22:43.762662 41394 process.cpp:1010] Socket closed while receiving
I0117 17:22:43.762800 41370 status_update_manager.cpp:525] Cleaning up status update stream for task 34 of framework 20150117-171023-3391097354-60030-7325-0004
I0117 17:22:43.763167 41370 status_update_manager.cpp:393] Received status update acknowledgement (UUID: f198e879-a762-4cce-97ff-5261cb4ff820) for task 63 of framework 20150117-171023-3391097354-60030-7325-0004
I0117 17:22:43.763293 41388 slave.cpp:1256] Status update manager successfully handled status update acknowledgement (UUID: cc571b34-e161-4ec9-bcf1-033bd967209f) for task 34 of framework 20150117-171023-3391097354-60030-7325-0004
I0117 17:22:43.763332 41370 status_update_manager.cpp:525] Cleaning up status update stream for task 63 of framework 20150117-171023-3391097354-60030-7325-0004
I0117 17:22:43.763442 41388 slave.cpp:3165] Completing task 34
I0117 17:22:43.763651 41388 slave.cpp:1256] Status update manager successfully handled status update acknowledgement (UUID: f198e879-a762-4cce-97ff-5261cb4ff820) for task 63 of framework 20150117-171023-3391097354-60030-7325-0004
I0117 17:22:43.763836 41388 slave.cpp:3165] Completing task 63
I0117 17:22:43.763959 41388 slave.cpp:2198] Cleaning up executor '20141110-112437-3374320138-60030-57359-44' of framework 20150117-171023-3391097354-60030-7325-0004
I0117 17:22:43.764222 41374 gc.cpp:56] Scheduling '/data03/mesos/slaves/20141110-112437-3374320138-60030-57359-44/frameworks/20150117-171023-3391097354-60030-7325-0004/executors/20141110-112437-3374320138-60030-57359-44/runs/3fdbdd09-98cd-4197-954f-d95d9b3b4aee' for gc 6.99999115569481days in the future
I0117 17:22:43.764240 41388 slave.cpp:2273] Cleaning up framework 20150117-171023-3391097354-60030-7325-0004
I0117 17:22:43.764397 41374 gc.cpp:56] Scheduling '/data03/mesos/slaves/20141110-112437-3374320138-60030-57359-44/frameworks/20150117-171023-3391097354-60030-7325-0004/executors/20141110-112437-3374320138-60030-57359-44' for gc 6.99999115478222days in the future
I0117 17:22:43.764551 41371 status_update_manager.cpp:277] Closing status update streams for framework 20150117-171023-3391097354-60030-7325-0004
I0117 17:22:43.764658 41374 gc.cpp:56] Scheduling '/data03/mesos/slaves/20141110-112437-3374320138-60030-57359-44/frameworks/20150117-171023-3391097354-60030-7325-0004' for gc 6.99999115088889days in the future
I0117 17:23:21.208463 41377 launcher.cpp:120] Forked child with pid '51757' for container '3fdbdd09-98cd-4197-954f-d95d9b3b4aee'
W0117 17:23:21.209522 41386 mesos_containerizer.cpp:808] Ignoring destroy of unknown container: 3fdbdd09-98cd-4197-954f-d95d9b3b4aee
E0117 17:23:21.209867 41389 slave.cpp:1956] Container '3fdbdd09-98cd-4197-954f-d95d9b3b4aee' for executor '20141110-112437-3374320138-60030-57359-44' of framework '20150117-171023-3391097354-60030-7325-0004' failed to start: Collect failed: Unknown container: 3fdbdd09-98cd-4197-954f-d95d9b3b4aee
I0117 17:23:28.864325 41384 slave.cpp:2523] Current usage 44.86%. Max allowed age: 3.159526880805023days

mateiz · 2015-01-17T22:55:49Z

@jongyoul the goal of fine-grained mode is to run many Spark tasks in the same executor, which is why we're giving them all the same executor ID. Mesos supports this in its concept of executors, and it has the benefit that Mesos can account for the CPUs used by each task separately and give those CPUs to other frameworks when Spark is not active. In contrast, coarse-grained mode reserves the CPUs on the machine for the whole lifetime of the executor.

tnachen · 2015-01-17T23:11:04Z

From the logs it is indeed hit the executor registeration timeout (1 minutes), so Mesos terminated the task.

I don't think changing the executor Id fixes this problem, and isn't necessary I think.

Can you try changing the timeout via slave flags to a longer time and try again?

jongyoul · 2015-01-20T02:17:46Z

@tnachen Yes, I've also found what you told about timeout. I'll check it again by changing that value. But changing executorId is needed. If there are two tasks running on a same executor id - same node -, one task become failed, all tasks become failed. Don't you think this is a problem?

tnachen · 2015-01-20T02:38:46Z

The executor is responsible to launch and wait for tasks, but it is entirely depends on how the executor is implemented if any task interfere with another. In spark case the executor lives through tasks so if one task fail it won't interfere with another one.
Feel free to try it your self

jongyoul · 2015-01-20T02:55:32Z

@tnachen In my case - above logs -, task 34 and 63 are assigned to same executor and also same container on same node. Task 34 occurs error about registration timeout, and task 34 is terminated and task 63, which is on queue, is also exited because mesos_containerizer destroys that container which contain task 34 and 63. I think this is a bug. As you told, an executor is responsible to manage tasks and executor can terminate all tasks which are running or queueing. I think only one task is terminated if the task fails. I think a task must not be influenced by any other task. What do you think of this situation?

tnachen · 2015-01-20T03:17:13Z

Yes when the executor couldn't get launched then all the tasks assigned to it is LOST. IMO this is really a configuration problem, and not a normal failure that should occur.

jongyoul · 2015-01-20T03:43:58Z

@tnachen Ok, I see. It happened when executor couldn't get launched, doesn't it? I'll change that setting first.

jongyoul · 2015-01-28T03:40:44Z

I'll also close this PR. I've misunderstood mesos #4170

[SPARK-5198][Mesos] Change executorId more unique on mesos fine-grain…

29c7d60

…ed mode - Change executorId from slaveId to taskId.getValue

jongyoul closed this Jan 28, 2015

[SPARK-5198][Mesos] Change executorId more unique on mesos fine-grained ... #3994

[SPARK-5198][Mesos] Change executorId more unique on mesos fine-grained ... #3994

Uh oh!

Conversation

jongyoul commented Jan 12, 2015

Uh oh!

SparkQA commented Jan 12, 2015

Uh oh!

SparkQA commented Jan 12, 2015

Uh oh!

AmplabJenkins commented Jan 12, 2015

Uh oh!

jongyoul commented Jan 12, 2015

Uh oh!

jongyoul commented Jan 14, 2015

Uh oh!

mateiz commented Jan 17, 2015

Uh oh!

jongyoul commented Jan 17, 2015

Uh oh!

jongyoul commented Jan 17, 2015

Uh oh!

mateiz commented Jan 17, 2015

Uh oh!

mateiz commented Jan 17, 2015

Uh oh!

tnachen commented Jan 17, 2015

Uh oh!

jongyoul commented Jan 17, 2015

Uh oh!

jongyoul commented Jan 17, 2015

Uh oh!

jongyoul commented Jan 17, 2015

Uh oh!

jongyoul commented Jan 17, 2015

Uh oh!

mateiz commented Jan 17, 2015

Uh oh!

tnachen commented Jan 17, 2015

Uh oh!

jongyoul commented Jan 20, 2015

Uh oh!

tnachen commented Jan 20, 2015

Uh oh!

jongyoul commented Jan 20, 2015

Uh oh!

tnachen commented Jan 20, 2015

Uh oh!

jongyoul commented Jan 20, 2015

Uh oh!

jongyoul commented Jan 28, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants