-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-5198][Mesos] Change executorId more unique on mesos fine-grained ... #3994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ed mode - Change executorId from slaveId to taskId.getValue
|
Test build #25392 has started for PR 3994 at commit
|
|
Test build #25392 has finished for PR 3994 at commit
|
|
Test PASSed. |
|
/cc @JoshRosen @andrewor14 Review this, please. |
|
@mateiz Could you please also review this PR? When two different tasks runs at same executor at the same time, those runs on the same container. That results different logs in a same file. The most fatal effect is one launcher unarchives executor.tgz when another launcher is unarchiving a same file at same location on a same node. This can make one launcher lost. |
|
This looks like it's going to create a separate executor for each task, which is the opposite of what we want for fine-grained mode. The goal of fine-grained mode is to share each executor between multiple tasks. What was the problem here? Do you have two Mesos slaves on the same machine and they clash? |
|
@mateiz I have one slave per node and the problem occurs when two tasks are launched at the same time. Two tasks run in a same container so it makes two tasks leave same log file and crash when tasks launch |
|
@mateiz especially, I've found that mesos launcher copy spark file from hers and extract that file mesos working directory. When tasks are launched at the same time, one task are trying to launch it while another task are extracting the file. At this case, former task sometimes aren't launched so launcher returns |
|
This sounds to me like a Mesos bug, since Mesos is what's un-tarring the executor. I'd suggest asking them. We can't make this change in Spark, it would break fine-grained mode. |
|
It might also be that Mesos's behavior changed. It shouldn't be sending TASK_LOST while the executor is being unzipped, it should be queueing up the task and giving it to the executor once it's ready. @tnachen what do you think about this? |
|
@jongyoul when you're talking about the spark file, you're talking about the spark executor uri tar itself right? |
|
@tnachen Yes, I used spark executor uri like |
|
@mateiz I think I don't know fine-grained mode how you intend to behave exactly. What help me to understand more? I don't know how multi executor break spark's intended behaviour. |
|
@tnachen And slave's logs around task 34, 63. It looks like that if any task occurs error while running, the executor running that task is terminated. Check this, please. |
|
@jongyoul the goal of fine-grained mode is to run many Spark tasks in the same executor, which is why we're giving them all the same executor ID. Mesos supports this in its concept of executors, and it has the benefit that Mesos can account for the CPUs used by each task separately and give those CPUs to other frameworks when Spark is not active. In contrast, coarse-grained mode reserves the CPUs on the machine for the whole lifetime of the executor. |
|
From the logs it is indeed hit the executor registeration timeout (1 minutes), so Mesos terminated the task. I don't think changing the executor Id fixes this problem, and isn't necessary I think. Can you try changing the timeout via slave flags to a longer time and try again? |
|
@tnachen Yes, I've also found what you told about timeout. I'll check it again by changing that value. But changing executorId is needed. If there are two tasks running on a same executor id - same node -, one task become failed, all tasks become failed. Don't you think this is a problem? |
|
The executor is responsible to launch and wait for tasks, but it is entirely depends on how the executor is implemented if any task interfere with another. In spark case the executor lives through tasks so if one task fail it won't interfere with another one. |
|
@tnachen In my case - above logs -, task 34 and 63 are assigned to same executor and also same container on same node. Task 34 occurs error about registration timeout, and task 34 is terminated and task 63, which is on queue, is also exited because mesos_containerizer destroys that container which contain task 34 and 63. I think this is a bug. As you told, an executor is responsible to manage tasks and executor can terminate all tasks which are running or queueing. I think only one task is terminated if the task fails. I think a task must not be influenced by any other task. What do you think of this situation? |
|
Yes when the executor couldn't get launched then all the tasks assigned to it is LOST. IMO this is really a configuration problem, and not a normal failure that should occur. |
|
@tnachen Ok, I see. It happened when executor couldn't get launched, doesn't it? I'll change that setting first. |
|
I'll also close this PR. I've misunderstood mesos #4170 |



...mode