From 008d2b6159c054d8ef870990937a694edd817670 Mon Sep 17 00:00:00 2001 From: Joshua Hursey Date: Tue, 31 Jan 2017 16:26:30 -0500 Subject: [PATCH] odls/base: Fix abormal cleanup when app is wrapped * The scenario is that we have a wrapper process placed before the MPI application: ```shell mpirun -np 2 wrapper ./hello_c ``` * If `hello_c` crashes and `wrapper` detects it, then `wrapper` will exit with a non-zero exit status. The orted will notice that and start a kill process for all local processes. - The orted will send `SIGKILL` to the `wrapper` process, and that process will terminate and leave the `hello_c` running. The `hello_c` will continue to run (in this test case will wait in `MPI_Finalize`) and the job will seem to hang. * This commit does two things each will fix this scenario. 1. After killing the process mark it as not alive since we are not going to wait on it. This prevents orted_cmd from seeing the process as alive and waiting for it to complete (note that the pid is set to `0` so we wouldn't be able to mark it correctly later even if we did get a notice. 2. Instead of sending the `SIGKILL` signal to just the `PID` of `wrapper` send it to `-PID` so that the kernel will send the signal to the whole process group under `wrapper` as well. This will case the `hello_c` program to terminate as well. Signed-off-by: Austen Lauria (cherry picked from commit 9a849b509febf099c884b8cad0f798fddd873bd7) --- orte/mca/odls/base/odls_base_default_fns.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/orte/mca/odls/base/odls_base_default_fns.c b/orte/mca/odls/base/odls_base_default_fns.c index 8db35a6eb9e..c551f93d4bf 100644 --- a/orte/mca/odls/base/odls_base_default_fns.c +++ b/orte/mca/odls/base/odls_base_default_fns.c @@ -1947,7 +1947,12 @@ int orte_odls_base_default_kill_local_procs(opal_pointer_array_t *procs, "%s SENDING SIGKILL TO %s", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), ORTE_NAME_PRINT(&cd->child->name))); - kill_local(cd->child->pid, SIGKILL); + /* Send signal to the negative of the PID to send the signal to all + * of the children of that PID - the process group under it. + * Otherwise it is delivered to only that PID. + */ + kill_local(cd->child->pid * -1, SIGKILL); + /* indicate the waitpid fired as this is effectively what * has happened */