Skip to content

Conversation

@awlauria
Copy link
Contributor

  • The scenario is that we have a wrapper process placed before the
    MPI application:
 mpirun -np 2 wrapper ./hello_c
  • If hello_c crashes and wrapper detects it, then wrapper will
    exit with a non-zero exit status. The orted will notice that and
    start a kill process for all local processes.
    • The orted will send SIGKILL to the wrapper process, and that
      process will terminate and leave the hello_c running. The hello_c
      will continue to run (in this test case will wait in MPI_Finalize)
      and the job will seem to hang.
  • This commit does two things each will fix this scenario.
    1. After killing the process mark it as not alive since we are not
      going to wait on it. This prevents orted_cmd from seeing the process
      as alive and waiting for it to complete (note that the pid is set
      to 0 so we wouldn't be able to mark it correctly later even if
      we did get a notice.
    2. Instead of sending the SIGKILL signal to just the PID of wrapper
      send it to -PID so that the kernel will send the signal to the
      whole process group under wrapper as well. This will case the
      hello_c program to terminate as well.

Signed-off-by: Austen Lauria [email protected]
(cherry picked from commit 9a849b5)

 * The scenario is that we have a wrapper process placed before the
   MPI application:
```shell
 mpirun -np 2 wrapper ./hello_c
```
 * If `hello_c` crashes and `wrapper` detects it, then `wrapper` will
   exit with a non-zero exit status. The orted will notice that and
   start a kill process for all local processes.
   - The orted will send `SIGKILL` to the `wrapper` process, and that
     process will terminate and leave the `hello_c` running. The `hello_c`
     will continue to run (in this test case will wait in `MPI_Finalize`)
     and the job will seem to hang.
 * This commit does two things each will fix this scenario.
   1. After killing the process mark it as not alive since we are not
      going to wait on it. This prevents orted_cmd from seeing the process
      as alive and waiting for it to complete (note that the pid is set
      to `0` so we wouldn't be able to mark it correctly later even if
      we did get a notice.
   2. Instead of sending the `SIGKILL` signal to just the `PID` of `wrapper`
      send it to `-PID` so that the kernel will send the signal to the
      whole process group under `wrapper` as well. This will case the
      `hello_c` program to terminate as well.

Signed-off-by: Austen Lauria <[email protected]>
(cherry picked from commit 9a849b5)
@awlauria awlauria added this to the v4.1.2 milestone Aug 17, 2021
@awlauria awlauria requested review from jjhursey and rhc54 August 17, 2021 15:42
@jjhursey
Copy link
Member

Ref comment #9261 (comment)

Copy link
Member

@gpaulsen gpaulsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this one...

@rhc54
Copy link
Contributor

rhc54 commented Aug 18, 2021

Please see explanation in #9261 - this patch is incorrect and should not be merged.

@awlauria
Copy link
Contributor Author

Verified this is a dup of: #3773

@awlauria awlauria closed this Aug 18, 2021
@awlauria awlauria deleted the fix_abnormal_cleanup_v4.1.x branch August 18, 2021 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants