Shift the signal forwarding code to ess/base #3717

rhc54 · 2017-06-19T17:44:45Z

Shift the signal forwarding code to ess/base so it can be available to more than just the hnp component. Extend the slurm component to use it so that any signals given directly to the daemons by their slurmstepd get forwarded to their local clients

Check for NULL

Resolve differences from OMPI master

Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit 066d5ee)

…o more than just the hnp component. Extend the slurm component to use it so that any signals given directly to the daemons by their slurmstepd get forwarded to their local clients Check for NULL Resolve differences from OMPI master Signed-off-by: Ralph Castain <[email protected]> (cherry picked from commit 066d5ee)

hppritcha · 2017-06-19T19:03:03Z

bot:mellanox:retest

jsquyres · 2017-06-19T19:21:55Z

@rhc54 Is this code already in the 3.x branch, perchance?

rhc54 · 2017-06-19T19:23:13Z

yes

jsquyres · 2017-06-19T19:40:28Z

@rhc54 👍 Thanks

rhc54 · 2017-06-20T15:04:59Z

bot:mellanox:retest

hjelmn · 2017-06-21T14:41:27Z

:bot:mellanox:retest

jladd-mlnx · 2017-06-21T18:24:30Z

We are aware of the issue currently affecting the Mellanox Jenkins servers. The issue is being addressed and we hope it will be resolved soon. We apologize for the inconvenience and thank you for your patience.

jladd-mlnx · 2017-06-22T12:20:20Z

bot:mellanox:retest

hppritcha · 2017-06-22T18:18:35Z

I'm not sure this PR is actually working. I built it and launched a job using mpirun on my cluster.
In another window, I then did scancel --signal=KILL jobid, and the job terminated but I got this message too:

salloc: Job allocation 1077065 has been revoked.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[12645,0],0] on node sn358
  Remote daemon: [[12645,0],1] on node sn359

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

hppritcha · 2017-06-22T18:24:27Z

never mind last comment. I may be picking up the system's orted rather than my patched one. forgot to use

-enable-orterun-prefix-by-default

hppritcha · 2017-06-22T18:59:06Z

Well unfortunately I double checked and realized that the only other orted exec on the system is from 1.10.5, so actually I was picking up the right orted. Unfortunately, the error message about about lost communication persists, so I don't think this patch by itself is sufficient to fix the problem here at our site.

rhc54 · 2017-06-22T19:02:56Z

I'm puzzled - the problem was to have the orteds capture and forward signals on to their procs. If you hit them with SIGKILL, there is nothing they can do as it cannot be trapped - they are just going to die.

hppritcha · 2017-06-22T19:26:53Z

I'm querying the consult folks to see which signal was causing the issue. I don't think it was just user1, user2, etc.

hjelmn · 2017-06-22T20:23:28Z

The signals they want handled are URG, USR1, and USR2.

hppritcha · 2017-06-26T16:21:14Z

@hjelmn site consultants seem to be dealing with a different ticket related to what happens when the job exceeds its time limit. Here's what happens with Open MPI2.1.1 + this PR when an app runs over its time limit:

#----------------------------------------------------------------
# Benchmarking Allgatherv 
# #processes = 16 
# ( 48 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         7.12         7.34         7.20
            1         1000        14.39        15.10        14.62
            2         1000        14.42        15.07        14.67
            4         1000        14.49        15.18        14.75
            8         1000        14.82        15.60        15.05
           16         1000        14.90        15.64        15.22
           32         1000        15.56        16.36        15.78
           64         1000        15.56        16.38        15.88
          128         1000        16.03        16.79        16.32
          256         1000        16.77        17.43        17.03
          512         1000        17.82        18.59        18.16
salloc: Job 1078432 has exceeded its time limit and its allocation has been revoked.
         1024         1000        20.73        21.62        21.08
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[43334,0],0] on node sn007
  Remote daemon: [[43334,0],1] on node sn008

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

rhc54 · 2017-06-26T16:31:24Z

The problem described above is an issue with SLURM, not OMPI. I reported it to SchedMD and they fixed it for future releases - I believe the one just released contains it - but they didn't backport it. The problem was that they were hitting the job with a SIGKILL right away, instead of using the usual SIGCONT, SIGTERM, SIGKILL sequence. Thus, we have no opportunity to do anything other than just die.

hppritcha · 2017-06-26T16:47:05Z

Would this be SLURM 17.02.5?

rhc54 · 2017-06-26T16:52:34Z

Looks like it's in 17.11, though the NEWS text isn't quite what it actually does:

-- If a task in a parallel job fails and it was launched with the
   --kill-on-bad-exit option then terminate the remaining tasks using
   the
   SIGCONT, SIGTERM and SIGKILL signals rather than just sending
   SIGKILL.

As I understand it, they changed timeout as well so jobs can terminate cleanly

hppritcha · 2017-06-30T17:35:24Z

I checked further with the LANL consultants. This patch plus a workaround in the users' scripts will be sufficient.

hppritcha · 2017-06-30T17:35:38Z

@hjelmn please review

rhc54 added the enhancement label Jun 19, 2017

rhc54 added this to the v2.1.2 milestone Jun 19, 2017

rhc54 self-assigned this Jun 19, 2017

rhc54 requested a review from hjelmn June 19, 2017 17:44

kawashima-fj mentioned this pull request Jun 20, 2017

opal/util: Get rid of \0 from abort delay message #3716

Merged

hjelmn approved these changes Jul 11, 2017

View reviewed changes

jsquyres added bug RM approved labels Jul 11, 2017

jsquyres merged commit 2116c91 into open-mpi:v2.x Jul 11, 2017

rhc54 deleted the cmr20/signal branch March 25, 2018 17:45

Shift the signal forwarding code to ess/base #3717

Shift the signal forwarding code to ess/base #3717

Uh oh!

Conversation

rhc54 commented Jun 19, 2017

Uh oh!

hppritcha commented Jun 19, 2017

Uh oh!

jsquyres commented Jun 19, 2017

Uh oh!

rhc54 commented Jun 19, 2017

Uh oh!

jsquyres commented Jun 19, 2017

Uh oh!

rhc54 commented Jun 20, 2017

Uh oh!

hjelmn commented Jun 21, 2017

Uh oh!

jladd-mlnx commented Jun 21, 2017

Uh oh!

jladd-mlnx commented Jun 22, 2017

Uh oh!

hppritcha commented Jun 22, 2017

Uh oh!

hppritcha commented Jun 22, 2017

Uh oh!

hppritcha commented Jun 22, 2017

Uh oh!

rhc54 commented Jun 22, 2017

Uh oh!

hppritcha commented Jun 22, 2017

Uh oh!

hjelmn commented Jun 22, 2017

Uh oh!

hppritcha commented Jun 26, 2017

Uh oh!

rhc54 commented Jun 26, 2017

Uh oh!

hppritcha commented Jun 26, 2017

Uh oh!

rhc54 commented Jun 26, 2017

Uh oh!

hppritcha commented Jun 30, 2017

Uh oh!

hppritcha commented Jun 30, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants