Skip to content

Conversation

@rhc54
Copy link
Contributor

@rhc54 rhc54 commented Jun 19, 2017

Shift the signal forwarding code to ess/base so it can be available to more than just the hnp component. Extend the slurm component to use it so that any signals given directly to the daemons by their slurmstepd get forwarded to their local clients

Check for NULL

Resolve differences from OMPI master

Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit 066d5ee)

…o more than just the hnp component. Extend the slurm component to use it so that any signals given directly to the daemons by their slurmstepd get forwarded to their local clients

Check for NULL

Resolve differences from OMPI master

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit 066d5ee)
@rhc54 rhc54 added this to the v2.1.2 milestone Jun 19, 2017
@rhc54 rhc54 self-assigned this Jun 19, 2017
@rhc54 rhc54 requested a review from hjelmn June 19, 2017 17:44
@hppritcha
Copy link
Member

bot:mellanox:retest

@jsquyres
Copy link
Member

@rhc54 Is this code already in the 3.x branch, perchance?

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 19, 2017

yes

@jsquyres
Copy link
Member

@rhc54 👍 Thanks

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 20, 2017

bot:mellanox:retest

@hjelmn
Copy link
Member

hjelmn commented Jun 21, 2017

:bot:mellanox:retest

@jladd-mlnx
Copy link
Member

We are aware of the issue currently affecting the Mellanox Jenkins servers. The issue is being addressed and we hope it will be resolved soon. We apologize for the inconvenience and thank you for your patience.

@jladd-mlnx
Copy link
Member

bot:mellanox:retest

@hppritcha
Copy link
Member

I'm not sure this PR is actually working. I built it and launched a job using mpirun on my cluster.
In another window, I then did scancel --signal=KILL jobid, and the job terminated but I got this message too:

salloc: Job allocation 1077065 has been revoked.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[12645,0],0] on node sn358
  Remote daemon: [[12645,0],1] on node sn359

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

@hppritcha
Copy link
Member

never mind last comment. I may be picking up the system's orted rather than my patched one. forgot to use

-enable-orterun-prefix-by-default

@hppritcha
Copy link
Member

Well unfortunately I double checked and realized that the only other orted exec on the system is from 1.10.5, so actually I was picking up the right orted. Unfortunately, the error message about about lost communication persists, so I don't think this patch by itself is sufficient to fix the problem here at our site.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 22, 2017

I'm puzzled - the problem was to have the orteds capture and forward signals on to their procs. If you hit them with SIGKILL, there is nothing they can do as it cannot be trapped - they are just going to die.

@hppritcha
Copy link
Member

I'm querying the consult folks to see which signal was causing the issue. I don't think it was just user1, user2, etc.

@hjelmn
Copy link
Member

hjelmn commented Jun 22, 2017

The signals they want handled are URG, USR1, and USR2.

@hppritcha
Copy link
Member

@hjelmn site consultants seem to be dealing with a different ticket related to what happens when the job exceeds its time limit. Here's what happens with Open MPI2.1.1 + this PR when an app runs over its time limit:

#----------------------------------------------------------------
# Benchmarking Allgatherv 
# #processes = 16 
# ( 48 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         7.12         7.34         7.20
            1         1000        14.39        15.10        14.62
            2         1000        14.42        15.07        14.67
            4         1000        14.49        15.18        14.75
            8         1000        14.82        15.60        15.05
           16         1000        14.90        15.64        15.22
           32         1000        15.56        16.36        15.78
           64         1000        15.56        16.38        15.88
          128         1000        16.03        16.79        16.32
          256         1000        16.77        17.43        17.03
          512         1000        17.82        18.59        18.16
salloc: Job 1078432 has exceeded its time limit and its allocation has been revoked.
         1024         1000        20.73        21.62        21.08
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[43334,0],0] on node sn007
  Remote daemon: [[43334,0],1] on node sn008

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 26, 2017

The problem described above is an issue with SLURM, not OMPI. I reported it to SchedMD and they fixed it for future releases - I believe the one just released contains it - but they didn't backport it. The problem was that they were hitting the job with a SIGKILL right away, instead of using the usual SIGCONT, SIGTERM, SIGKILL sequence. Thus, we have no opportunity to do anything other than just die.

@hppritcha
Copy link
Member

Would this be SLURM 17.02.5?

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 26, 2017

Looks like it's in 17.11, though the NEWS text isn't quite what it actually does:

-- If a task in a parallel job fails and it was launched with the
   --kill-on-bad-exit option then terminate the remaining tasks using
   the
   SIGCONT, SIGTERM and SIGKILL signals rather than just sending
   SIGKILL.

As I understand it, they changed timeout as well so jobs can terminate cleanly

@hppritcha
Copy link
Member

I checked further with the LANL consultants. This patch plus a workaround in the users' scripts will be sufficient.

@hppritcha
Copy link
Member

@hjelmn please review

@jsquyres jsquyres merged commit 2116c91 into open-mpi:v2.x Jul 11, 2017
@rhc54 rhc54 deleted the cmr20/signal branch March 25, 2018 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants