-
Notifications
You must be signed in to change notification settings - Fork 936
Shift the signal forwarding code to ess/base #3717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…o more than just the hnp component. Extend the slurm component to use it so that any signals given directly to the daemons by their slurmstepd get forwarded to their local clients Check for NULL Resolve differences from OMPI master Signed-off-by: Ralph Castain <[email protected]> (cherry picked from commit 066d5ee)
|
bot:mellanox:retest |
|
@rhc54 Is this code already in the 3.x branch, perchance? |
|
yes |
|
@rhc54 👍 Thanks |
|
bot:mellanox:retest |
|
:bot:mellanox:retest |
|
We are aware of the issue currently affecting the Mellanox Jenkins servers. The issue is being addressed and we hope it will be resolved soon. We apologize for the inconvenience and thank you for your patience. |
|
bot:mellanox:retest |
|
I'm not sure this PR is actually working. I built it and launched a job using mpirun on my cluster. |
|
never mind last comment. I may be picking up the system's orted rather than my patched one. forgot to use |
|
Well unfortunately I double checked and realized that the only other orted exec on the system is from 1.10.5, so actually I was picking up the right orted. Unfortunately, the error message about about lost communication persists, so I don't think this patch by itself is sufficient to fix the problem here at our site. |
|
I'm puzzled - the problem was to have the orteds capture and forward signals on to their procs. If you hit them with SIGKILL, there is nothing they can do as it cannot be trapped - they are just going to die. |
|
I'm querying the consult folks to see which signal was causing the issue. I don't think it was just user1, user2, etc. |
|
The signals they want handled are URG, USR1, and USR2. |
|
@hjelmn site consultants seem to be dealing with a different ticket related to what happens when the job exceeds its time limit. Here's what happens with Open MPI2.1.1 + this PR when an app runs over its time limit: |
|
The problem described above is an issue with SLURM, not OMPI. I reported it to SchedMD and they fixed it for future releases - I believe the one just released contains it - but they didn't backport it. The problem was that they were hitting the job with a SIGKILL right away, instead of using the usual SIGCONT, SIGTERM, SIGKILL sequence. Thus, we have no opportunity to do anything other than just die. |
|
Would this be SLURM 17.02.5? |
|
Looks like it's in 17.11, though the NEWS text isn't quite what it actually does: As I understand it, they changed timeout as well so jobs can terminate cleanly |
|
I checked further with the LANL consultants. This patch plus a workaround in the users' scripts will be sufficient. |
|
@hjelmn please review |
Shift the signal forwarding code to ess/base so it can be available to more than just the hnp component. Extend the slurm component to use it so that any signals given directly to the daemons by their slurmstepd get forwarded to their local clients
Check for NULL
Resolve differences from OMPI master
Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit 066d5ee)