-
Notifications
You must be signed in to change notification settings - Fork 124
Support for Slurm 21.08 #227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Starting from Slurm >= 21.08, slurm version is declared in this dedicated header file.
These templates have been generated using autopxd2, installed with: $ pip install autopxd2 Then, *.pxd files have been generated using these commands: $ cd /usr/include $ autopxd --include-dir . slurm/slurm_errno.h > \ ~/pyslurm/jinja2/slurm_errno.h.pxd $ autopxd --include-dir . slurm/slurm.h > \ ~/pyslurm/jinja2/slurm.h.pxd $ autopxd --include-dir . slurm/slurmdb.h > \ ~/pyslurm/jinja2/slurmdb.h.pxd Then, jinja2/slurm.h.pxd and jinja2/slurmdb.h.pxd have been manually modified to: - Remove libc.stdint import - Remove duplicated slurm_errno symbols - Include defines from dedicated subdir Additionally: - in jinja2/slurm.h.pxd: - symbols SLURM_ERROR, SLURM_SUCCESS and SLURM_VERSION have been restored, - slurm_addr_t control_addr and pthread_mutex_t lock are commented out, just like before, to avoid compilation error with these types (they are not used by pyslurm). - in jinja2/slurmdb.h.pxd, all symbols duplicated in jinja2/slurm.h.pxd have been removed. This finally produces this commit.
To generate this files, I used j2cli: $ pip install j2cli $ j2 jinja2/slurm.j2 > pyslurm/slurm.pxd
There are some new parameters, some have vanished, some have been renamed.
Slurm 21.08 slurm_kill_job2() now expects a fourth char* sibling argument.
The checks initially failed because the github actions pulled tag 20.11.8 of @giovtorres docker-centos7-slurm docker image. I just pushed an additional commit to pull existing tag 21.08.0 of the same image. Now it fails on another error, it segfaults in Unless somebody has a clear idea of what is going on, I will have a deeper look into it. |
I pushed an additional commit to update the expected result in test_slurm_api_version. Regarding the segfault in
For these reasons, I clearly suspect it is a bug in Slurm 21.08.0 that has been fixed (at least) in Slurm 21.08.4. I gave a quick look into Slurm changelog but I couldn't find anything really relevant, except maybe: SchedMD/slurm@e98e23c Maybe it's worth trying to update @giovtorres docker-centos7-slurm docker image with latest Slurm 21.08 version? |
Thanks for your work on this. Is was quite helpful for us. We did run into a few problems with slurmdb_ functions. Here's a short diff that fixed them:
pyslurm.slurm_init() needs to be invoked prior to running any other slurm library calls (per a comment in "slurm.h"). |
Use hard-coded NULL slurm.slurmdb_connection_get() persist_conn_flags consistently. Persistent connections are not used in PySlurm. The variable was declared to NULL and not used elsewhere, the pointer was not preallocated in class slurmdb_jobs. I propose NULL to be used consistently in both cases. Co-authored-by: Nicholas Carriero <[email protected]>
Class slurmdb_clusters db_conn attribute is declared and used in get() but it was not initialized with a proper connection. Also close and free the allocation in __dealloc__(). Co-authored-by: Nicholas Carriero <[email protected]>
Starting from Slurm 20.11, slurm_init() must be called prior to any other Slurm library API calls. For the moment, its load Slurm configuration structure. For reference: SchedMD/slurm@e35a6e3 On the other side, slurm_fini() cleanup the configuration data structures in memory. Co-authored-by: Nicholas Carriero <[email protected]>
This way, PySlurm consumers do not have to do it explicitely.
Good catches @njcarriero! Thank you for reporting them. I integrated your patches in the PR. I deliberately chose to hard-code @njcarriero: I set yourself as Co-author of the commits, please tell me if you disagree. I added a call to I was hopeful but unfortunately these new updates do not fix the segfault on Also, I'm not sure if we need to handle the call Please share your thoughts! |
I've tried this PR and it builds fine but:
|
docker-compose-github.yml
Outdated
@@ -2,7 +2,7 @@ version: "3.8" | |||
|
|||
services: | |||
slurm: | |||
image: giovtorres/docker-centos7-slurm:20.11.8 | |||
image: giovtorres/docker-centos7-slurm:21.08.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rezib I pushed a 21.08.6
version you can retry with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @giovtorres 👍
I bumped to latest docker-centos7-slurm:21.08.6 but unfortunately, it fails even quicker claiming it does not find |
My code base (your original PR plus the changes I sent) does not have a problem with slurm_load_slurmd_status() for a node running slurmd:
It does have a problem on a node that is not running slurmd. I am guessing (but haven't tested) that if you do the free only in the errCode success block, it will be happier. [[GUESSED WRONG, moving the free doesn't change the behavior.]] As for 26 Feb Update: It looks like the problem arises from a collision between libslurm's error() function and error(3). |
Looks like I missed something when converting to using |
Thanks @giovtorres 👍 The patch is applied, we now have real tests failures!
At least it looks like something changed regarding node names management! |
@rezib I took a look and I think we can ignore those test failures. I've been having a hard time with these containers after Slurm made a change to the I'm ok to ship these changes, given that they have been tested by you and @njcarriero on real hardware. Is there anything else needed before merging? |
No, personally I'm fine with the PR in its current state! |
One final note about slurm_load_slurmd_status. It looks like one can specify a dlopen flag to alter the function look up:
With this change, no LD_PRELOAD is needed for the invocation to fail producing a clean error message without aborting on a node not running slurmd. Not really sure that this is the "right" way to solve this problem (having to define the constant is a hint that it is not), but certainly better than LD_PRELOAD. At any rate, we don't need this function for our purposes, so I'm happy with things as they are. I will note that my design choice would be to document the need for the user to call slurm_init() rather than do it automatically, since it takes an optional argument it is kind of up to the user as to whether to supply that. This also better aligns with the C API. But either way, your work has been very useful. Thanks again! |
Nice work Guys ! Now if only I can get my environment to work cleanly |
@njcarriero Thank you for doing that research. Let's merge this and if you'd like, submit those changes in a separate PR. |
Thanks for merging it! |
Hello PySlurm maintainers,
Here is my proposal to add support of Slurm 21.08 in PySlurm.
I made many incremental commits with comments to explain how I proceeded. This way, I hope you can check more easily if I follow the right path, or if I miss a thing.
These patches have been successfully tested with Slurm 21.08.4 on a real HPC cluster.
This fixes #225.
I'm looking forward to your comments!