Skip to content

Conversation

rezib
Copy link
Contributor

@rezib rezib commented Feb 22, 2022

Hello PySlurm maintainers,

Here is my proposal to add support of Slurm 21.08 in PySlurm.

I made many incremental commits with comments to explain how I proceeded. This way, I hope you can check more easily if I follow the right path, or if I miss a thing.

These patches have been successfully tested with Slurm 21.08.4 on a real HPC cluster.

This fixes #225.

I'm looking forward to your comments!

rezib and others added 18 commits February 3, 2022 15:54
Starting from Slurm >= 21.08, slurm version is declared in this
dedicated header file.
These templates have been generated using autopxd2, installed with:

  $ pip install autopxd2

Then, *.pxd files have been generated using these commands:

  $ cd /usr/include
  $ autopxd --include-dir . slurm/slurm_errno.h > \
    ~/pyslurm/jinja2/slurm_errno.h.pxd
  $ autopxd --include-dir . slurm/slurm.h > \
    ~/pyslurm/jinja2/slurm.h.pxd
  $ autopxd --include-dir . slurm/slurmdb.h > \
    ~/pyslurm/jinja2/slurmdb.h.pxd

Then, jinja2/slurm.h.pxd and jinja2/slurmdb.h.pxd have been manually
modified to:

- Remove libc.stdint import
- Remove duplicated slurm_errno symbols
- Include defines from dedicated subdir

Additionally:

- in jinja2/slurm.h.pxd:
  - symbols SLURM_ERROR, SLURM_SUCCESS and SLURM_VERSION have been
    restored,
  - slurm_addr_t control_addr and pthread_mutex_t lock are commented
    out, just like before, to avoid compilation error with these types
    (they are not used by pyslurm).
- in jinja2/slurmdb.h.pxd, all symbols duplicated in
  jinja2/slurm.h.pxd have been removed.

This finally produces this commit.
To generate this files, I used j2cli:

  $ pip install j2cli
  $ j2 jinja2/slurm.j2 > pyslurm/slurm.pxd
There are some new parameters, some have vanished, some have been
renamed.
Slurm 21.08 slurm_kill_job2() now expects a fourth char* sibling
argument.
@rezib
Copy link
Contributor Author

rezib commented Feb 23, 2022

The checks initially failed because the github actions pulled tag 20.11.8 of @giovtorres docker-centos7-slurm docker image. I just pushed an additional commit to pull existing tag 21.08.0 of the same image. Now it fails on another error, it segfaults in test_slurm_load_slurmd_status.

Unless somebody has a clear idea of what is going on, I will have a deeper look into it.

@rezib
Copy link
Contributor Author

rezib commented Feb 24, 2022

I pushed an additional commit to update the expected result in test_slurm_api_version.

Regarding the segfault in slurm_load_slurmd_status(), @gingergeeks noticed the bug disappeared when using libslurm instead of libslurmfull. The testing environment uses Slurm 21.08.0. On my side, I'm not able to reproduce the bug with PySlurm linked to libslurmfull with Slurm 21.08.4 :

[root-node1-pts0] ~ # ldd /usr/lib64/python3.6/site-packages/pyslurm/pyslurm.cpython-36m-x86_64-linux-gnu.so
	linux-vdso.so.1 (0x00007ffd6b3b7000)
	libslurmfull.so => /usr/lib64/slurm/libslurmfull.so (0x000014ba8b359000)
	libpython3.6m.so.1.0 => /lib64/libpython3.6m.so.1.0 (0x000014ba8ae16000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x000014ba8abf6000)
	libc.so.6 => /lib64/libc.so.6 (0x000014ba8a831000)
	libdl.so.2 => /lib64/libdl.so.2 (0x000014ba8a62d000)
	libm.so.6 => /lib64/libm.so.6 (0x000014ba8a2ab000)
	libresolv.so.2 => /lib64/libresolv.so.2 (0x000014ba8a094000)
	libcrypto.so.1.1 => /lib64/libcrypto.so.1.1 (0x000014ba89bae000)
	libutil.so.1 => /lib64/libutil.so.1 (0x000014ba899aa000)
	/lib64/ld-linux-x86-64.so.2 (0x000014ba8ba94000)
	libz.so.1 => /lib64/libz.so.1 (0x000014ba89793000)
[root-node1-pts0] ~ # python3 -c "import pyslurm; print(pyslurm.slurm_load_slurmd_status())"
{'node1': {'actual_boards': 1, 'booted': 1645539434, 'actual_cores': 24, 'actual_cpus': 48, 'actual_real_mem': 772686, 'actual_sockets': 2, 'actual_threads': 1, 'actual_tmp_disk': 900287, 'hostname': 'node1', 'last_slurmctld_msg': 1645694137, 'pid': 679544, 'slurmd_debug': 3, 'slurmd_logfile': None, 'step_list': 'NONE', 'version': '21.08.4'}}

For these reasons, I clearly suspect it is a bug in Slurm 21.08.0 that has been fixed (at least) in Slurm 21.08.4. I gave a quick look into Slurm changelog but I couldn't find anything really relevant, except maybe: SchedMD/slurm@e98e23c

Maybe it's worth trying to update @giovtorres docker-centos7-slurm docker image with latest Slurm 21.08 version?

@njcarriero
Copy link
Contributor

Thanks for your work on this. Is was quite helpful for us.

We did run into a few problems with slurmdb_ functions.

Here's a short diff that fixed them:

diff --git a/pyslurm/pyslurm.pyx b/pyslurm/pyslurm.pyx
index cadd2e4..18be509 100644
--- a/pyslurm/pyslurm.pyx
+++ b/pyslurm/pyslurm.pyx
@@ -341,6 +341,14 @@ def slurm_api_version():
             SLURM_VERSION_MINOR(version),
             SLURM_VERSION_MICRO(version))
 
+def slurm_init(conf_file=None):
+    if conf_file:
+        slurm.slurm_init(conf_file.encode('utf8'))
+    else:
+        slurm.slurm_init(NULL)
+       
+def slurm_fini():
+    slurm.slurm_fini()
 
 def slurm_load_slurmd_status():
     """Issue RPC to get and load the status of Slurmd daemon.
@@ -5080,11 +5088,11 @@ cdef class slurmdb_jobs:
     cdef:
         void* db_conn
         slurm.slurmdb_job_cond_t *job_cond
-        uint16_t* persist_conn_flags
+        uint16_t persist_conn_flags
 
     def __cinit__(self):
         self.job_cond = <slurm.slurmdb_job_cond_t *>xmalloc(sizeof(slurm.slurmdb_job_cond_t))
-        self.db_conn = slurm.slurmdb_connection_get(self.persist_conn_flags)
+        self.db_conn = slurm.slurmdb_connection_get(&self.persist_conn_flags)
 
     def __dealloc__(self):
         slurm.xfree(self.job_cond)
@@ -5457,10 +5465,12 @@ cdef class slurmdb_clusters:
     cdef:
         void *db_conn
         slurm.slurmdb_cluster_cond_t *cluster_cond
+        uint16_t persist_conn_flags
 
     def __cinit__(self):
         self.cluster_cond = <slurm.slurmdb_cluster_cond_t *>xmalloc(sizeof(slurm.slurmdb_cluster_cond_t))
         slurm.slurmdb_init_cluster_cond(self.cluster_cond, 0)
+        self.db_conn = slurm.slurmdb_connection_get(&self.persist_conn_flags)
 
     def __dealloc__(self):
         slurm.slurmdb_destroy_cluster_cond(self.cluster_cond)

pyslurm.slurm_init() needs to be invoked prior to running any other slurm library calls (per a comment in "slurm.h").

rezib and others added 4 commits February 25, 2022 09:49
Use hard-coded NULL slurm.slurmdb_connection_get() persist_conn_flags
consistently. Persistent connections are not used in PySlurm. The
variable was declared to NULL and not used elsewhere, the pointer was
not preallocated in class slurmdb_jobs.

I propose NULL to be used consistently in both cases.

Co-authored-by: Nicholas Carriero <[email protected]>
Class slurmdb_clusters db_conn attribute is declared and used in get()
but it was not initialized with a proper connection. Also close and
free the allocation in __dealloc__().

Co-authored-by: Nicholas Carriero <[email protected]>
Starting from Slurm 20.11, slurm_init() must be called prior to any
other Slurm library API calls. For the moment, its load Slurm
configuration structure. For reference:

SchedMD/slurm@e35a6e3

On the other side, slurm_fini() cleanup the configuration data
structures in memory.

Co-authored-by: Nicholas Carriero <[email protected]>
This way, PySlurm consumers do not have to do it explicitely.
@rezib
Copy link
Contributor Author

rezib commented Feb 25, 2022

Good catches @njcarriero! Thank you for reporting them.

I integrated your patches in the PR. I deliberately chose to hard-code persist_conn_flags to NULL as they were not really used in PySlurm. Please let me know if you see anything wrong with this option.

@njcarriero: I set yourself as Co-author of the commits, please tell me if you disagree.

I added a call to slurm_init() at pyslurm module load. This is quite opinionated, then I would be pleased to get feedback on this proposal.

I was hopeful but unfortunately these new updates do not fix the segfault on slurm_load_slurmd_status() :)

Also, I'm not sure if we need to handle the call slurm_fini() at PySlurm level as I guess PySlurm consumers need Slurm configuration structures to be loaded all along their lifetime, don't they?

Please share your thoughts!

@wpoely86
Copy link
Contributor

I've tried this PR and it builds fine but:

$ python -c 'import pyslurm'
python: Unknown host: Unknown error 1284856832

@@ -2,7 +2,7 @@ version: "3.8"

services:
slurm:
image: giovtorres/docker-centos7-slurm:20.11.8
image: giovtorres/docker-centos7-slurm:21.08.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rezib I pushed a 21.08.6 version you can retry with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @giovtorres 👍

@rezib
Copy link
Contributor Author

rezib commented Feb 25, 2022

I bumped to latest docker-centos7-slurm:21.08.6 but unfortunately, it fails even quicker claiming it does not find python3.x commands :/

@njcarriero
Copy link
Contributor

njcarriero commented Feb 25, 2022

My code base (your original PR plus the changes I sent) does not have a problem with slurm_load_slurmd_status() for a node running slurmd:

$ /mnt/home/carriero/projects/slurm/envs/py3_sl21.08/bin/python
Python 3.8.12 (default, Jan 12 2022, 22:35:32) 
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyslurm as S
>>> S.slurm_init()
>>> S.slurm_load_slurmd_status()
{'worker1013': {'actual_boards': 1, 'booted': 1645549416, 'actual_cores': 14, 'actual_cpus': 28, 'actual_real_mem': 515915, 'actual_sockets': 2, 'actual_threads': 1, 'actual_tmp_disk': 1906795, 'hostname': 'worker1013', 'last_slurmctld_msg': 1645821180, 'pid': 191195, 'slurmd_debug': 3, 'slurmd_logfile': '/var/log/slurmd', 'step_list': 'NONE', 'version': '21.08.5'}}
>>> 

It does have a problem on a node that is not running slurmd. I am guessing (but haven't tested) that if you do the free only in the errCode success block, it will be happier. [[GUESSED WRONG, moving the free doesn't change the behavior.]]

As for persist_conn_flags, this parameter is labelled in slurmdb.h as "OUT", so lacking firm documentation to the contrary, it felt safer to hand it an actual address. However, after now having browsed the SLURM source code a bit more, I see invocations that do use NULL, so that seems like it should be OK.

26 Feb Update: It looks like the problem arises from a collision between libslurm's error() function and error(3).
In the case where there is no slurmd running, a function invoked by slurm_load_slurmd_status wants to report an error and ends up using the wrong error() function. If you force the issue with LD_PRELOAD=.../libslurm.so, there is no problem running the test code on a machine that is not running slurmd, you get the error message and the code doesn't abort. LD_PRELOAD, of course, is diagnostic not a real fix. Don't know enough about cython's name resolution to hazard a guess as to what would be a good fix.

@giovtorres
Copy link
Member

I bumped to latest docker-centos7-slurm:21.08.6 but unfortunately, it fails even quicker claiming it does not find python3.x commands :/

Looks like I missed something when converting to using pyenv to manage the installations of Python. Could you try applying this change and see if it helps: https://github.com/PySlurm/pyslurm/pull/232/files

@rezib
Copy link
Contributor Author

rezib commented Feb 27, 2022

I bumped to latest docker-centos7-slurm:21.08.6 but unfortunately, it fails even quicker claiming it does not find python3.x commands :/

Looks like I missed something when converting to using pyenv to manage the installations of Python. Could you try applying this change and see if it helps: https://github.com/PySlurm/pyslurm/pull/232/files

Thanks @giovtorres 👍 The patch is applied, we now have real tests failures!

  • test_slurm_load_slurmd_status() fails with KeyError 'slurmctl'
  • test_node_update() fails with ValueError: ('Invalid node name specified', 2018)
  • test_slurmdb_jobs_get_steps() fails with AssertionError(2==3)

At least it looks like something changed regarding node names management!

@giovtorres
Copy link
Member

@rezib I took a look and I think we can ignore those test failures. I've been having a hard time with these containers after Slurm made a change to the FastSchedule=1 setting when enabling the front end at compile time. Since then, I've had some issues wit the trying to run all nodes in the same container.

I'm ok to ship these changes, given that they have been tested by you and @njcarriero on real hardware.

Is there anything else needed before merging?

@rezib
Copy link
Contributor Author

rezib commented Feb 28, 2022

Is there anything else needed before merging?

No, personally I'm fine with the PR in its current state!

@njcarriero
Copy link
Contributor

One final note about slurm_load_slurmd_status. It looks like one can specify a dlopen flag to alter the function look up:

diff --git a/pyslurm/__init__.py b/pyslurm/__init__.py
index 177bf7c..e0c2269 100644
--- a/pyslurm/__init__.py
+++ b/pyslurm/__init__.py
@@ -11,7 +11,10 @@ from __future__ import absolute_import
 import ctypes
 import sys
 
-sys.setdlopenflags(sys.getdlopenflags() | ctypes.RTLD_GLOBAL)
+# From dlfcn.h
+RTLD_DEEPBIND = 8
+
+sys.setdlopenflags(sys.getdlopenflags() | ctypes.RTLD_GLOBAL | RTLD_DEEPBIND)
 
 from .pyslurm import *
 from .__version__ import __version__

With this change, no LD_PRELOAD is needed for the invocation to fail producing a clean error message without aborting on a node not running slurmd.

Not really sure that this is the "right" way to solve this problem (having to define the constant is a hint that it is not), but certainly better than LD_PRELOAD.

At any rate, we don't need this function for our purposes, so I'm happy with things as they are.

I will note that my design choice would be to document the need for the user to call slurm_init() rather than do it automatically, since it takes an optional argument it is kind of up to the user as to whether to supply that. This also better aligns with the C API.

But either way, your work has been very useful. Thanks again!

@gingergeeks
Copy link
Member

Nice work Guys ! Now if only I can get my environment to work cleanly

@giovtorres
Copy link
Member

@njcarriero Thank you for doing that research. Let's merge this and if you'd like, submit those changes in a separate PR.

@giovtorres giovtorres merged commit 8febf21 into PySlurm:main Mar 1, 2022
@rezib
Copy link
Contributor Author

rezib commented Mar 1, 2022

Thanks for merging it!

@rezib rezib deleted the 21.08.0 branch March 1, 2022 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support for slurm 21.8.5
5 participants