-
Notifications
You must be signed in to change notification settings - Fork 25
Description
I was confused when I saw two dirs in the host_injections with symlinks to the drivers
$ ls -al /cvmfs/software.eessi.io/host_injections/2023.06/compat/linux/x86_64/lib/
total 23
drwxr-xr-x 2 jenkins jenkins 4096 Aug 27 13:59 .
drwxr-xr-x 3 jenkins jenkins 4096 Aug 27 13:59 ..
-rw-r--r-- 1 jenkins jenkins 4 Aug 27 13:59 cuda_version.txt
-rw-r--r-- 1 jenkins jenkins 10 Aug 27 13:59 driver_version.txt
lrwxrwxrwx 1 jenkins jenkins 16 Aug 27 13:59 libEGL.so -> /lib64/libEGL.so
lrwxrwxrwx 1 jenkins jenkins 18 Aug 27 13:59 libEGL.so.1 -> /lib64/libEGL.so.1
lrwxrwxrwx 1 jenkins jenkins 25 Aug 27 13:59 libEGL_nvidia.so.0 -> /lib64/libEGL_nvidia.so.0
...
$ ls -al /cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/
total 23
drwxr-xr-x 2 jenkins jenkins 4096 Aug 27 13:59 .
drwxr-xr-x 3 jenkins jenkins 4096 Aug 27 13:59 ..
-rw-r--r-- 1 jenkins jenkins 4 Aug 27 13:59 cuda_version.txt
-rw-r--r-- 1 jenkins jenkins 10 Aug 27 13:59 driver_version.txt
lrwxrwxrwx 1 jenkins jenkins 16 Aug 27 13:59 libEGL.so -> /lib64/libEGL.so
lrwxrwxrwx 1 jenkins jenkins 18 Aug 27 13:59 libEGL.so.1 -> /lib64/libEGL.so.1
lrwxrwxrwx 1 jenkins jenkins 25 Aug 27 13:59 libEGL_nvidia.so.0 -> /lib64/libEGL_nvidia.so.0
Currently, the runtime linker /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/lib64/ld-linux-x86-64.so.2 has /cvmfs/software.eessi.io/host_injections/2023.06/compat/linux/x86_64/lib/ as an additional search path. However, the Lmod SitePackage.lua searches the other prefix:
if checkGpu and (overrideGpuCheck == nil) then
local arch = os.getenv("EESSI_CPU_FAMILY") or ""
local cvmfs_repo = os.getenv("EESSI_CVMFS_REPO") or ""
local cudaVersionFile = cvmfs_repo .. "/host_injections/nvidia/" .. arch .. "/latest/cuda_version.txt"
local cudaDriverFile = cvmfs_repo .. "/host_injections/nvidia/" .. arch .. "/latest/libcuda.so"
local cudaDriverExists = isFile(cudaDriverFile)
local singularityCudaExists = isFile("/.singularity.d/libs/libcuda.so")
if not (cudaDriverExists or singularityCudaExists) then
local advice = "which relies on the CUDA runtime environment and driver libraries. "
advice = advice .. "In order to be able to use the module, you will need "
advice = advice .. "to make sure EESSI can find the GPU driver libraries on your host system. You can "
advice = advice .. "override this check by setting the environment variable EESSI_OVERRIDE_GPU_CHECK but "
advice = advice .. "the loaded application will not be able to execute on your system.\n"
advice = advice .. refer_to_docs
LmodError("\nYou requested to load ", simpleName, " ", advice)
else
-- CUDA driver exists, now we check its version to see if an update is needed
if cudaDriverExists then
local cudaVersion = read_file(cudaVersionFile)
local cudaVersion_req = os.getenv("EESSICUDAVERSION")
-- driver CUDA versions don't give a patch version for CUDA
local major, minor = string.match(cudaVersion, "(%d+)%.(%d+)")
So, both are out there, in the wild, being used, and we thus can't change it for 2023.06.
After discussing with Alan, it seems there were some historical reasons, but we might still have a very small window to reconsider for 2025.06. One of the reasons is that the link within the EESSI-version specific directory is annoying: it requires admins from sites that make EESSI available to rerun the script that creates these symlinks for every EESSI version we release. I'd be happy if I can get sysadmins to do this once, but it'll be challenging if users have to chase their sysadmins over and over again. Fundamentally, the symlinks aren't specific to the EESSI version. One exception might be if CUDA compatibility libraries are used. But: that's the exception, not the rule.
New proposal is to make
/cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/lib_host
a variant symlink. By default, it could point to
/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/latest
The driver symlink script link_nvidia_host_libraries.sh will then be adapted to install in /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/lib_host. This will point to that default dir, and if all EESSI versions by default point their lib_host there, admins don't need to rerun link_nvidia_host_libraries.sh when they a new EESSI version is released.
At the same time, it still gives sysadmins the possibility to point the symlinks to different locations, if that's needed on their system (e.g. because they need to provide CUDA compatibility libraries for 2025.06, but not for 2023.06). Or, they could even point it somewhere where it would never pick anything up (/dev/null?) if they want to disable this mechanism altogether.
Some things to still think about:
- Can we modify the current
link_nvidia_host_libraries.shto work both for2023.06(old behaviour) and2025.06(new behaviour)? - How do we modify the `create_sitepackage.py in a way that accounts for this difference? There's only one such script, I guess we need to give it EESSI-version specific behaviour
- Do we need to consider other host libraries? For example, pointing
/cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/lib_hostby default to/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/latestas the default location makes sense for the nvidia stuff, but are there other host libs for which it makes less sense? Maybe we should just make it point to/cvmfs/software.eessi.io/host_injections/x86_64/lib_hostor something generic like that, and then accept that sysadmins at least have to run thelink_nvidia_host_libraries.shonce more (to install there)? (I think for MPI injection, we use a different method, right? Becauselib_hostwill have the lowest search priority...)