Skip to content

Revise symlink strategy for drivers #226

@casparvl

Description

@casparvl

I was confused when I saw two dirs in the host_injections with symlinks to the drivers

$ ls -al /cvmfs/software.eessi.io/host_injections/2023.06/compat/linux/x86_64/lib/
total 23
drwxr-xr-x 2 jenkins jenkins 4096 Aug 27 13:59 .
drwxr-xr-x 3 jenkins jenkins 4096 Aug 27 13:59 ..
-rw-r--r-- 1 jenkins jenkins    4 Aug 27 13:59 cuda_version.txt
-rw-r--r-- 1 jenkins jenkins   10 Aug 27 13:59 driver_version.txt
lrwxrwxrwx 1 jenkins jenkins   16 Aug 27 13:59 libEGL.so -> /lib64/libEGL.so
lrwxrwxrwx 1 jenkins jenkins   18 Aug 27 13:59 libEGL.so.1 -> /lib64/libEGL.so.1
lrwxrwxrwx 1 jenkins jenkins   25 Aug 27 13:59 libEGL_nvidia.so.0 -> /lib64/libEGL_nvidia.so.0
...
$ ls -al /cvmfs/software.eessi.io/host_injections/nvidia/x86_64/host/
total 23
drwxr-xr-x 2 jenkins jenkins 4096 Aug 27 13:59 .
drwxr-xr-x 3 jenkins jenkins 4096 Aug 27 13:59 ..
-rw-r--r-- 1 jenkins jenkins    4 Aug 27 13:59 cuda_version.txt
-rw-r--r-- 1 jenkins jenkins   10 Aug 27 13:59 driver_version.txt
lrwxrwxrwx 1 jenkins jenkins   16 Aug 27 13:59 libEGL.so -> /lib64/libEGL.so
lrwxrwxrwx 1 jenkins jenkins   18 Aug 27 13:59 libEGL.so.1 -> /lib64/libEGL.so.1
lrwxrwxrwx 1 jenkins jenkins   25 Aug 27 13:59 libEGL_nvidia.so.0 -> /lib64/libEGL_nvidia.so.0

Currently, the runtime linker /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/lib64/ld-linux-x86-64.so.2 has /cvmfs/software.eessi.io/host_injections/2023.06/compat/linux/x86_64/lib/ as an additional search path. However, the Lmod SitePackage.lua searches the other prefix:

    if checkGpu and (overrideGpuCheck == nil) then
        local arch = os.getenv("EESSI_CPU_FAMILY") or ""
        local cvmfs_repo = os.getenv("EESSI_CVMFS_REPO") or ""
        local cudaVersionFile = cvmfs_repo .. "/host_injections/nvidia/" .. arch .. "/latest/cuda_version.txt"
        local cudaDriverFile = cvmfs_repo .. "/host_injections/nvidia/" .. arch .. "/latest/libcuda.so"
        local cudaDriverExists = isFile(cudaDriverFile)
        local singularityCudaExists = isFile("/.singularity.d/libs/libcuda.so")
        if not (cudaDriverExists or singularityCudaExists)  then
            local advice = "which relies on the CUDA runtime environment and driver libraries. "
            advice = advice .. "In order to be able to use the module, you will need "
            advice = advice .. "to make sure EESSI can find the GPU driver libraries on your host system. You can "
            advice = advice .. "override this check by setting the environment variable EESSI_OVERRIDE_GPU_CHECK but "
            advice = advice .. "the loaded application will not be able to execute on your system.\n"
            advice = advice .. refer_to_docs
            LmodError("\nYou requested to load ", simpleName, " ", advice)
        else
            -- CUDA driver exists, now we check its version to see if an update is needed
            if cudaDriverExists then
                local cudaVersion = read_file(cudaVersionFile)
                local cudaVersion_req = os.getenv("EESSICUDAVERSION")
                -- driver CUDA versions don't give a patch version for CUDA
                local major, minor = string.match(cudaVersion, "(%d+)%.(%d+)")

So, both are out there, in the wild, being used, and we thus can't change it for 2023.06.

After discussing with Alan, it seems there were some historical reasons, but we might still have a very small window to reconsider for 2025.06. One of the reasons is that the link within the EESSI-version specific directory is annoying: it requires admins from sites that make EESSI available to rerun the script that creates these symlinks for every EESSI version we release. I'd be happy if I can get sysadmins to do this once, but it'll be challenging if users have to chase their sysadmins over and over again. Fundamentally, the symlinks aren't specific to the EESSI version. One exception might be if CUDA compatibility libraries are used. But: that's the exception, not the rule.

New proposal is to make

/cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/lib_host

a variant symlink. By default, it could point to

/cvmfs/software.eessi.io/host_injections/nvidia/x86_64/latest

The driver symlink script link_nvidia_host_libraries.sh will then be adapted to install in /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/lib_host. This will point to that default dir, and if all EESSI versions by default point their lib_host there, admins don't need to rerun link_nvidia_host_libraries.sh when they a new EESSI version is released.

At the same time, it still gives sysadmins the possibility to point the symlinks to different locations, if that's needed on their system (e.g. because they need to provide CUDA compatibility libraries for 2025.06, but not for 2023.06). Or, they could even point it somewhere where it would never pick anything up (/dev/null?) if they want to disable this mechanism altogether.

Some things to still think about:

  • Can we modify the current link_nvidia_host_libraries.sh to work both for 2023.06 (old behaviour) and 2025.06 (new behaviour)?
  • How do we modify the `create_sitepackage.py in a way that accounts for this difference? There's only one such script, I guess we need to give it EESSI-version specific behaviour
  • Do we need to consider other host libraries? For example, pointing /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/lib_host by default to /cvmfs/software.eessi.io/host_injections/nvidia/x86_64/latest as the default location makes sense for the nvidia stuff, but are there other host libs for which it makes less sense? Maybe we should just make it point to /cvmfs/software.eessi.io/host_injections/x86_64/lib_host or something generic like that, and then accept that sysadmins at least have to run the link_nvidia_host_libraries.sh once more (to install there)? (I think for MPI injection, we use a different method, right? Because lib_host will have the lowest search priority...)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions