Skip to content

Bot-specific SitePackage.lua that solves libfabric issues #531

@bedroge

Description

@bedroge

With help from @casparvl, I've added the following to /project/def-users/bot/shared/host-injections/2023.06/.lmod/SitePackage.lua on our AWS build cluster, which will be picked up by the bot for builds relying on libfabric:

require("strict")
local hook = require("Hook")

-- LmodMessage("Load bot-specific SitePackage.lua")

local function eessi_bot_libfabric_set_psm3_devices_hook(t)
    local simpleName = string.match(t.modFullName, "(.-)/")
    -- we may want to be more specific in the future, and only do this for specific versions of libfabric
    if simpleName == 'libfabric' then
        -- set environment variables PSM3_DEVICES as workaround for MPI applications hanging in libfabric's PSM3 provider
        -- crf. https://github.com/easybuilders/easybuild-easyconfigs/issues/18925
        setenv('PSM3_DEVICES', 'self,shm')
    end
end

-- combine all load hook functions into a single one
function site_specific_load_hook(t)
    eessi_bot_libfabric_set_psm3_devices_hook(t)
end

local function combined_load_hook(t)
    -- Assuming this was called from EESSI's SitePackage.lua, this should be defined and thus run
    if eessi_load_hook ~= nil then
        eessi_load_hook(t)
    end
    site_specific_load_hook(t)
end

hook.register("load", combined_load_hook)

This solves the Haswell OpenMPI issues that we observed in several PRs. I was going to make a PR for it, but I have some doubts on how this should be done:

  • does it have to be restricted to Haswell (we also saw some hangs with other architectures, but it's not entirely clear if they were caused by the same issue)?
  • does it have to be restricted to certain versions of libfabric?
  • do we also need this for the tests? Answer fron @casparvl: yes, might be needed.
  • which script should make sure that this SitePackage.lua is picked up / copied to the right location? bot/build.sh, EESSI-install-software.sh, eessi_container.sh, ...?
  • what if a PR wants to update SitePackage.lua, should it already pick up the new version? If so, we should probably prevent it from being copied to the shared directory already, otherwise other builds will also pick it up already before it's merged.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions