-
Notifications
You must be signed in to change notification settings - Fork 66
Open
Description
With help from @casparvl, I've added the following to /project/def-users/bot/shared/host-injections/2023.06/.lmod/SitePackage.lua on our AWS build cluster, which will be picked up by the bot for builds relying on libfabric:
require("strict")
local hook = require("Hook")
-- LmodMessage("Load bot-specific SitePackage.lua")
local function eessi_bot_libfabric_set_psm3_devices_hook(t)
local simpleName = string.match(t.modFullName, "(.-)/")
-- we may want to be more specific in the future, and only do this for specific versions of libfabric
if simpleName == 'libfabric' then
-- set environment variables PSM3_DEVICES as workaround for MPI applications hanging in libfabric's PSM3 provider
-- crf. https://github.com/easybuilders/easybuild-easyconfigs/issues/18925
setenv('PSM3_DEVICES', 'self,shm')
end
end
-- combine all load hook functions into a single one
function site_specific_load_hook(t)
eessi_bot_libfabric_set_psm3_devices_hook(t)
end
local function combined_load_hook(t)
-- Assuming this was called from EESSI's SitePackage.lua, this should be defined and thus run
if eessi_load_hook ~= nil then
eessi_load_hook(t)
end
site_specific_load_hook(t)
end
hook.register("load", combined_load_hook)
This solves the Haswell OpenMPI issues that we observed in several PRs. I was going to make a PR for it, but I have some doubts on how this should be done:
- does it have to be restricted to Haswell (we also saw some hangs with other architectures, but it's not entirely clear if they were caused by the same issue)?
- does it have to be restricted to certain versions of
libfabric? - do we also need this for the tests? Answer fron @casparvl: yes, might be needed.
- which script should make sure that this
SitePackage.luais picked up / copied to the right location?bot/build.sh,EESSI-install-software.sh,eessi_container.sh, ...? - what if a PR wants to update
SitePackage.lua, should it already pick up the new version? If so, we should probably prevent it from being copied to the shared directory already, otherwise other builds will also pick it up already before it's merged.
Metadata
Metadata
Assignees
Labels
No labels