-
Notifications
You must be signed in to change notification settings - Fork 66
Rebuild all CUDA software with EB-5.1.1 #1147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…y check, so we can see if anything is 'broken'. Also, there are so many 'holes' in which software is present for which combination of CPU+GPU, that this is a convenient way to fill the gaps
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/intel/icelake accelerator:nvidia/cc80 |
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
|
New job on instance
|
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
|
New job on instance
|
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
|
New job on instance
|
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
|
New job on instance
|
|
Hmmm, CUDA builds fail with: Those are the files that are symlinked from host-injections, probably (at least |
|
Ah, found the issue: Note that in the host-injections dir, the whole but |
|
https://github.com/EESSI/software-layer-scripts/blob/41f3775bfe214ecc51af2ea88f914d93414ed87b/eb_hooks.py#L1310 this is the line where it happens. Might actually be an issue with the setting of the It seems strange that both are identical, I think the code expected |
|
Apparently, that is totally expected. Here, it's essentially set to the same value by https://github.com/EESSI/software-layer-scripts/blob/41f3775bfe214ecc51af2ea88f914d93414ed87b/init/modules/EESSI/2023.06.lua#L77 and https://github.com/EESSI/software-layer-scripts/blob/41f3775bfe214ecc51af2ea88f914d93414ed87b/init/modules/EESSI/2023.06.lua#L157 |
|
I think the bug is here. The |
|
@casparvl you are correct, the bot was previously setting the accelerator override in a way that did not include the |
|
This PR is on hold until EESSI/software-layer-scripts#59 is merged |
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
|
New job on instance
|
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
|
New job on instance
|
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90 |
|
New job on instance
|
|
New job on instance
|
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf for:arch=x86_64/amd/zen4,accel=nvidia/cc90 |
|
New job on instance
|
|
Interesting, in #1147 (comment) I'm seeing: I wonder if that's expected on ZEN4? |
Yes this is normal for the easyblock that is used in this pr. Their is only support for zen4 since 2apr2025 so the easyblock sets zen3 for older versions. |
Looking in the build log, I see I guess it's that |
|
bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf for:arch=x86_64/amd/zen4,accel=nvidia/cc90 |
|
New job on instance
|
|
I see this stack: And the |
|
I figured before I kill it, let's attach an And the The repeatedly. |
|
Exactly the same output. I guess it shouldn't be a surprise: whether I kill it, or the OOM killer does, doesn't matter much. Could it be related to trying to read this file from the overlay? I guess the overlay is one of the key differences between my |
|
Let's check on a different system... bot: build repo:eessi.io-2023.06-software instance:eessi-bot-vsc-ugent for:arch=x86_64/intel/cascadelake,accel=nvidia/cc70 |
|
New job on instance
|
|
Check if this also happens for other arch: bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf for:arch=x86_64/intel/icelake,accel=nvidia/cc80 |
|
New job on instance
|
|
Yeah this is weird, this test run should take less than 10 seconds. |
|
It seems limited to the zen4-cc90 instance because on eessi-bot-vsc-ugent the LAMMPS build took normal amount of time. I'll check the logs of eessi-bot-surf but since that one is also still running I think the other target at SURF is ok. |
|
The complete sanity check on |
|
Once |
|
For icelake+cc80, I'm getting a hang at the end of the GROMACS test step. I could actually reproduce that interactively. First, I had an error interactively To resolve, I unset all the SLURM environment variables This makes it go further, and actually report the test results: But, it then hangs on completing the And then... nothing. It just hangs, on the same process that the build bot was hanging on: |
|
Maybe it's even related to the LAMMPS hang... both seem to use MPI through a python interface. Maybe the MPI finalize doesn't return correctly? |
|
This seems very similar to issues that we've seen before with GROMACS and Siesta: |
|
Since you can reproduce it interactively, maybe you can try this? |
|
@bedroge note that I could reproduce the GROMACS hang interactively, not the LAMMPS hang... But, it doesn't hurt to try it. I'm running the GROMACS test step now with What we should really do is split up this PR though, it's way too big / time consuming to tackle all of these at once... |
|
Setting that env var seems to work for GROMACS when I run the test step interactively... I'll set it, then try a full rebuild, see if it completes succesfully. |
|
Ok, I'm going to close this, as I've split it up in
I've also implemented the suggested LUA hook from #966 (comment) in the |
|
We may want to consider making this hook install/available as part of the build process. We can be ultra conservative because we know it is a single node build. |
There are two reasons for this: