Skip to content

Conversation

@casparvl
Copy link
Collaborator

@casparvl casparvl commented Aug 5, 2025

There are two reasons for this:

  1. Now that we have a CUDA sanity check, this allows us to see if anything is 'broken'.
  2. The PR that enables CI to check for differences between CUDA stacks at Add CUDA software check to stack comparison CI #1087 shows there are many differences between the architectures. In fact, there are so many holes that a rebuild PR for all architectures is probably the easiest way to fill all the gaps (much easier that figuring out what's missing for which of the 37 combinations of CPU+GPU).

…y check, so we can see if anything is 'broken'. Also, there are so many 'holes' in which software is present for which combination of CPU+GPU, that this is a convenient way to fill the gaps
@casparvl
Copy link
Collaborator Author

casparvl commented Aug 5, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/intel/icelake accelerator:nvidia/cc80

@casparvl
Copy link
Collaborator Author

casparvl commented Aug 5, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 5, 2025

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_1147/13574555

date job status comment
Aug 05 14:33:01 UTC 2025 submitted job id 13574555 will be eligible to start in about 20 seconds
Aug 05 14:33:15 UTC 2025 received job awaits launch by Slurm scheduler
Aug 05 14:33:28 UTC 2025 running job 13574555 is running
Aug 05 14:35:12 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-13574555.out
✅ no message matching FATAL:
❌ found message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Aug 05 14:35:12 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (2/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (3/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (4/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (5/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (6/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (7/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (8/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ PASSED ] Ran 0/8 test case(s) from 8 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-13574555.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

casparvl commented Aug 5, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 5, 2025

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_1147/13575145

date job status comment
Aug 05 14:50:14 UTC 2025 submitted job id 13575145 will be eligible to start in about 20 seconds
Aug 05 14:50:19 UTC 2025 received job awaits launch by Slurm scheduler
Aug 05 14:50:43 UTC 2025 running job 13575145 is running
Aug 05 14:52:27 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-13575145.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
Aug 05 14:52:27 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (2/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (3/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (4/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (5/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (6/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (7/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (8/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ PASSED ] Ran 0/8 test case(s) from 8 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-13575145.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

casparvl commented Aug 6, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 6, 2025

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_1147/13593270

date job status comment
Aug 06 09:07:57 UTC 2025 submitted job id 13593270 will be eligible to start in about 20 seconds
Aug 06 09:08:01 UTC 2025 received job awaits launch by Slurm scheduler
Aug 06 09:08:25 UTC 2025 running job 13593270 is running
Aug 06 09:14:48 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-13593270.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-17544716010.tar.gzsize: 0 MiB (45 bytes)
entries: 0
modules under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software
no software packages in tarball
reprod directories under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90
no other files in tarball
Aug 06 09:14:48 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (2/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (3/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (4/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (5/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (6/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (7/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (8/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ PASSED ] Ran 0/8 test case(s) from 8 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-13593270.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

casparvl commented Aug 6, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 6, 2025

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_1147/13593590

date job status comment
Aug 06 09:19:43 UTC 2025 submitted job id 13593590 will be eligible to start in about 20 seconds
Aug 06 09:19:54 UTC 2025 received job awaits launch by Slurm scheduler
Aug 06 09:20:37 UTC 2025 running job 13593590 is running
Aug 06 09:28:46 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-13593590.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-17544724340.tar.gzsize: 0 MiB (45 bytes)
entries: 0
modules under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software
no software packages in tarball
reprod directories under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90
no other files in tarball
Aug 06 09:28:46 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (2/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (3/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (4/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (5/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (6/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (7/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (8/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ PASSED ] Ran 0/8 test case(s) from 8 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-13593590.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

casparvl commented Aug 6, 2025

Hmmm, CUDA builds fail with:

== sanity checking...
  >> file 'bin/fatbinary' found: FAILED
  >> file 'bin/nvcc' found: FAILED
  >> file 'bin/nvlink' found: FAILED
  >> file 'bin/ptxas' found: FAILED
  >> file 'lib64/libcublas.so' found: OK
  >> file 'lib64/libcudart.so' found: OK
  >> file 'lib64/libcufft.so' found: OK
  >> file 'lib64/libcurand.so' found: OK
  >> file 'lib64/libcusparse.so' found: OK
  >> file 'lib/libcublas.so' found: OK
  >> file 'lib/libcudart.so' found: OK
  >> file 'lib/libcufft.so' found: OK
  >> file 'lib/libcurand.so' found: OK
  >> file 'lib/libcusparse.so' found: OK
  >> file 'extras/CUPTI/lib64/libcupti.so' found: OK
  >> file 'pkgconfig/cublas.pc' found: FAILED
  >> file 'pkgconfig/cudart.pc' found: FAILED
  >> file 'pkgconfig/cuda.pc' found: FAILED
  >> (non-empty) directory 'include' found: OK
  >> (non-empty) directory 'extras/CUPTI/include' found: OK
  >> loading modules: CUDA/12.1.1...

Those are the files that are symlinked from host-injections, probably (at least bin/nvcc is for sure). I guess the symlinks are broken for some reason?

@casparvl
Copy link
Collaborator Author

casparvl commented Aug 6, 2025

Ah, found the issue:

== 2025-08-06 11:23:55,548 eb_hooks.py:1301 DEBUG nvcc is not found in allowlist, so replacing it with symlink: /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software/CUDA/12.1.1/bin/nvcc
== 2025-08-06 11:23:55,550 filetools.py:358 INFO Symlinked /cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software/CUDA/12.1.1/bin/nvcc to /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software/CUDA/12.1.1/bin/nvcc

Note that in the host-injections dir, the whole accel/nvidia/cc90 pat should be stripped. I.e. it should symlink

/cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software/CUDA/12.1.1/bin/nvcc

but

/cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/amd/zen4/software/CUDA/12.1.1/bin/nvcc

@casparvl
Copy link
Collaborator Author

casparvl commented Aug 6, 2025

https://github.com/EESSI/software-layer-scripts/blob/41f3775bfe214ecc51af2ea88f914d93414ed87b/eb_hooks.py#L1310 this is the line where it happens. Might actually be an issue with the setting of the EESSI_ACCELERATOR_TARGET. I'm not 100% sure what kind of value is expected there, but looking on our GPU nodes:

EESSI_ACCEL_SUBDIR=accel/nvidia/cc80
EESSI_ACCELERATOR_TARGET=accel/nvidia/cc80

It seems strange that both are identical, I think the code expected nvidia/cc80 instead? I'll need to figure out where this gets set, and if it changed recently.

@casparvl
Copy link
Collaborator Author

casparvl commented Aug 6, 2025

@casparvl
Copy link
Collaborator Author

casparvl commented Aug 6, 2025

I think the bug is here. The "/accel/%s" % accel_subdir will essentially create e.g. /accel/accel/cuda/cc80, since accel_subdir is something like accel/cuda/cc80 (i.e. equal to the EESSI_ACCELERATOR_TARGET).

@ocaisa
Copy link
Member

ocaisa commented Aug 6, 2025

@casparvl you are correct, the bot was previously setting the accelerator override in a way that did not include the accel/ (but archdetect does include this top level directory). It worked because the incorrect value was consistently used. I thought I fixed it everywhere but I clearly missed this one

@casparvl casparvl marked this pull request as draft August 12, 2025 15:06
@casparvl
Copy link
Collaborator Author

This PR is on hold until EESSI/software-layer-scripts#59 is merged

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 20, 2025

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_1147/14180522

date job status comment
Aug 20 19:19:12 UTC 2025 submitted job id 14180522 will be eligible to start in about 20 seconds
Aug 20 19:19:22 UTC 2025 received job awaits launch by Slurm scheduler
Aug 20 19:19:45 UTC 2025 running job 14180522 is running
Aug 20 19:20:08 UTC 2025 finished
🤷 UNKNOWN (click triangle for detailed information)
  • Did not find bot/check-result.sh script in job's work directory.
  • Check job manually or ask an admin of the bot instance to assist you.
Aug 20 19:20:08 UTC 2025 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job14180522.test does not exist in job directory, or parsing it failed.

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 21, 2025

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_1147/14191183

date job status comment
Aug 21 09:50:40 UTC 2025 submitted job id 14191183 will be eligible to start in about 20 seconds
Aug 21 09:50:53 UTC 2025 received job awaits launch by Slurm scheduler
Aug 21 09:51:06 UTC 2025 running job 14191183 is running
Aug 21 10:09:29 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-14191183.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-accel-nvidia-cc90-17557703140.tar.gzsize: 4373 MiB (4585814055 bytes)
entries: 11839
modules under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all
CUDA/12.1.1.lua
CUDA/12.4.0.lua
software under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software
CUDA/12.1.1
CUDA/12.4.0
reprod directories under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90
2023.06/software/linux/x86_64/amd/zen4/.lmod/SitePackage.lua
Aug 21 10:09:29 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (2/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (3/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (4/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (5/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (6/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (7/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (8/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ PASSED ] Ran 0/8 test case(s) from 8 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-14191183.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf architecture:x86_64/amd/zen4 accelerator:nvidia/cc90

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 21, 2025

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_1147/14192402

date job status comment
Aug 21 11:12:00 UTC 2025 submitted job id 14192402 will be eligible to start in about 20 seconds
Aug 21 11:12:12 UTC 2025 received job awaits launch by Slurm scheduler
Aug 21 11:12:46 UTC 2025 running job 14192402 is running
Aug 21 11:15:27 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-14192402.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-accel-nvidia-cc90-17557748680.tar.gzsize: 0 MiB (4259 bytes)
entries: 1
modules under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software
no software packages in tarball
reprod directories under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90
2023.06/software/linux/x86_64/amd/zen4/.lmod/SitePackage.lua
Aug 21 11:15:27 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (2/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (3/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (4/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (5/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (6/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (7/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (8/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ PASSED ] Ran 0/8 test case(s) from 8 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-14192402.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 25, 2025

New job on instance eessi-bot-surf for repository eessi.io-2023.06-software
Building on: amd-zen4 and accelerator nvidia/cc90
Building for: x86_64/amd/zen4 and accelerator nvidia/cc90
Job dir: /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_1147/14291236

date job status comment
Aug 25 19:05:52 UTC 2025 submitted job id 14291236 will be eligible to start in about 20 seconds
Aug 25 19:06:01 UTC 2025 received job awaits launch by Slurm scheduler
Aug 25 19:07:05 UTC 2025 running job 14291236 is running
Aug 25 19:09:22 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-14291236.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-accel-nvidia-cc90-17561489050.tar.gzsize: 0 MiB (45 bytes)
entries: 0
modules under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software
no software packages in tarball
reprod directories under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90
no other files in tarball
Aug 25 19:09:22 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (2/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (3/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (4/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (5/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (6/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (7/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (8/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ PASSED ] Ran 0/8 test case(s) from 8 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-14291236.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf for:arch=x86_64/amd/zen4,accel=nvidia/cc90

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 26, 2025

New job on instance eessi-bot-surf for repository eessi.io-2023.06-software
Building on: amd-zen4 and accelerator nvidia/cc90
Building for: x86_64/amd/zen4 and accelerator nvidia/cc90
Job dir: /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_1147/14312163

date job status comment
Aug 26 15:07:32 UTC 2025 submitted job id 14312163 will be eligible to start in about 20 seconds
Aug 26 15:07:44 UTC 2025 received job awaits launch by Slurm scheduler
Aug 26 15:08:08 UTC 2025 running job 14312163 is running
Aug 26 18:39:29 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-14312163.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-accel-nvidia-cc90-17562335080.tar.gzsize: 0 MiB (45 bytes)
entries: 0
modules under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software
no software packages in tarball
reprod directories under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90
no other files in tarball
Aug 26 18:39:29 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (2/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (3/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (4/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (5/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (6/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (7/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (8/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ PASSED ] Ran 0/8 test case(s) from 8 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-14312163.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

Interesting, in #1147 (comment) I'm seeing:

 Using Kokkos package with arch: CPU - ZEN3, GPU - HOPPER90

I wonder if that's expected on ZEN4?

@laraPPr
Copy link
Collaborator

laraPPr commented Aug 26, 2025

Interesting, in #1147 (comment) I'm seeing:


 Using Kokkos package with arch: CPU - ZEN3, GPU - HOPPER90

I wonder if that's expected on ZEN4?

Yes this is normal for the easyblock that is used in this pr. Their is only support for zen4 since 2apr2025 so the easyblock sets zen3 for older versions.

@casparvl
Copy link
Collaborator Author

== Using Kokkos package with arch: CPU - ZEN3, GPU - HOPPER90
/gpfs/work1/1/eessibot/eessi-bot-surf/jobs/2025.08/pr_1147/event_61aa4100-828e-11f0-84ea-d8b59fe75d68/run_000/x86_64/amd/zen4/nvidia/cc90/eessi.io-2023.06-software/software-layer-scripts/EESSI-install-software.sh: line 380:  2047 Killed                  ${EB} --easystack ${easystack_file} --rebuild
...
[2025-08-26T20:39:21.970] error: Detected 1 oom_kill event in StepId=14312163.batch. Some of the step tasks have been OOM Killed.

Looking in the build log, I see

== 2025-08-26 17:40:20,208 run.py:502 INFO Running shell command 'cd /tmp/eb-6sujs4d9/eb-llfc02uv/tmp2292l8zz && mpirun -n 1 python -c 'from lammps import lammps; l=lammps(); l.file("/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software/LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1/examples/atm/in.atm"); l.finalize()'' in /tmp/eb-6sujs4d9/eb-llfc02uv/eb-sanity-check-wq2fmk8z

Context from which build log was copied:
- original path of build log: /tmp/eb-6sujs4d9/easybuild-nkxbr1pf.log
- working directory: /eessi_bot_job
- Slurm job ID:
- EasyBuild version: 5.1.1
- easystack file: easystacks/software.eessi.io/2023.06/accel/nvidia/rebuilds/20250805-eb-5.1.1-rebuild-2023a-for-cuda-sanity-check.yml

I guess it's that mpirun command that got killed / ran OOM? I did notice yesterday that that command took very long to run. I did not notice (or look at) the memory usage...

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf for:arch=x86_64/amd/zen4,accel=nvidia/cc90

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 27, 2025

New job on instance eessi-bot-surf for repository eessi.io-2023.06-software
Building on: amd-zen4 and accelerator nvidia/cc90
Building for: x86_64/amd/zen4 and accelerator nvidia/cc90
Job dir: /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_1147/14324195

date job status comment
Aug 27 10:05:19 UTC 2025 submitted job id 14324195 will be eligible to start in about 20 seconds
Aug 27 10:05:24 UTC 2025 received job awaits launch by Slurm scheduler
Aug 27 10:05:50 UTC 2025 running job 14324195 is running
Aug 27 11:28:15 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-14324195.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-accel-nvidia-cc90-17562939990.tar.gzsize: 0 MiB (45 bytes)
entries: 0
modules under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software
no software packages in tarball
reprod directories under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90
no other files in tarball
Aug 27 11:28:15 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (2/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (3/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (4/8) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] (5/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (6/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (7/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ SKIP ] (8/8) Skipping test : 1 GPU(s) available for this test case, need exactly 2
[ PASSED ] Ran 0/8 test case(s) from 8 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-14324195.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

I see this stack:

  ├─ /bin/bash /gpfs/work1/1/eessibot/eessi-bot-surf/jobs/2025.08/pr_1147/event_533c67c0-832d-11f0-8c8c-94392419de11/run_000/x86_64/amd/zen4/nvidia/cc90/eessi.io-2023.06-software/software-layer-scripts/install_software_layer.sh --build-logs-dir /projects/eessibot/eessi-bot-surf/buildlo
  │  └─ /bin/bash /gpfs/work1/1/eessibot/eessi-bot-surf/jobs/2025.08/pr_1147/event_533c67c0-832d-11f0-8c8c-94392419de11/run_000/x86_64/amd/zen4/nvidia/cc90/eessi.io-2023.06-software/software-layer-scripts/run_in_compat_layer_env.sh /gpfs/work1/1/eessibot/eessi-bot-surf/jobs/2025.08/pr_
  │     └─ /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/bin/bash /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/startprefix
  │        └─ /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/bin/bash -l
  │           └─ /bin/bash /gpfs/work1/1/eessibot/eessi-bot-surf/jobs/2025.08/pr_1147/event_533c67c0-832d-11f0-8c8c-94392419de11/run_000/x86_64/amd/zen4/nvidia/cc90/eessi.io-2023.06-software/software-layer-scripts/EESSI-install-software.sh --build-logs-dir /projects/eessibot/eessi-bot-
  │              └─ /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/usr/lib/python-exec/python3.11/python -m easybuild.main --easystack easystacks/software.eessi.io/2023.06/accel/nvidia/rebuilds/20250805-eb-5.1.1-rebuild-2023a-for-cuda-sanity-check.yml --rebuild
  │                 ├─ /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/usr/lib/python-exec/python3.11/python -m easybuild.main --easystack easystacks/software.eessi.io/2023.06/accel/nvidia/rebuilds/20250805-eb-5.1.1-rebuild-2023a-for-cuda-sanity-check.yml --rebuild
  │                 └─ mpirun -n 1 python -c from lammps import lammps; l=lammps(); l.file("/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software/LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1/examples/atm/in.atm"); l.finalize()
  │                    ├─ mpirun -n 1 python -c from lammps import lammps; l=lammps(); l.file("/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software/LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1/examples/atm/in.atm"); l.finalize()
  │                    ├─ mpirun -n 1 python -c from lammps import lammps; l=lammps(); l.file("/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software/LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1/examples/atm/in.atm"); l.finalize()
  │                    ├─ mpirun -n 1 python -c from lammps import lammps; l=lammps(); l.file("/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software/LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1/examples/atm/in.atm"); l.finalize()
  │                    └─ python -c from lammps import lammps; l=lammps(); l.file("/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software/LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1/examples/atm/in.atm"); l.finalize()
  │                       ├─ python -c from lammps import lammps; l=lammps(); l.file("/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software/LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1/examples/atm/in.atm"); l.finalize()
  │                       ├─ python -c from lammps import lammps; l=lammps(); l.file("/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software/LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1/examples/atm/in.atm"); l.finalize()
  │                       └─ python -c from lammps import lammps; l=lammps(); l.file("/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software/LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1/examples/atm/in.atm"); l.finalize()

And the easybuild.main process is using more and more memory, it just creeps up with ~100MB ever 5 seconds or so. Until I hit an OOM, apparently. Let me try to kill the mpirun process, see if that somehow breaks the hang and if we get more useful output.

@casparvl
Copy link
Collaborator Author

I figured before I kill it, let's attach an strace to the easybuild.main process. I see a million lines like:

read(7, "[1756293491.886531] [gcn145:5312"..., 4122442962) = 400
read(7, "[1756293491.886553] [gcn145:5312"..., 4122442562) = 400
read(7, "[1756293491.886576] [gcn145:5312"..., 4122442162) = 400
read(7, "[1756293491.886600] [gcn145:5312"..., 4122441762) = 400
read(7, "[1756293491.886623] [gcn145:5312"..., 4122441362) = 400
read(7, "[1756293491.886645] [gcn145:5312"..., 4122440962) = 252
read(7, "[1756293491.886653] [gcn145:5312"..., 4122440710) = 148
read(7, "[1756293491.886669] [gcn145:5312"..., 4122440562) = 400
read(7, "[1756293491.886693] [gcn145:5312"..., 4122440162) = 400
read(7, "[1756293491.886718] [gcn145:5312"..., 4122439762) = 400
read(7, "[1756293491.886741] [gcn145:5312"..., 4122439362) = 252
read(7, "[1756293491.886748] [gcn145:5312"..., 4122439110) = 548
read(7, "[1756293491.886787] [gcn145:5312"..., 4122438562) = 400
read(7, "[1756293491.886810] [gcn145:5312"..., 4122438162) = 252
read(7, "[1756293491.886816] [gcn145:5312"..., 4122437910) = 548
read(7, "[1756293491.886855] [gcn145:5312"..., 4122437362) = 252
read(7, "[1756293491.886862] [gcn145:5312"..., 4122437110) = 548
read(7, "[1756293491.886901] [gcn145:5312"..., 4122436562) = 400
read(7, "[1756293491.886925] [gcn145:5312"..., 4122436162) = 400
read(7, "[1756293491.886949] [gcn145:5312"..., 4122435762) = 400
read(7, "[1756293491.886979] [gcn145:5312"..., 4122435362) = 400
read(7, "[1756293491.887003] [gcn145:5312"..., 4122434962) = 400
read(7, "[1756293491.887026] [gcn145:5312"..., 4122434562) = 252
read(7, "[1756293491.887033] [gcn145:5312"..., 4122434310) = 148
read(7, "[1756293491.887049] [gcn145:5312"..., 4122434162) = 400
read(7, "[1756293491.887073] [gcn145:5312"..., 4122433762) = 400
read(7, "[1756293491.887097] [gcn145:5312"..., 4122433362) = 400
read(7, "[1756293491.887119] [gcn145:5312"..., 4122432962) = 400
read(7, "[1756293491.887141] [gcn145:5312"..., 4122432562) = 400
read(7, "[1756293491.887165] [gcn145:5312"..., 4122432162) = 400
read(7, "[1756293491.887187] [gcn145:5312"..., 4122431762) = 400
read(7, "[1756293491.887210] [gcn145:5312"..., 4122431362) = 400
read(7, "[1756293491.887232] [gcn145:5312"..., 4122430962) = 400
read(7, "[1756293491.887255] [gcn145:5312"..., 4122430562) = 252
read(7, "[1756293491.887261] [gcn145:5312"..., 4122430310) = 548
read(7, "[1756293491.887301] [gcn145:5312"..., 4122429762) = 400
read(7, "[1756293491.887324] [gcn145:5312"..., 4122429362) = 400
read(7, "[1756293491.887346] [gcn145:5312"..., 4122428962) = 400
read(7, "[1756293491.887370] [gcn145:5312"..., 4122428562) = 400

And the mpirun process is stuck with:

read(21, "[1756293766.892765] [gcn145:5312"..., 4096) = 948
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=20, events=POLLIN}, {fd=25, events=POLLIN}, {fd=1, events=POLLOUT}, {fd=21, events=POLLIN}], 7, -1) = 2 ([{fd=1, revents=POLLOUT}, {fd=21, revents=POLLIN}])
write(1, "[1756293766.892765] [gcn145:5312"..., 948) = 948
read(21, "[1756293766.892830] [gcn145:5312"..., 4096) = 1200
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=20, events=POLLIN}, {fd=25, events=POLLIN}, {fd=1, events=POLLOUT}, {fd=21, events=POLLIN}], 7, -1) = 2 ([{fd=1, revents=POLLOUT}, {fd=21, revents=POLLIN}])
write(1, "[1756293766.892830] [gcn145:5312"..., 1200) = 1200
read(21, "[1756293766.892898] [gcn145:5312"..., 4096) = 1200
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=20, events=POLLIN}, {fd=25, events=POLLIN}, {fd=1, events=POLLOUT}, {fd=21, events=POLLIN}], 7, -1) = 2 ([{fd=1, revents=POLLOUT}, {fd=21, revents=POLLIN}])
write(1, "[1756293766.892898] [gcn145:5312"..., 1200) = 1200
read(21, "[1756293766.892967] [gcn145:5312"..., 4096) = 1200
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=20, events=POLLIN}, {fd=25, events=POLLIN}, {fd=1, events=POLLOUT}, {fd=21, events=POLLIN}], 7, -1) = 2 ([{fd=1, revents=POLLOUT}, {fd=21, revents=POLLIN}])
read(21, "[1756293766.893041] [gcn145:5312"..., 4096) = 652
write(1, "[1756293766.892967] [gcn145:5312"..., 1200) = 1200
write(1, "[1756293766.893041] [gcn145:5312"..., 652) = 652
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=20, events=POLLIN}, {fd=25, events=POLLIN}, {fd=21, events=POLLIN}], 6, -1) = 1 ([{fd=21, revents=POLLIN}])
read(21, "[1756293766.893069] [gcn145:5312"..., 4096) = 1600
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=20, events=POLLIN}, {fd=25, events=POLLIN}, {fd=1, events=POLLOUT}, {fd=21, events=POLLIN}], 7, -1) = 2 ([{fd=1, revents=POLLOUT}, {fd=21, revents=POLLIN}])
write(1, "[1756293766.893069] [gcn145:5312"..., 1600) = 1600
read(21, "[1756293766.893159] [gcn145:5312"..., 4096) = 948
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=20, events=POLLIN}, {fd=25, events=POLLIN}, {fd=1, events=POLLOUT}, {fd=21, events=POLLIN}], 7, -1) = 2 ([{fd=1, revents=POLLOUT}, {fd=21, revents=POLLIN}])
write(1, "[1756293766.893159] [gcn145:5312"..., 948) = 948
read(21, "[1756293766.893224] [gcn145:5312"..., 4096) = 1200
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=20, events=POLLIN}, {fd=25, events=POLLIN}, {fd=1, events=POLLOUT}, {fd=21, events=POLLIN}], 7, -1) = 2 ([{fd=1, revents=POLLOUT}, {fd=21, revents=POLLIN}])
write(1, "[1756293766.893224] [gcn145:5312"..., 1200) = 1200
read(21, "[1756293766.893293] [gcn145:5312"..., 4096) = 1200
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=20, events=POLLIN}, {fd=25, events=POLLIN}, {fd=1, events=POLLOUT}, {fd=21, events=POLLIN}], 7, -1) = 2 ([{fd=1, revents=POLLOUT}, {fd=21, revents=POLLIN}])
write(1, "[1756293766.893293] [gcn145:5312"..., 1200) = 1200
read(21, "[1756293766.893361] [gcn145:5312"..., 4096) = 1052

The python process launched by mpirun gives:

mmap(NULL, 23072768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa260be5000
madvise(0x7fa260c00000, 20971520, MADV_HUGEPAGE) = 0
ioctl(28, RDMA_VERBS_IOCTL, 0x7ffd9a2871f0) = -1 ENOMEM (Cannot allocate memory)
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=0, inheritable=0}) = 0
prlimit64(0, RLIMIT_MEMLOCK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
write(1, "[1756293836.193150] [gcn145:5312"..., 252) = 252
munmap(0x7fa260be5000, 23072768)        = 0
write(1, "[1756293836.193192] [gcn145:5312"..., 148) = 148
shmget(IPC_PRIVATE, 20971520, IPC_CREAT|SHM_HUGETLB|0600) = -1 EPERM (Operation not permitted)
shmctl(0, IPC_INFO, {shmmax=18446744073692774399, shmmin=1, shmmni=4096, shmseg=4096, shmall=18446744073692774399}) = 46
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=0, inheritable=0}) = 0
openat(AT_FDCWD, "/sys/kernel/mm/transparent_hugepage/enabled", O_RDONLY) = 63
read(63, "[always] madvise never\n", 254) = 23
close(63)                               = 0
mmap(NULL, 23072768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa260be5000
madvise(0x7fa260c00000, 20971520, MADV_HUGEPAGE) = 0
ioctl(28, RDMA_VERBS_IOCTL, 0x7ffd9a2871f0) = -1 ENOMEM (Cannot allocate memory)
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=0, inheritable=0}) = 0
prlimit64(0, RLIMIT_MEMLOCK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
write(1, "[1756293836.193423] [gcn145:5312"..., 252) = 252
munmap(0x7fa260be5000, 23072768)        = 0
write(1, "[1756293836.193464] [gcn145:5312"..., 148) = 148
shmget(IPC_PRIVATE, 20971520, IPC_CREAT|SHM_HUGETLB|0600) = -1 EPERM (Operation not permitted)
shmctl(0, IPC_INFO, {shmmax=18446744073692774399, shmmin=1, shmmni=4096, shmseg=4096, shmall=18446744073692774399}) = 46
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=0, inheritable=0}) = 0
openat(AT_FDCWD, "/sys/kernel/mm/transparent_hugepage/enabled", O_RDONLY) = 63
read(63, "[always] madvise never\n", 254) = 23
close(63)                               = 0
mmap(NULL, 23072768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa260be5000
madvise(0x7fa260c00000, 20971520, MADV_HUGEPAGE) = 0
ioctl(28, RDMA_VERBS_IOCTL, 0x7ffd9a2871f0) = -1 ENOMEM (Cannot allocate memory)
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=0, inheritable=0}) = 0
prlimit64(0, RLIMIT_MEMLOCK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
write(1, "[1756293836.193692] [gcn145:5312"..., 252) = 252
munmap(0x7fa260be5000, 23072768)        = 0
write(1, "[1756293836.193733] [gcn145:5312"..., 148) = 148
shmget(IPC_PRIVATE, 20971520, IPC_CREAT|SHM_HUGETLB|0600) = -1 EPERM (Operation not permitted)
shmctl(0, IPC_INFO, {shmmax=18446744073692774399, shmmin=1, shmmni=4096, shmseg=4096, shmall=18446744073692774399}) = 46
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=0, inheritable=0}) = 0
openat(AT_FDCWD, "/sys/kernel/mm/transparent_hugepage/enabled", O_RDONLY) = 63
read(63, "[always] madvise never\n", 254) = 23
close(63)

repeatedly.

@casparvl
Copy link
Collaborator Author

Exactly the same output. I guess it shouldn't be a surprise: whether I kill it, or the OOM killer does, doesn't matter much.

Could it be related to trying to read this file

/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software/LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1/examples/atm/in.atm

from the overlay? I guess the overlay is one of the key differences between my EESSI-extend build (which was succesful), and the one done by the bot...

@casparvl
Copy link
Collaborator Author

Let's check on a different system...

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-vsc-ugent for:arch=x86_64/intel/cascadelake,accel=nvidia/cc70

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Aug 27, 2025

New job on instance eessi-bot-vsc-ugent for repository eessi.io-2023.06-software
Building on: intel-cascadelake and accelerator nvidia/cc70
Building for: x86_64/intel/cascadelake and accelerator nvidia/cc70
Job dir: /scratch/gent/vo/002/gvo00211/SHARED/jobs/2025.08/pr_1147/40716864

date job status comment
Aug 27 11:30:51 UTC 2025 submitted job id 40716864 awaits release by job manager
Aug 27 11:32:34 UTC 2025 released job awaits launch by Slurm scheduler
Aug 27 11:34:38 UTC 2025 running job 40716864 is running
Aug 28 11:33:04 UTC 2025 finished
🤷 UNKNOWN (click triangle for detailed information)
  • Job results file _bot_job40716864.result does not exist in job directory, or parsing it failed.
  • No artefacts were found/reported.
Aug 28 11:33:04 UTC 2025 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job40716864.test does not exist in job directory, or parsing it failed.

@casparvl
Copy link
Collaborator Author

Check if this also happens for other arch:

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-surf for:arch=x86_64/intel/icelake,accel=nvidia/cc80

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Aug 27, 2025

New job on instance eessi-bot-surf for repository eessi.io-2023.06-software
Building on: intel-icelake and accelerator nvidia/cc80
Building for: x86_64/intel/icelake and accelerator nvidia/cc80
Job dir: /projects/eessibot/eessi-bot-surf/jobs/2025.08/pr_1147/14329026

date job status comment
Aug 27 14:21:34 UTC 2025 submitted job id 14329026 will be eligible to start in about 20 seconds
Aug 27 14:21:42 UTC 2025 received job awaits launch by Slurm scheduler
Aug 27 14:22:07 UTC 2025 running job 14329026 is running

@lorisercole
Copy link

Yeah this is weird, this test run should take less than 10 seconds.

@laraPPr
Copy link
Collaborator

laraPPr commented Aug 28, 2025

It seems limited to the zen4-cc90 instance because on eessi-bot-vsc-ugent the LAMMPS build took normal amount of time. I'll check the logs of eessi-bot-surf but since that one is also still running I think the other target at SURF is ok.

@laraPPr
Copy link
Collaborator

laraPPr commented Aug 28, 2025

The complete sanity check on intel-icelake and accelerator nvidia/cc80 took 2 mins 33 secs so that is normal.

@laraPPr
Copy link
Collaborator

laraPPr commented Aug 28, 2025

Once eessi-bot-vsc-ugent is done I'll add the x86_64/amd/zen4 and accelerator nvidia/cc90 at UGent to see if we get the same. @casparvl was not able to reproduce it interactively. So it will be tricky figuring out what it going wrong.

@casparvl
Copy link
Collaborator Author

For icelake+cc80, I'm getting a hang at the end of the GROMACS test step. I could actually reproduce that interactively. First, I had an error interactively

make[3]: *** [python_packaging/gmxapi/test/CMakeFiles/gmxapi_pytest.dir/build.make:73: python_packaging/gmxapi/test/CMakeFiles/gmxapi_pytest] Error 1
make[3]: Leaving directory '/tmp/casparl/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0/easybuild_obj'
make[2]: *** [CMakeFiles/Makefile2:7274: python_packaging/gmxapi/test/CMakeFiles/gmxapi_pytest.dir/all] Error 2
make[2]: *** Waiting for unfinished jobs....

To resolve, I unset all the SLURM environment variables

for i in $(env | grep SLURM); do unset "${i%=*}"; done

This makes it go further, and actually report the test results:

100% tests passed, 0 tests failed out of 89

Label Time Summary:
GTest              = 566.41 sec*proc (85 tests)
IntegrationTest    = 324.32 sec*proc (28 tests)
MpiTest            = 383.95 sec*proc (21 tests)
QuickGpuTest       = 123.21 sec*proc (20 tests)
SlowGpuTest        = 427.80 sec*proc (14 tests)
SlowTest           = 232.29 sec*proc (13 tests)
UnitTest           =   9.80 sec*proc (44 tests)

Total Test time (real) = 302.50 sec

But, it then hangs on completing the make check

[100%] Built target missing-tests-notice

And then... nothing. It just hangs, on the same process that the build bot was hanging on:

2594465 casparl     20   0  235M 79320 16704 S   0.0  0.0  0:00.58  32 │                    │     └─ /home/casparl/eessi/versions/2023.06/software/linux/x86_64/intel/icelake/software/Python/3.11.5-GCCcore-13.2.0/bin/python -m pytest --log-cli-level ERROR /tmp/casparl/easybuild/build/GROMACS/2024.4/foss-2023b-CUDA-12.4.0/gromacs-2024.4/python_packaging/sample_restraint/tests
2594482 casparl     20   0 91776 17280 13824 S   0.0  0.0  0:00.09  25 │                    │        ├─ orted --hnp --set-sid --report-uri 14 --singleton-died-pipe 15 -mca state_novm_select 1 -mca ess hnp -mca pmix ^s1,s2,cray,isolated
2594486 casparl     20   0 91776 17280 13824 S   0.0  0.0  0:00.01  27 │                    │        │  ├─ orted --hnp --set-sid --report-uri 14 --singleton-died-pipe 15 -mca state_novm_select 1 -mca ess hnp -mca pmix ^s1,s2,cray,isolated
2594489 casparl     20   0 91776 17280 13824 S   0.0  0.0  0:00.00  30 │                    │        │  └─ orted --hnp --set-sid --report-uri 14 --singleton-died-pipe 15 -mca state_novm_select 1 -mca ess hnp -mca pmix ^s1,s2,cray,isolated

@casparvl
Copy link
Collaborator Author

Maybe it's even related to the LAMMPS hang... both seem to use MPI through a python interface. Maybe the MPI finalize doesn't return correctly?

@bedroge
Copy link
Collaborator

bedroge commented Aug 28, 2025

This seems very similar to issues that we've seen before with GROMACS and Siesta:
#531
#966 (comment)

@bedroge
Copy link
Collaborator

bedroge commented Aug 28, 2025

Since you can reproduce it interactively, maybe you can try this?

export FI_PROVIDER="^psm3"

@casparvl
Copy link
Collaborator Author

@bedroge note that I could reproduce the GROMACS hang interactively, not the LAMMPS hang... But, it doesn't hurt to try it. I'm running the GROMACS test step now with export FI_PROVIDER="^psm3", so far so good.

What we should really do is split up this PR though, it's way too big / time consuming to tackle all of these at once...

@casparvl
Copy link
Collaborator Author

Setting that env var seems to work for GROMACS when I run the test step interactively... I'll set it, then try a full rebuild, see if it completes succesfully.

This was referenced Aug 28, 2025
@casparvl
Copy link
Collaborator Author

Ok, I'm going to close this, as I've split it up in

I've also implemented the suggested LUA hook from #966 (comment) in the host-injections dir of the SURF bot:

require("strict")
local hook = require("Hook")

-- LmodMessage("Load bot-specific SitePackage.lua")

local function eessi_bot_libfabric_set_psm3_devices_hook(t)
    local simpleName = string.match(t.modFullName, "(.-)/")
    -- we may want to be more specific in the future, and only do this for specific versions of libfabric
    if simpleName == 'libfabric' then
        -- set environment variables PSM3_DEVICES as workaround for MPI applications hanging in libfabric's PSM3 provider
        -- crf. https://github.com/easybuilders/easybuild-easyconfigs/issues/18925
        -- setenv('PSM3_DEVICES', 'self,shm')
        setenv('FI_PROVIDER', '^psm3')
    end
end

-- combine all load hook functions into a single one
function site_specific_load_hook(t)
    eessi_bot_libfabric_set_psm3_devices_hook(t)
end

local function combined_load_hook(t)
    -- Assuming this was called from EESSI's SitePackage.lua, this should be defined and thus run
    if eessi_load_hook ~= nil then
        eessi_load_hook(t)
    end
    site_specific_load_hook(t)
end

hook.register("load", combined_load_hook)

@casparvl casparvl closed this Aug 28, 2025
@ocaisa
Copy link
Member

ocaisa commented Aug 29, 2025

We may want to consider making this hook install/available as part of the build process. We can be ultra conservative because we know it is a single node build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants