Skip to content

Conversation

@trz42
Copy link
Collaborator

@trz42 trz42 commented Aug 1, 2024

Companion PR for #655 to apply changes also on zen4.

@trz42 trz42 added the zen4 label Aug 1, 2024
@eessi-bot
Copy link

eessi-bot bot commented Aug 1, 2024

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-compat, eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software

@eessi-bot
Copy link

eessi-bot bot commented Aug 1, 2024

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-compat, eessi-hpc.org-2023.06-compat, eessi-hpc.org-2023.06-software, eessi.io-2023.06-software

@trz42
Copy link
Collaborator Author

trz42 commented Aug 1, 2024

bot: build repo:eessi.io-2023.06-software arch:zen4

@eessi-bot
Copy link

eessi-bot bot commented Aug 1, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:zen4 from trz42

    • expanded format: build repository:eessi.io-2023.06-software architecture:zen4
  • handling command build repository:eessi.io-2023.06-software architecture:zen4 resulted in:

    • no jobs were submitted

@eessi-bot
Copy link

eessi-bot bot commented Aug 1, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)

@eessi-bot
Copy link

eessi-bot bot commented Aug 1, 2024

New job on instance eessi-bot-mc-azure for architecture x86_64-amd-zen4 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_657/168

date job status comment
Aug 01 18:59:35 UTC 2024 submitted job id 168 awaits release by job manager
Aug 01 18:59:42 UTC 2024 released job awaits launch by Slurm scheduler
Aug 01 19:03:45 UTC 2024 running job 168 is running
Aug 01 19:37:26 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-168.out
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-1722540924.tar.gzsize: 179 MiB (187932168 bytes)
entries: 8429
modules under 2023.06/software/linux/x86_64/amd/zen4/modules/all
FLAC/1.4.2-GCCcore-12.3.0.lua
LLVM/14.0.6-GCCcore-12.3.0-llvmlite.lua
LittleCMS/2.15-GCCcore-12.3.0.lua
OpenJPEG/2.5.0-GCCcore-12.3.0.lua
Pillow/10.0.0-GCCcore-12.3.0.lua
Qhull/2020.2-GCCcore-12.3.0.lua
Tkinter/3.11.3-GCCcore-12.3.0.lua
cppy/1.2.1-GCCcore-12.3.0.lua
libogg/1.3.5-GCCcore-12.3.0.lua
libopus/1.4-GCCcore-12.3.0.lua
libsndfile/1.2.2-GCCcore-12.3.0.lua
libvorbis/1.3.7-GCCcore-12.3.0.lua
libwebp/1.3.1-GCCcore-12.3.0.lua
matplotlib/3.7.2-gfbf-2023a.lua
meson-python/0.13.2-GCCcore-12.3.0.lua
numba/0.58.1-foss-2023a.lua
scikit-learn/1.3.1-gfbf-2023a.lua
software under 2023.06/software/linux/x86_64/amd/zen4/software
FLAC/1.4.2-GCCcore-12.3.0
LLVM/14.0.6-GCCcore-12.3.0-llvmlite
LittleCMS/2.15-GCCcore-12.3.0
OpenJPEG/2.5.0-GCCcore-12.3.0
Pillow/10.0.0-GCCcore-12.3.0
Qhull/2020.2-GCCcore-12.3.0
Tkinter/3.11.3-GCCcore-12.3.0
cppy/1.2.1-GCCcore-12.3.0
libogg/1.3.5-GCCcore-12.3.0
libopus/1.4-GCCcore-12.3.0
libsndfile/1.2.2-GCCcore-12.3.0
libvorbis/1.3.7-GCCcore-12.3.0
libwebp/1.3.1-GCCcore-12.3.0
matplotlib/3.7.2-gfbf-2023a
meson-python/0.13.2-GCCcore-12.3.0
numba/0.58.1-foss-2023a
scikit-learn/1.3.1-gfbf-2023a
other under 2023.06/software/linux/x86_64/amd/zen4
no other files in tarball
Aug 01 19:37:26 UTC 2024 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-168.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@trz42
Copy link
Collaborator Author

trz42 commented Aug 1, 2024

Fails rebuilding Python...

INFO:    Using fakeroot command combined with root-mapped namespace
unknown argument ignored: lazytime
FATAL:   exec /.singularity.d/libs/fakeroot failed: fork/exec /.singularity.d/libs/fakeroot: no such file or directory

@boegel maybe fakeroot is not support on the Azure cluster?

@casparvl
Copy link
Collaborator

casparvl commented Aug 1, 2024

https://docs.sylabs.io/guides/3.6/admin-guide/installation.html#filesystem-support-limitations

Fakeroot / (sub)uid/gid mapping
When Singularity is run using the fakeroot option it creates a user namespace for the container, and UIDs / GIDs in that user namepace are mapped to different host UID / GIDs.

Most local filesystems (ext4/xfs etc.) support this uid/gid mapping in a user namespace.

Most network filesystems (NFS/Lustre/GPFS etc.) do not support this uid/gid mapping in a user namespace. Because the fileserver is not aware of the mappings it will deny many operations, with ‘permission denied’ errors. This is currently a generic problem for rootless container runtimes.

@casparvl
Copy link
Collaborator

casparvl commented Aug 1, 2024

Hm, maybe that's not it. Seems like in both clusters (AWS and Azure) singularity is installed on xfs

@bedroge
Copy link
Collaborator

bedroge commented Aug 1, 2024

We ran into the same issue last week, see https://github.com/EESSI/magic-castle-clusters/issues/28#issuecomment-2242669887. @ocaisa has a temporary workaround for it.

@ocaisa
Copy link
Member

ocaisa commented Aug 2, 2024

bot: build repo:eessi.io-2023.06-software arch:zen4

@eessi-bot
Copy link

eessi-bot bot commented Aug 2, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:zen4 from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software architecture:zen4
  • handling command build repository:eessi.io-2023.06-software architecture:zen4 resulted in:

    • no jobs were submitted

@eessi-bot
Copy link

eessi-bot bot commented Aug 2, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)

@eessi-bot
Copy link

eessi-bot bot commented Aug 2, 2024

New job on instance eessi-bot-mc-azure for architecture x86_64-amd-zen4 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.08/pr_657/170

date job status comment
Aug 02 07:30:28 UTC 2024 submitted job id 170 awaits release by job manager
Aug 02 07:30:37 UTC 2024 released job awaits launch by Slurm scheduler
Aug 02 07:31:39 UTC 2024 running job 170 is running
Aug 02 08:23:46 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-170.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-1722586858.tar.gzsize: 405 MiB (425570751 bytes)
entries: 28875
modules under 2023.06/software/linux/x86_64/amd/zen4/modules/all
FLAC/1.4.2-GCCcore-12.3.0.lua
LLVM/14.0.6-GCCcore-12.3.0-llvmlite.lua
LittleCMS/2.15-GCCcore-12.3.0.lua
OpenJPEG/2.5.0-GCCcore-12.3.0.lua
Pillow/10.0.0-GCCcore-12.3.0.lua
Python/3.11.3-GCCcore-12.3.0.lua
Python/3.11.5-GCCcore-13.2.0.lua
Qhull/2020.2-GCCcore-12.3.0.lua
Tkinter/3.11.3-GCCcore-12.3.0.lua
cppy/1.2.1-GCCcore-12.3.0.lua
libogg/1.3.5-GCCcore-12.3.0.lua
libopus/1.4-GCCcore-12.3.0.lua
librosa/0.10.1-foss-2023a.lua
libsndfile/1.2.2-GCCcore-12.3.0.lua
libvorbis/1.3.7-GCCcore-12.3.0.lua
libwebp/1.3.1-GCCcore-12.3.0.lua
matplotlib/3.7.2-gfbf-2023a.lua
meson-python/0.13.2-GCCcore-12.3.0.lua
numba/0.58.1-foss-2023a.lua
scikit-learn/1.3.1-gfbf-2023a.lua
software under 2023.06/software/linux/x86_64/amd/zen4/software
FLAC/1.4.2-GCCcore-12.3.0
LLVM/14.0.6-GCCcore-12.3.0-llvmlite
LittleCMS/2.15-GCCcore-12.3.0
OpenJPEG/2.5.0-GCCcore-12.3.0
Pillow/10.0.0-GCCcore-12.3.0
Python/3.11.3-GCCcore-12.3.0
Python/3.11.5-GCCcore-13.2.0
Qhull/2020.2-GCCcore-12.3.0
Tkinter/3.11.3-GCCcore-12.3.0
cppy/1.2.1-GCCcore-12.3.0
libogg/1.3.5-GCCcore-12.3.0
libopus/1.4-GCCcore-12.3.0
librosa/0.10.1-foss-2023a
libsndfile/1.2.2-GCCcore-12.3.0
libvorbis/1.3.7-GCCcore-12.3.0
libwebp/1.3.1-GCCcore-12.3.0
matplotlib/3.7.2-gfbf-2023a
meson-python/0.13.2-GCCcore-12.3.0
numba/0.58.1-foss-2023a
scikit-learn/1.3.1-gfbf-2023a
other under 2023.06/software/linux/x86_64/amd/zen4
no other files in tarball
Aug 02 08:23:46 UTC 2024 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-170.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case
Aug 02 09:00:29 UTC 2024 uploaded transfer of eessi-2023.06-software-linux-x86_64-amd-zen4-1722586858.tar.gz to S3 bucket succeeded

@trz42
Copy link
Collaborator Author

trz42 commented Aug 2, 2024

The last build looks good! Only the Python/3.11.* packages are rebuilt. All other packages are new (with the versions/toolchains listed, for some packages other versions exist). I took a snapshot of the directory contains for the Python packages, hence we can go on deploying the built installations.

@trz42 trz42 added the ready-to-deploy Mark a PR as ready to deploy label Aug 2, 2024
@casparvl
Copy link
Collaborator

casparvl commented Aug 2, 2024

Test failure is not a big deal (for now, though we should fix it). It's trying to read the memory limit from the cgroup here. Apparently, that fails on the azure cluster:

cat: /hostsys/fs/cgroup/memory/slurm/uid_60008/job_170/memory.limit_in_bytes: No such file or directory
ESC[31mERROR: Failed to get the memory limit in bytes from the current cgroupESC[0m

I guess we should interactively try this, and figure out why that path doesn't exist. Did the bind-mount from /sys/fs/cgroup to /hostsys/fs/cgroup fail? Does that specific subdir not exist? I currently can't seem to submit jobs, so can't figure it out right now....

$ salloc -p x86-64-amd-zen4-node
salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified

Maybe SLURM is not configured in a way that it sets cgroups for the job? Although I do see TaskPlugin=task/affinity,task/cgroup in slurm.conf...

Maybe we should have a fallback: if it fails, maybe we just assume a very large memory, and hope for the best (and print a big fat warning we did this). Worst case, a test runs OOM, but in many cases, tests would probably still run fine.

@casparvl casparvl added bot:deploy Ask bot to deploy missing software installations to EESSI and removed ready-to-deploy Mark a PR as ready to deploy labels Aug 2, 2024
@casparvl
Copy link
Collaborator

casparvl commented Aug 2, 2024

The last build looks good! Only the Python/3.11.* packages are rebuilt. All other packages are new (with the versions/toolchains listed, for some packages other versions exist). I took a snapshot of the directory contains for the Python packages, hence we can go on deploying the built installations.

Agreed, the test suite failure is not an issue (and is probably there in other zen4 builds too, I just never looked at them apparently).

@ocaisa ocaisa merged commit 753ae69 into EESSI:2023.06-software.eessi.io Aug 2, 2024
@Neves-P
Copy link
Member

Neves-P commented Aug 16, 2024

Replaced and manually added the necessary tarballs on the stratum 0 following Bob's procedure. Thanks, @trz42 for the help!

Procedure:

# Deleting only zen4 builds
rm -rf /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/FLAC/1.4.2-GCCcore-12.3.0
rm -rf /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/LLVM/14.0.6-GCCcore-12.3.0-llvmlite
rm -rf /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/LittleCMS/2.15-GCCcore-12.3.0
rm -rf /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/OpenJPEG/2.5.0-GCCcore-12.3.0
rm -rf /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/Pillow/10.0.0-GCCcore-12.3.0
rm -rf /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/Python/3.11.3-GCCcore-12.3.0
rm -rf /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/Python/3.11.5-GCCcore-13.2.0
rm -rf /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/Qhull/2020.2-GCCcore-12.3.0
rm -rf /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/Tkinter/3.11.3-GCCcore-12.3.0
rm -rf /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/cppy/1.2.1-GCCcore-12.3.0
rm -rf /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/libogg/1.3.5-GCCcore-12.3.0
rm -rf /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/libopus/1.4-GCCcore-12.3.0
rm -rf /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/librosa/0.10.1-foss-2023a
rm -rf /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/libsndfile/1.2.2-GCCcore-12.3.0
rm -rf /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/libvorbis/1.3.7-GCCcore-12.3.0
rm -rf /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/libwebp/1.3.1-GCCcore-12.3.0
rm -rf /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/matplotlib/3.7.2-gfbf-2023a
rm -rf /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/meson-python/0.13.2-GCCcore-12.3.0
rm -rf /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/numba/0.58.1-foss-2023a
rm -rf /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/scikit-learn/1.3.1-gfbf-2023a

# Move to software dir
cd /cvmfs/software.eessi.io/versions

# Unpack tarball
tar xvzf /srv/tmp/tarballs/eessi-2023.06-software-linux-x86_64-amd-zen4-1722586858.tar.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bot:deploy Ask bot to deploy missing software installations to EESSI zen4

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants