Skip to content

Conversation

@trz42
Copy link
Collaborator

@trz42 trz42 commented Apr 20, 2025

Adds TensorFlow 2.13. Limits parallelism to 8 (via eb_hooks.py) in order to work around out-of-memory issue.

@trz42 trz42 added 2023.06-software.eessi.io 2023.06 version of software.eessi.io a64fx labels Apr 20, 2025
@eessi-bot
Copy link

eessi-bot bot commented Apr 20, 2025

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/sapphirerapids, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

@eessi-bot-deucalion
Copy link

Instance eessi-bot-deucalion is configured to build for:

  • architectures: aarch64/a64fx
  • repositories: eessi.io-2023.06-software

@eessi-bot
Copy link

eessi-bot bot commented Apr 20, 2025

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-compat, eessi.io-2023.06-software

@eessi-bot-trz42
Copy link

Instance trz42-GH200-jr is configured to build for:

  • architectures: aarch64/nvidia/grace
  • repositories: eessi.io-2023.06-software

@eessi-bot-toprichard
Copy link

Instance rt-Grace-jr is configured to build for:

  • architectures: aarch64/nvidia/grace
  • repositories: eessi.io-2023.06-software

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Apr 20, 2025

Instance eessi-bot-vsc-ugent is configured to build for:

  • architectures: x86_64/amd/zen3
  • repositories: eessi-hpc.org-2023.06-software, eessi.io-2023.06-compat, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software

@eessi-bot-surf
Copy link

Instance eessi-bot-surf is configured to build for:

  • architectures: x86_64/amd/zen4, x86_64/amd/zen2
  • repositories: eessi-hpc.org-2023.06-software, eessi.io-2023.06-software, eessi.io-2023.06-compat, eessi-hpc.org-2023.06-compat

@trz42
Copy link
Collaborator Author

trz42 commented Apr 20, 2025

bot: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx

@eessi-bot
Copy link

eessi-bot bot commented Apr 20, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx from trz42

    • expanded format: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx
  • handling command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx resulted in:

    • no jobs were submitted

@eessi-bot
Copy link

eessi-bot bot commented Apr 20, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx from trz42

    • expanded format: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx
  • handling command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx resulted in:

    • no jobs were submitted

@eessi-bot-deucalion
Copy link

eessi-bot-deucalion bot commented Apr 20, 2025

Updates by the bot instance eessi-bot-deucalion (click for details)
  • received bot command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx from trz42

    • expanded format: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx
  • handling command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx resulted in:

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Apr 20, 2025

Updates by the bot instance eessi-bot-surf (click for details)
  • received bot command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx from trz42

    • expanded format: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx
  • handling command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx resulted in:

    • no jobs were submitted

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Apr 20, 2025

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • received bot command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx from trz42

    • expanded format: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx
  • handling command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx resulted in:

    • no jobs were submitted

@eessi-bot-trz42
Copy link

eessi-bot-trz42 bot commented Apr 20, 2025

Updates by the bot instance trz42-GH200-jr (click for details)
  • received bot command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx from trz42

    • expanded format: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx
  • handling command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx resulted in:

    • no jobs were submitted

@eessi-bot-toprichard
Copy link

Updates by the bot instance rt-Grace-jr (click for details)
  • account trz42 has NO permission to send commands to the bot

@eessi-bot-deucalion
Copy link

eessi-bot-deucalion bot commented Apr 20, 2025

New job on instance eessi-bot-deucalion for CPU micro-architecture aarch64-a64fx for repository eessi.io-2023.06-software in job dir /home/eessibot/new-bot/jobs/2025.04/pr_1034/406938

date job status comment
Apr 20 06:33:05 UTC 2025 submitted job id 406938 awaits release by job manager
Apr 20 06:33:55 UTC 2025 released job awaits launch by Slurm scheduler
Apr 20 06:34:57 UTC 2025 running job 406938 is running
Apr 21 06:24:11 UTC 2025 finished
🤷 UNKNOWN (click triangle for detailed information)
  • Job results file _bot_job406938.result does not exist in job directory, or parsing it failed.
  • No artefacts were found/reported.
Apr 21 06:24:11 UTC 2025 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job406938.test does not exist in job directory, or parsing it failed.

@trz42
Copy link
Collaborator Author

trz42 commented Apr 20, 2025

First job seems a bit slow. Launch one with parallel = 8...
bot: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx

@eessi-bot
Copy link

eessi-bot bot commented Apr 20, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx from trz42

    • expanded format: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx
  • handling command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx resulted in:

    • no jobs were submitted

@eessi-bot
Copy link

eessi-bot bot commented Apr 20, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx from trz42

    • expanded format: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx
  • handling command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx resulted in:

    • no jobs were submitted

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Apr 20, 2025

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • received bot command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx from trz42

    • expanded format: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx
  • handling command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx resulted in:

    • no jobs were submitted

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Apr 20, 2025

Updates by the bot instance eessi-bot-surf (click for details)
  • received bot command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx from trz42

    • expanded format: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx
  • handling command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx resulted in:

    • no jobs were submitted

@eessi-bot-trz42
Copy link

eessi-bot-trz42 bot commented Apr 20, 2025

Updates by the bot instance trz42-GH200-jr (click for details)
  • received bot command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx from trz42

    • expanded format: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx
  • handling command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx resulted in:

    • no jobs were submitted

@eessi-bot-toprichard
Copy link

Updates by the bot instance rt-Grace-jr (click for details)
  • account trz42 has NO permission to send commands to the bot

@eessi-bot-deucalion
Copy link

eessi-bot-deucalion bot commented Apr 20, 2025

Updates by the bot instance eessi-bot-deucalion (click for details)
  • received bot command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx from trz42

    • expanded format: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx
  • handling command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx resulted in:

@eessi-bot-deucalion
Copy link

eessi-bot-deucalion bot commented Apr 20, 2025

New job on instance eessi-bot-deucalion for CPU micro-architecture aarch64-a64fx for repository eessi.io-2023.06-software in job dir /home/eessibot/new-bot/jobs/2025.04/pr_1034/406966

date job status comment
Apr 20 17:16:07 UTC 2025 submitted job id 406966 awaits release by job manager
Apr 20 17:17:06 UTC 2025 released job awaits launch by Slurm scheduler
Apr 20 17:18:12 UTC 2025 running job 406966 is running
Apr 21 06:13:53 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-406966.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-a64fx-1745215393.tar.gzsize: 293 MiB (307684455 bytes)
entries: 17257
modules under 2023.06/software/linux/aarch64/a64fx/modules/all
TensorFlow/2.13.0-foss-2023a.lua
software under 2023.06/software/linux/aarch64/a64fx/software
TensorFlow/2.13.0-foss-2023a
other under 2023.06/software/linux/aarch64/a64fx
2023.06/init/easybuild/eb_hooks.py
Apr 21 06:13:53 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/1) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=1_node /04ff9ece @BotBuildTests:aarch64_a64fx+default
P: perf: 12.815 timesteps/s (r:0, l:None, u:None)
[ PASSED ] Ran 1/1 test case(s) from 1 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-406966.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case
Apr 22 14:15:06 UTC 2025 uploaded transfer of eessi-2023.06-software-linux-aarch64-a64fx-1745215393.tar.gz to S3 bucket succeeded

@eessi-bot-toprichard
Copy link

Updates by the bot instance rt-Grace-jr (click for details)
  • account trz42 has NO permission to send commands to the bot

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Apr 20, 2025

Updates by the bot instance eessi-bot-surf (click for details)
  • received bot command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx from trz42

    • expanded format: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx
  • handling command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx resulted in:

    • no jobs were submitted

@eessi-bot-deucalion
Copy link

eessi-bot-deucalion bot commented Apr 20, 2025

New job on instance eessi-bot-deucalion for CPU micro-architecture aarch64-a64fx for repository eessi.io-2023.06-software in job dir /home/eessibot/new-bot/jobs/2025.04/pr_1034/406967

date job status comment
Apr 20 17:17:49 UTC 2025 submitted job id 406967 awaits release by job manager
Apr 20 17:18:10 UTC 2025 released job awaits launch by Slurm scheduler
Apr 20 17:19:19 UTC 2025 running job 406967 is running
Apr 21 02:00:07 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-406967.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-a64fx-1745199692.tar.gzsize: 0 MiB (15594 bytes)
entries: 1
modules under 2023.06/software/linux/aarch64/a64fx/modules/all
no module files in tarball
software under 2023.06/software/linux/aarch64/a64fx/software
no software packages in tarball
other under 2023.06/software/linux/aarch64/a64fx
2023.06/init/easybuild/eb_hooks.py
Apr 21 02:00:07 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/1) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=1_node /04ff9ece @BotBuildTests:aarch64_a64fx+default
P: perf: 15.758 timesteps/s (r:0, l:None, u:None)
[ PASSED ] Ran 1/1 test case(s) from 1 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-406967.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@trz42
Copy link
Collaborator Author

trz42 commented Apr 20, 2025

One more with parallel = 16...
bot: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx

@eessi-bot
Copy link

eessi-bot bot commented Apr 20, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx from trz42

    • expanded format: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx
  • handling command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx resulted in:

    • no jobs were submitted

@eessi-bot
Copy link

eessi-bot bot commented Apr 20, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx from trz42

    • expanded format: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx
  • handling command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx resulted in:

    • no jobs were submitted

@eessi-bot-deucalion
Copy link

eessi-bot-deucalion bot commented Apr 20, 2025

Updates by the bot instance eessi-bot-deucalion (click for details)
  • received bot command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx from trz42

    • expanded format: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx
  • handling command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx resulted in:

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Apr 20, 2025

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • received bot command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx from trz42

    • expanded format: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx
  • handling command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx resulted in:

    • no jobs were submitted

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Apr 20, 2025

Updates by the bot instance eessi-bot-surf (click for details)
  • received bot command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx from trz42

    • expanded format: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx
  • handling command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx resulted in:

    • no jobs were submitted

@eessi-bot-trz42
Copy link

eessi-bot-trz42 bot commented Apr 20, 2025

Updates by the bot instance trz42-GH200-jr (click for details)
  • received bot command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx from trz42

    • expanded format: build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx
  • handling command build instance:eessi-bot-deucalion repository:eessi.io-2023.06-software architecture:aarch64/a64fx resulted in:

    • no jobs were submitted

@eessi-bot-toprichard
Copy link

Updates by the bot instance rt-Grace-jr (click for details)
  • account trz42 has NO permission to send commands to the bot

@eessi-bot-deucalion
Copy link

eessi-bot-deucalion bot commented Apr 20, 2025

New job on instance eessi-bot-deucalion for CPU micro-architecture aarch64-a64fx for repository eessi.io-2023.06-software in job dir /home/eessibot/new-bot/jobs/2025.04/pr_1034/406968

date job status comment
Apr 20 17:18:48 UTC 2025 submitted job id 406968 awaits release by job manager
Apr 20 17:19:17 UTC 2025 released job awaits launch by Slurm scheduler
Apr 20 17:20:25 UTC 2025 running job 406968 is running
Apr 21 00:48:06 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-406968.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-a64fx-1745195417.tar.gzsize: 0 MiB (15597 bytes)
entries: 1
modules under 2023.06/software/linux/aarch64/a64fx/modules/all
no module files in tarball
software under 2023.06/software/linux/aarch64/a64fx/software
no software packages in tarball
other under 2023.06/software/linux/aarch64/a64fx
2023.06/init/easybuild/eb_hooks.py
Apr 21 00:48:06 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/1) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=1_node /04ff9ece @BotBuildTests:aarch64_a64fx+default
P: perf: 13.684 timesteps/s (r:0, l:None, u:None)
[ PASSED ] Ran 1/1 test case(s) from 1 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-406968.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@trz42 trz42 added ready-to-deploy Mark a PR as ready to deploy ready-to-review labels Apr 21, 2025
…-layer into 2023.06-a64fx-2023a-eb482-apps-tf

kept original order to list TensorFlow right after all its dependencies
if cpu_target == CPU_TARGET_A64FX and self.name in ['TensorFlow']:
# limit parallelism to 8, builds with 12 and 16 failed on Deucalion
if parallel > 8:
self.cfg['parallel'] = 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trz42 Why don't we simply use a factor of 4 when building for A64FX, rather than a factor of 2 like we do below?

In theory, we could have smaller build jobs (say with 4 cores) at some point, so really hardcoding to 8 seems wrong to me...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I overlooked the > 8 condition, sorry

@boegel boegel added bot:deploy Ask bot to deploy missing software installations to EESSI and removed ready-to-deploy Mark a PR as ready to deploy labels Apr 22, 2025
@eessi-bot-toprichard
Copy link

Label bot:deploy has been set by user boegel, but this person does not have permission to trigger deployments

@boegel
Copy link
Contributor

boegel commented Apr 22, 2025

staging PR merged, so merging this too...

@boegel boegel merged commit 77651f1 into EESSI:2023.06-software.eessi.io Apr 22, 2025
64 of 66 checks passed
@eessi-bot
Copy link

eessi-bot bot commented Apr 22, 2025

PR merged! Moved [] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2025.04.22

1 similar comment
@eessi-bot
Copy link

eessi-bot bot commented Apr 22, 2025

PR merged! Moved [] to /project/def-users/SHARED/trash_bin/EESSI/software-layer/2025.04.22

@eessi-bot-deucalion
Copy link

PR merged! Moved ['/home/eessibot/new-bot/jobs/2025.04/pr_1034/406968', '/home/eessibot/new-bot/jobs/2025.04/pr_1034/406938', '/home/eessibot/new-bot/jobs/2025.04/pr_1034/406967', '/home/eessibot/new-bot/jobs/2025.04/pr_1034/406966'] to /home/eessibot/new-bot/trash-bin/EESSI/software-layer/2025.04.22

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Apr 22, 2025

PR merged! Moved [] to /scratch/gent/vo/002/gvo00211/SHARED/trash_bin/EESSI/software-layer/2025.04.22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2023.06-software.eessi.io 2023.06 version of software.eessi.io a64fx bot:deploy Ask bot to deploy missing software installations to EESSI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants