Configuration files for single-node SLURM cluster
Write an executable run_first_script_from_github.sh script like
#!/bin/bash
export NODE_LABEL=15cpu-60ram
export GITHUB_TAG=v0.0.21
curl -s "https://raw.githubusercontent.com/fractal-analytics-platform/slurm-single-node-config/refs/tags/$GITHUB_TAG/first_config_script.sh" -o first_config_script.sh
bash first_config_script.shand run it with sudo:
$ sudo ./run_first_script_from_github.shsudo apt install ubuntu-drivers-common
# Confirm
sudo ubuntu-drivers install --gpgpu
# Take note of which version it installs (e.g. 535, 580, ..)
# (reboot)
# Replace 580 with the versions noted above
sudo apt install nvidia-utils-580-server
# Test nvidia-smi
nvidia-smi
# Expected output starts e.g. with
# | NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
# Test that apt updates are not broken
sudo apt update
sudo apt upgrade # should not failActual PyTorch-based test:
python3.12 -m venv /tmp/venv
source /tmp/venv/bin/activate
python3.12 -m pip install torch numpy
python3.12 -c 'import torch; assert torch.cuda.is_available(); print(torch.cuda.get_device_name(0)); assert torch.cuda.get_device_name(0) == "Tesla T4"; print("ALL OK!");'
deactivate
rm -r /tmp/venv
Lock the VM from the openstack dashboard.
Create the appropriate file
sudo -u fractal-worker mkdir /home/fractal-worker/.ssh
sudo -u fractal-worker touch /home/fractal-worker/.ssh/authorized_keysand then populate it with appropriate keys:
- (required) vm-specific
fractal-workerkey from the main Fractal instance - (optional) Relevant team members
First, run sinfo.
If it fails, the likely solution is
sudo systemctl restart slurmctld.service