Skip to content

Slurm ntasks-per-node is ignored #102

@neggert

Description

@neggert

Describe the bug

When running with DDP, Lightning throws this warning:

UserWarning: 
You requested 2 GPUs but launched 1 slurm tasks.
We will launch 2 processes for you.
We recommend you let slurm manage the processes by setting: --ntasks-per-node=2
If you're not using SLURM, ignore this message!

I made the suggested change, but I still get the warning. Digging into the code a bit, it looks like this warning goes away when $SLURM_NTASKS matches trainer.nb_requested_gpus. If I'm understanding the code correctly, this should be changed to check $SLURM_NTASKS_PER_NODE, since trainer.nb_requested_gpus is the number of gpus per node.

I'm happy to make the change if you agree that this is the correct fix.

To Reproduce
Submit job with test_tube.SlurmCluster

    cluster = SlurmCluster(
        hyperparam_optimizer=args,
        log_path="./logs"
    )

    cluster.per_experiment_nb_gpus = 2
    cluster.per_experiment_nb_nodes = 2
    cluster.per_experiment_nb_cpus = 16
    cluster.add_slurm_cmd(cmd="ntasks-per-node", value=str(cluster.per_experiment_nb_gpus), comment="1 task per gpu, for ddp")
    cluster.job_time = "1:00:00"
    cluster.gpu_type = "p100"
    cluster.memory_mb_per_node = 300000

    cluster.optimize_parallel_cluster_gpu(train, nb_trials=1, job_name="tml")

Expected behavior
Warning should go away and lightning should use slurm-created tasks

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions