Skip to content

LOCAL_RANK not being set in slurm #6797

@Queuecumber

Description

@Queuecumber

🐛 Bug

A lot of the PTL tooling around multiprocess depends on a specific environment variable: LOCAL_RANK being set correctly, it seems that when running in slurm this isnt set causing it to return the default of 0 for all processes which makes every process do things that should only be done on rank 0, like log stuff.

Also I'm a little unclear about the name of that variable, if I have multiple nodes, only the global rank 0 not the local rank should be logging and saving checkpoints etc.

To Reproduce

Run in slurm (cant really do it w/ colab), a good way to easily see it is to use the Wandb logger, you'll see that each process makes a new run on the Wandb UI which means that @rank_zero_experiment didnt work properly, and you can confirm this by printing LOCAL_RANK which is defaulted to 0 if unset, it will always give back 0.

Expected behavior

LOCAL_RANK is set correctly or the rest of the tooling is aware of the global rank of the process

Environment

Will update if it's really necessary

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions