-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🐛 Bug
A lot of the PTL tooling around multiprocess depends on a specific environment variable: LOCAL_RANK being set correctly, it seems that when running in slurm this isnt set causing it to return the default of 0 for all processes which makes every process do things that should only be done on rank 0, like log stuff.
Also I'm a little unclear about the name of that variable, if I have multiple nodes, only the global rank 0 not the local rank should be logging and saving checkpoints etc.
To Reproduce
Run in slurm (cant really do it w/ colab), a good way to easily see it is to use the Wandb logger, you'll see that each process makes a new run on the Wandb UI which means that @rank_zero_experiment didnt work properly, and you can confirm this by printing LOCAL_RANK which is defaulted to 0 if unset, it will always give back 0.
Expected behavior
LOCAL_RANK is set correctly or the rest of the tooling is aware of the global rank of the process
Environment
Will update if it's really necessary