Skip to content

loss from progress bar appears to be sum of loss across all GPUs in Lightning 1.0.3 #4395

@junwen-austin

Description

@junwen-austin

🐛 Bug

The loss from progress bar during the training appears to be sum of loss across of all GPUs

Please reproduce using the BoringModel and post here

To Reproduce

One can pick any model and dataset and varies the number of GPUs for training. In my example, I have roughly 0.7 loss for 1 GPUs, 1.4 loss for 2 GPUs and 0.28 loss for 4 GPUs for the first few training batches, indicating loss is roughly the sum instead of mean of loss across all GPUs.

I used the standard from the doc:

def train_step(self, batch, batch_idx):
    loss = self.forward(batch)
   self.log('loss/train', loss, on_step=True, on_epoch=False, sync_dist=True, sync_dist_op='mean')

Note: the loss logged in tensorboard appears to be correct (mean of loss)

Expected behavior

The loss from progress bar during the training should be mean of loss across of all GPUs

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
  • PyTorch Version (e.g., 1.0): 1.4
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.7
  • CUDA/cuDNN version: 10.1
  • GPU models and configuration:
  • Any other relevant information: PyTorch Lightning 1.0.3

Additional context

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinghelp wantedOpen to be worked onloggerRelated to the Loggers

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions