Skip to content

Metrics API when using DDP and multi-GPU freezes on compute() at end of validation phase #5930

@angadkalra

Description

@angadkalra

🐛 Bug

Implemented AUC metric class to calculate train/valid AUC per epoch, but my progress bar freezes at end of first epoch with GPUs at 100%. It works with 1 GPU, but not more. I basically copied the source code from metric ExplainedVariance
but it doesn't work in DDP with multi-gpus for me. The bug happens after the return in compute() because print statements in compute() successfully print the preds and targets variables.

I'm training ResNet101 on 2700 3D images stored as .npy files.

import torch
from pytorch_lightning.metrics import Metric
from pytorch_lightning.metrics.functional.classification import multiclass_auroc


class AUC(Metric):
    def __init__(self, dist_sync_on_step=False):
        super().__init__(compute_on_step=False, dist_sync_on_step=dist_sync_on_step)

        self.add_state("preds", default=[], dist_reduce_fx=None)
        self.add_state("targets", default=[], dist_reduce_fx=None)

    def update(self, preds: torch.Tensor, targets: torch.Tensor):
        self.preds.append(preds)
        self.targets.append(targets)

    def compute(self):
        preds = torch.cat(self.preds)
        targets = torch.cat(self.targets)
        return multiclass_auroc(preds, targets)
  • PyTorch Version (e.g., 1.0): 1.7.1+cu101
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.7.6
  • CUDA/cuDNN version: 10.1
  • GPU models and configuration: 4 V100 on google cloud VM
  • Any other relevant information: 32 cores, 128GB mem
  • Pytorch Lightning Version: 1.1.8

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedOpen to be worked onpriority: 0High priority task

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions