Error in returning `Dict` from `training_step` with multiple GPUs

## 🐛 Bug

When using multiple GPUs with 'dp', the error `RuntimeError: grad can be implicitly created only for scalar outputs` occurs if I utilized `training_step` function like this:

```Python
def training_step(self, batch, batch_idx):
    ...
    return {'loss': loss}
```

## Please reproduce using the BoringModel

https://colab.research.google.com/drive/1hmHqYHPOqDlZUAF7-9zcCvobrvSPt7W5?usp=sharing

### Expected behavior

It is supposed to work fine to return `Dict` with `loss` key.

### A quick solution

Return loss tensor directly from `training_step` function:

```Python
def training_step(self, batch, batch_idx):
    ...
    return loss
```

### Environment

 - PyTorchLightning Version: 1.2.0
 - PyTorch Version: 1.7.0
 - OS: Linux
 - Python version: 3.8
 - CUDA/cuDNN version: 10.2

cc. @carmocca 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error in returning `Dict` from `training_step` with multiple GPUs #6193

🐛 Bug

Please reproduce using the BoringModel

Expected behavior

A quick solution

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error in returning Dict from training_step with multiple GPUs #6193

Description

🐛 Bug

Please reproduce using the BoringModel

Expected behavior

A quick solution

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Error in returning `Dict` from `training_step` with multiple GPUs #6193