Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
d061cbc
fixed bug where tuner would not tune lr if also tuning batch_size (#4…
Palzer Mar 9, 2021
c395040
fix logger creating directory structure too early in DDP (#6380)
awaelchli Mar 9, 2021
745d54d
Set find unused parameters to True by default to fix breaking compati…
SeanNaren Mar 10, 2021
6ee1627
[Fix] Ensure we set the default device before initializing deepspeed …
SeanNaren Mar 10, 2021
344b455
Hotfix for torchvision (#6476)
kaushikb11 Mar 11, 2021
dfba137
Disable batch transfer in DP mode (#6098)
rohitgr7 Mar 11, 2021
b169fb5
Add missing imports & stage
Mar 16, 2021
0560c97
[bugfix] Perform reduction for dict in training_step and DP (#6324)
tchaton Mar 4, 2021
fc3503b
Fix tuner.scale_batch_size not finding batch size attribute when usin…
awaelchli Mar 14, 2021
50c15dc
Handle torch.jit scripted modules in layer summary (#6511)
awaelchli Mar 15, 2021
2178b53
CI: resume testing with py3.8 (#6516)
Borda Mar 15, 2021
133fd82
fix attribute access in LightningModule.toggle_optimizer (#6513)
awaelchli Mar 15, 2021
4e37c57
Custom Plugin is_distributed (#6537)
amogkam Mar 15, 2021
fed7949
refactor reading env defaults (#6510)
Borda Mar 16, 2021
293ff2a
[bug] Update broadcast + reduce decision ModelCheckpoint] (#6410)
tchaton Mar 14, 2021
7d4db76
v1.2.4 & chnagelog
Mar 16, 2021
f35b9b8
Fix: Train loop config validation was run during `trainer.predict` (#…
kaushikb11 Mar 16, 2021
c6e1035
Expose DeepSpeed FP16 parameters due to loss instability (#6115)
SeanNaren Feb 21, 2021
e71f916
Fixed reference in test
Mar 16, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 3 additions & 4 deletions .github/workflows/ci_test-base.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,9 +51,8 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade --user pip
pip install --requirement ./requirements.txt --quiet --find-links https://download.pytorch.org/whl/cpu/torch_stable.html --upgrade
pip install --requirement ./requirements/test.txt --quiet --upgrade-strategy only-if-needed
# pip install tox coverage
pip install --requirement ./requirements.txt --find-links https://download.pytorch.org/whl/cpu/torch_stable.html --upgrade
pip install "pytest>6.0" "pytest-cov>2.10" --upgrade-strategy only-if-needed
python --version
pip --version
pip list
Expand All @@ -69,7 +68,7 @@ jobs:
- name: Test Package [only]
run: |
# NOTE: run coverage on tests does not propagare faler status for Win, https://github.com/nedbat/coveragepy/issues/1003
coverage run --source pytorch_lightning -m pytest pytorch_lightning -v --junitxml=junit/test-results-${{ runner.os }}-${{ matrix.python-version }}-${{ matrix.requires }}.xml
python -m pytest pytorch_lightning -v --cov=pytorch_lightning --junitxml=junit/test-results-${{ runner.os }}-${{ matrix.python-version }}-${{ matrix.requires }}.xml

- name: Upload pytest test results
uses: actions/upload-artifact@v2
Expand Down
4 changes: 0 additions & 4 deletions .github/workflows/ci_test-full.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,6 @@ jobs:
os: [ubuntu-18.04, windows-2019, macOS-10.15]
python-version: [3.6, 3.7, 3.8]
requires: ['minimal', 'latest']
exclude:
# # todo: segmentation fault for minimal and hanging for latest
- python-version: 3.8
os: ubuntu-18.04

# Timeout: https://stackoverflow.com/a/59076067/4521646
timeout-minutes: 35 # TODO: the macOS is taking too long, probably caching did not work...
Expand Down
144 changes: 143 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,145 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).


## [1.2.3] - 2021-03-09
## [UnReleased] - 2021-MM-DD

### Added

- Added a way to print to terminal without breaking up the progress bar ([#5470](https://github.com/PyTorchLightning/pytorch-lightning/pull/5470))

- Added support to checkpoint after training steps in `ModelCheckpoint` callback ([#6146](https://github.com/PyTorchLightning/pytorch-lightning/pull/6146))

- Added `checkpoint` parameter to callback's `on_save_checkpoint` hook ([#6072](https://github.com/PyTorchLightning/pytorch-lightning/pull/6072))


- Added `RunningStage.SANITY_CHECKING` ([#4945](https://github.com/PyTorchLightning/pytorch-lightning/pull/4945))


- Added `TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}` ([#4945](https://github.com/PyTorchLightning/pytorch-lightning/pull/4945))


- Added `Trainer.validate()` method to perform one evaluation epoch over the validation set ([#4948](https://github.com/PyTorchLightning/pytorch-lightning/pull/4948))


- Added `LightningEnvironment` for Lightning-specific DDP ([#5915](https://github.com/PyTorchLightning/pytorch-lightning/pull/5915))


- Added `auto_insert_metric_name` parameter to `ModelCheckpoint` ([#6277](https://github.com/PyTorchLightning/pytorch-lightning/pull/6277))


- Added arg to `self.log` that enables users to give custom names when dealing with multiple dataloaders ([#6274](https://github.com/PyTorchLightning/pytorch-lightning/pull/6274))


- Added no return warning to predict ([#6139](https://github.com/PyTorchLightning/pytorch-lightning/pull/6139))


### Changed

- Renamed `pytorch_lightning.callbacks.swa` to `pytorch_lightning.callbacks.stochastic_weight_avg` ([#6259](https://github.com/PyTorchLightning/pytorch-lightning/pull/6259))


- Refactor `RunningStage` and `TrainerState` usage ([#4945](https://github.com/PyTorchLightning/pytorch-lightning/pull/4945))


- Changed `trainer.evaluating` to return `True` if validating or testing ([#4945](https://github.com/PyTorchLightning/pytorch-lightning/pull/4945))


- Changed `setup()` and `teardown()` stage argument to take any of `{fit,validate,test,predict}` ([#6386](https://github.com/PyTorchLightning/pytorch-lightning/pull/6386))


### Deprecated

- `period` has been deprecated in favor of `every_n_val_epochs` in the `ModelCheckpoint` callback ([#6146](https://github.com/PyTorchLightning/pytorch-lightning/pull/6146))


- Deprecated `trainer.running_sanity_check` in favor of `trainer.sanity_checking` ([#4945](https://github.com/PyTorchLightning/pytorch-lightning/pull/4945))


### Removed

- Removed support for passing a bool value to `profiler` argument of Trainer ([#6164](https://github.com/PyTorchLightning/pytorch-lightning/pull/6164))


- Removed no return warning from val/test step ([#6139](https://github.com/PyTorchLightning/pytorch-lightning/pull/6139))


- Removed passing a `ModelCheckpoint` instance to `Trainer(checkpoint_callback)` ([#6166](https://github.com/PyTorchLightning/pytorch-lightning/pull/6166))


- Removed deprecated Trainer argument `enable_pl_optimizer` and `automatic_optimization` ([#6163](https://github.com/PyTorchLightning/pytorch-lightning/pull/6163))


- Removed deprecated metrics ([#6161](https://github.com/PyTorchLightning/pytorch-lightning/pull/6161))
* from `pytorch_lightning.metrics.functional.classification` removed `to_onehot`, `to_categorical`, `get_num_classes`, `roc`, `multiclass_roc`, `average_precision`, `precision_recall_curve`, `multiclass_precision_recall_curve`
* from `pytorch_lightning.metrics.functional.reduction` removed `reduce`, `class_reduce`


- Removed deprecated `ModelCheckpoint` arguments `prefix`, `mode="auto"` ([#6162](https://github.com/PyTorchLightning/pytorch-lightning/pull/6162))


- Removed `mode='auto'` from `EarlyStopping` ([#6167](https://github.com/PyTorchLightning/pytorch-lightning/pull/6167))


- Removed deprecated `LightningModule` `hparams` setter ([#6207](https://github.com/PyTorchLightning/pytorch-lightning/pull/6207))


- Removed `optimizer_idx` argument from `training_step` in manual optimization ([#6093](https://github.com/PyTorchLightning/pytorch-lightning/pull/6093))


### Fixed

- Made the `Plugin.reduce` method more consistent across all Plugins to reflect a mean-reduction by default ([#6011](https://github.com/PyTorchLightning/pytorch-lightning/pull/6011))


- Move lightning module to correct device type when using LightningDistributedWrapper ([#6070](https://github.com/PyTorchLightning/pytorch-lightning/pull/6070))


- Do not print top-k verbose log with `ModelCheckpoint(monitor=None)` ([#6109](https://github.com/PyTorchLightning/pytorch-lightning/pull/6109))


- Fixed `ModelCheckpoint(monitor=None, save_last=True)` not saving checkpoints ([#6136](https://github.com/PyTorchLightning/pytorch-lightning/pull/6136))


- Fixed `ModelCheckpoint(save_top_k=0, save_last=True)` not saving the `last` checkpoint ([#6136](https://github.com/PyTorchLightning/pytorch-lightning/pull/6136))


- Fixed duplicate logs appearing in console when using the python logging module ([#5509](https://github.com/PyTorchLightning/pytorch-lightning/pull/5509), [#6275](https://github.com/PyTorchLightning/pytorch-lightning/pull/6275))


- Fixed `.teardown(stage='fit')` getting called during `trainer.test` ([#6386](https://github.com/PyTorchLightning/pytorch-lightning/pull/6386))


- Fixed `.on_fit_{start,end}()` getting called during `trainer.test` ([#6386](https://github.com/PyTorchLightning/pytorch-lightning/pull/6386))


- Fixed LightningModule `all_gather` on cpu tensors ([#6416](https://github.com/PyTorchLightning/pytorch-lightning/pull/6416))


## [1.2.4] - 2021-03-16

### Changed

- Changed the default of `find_unused_parameters` back to `True` in DDP and DDP Spawn ([#6438](https://github.com/PyTorchLightning/pytorch-lightning/pull/6438))

### Fixed

- Expose DeepSpeed loss parameters to allow users to fix loss instability ([#6115](https://github.com/PyTorchLightning/pytorch-lightning/pull/6115))
- Fixed DP reduction with collection ([#6324](https://github.com/PyTorchLightning/pytorch-lightning/pull/6324))
- Fixed an issue where the tuner would not tune the learning rate if also tuning the batch size ([#4688](https://github.com/PyTorchLightning/pytorch-lightning/pull/4688))
- Fixed broadcast to use PyTorch `broadcast_object_list` and add `reduce_decision` ([#6410](https://github.com/PyTorchLightning/pytorch-lightning/pull/6410))
- Fixed logger creating directory structure too early in DDP ([#6380](https://github.com/PyTorchLightning/pytorch-lightning/pull/6380))
- Fixed DeepSpeed additional memory use on rank 0 when default device not set early enough ([#6460](https://github.com/PyTorchLightning/pytorch-lightning/pull/6460))
- Fixed `DummyLogger.log_hyperparams` raising a `TypeError` when running with `fast_dev_run=True` ([#6398](https://github.com/PyTorchLightning/pytorch-lightning/pull/6398))
- Fixed an issue with `Tuner.scale_batch_size` not finding the batch size attribute in the datamodule ([#5968](https://github.com/PyTorchLightning/pytorch-lightning/pull/5968))
- Fixed an exception in the layer summary when the model contains torch.jit scripted submodules ([#6511](https://github.com/PyTorchLightning/pytorch-lightning/pull/6511))
- Fixed when Train loop config was run during `Trainer.predict` ([#6541](https://github.com/PyTorchLightning/pytorch-lightning/pull/6541))


- Fixed when Train loop config was run during `Trainer.predict` ([#6541](https://github.com/PyTorchLightning/pytorch-lightning/pull/6541))


## [1.2.3] - 2021-03-09

### Fixed

- Fixed `ModelPruning(make_pruning_permanent=True)` pruning buffers getting removed when saved during training ([#6073](https://github.com/PyTorchLightning/pytorch-lightning/pull/6073))
Expand Down Expand Up @@ -53,6 +189,12 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Fixed error message for AMP + CPU incompatibility ([#6107](https://github.com/PyTorchLightning/pytorch-lightning/pull/6107))


- Disabled batch transfer in DP mode ([#6093](https://github.com/PyTorchLightning/pytorch-lightning/pull/6093))


- Expose DeepSpeed loss parameters to allow users to fix loss instability ([#6115](https://github.com/PyTorchLightning/pytorch-lightning/pull/6115)


## [1.2.0] - 2021-02-18

### Added
Expand Down
15 changes: 15 additions & 0 deletions docs/source/benchmarking/performance.rst
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,21 @@ DP performs three GPU transfers for EVERY batch:

Whereas DDP only performs 1 transfer to sync gradients. Because of this, DDP is MUCH faster than DP.

When using DDP set find_unused_parameters=False
-----------------------------------------------

By default we have enabled find unused parameters to True. This is for compatibility issues that have arisen in the past (see the `discussion <https://github.com/PyTorchLightning/pytorch-lightning/discussions/6219>`_ for more information).
This by default comes with a performance hit, and can be disabled in most cases.

.. code-block:: python

from pytorch_lightning.plugins import DDPPlugin

trainer = pl.Trainer(
gpus=2,
plugins=DDPPlugin(find_unused_parameters=False),
)

----------

16-bit precision
Expand Down
5 changes: 3 additions & 2 deletions pl_examples/basic_examples/autoencoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,10 @@
import pytorch_lightning as pl
from pl_examples import _DATASETS_PATH, _TORCHVISION_AVAILABLE, _TORCHVISION_MNIST_AVAILABLE, cli_lightning_logo

if _TORCHVISION_AVAILABLE and _TORCHVISION_MNIST_AVAILABLE:
if _TORCHVISION_AVAILABLE:
from torchvision import transforms
from torchvision.datasets.mnist import MNIST
if _TORCHVISION_MNIST_AVAILABLE:
from torchvision.datasets import MNIST
else:
from tests.helpers.datasets import MNIST

Expand Down
5 changes: 3 additions & 2 deletions pl_examples/basic_examples/backbone_image_classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,10 @@
import pytorch_lightning as pl
from pl_examples import _DATASETS_PATH, _TORCHVISION_AVAILABLE, _TORCHVISION_MNIST_AVAILABLE, cli_lightning_logo

if _TORCHVISION_AVAILABLE and _TORCHVISION_MNIST_AVAILABLE:
if _TORCHVISION_AVAILABLE:
from torchvision import transforms
from torchvision.datasets.mnist import MNIST
if _TORCHVISION_MNIST_AVAILABLE:
from torchvision.datasets import MNIST
else:
from tests.helpers.datasets import MNIST

Expand Down
5 changes: 3 additions & 2 deletions pl_examples/basic_examples/dali_image_classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,10 @@
cli_lightning_logo,
)

if _TORCHVISION_AVAILABLE and _TORCHVISION_MNIST_AVAILABLE:
if _TORCHVISION_AVAILABLE:
from torchvision import transforms
from torchvision.datasets.mnist import MNIST
if _TORCHVISION_MNIST_AVAILABLE:
from torchvision.datasets import MNIST
else:
from tests.helpers.datasets import MNIST

Expand Down
3 changes: 2 additions & 1 deletion pl_examples/basic_examples/mnist_datamodule.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,9 @@
from pl_examples import _DATASETS_PATH, _TORCHVISION_AVAILABLE, _TORCHVISION_MNIST_AVAILABLE
from pytorch_lightning import LightningDataModule

if _TORCHVISION_AVAILABLE and _TORCHVISION_MNIST_AVAILABLE:
if _TORCHVISION_AVAILABLE:
from torchvision import transforms as transform_lib
if _TORCHVISION_MNIST_AVAILABLE:
from torchvision.datasets import MNIST
else:
from tests.helpers.datasets import MNIST
Expand Down
5 changes: 3 additions & 2 deletions pl_examples/domain_templates/generative_adversarial_net.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,10 @@
from pytorch_lightning.core import LightningDataModule, LightningModule
from pytorch_lightning.trainer import Trainer

if _TORCHVISION_AVAILABLE and _TORCHVISION_MNIST_AVAILABLE:
if _TORCHVISION_AVAILABLE:
import torchvision
import torchvision.transforms as transforms
from torchvision import transforms
if _TORCHVISION_MNIST_AVAILABLE:
from torchvision.datasets import MNIST
else:
from tests.helpers.datasets import MNIST
Expand Down
2 changes: 1 addition & 1 deletion pytorch_lightning/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import time

_this_year = time.strftime("%Y")
__version__ = '1.2.3'
__version__ = '1.2.4'
__author__ = 'William Falcon et al.'
__author_email__ = '[email protected]'
__license__ = 'Apache-2.0'
Expand Down
3 changes: 1 addition & 2 deletions pytorch_lightning/accelerators/accelerator.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@
from pytorch_lightning.plugins.precision import ApexMixedPrecisionPlugin, NativeMixedPrecisionPlugin, PrecisionPlugin
from pytorch_lightning.plugins.training_type import TrainingTypePlugin
from pytorch_lightning.utilities.apply_func import move_data_to_device
from pytorch_lightning.utilities.distributed import all_gather_ddp_if_available
from pytorch_lightning.utilities.enums import AMPType, LightningEnum


Expand Down Expand Up @@ -396,7 +395,7 @@ def all_gather(self, tensor: Union[torch.Tensor], group: Optional[Any] = None, s
Return:
A tensor of shape (world_size, batch, ...)
"""
return all_gather_ddp_if_available(tensor, group=group, sync_grads=sync_grads)
return self.training_type_plugin.all_gather(tensor, group=group, sync_grads=sync_grads)

def process_dataloader(self, dataloader: Union[Iterable, DataLoader]) -> Union[Iterable, DataLoader]:
"""Wraps the dataloader if necessary
Expand Down
27 changes: 23 additions & 4 deletions pytorch_lightning/accelerators/gpu.py
Original file line number Diff line number Diff line change
@@ -1,40 +1,59 @@
import logging
import os
from typing import TYPE_CHECKING, Any

import torch

from pytorch_lightning.accelerators.accelerator import Accelerator
from pytorch_lightning.plugins import DataParallelPlugin
from pytorch_lightning.utilities.exceptions import MisconfigurationException

if TYPE_CHECKING:
from pytorch_lightning.core.lightning import LightningModule
from pytorch_lightning.trainer.trainer import Trainer

_log = logging.getLogger(__name__)


class GPUAccelerator(Accelerator):

def setup(self, trainer, model):
def setup(self, trainer: 'Trainer', model: 'LightningModule') -> None:
"""
Raises:
MisconfigurationException:
If the selected device is not GPU.
"""
if "cuda" not in str(self.root_device):
raise MisconfigurationException(f"Device should be GPU, got {self.root_device} instead")
self.set_nvidia_flags()
torch.cuda.set_device(self.root_device)
return super().setup(trainer, model)

def on_train_start(self):
def on_train_start(self) -> None:
# clear cache before training
# use context because of:
# https://discuss.pytorch.org/t/out-of-memory-when-i-use-torch-cuda-empty-cache/57898
with torch.cuda.device(self.root_device):
torch.cuda.empty_cache()

def on_train_end(self):
def on_train_end(self) -> None:
# clean up memory
self.model.cpu()
with torch.cuda.device(self.root_device):
torch.cuda.empty_cache()

@staticmethod
def set_nvidia_flags():
def set_nvidia_flags() -> None:
# set the correct cuda visible devices (using pci order)
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
all_gpu_ids = ",".join([str(x) for x in range(torch.cuda.device_count())])
devices = os.getenv("CUDA_VISIBLE_DEVICES", all_gpu_ids)
_log.info(f"LOCAL_RANK: {os.getenv('LOCAL_RANK', 0)} - CUDA_VISIBLE_DEVICES: [{devices}]")

def to_device(self, batch: Any) -> Any:
# no need to transfer batch to device in DP mode
# TODO: Add support to allow batch transfer to device in Lightning for DP mode.
if not isinstance(self.training_type_plugin, DataParallelPlugin):
batch = super().to_device(batch)

return batch
2 changes: 1 addition & 1 deletion pytorch_lightning/callbacks/early_stopping.py
Original file line number Diff line number Diff line change
Expand Up @@ -190,4 +190,4 @@ def _run_early_stopping_check(self, trainer, pl_module):
trainer.should_stop = True

# stop every ddp process if any world process decides to stop
trainer.should_stop = trainer.training_type_plugin.reduce_early_stopping_decision(trainer.should_stop)
trainer.should_stop = trainer.training_type_plugin.reduce_boolean_decision(trainer.should_stop)
Loading