Skip to content

Commit 6b31973

Browse files
authored
Merge branch 'master' into bugfix/batch-device
2 parents 3998873 + 8193bae commit 6b31973

30 files changed

+409
-161
lines changed

.azure-pipelines/ipu-tests.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ jobs:
8181
- bash: |
8282
source ${{ variables.poplar_sdk }}/poplar-ubuntu*/enable.sh
8383
source ${{ variables.poplar_sdk }}/popart-ubuntu*/enable.sh
84-
84+
export POPTORCH_WAIT_FOR_IPU=1
8585
python -m coverage run --source pytorch_lightning -m pytest pytorch_lightning tests -v --junitxml=$(Build.StagingDirectory)/test-results.xml --durations=50
8686
env:
8787
MKL_THREADING_LAYER: "GNU"
@@ -90,6 +90,7 @@ jobs:
9090
- bash: |
9191
source ${{ variables.poplar_sdk }}/poplar-ubuntu*/enable.sh
9292
source ${{ variables.poplar_sdk }}/popart-ubuntu*/enable.sh
93+
export POPTORCH_WAIT_FOR_IPU=1
9394
bash tests/special_tests.sh
9495
env:
9596
MKL_THREADING_LAYER: "GNU"

CHANGELOG.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
127127
- Added `max_depth` parameter in `ModelSummary` ([#8062](https://github.com/PyTorchLightning/pytorch-lightning/pull/8062))
128128

129129

130+
- Added `restore` function and `restarting` attribute to base `Loop` ([#8247](https://github.com/PyTorchLightning/pytorch-lightning/pull/8247))
131+
132+
130133
### Changed
131134

132135

@@ -167,6 +170,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
167170
* Refactored trainer `_run_*` functions and separate evaluation loops ([#8065](https://github.com/PyTorchLightning/pytorch-lightning/pull/8065))
168171
* Refactored prediction loop interface; added new classes `PredictionLoop`, `PredictionEpochLoop` ([#7700](https://github.com/PyTorchLightning/pytorch-lightning/pull/7700), [#8077](https://github.com/PyTorchLightning/pytorch-lightning/pull/8077))
169172
* Removed `pytorch_lightning/trainer/predict_loop.py` ([#8094](https://github.com/PyTorchLightning/pytorch-lightning/pull/8094))
173+
* Moved result teardown to the loops ([#8245](https://github.com/PyTorchLightning/pytorch-lightning/pull/8245))
170174

171175

172176
- Refactored logging
@@ -341,6 +345,17 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
341345
- Fixed a bug where using `precision=64` would cause buffers with complex dtype to be cast to real ([#8208](https://github.com/PyTorchLightning/pytorch-lightning/pull/8208))
342346

343347

348+
- Fixed a bug where `truncated_bptt_steps` would throw an AttributeError when the target RNN has multiple hidden states ([#8145](https://github.com/PyTorchLightning/pytorch-lightning/pull/8145))
349+
350+
351+
- Fixes access to `callback_metrics` in ddp_spawn ([#7916](https://github.com/PyTorchLightning/pytorch-lightning/pull/7916))
352+
353+
354+
- Fixed moving batch to device before sending it to the `on_*_batch_start`/`on_*_batch_end` callbacks and model hooks ([#7378](https://github.com/PyTorchLightning/pytorch-lightning/pull/7378))
355+
356+
357+
- Fixed passing a custom `DDPPlugin` when choosing `accelerator="ddp_cpu"` for the accelerator ([#6208](https://github.com/PyTorchLightning/pytorch-lightning/pull/6208))
358+
344359

345360
## [1.3.8] - 2021-07-01
346361

@@ -357,16 +372,6 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
357372
- Fixed SWA to also work with `IterableDataset` ([#8172](https://github.com/PyTorchLightning/pytorch-lightning/pull/8172))
358373

359374

360-
361-
- Fixed a bug where `truncated_bptt_steps` would throw an AttributeError when the target RNN has multiple hidden states ([#8145](https://github.com/PyTorchLightning/pytorch-lightning/pull/8145))
362-
363-
364-
- Fixes access to `callback_metrics` in ddp_spawn ([#7916](https://github.com/PyTorchLightning/pytorch-lightning/pull/7916))
365-
366-
367-
- Fixed moving batch to device before sending it to the `on_*_batch_start`/`on_*_batch_end` callbacks and model hooks ([#7378](https://github.com/PyTorchLightning/pytorch-lightning/pull/7378))
368-
369-
370375
## [1.3.7] - 2021-06-22
371376

372377
### Fixed

dockers/nvidia/Dockerfile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
# limitations under the License.
1414

1515
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes
16-
FROM nvcr.io/nvidia/pytorch:21.05-py3
16+
FROM nvcr.io/nvidia/pytorch:21.06-py3
1717

1818
LABEL maintainer="PyTorchLightning <https://github.com/PyTorchLightning>"
1919

@@ -39,7 +39,7 @@ RUN \
3939

4040
# Installations
4141
python -c "fname = './pytorch-lightning/requirements/extra.txt' ; lines = [line for line in open(fname).readlines() if not line.startswith('horovod')] ; open(fname, 'w').writelines(lines)" && \
42-
pip install "Pillow>=8.2" "cryptography>=3.4" "py>=1.10" --no-cache-dir --upgrade-strategy only-if-needed && \
42+
pip install "Pillow>=8.2, !=8.3.0" "cryptography>=3.4" "py>=1.10" --no-cache-dir --upgrade-strategy only-if-needed && \
4343
pip install -r ./pytorch-lightning/requirements/extra.txt --no-cache-dir --upgrade-strategy only-if-needed && \
4444
pip install -r ./pytorch-lightning/requirements/examples.txt --no-cache-dir --upgrade-strategy only-if-needed && \
4545
pip install ./pytorch-lightning --no-cache-dir && \

pytorch_lightning/callbacks/model_checkpoint.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ class ModelCheckpoint(Callback):
102102
saved (``model.save_weights(filepath)``), else the full model
103103
is saved (``model.save(filepath)``).
104104
every_n_train_steps: Number of training steps between checkpoints.
105-
If ``every_n_train_steps == None or every_n_train_steps == 0``, we skip saving during training
105+
If ``every_n_train_steps == None or every_n_train_steps == 0``, we skip saving during training.
106106
To disable, set ``every_n_train_steps = 0``. This value must be ``None`` or non-negative.
107107
This must be mutually exclusive with ``train_time_interval`` and ``every_n_val_epochs``.
108108
train_time_interval: Checkpoints are monitored at the specified time interval.
@@ -111,7 +111,7 @@ class ModelCheckpoint(Callback):
111111
guaranteed to execute at the exact time specified, but should be close.
112112
This must be mutually exclusive with ``every_n_train_steps`` and ``every_n_val_epochs``.
113113
every_n_val_epochs: Number of validation epochs between checkpoints.
114-
If ``every_n_val_epochs == None or every_n_val_epochs == 0``, we skip saving on validation end
114+
If ``every_n_val_epochs == None or every_n_val_epochs == 0``, we skip saving on validation end.
115115
To disable, set ``every_n_val_epochs = 0``. This value must be ``None`` or non-negative.
116116
This must be mutually exclusive with ``every_n_train_steps`` and ``train_time_interval``.
117117
Setting both ``ModelCheckpoint(..., every_n_val_epochs=V)`` and

pytorch_lightning/core/lightning.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -543,7 +543,7 @@ def write_prediction(
543543
' and will be removed in v1.5.'
544544
)
545545

546-
self.trainer.evaluation_loop.predictions._add_prediction(name, value, filename)
546+
self.trainer._evaluation_loop.predictions._add_prediction(name, value, filename)
547547

548548
def write_prediction_dict(self, predictions_dict: Dict[str, Any], filename: str = 'predictions.pt'):
549549
"""

pytorch_lightning/loops/base.py

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,15 @@ class Loop(ABC):
4646
def __init__(self) -> None:
4747
self.iteration_count: int = 0
4848
self.trainer: Optional['pl.Trainer'] = None
49+
self._restarting = False
50+
51+
@property
52+
def restarting(self) -> bool:
53+
return self._restarting
54+
55+
@restarting.setter
56+
def restarting(self, restarting: bool) -> None:
57+
self._restarting = restarting
4958

5059
@property
5160
@abstractmethod
@@ -87,7 +96,12 @@ def run(self, *args: Any, **kwargs: Any) -> Optional[Any]:
8796
if self.skip:
8897
return self.on_skip()
8998

90-
self.reset()
99+
if self.restarting:
100+
self.restore()
101+
self.restarting = False
102+
else:
103+
self.reset()
104+
91105
self.on_run_start(*args, **kwargs)
92106

93107
while not self.done:
@@ -100,9 +114,11 @@ def run(self, *args: Any, **kwargs: Any) -> Optional[Any]:
100114
break
101115

102116
output = self.on_run_end()
103-
self.teardown()
104117
return output
105118

119+
def restore(self) -> None:
120+
"""Restore the internal state of the loop the beginning of run if restarting is ``True``."""
121+
106122
@abstractmethod
107123
def reset(self) -> None:
108124
"""Resets the internal state of the loop at the beginning of each call to :attr:`run`."""
@@ -132,7 +148,7 @@ def on_run_end(self) -> Any:
132148
"""Hook to be called at the end of the run. Its return argument is returned from :attr:`run`."""
133149

134150
def teardown(self) -> None:
135-
"""The very last method called inside :meth:`run`. Use to release memory etc."""
151+
"""Use to release memory etc."""
136152

137153
def load_state_dict(self, state_dict: Dict) -> None:
138154
"""Restore the loop state from the provided state_dict."""

pytorch_lightning/loops/batch/training_batch_loop.py

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
from pytorch_lightning.trainer.connectors.logger_connector.result import ResultCollection
3030
from pytorch_lightning.trainer.supporters import TensorRunningAccum
3131
from pytorch_lightning.utilities import AMPType, AttributeDict, DeviceType, grad_norm
32+
from pytorch_lightning.utilities.apply_func import apply_to_collection
3233
from pytorch_lightning.utilities.exceptions import MisconfigurationException
3334
from pytorch_lightning.utilities.finite_checks import detect_nan_parameters
3435
from pytorch_lightning.utilities.imports import _TPU_AVAILABLE
@@ -47,7 +48,7 @@ def __init__(self) -> None:
4748
self.running_loss: TensorRunningAccum = TensorRunningAccum(window_length=20)
4849
self.batch_idx: int = 0
4950
self.split_idx: Optional[int] = None
50-
self.warning_cache: WarningCache = WarningCache()
51+
self._warning_cache: WarningCache = WarningCache()
5152

5253
self._hiddens: Optional[Tensor] = None
5354
self._optimizer_freq_cumsum: Optional[int] = None
@@ -75,7 +76,7 @@ def run(self, batch: Any, batch_idx: int, dataloader_idx: int) -> AttributeDict:
7576
dataloader_idx: the index of the dataloader producing the current batch
7677
"""
7778
if batch is None:
78-
self.warning_cache.warn("train_dataloader yielded None. If this was on purpose, ignore this warning...")
79+
self._warning_cache.warn("train_dataloader yielded None. If this was on purpose, ignore this warning...")
7980
return AttributeDict(signal=0, training_step_output=[[]])
8081

8182
# hook
@@ -349,7 +350,8 @@ def _process_training_step_output(self, training_step_output: STEP_OUTPUT) -> Op
349350
if isinstance(training_step_output, dict):
350351
loss = training_step_output.pop("loss", None)
351352
hiddens = training_step_output.pop("hiddens", None)
352-
353+
# detach hiddens to avoid `RuntimeError: Trying to backward through the graph a second time`
354+
hiddens = apply_to_collection(hiddens, Tensor, lambda t: t.detach())
353355
results.extra = training_step_output
354356

355357
# handle scalar return
@@ -546,7 +548,7 @@ def training_step_and_backward(
546548
self._check_finite(result.loss)
547549

548550
else:
549-
self.warning_cache.warn(
551+
self._warning_cache.warn(
550552
"training_step returned None. If this was on purpose, ignore this warning..."
551553
)
552554

@@ -648,7 +650,7 @@ def _build_kwargs(self, batch: Any, batch_idx: int, opt_idx: int, hiddens: Optio
648650
has_opt_idx_in_train_step = is_param_in_hook_signature(training_step_fx, "optimizer_idx")
649651
if has_opt_idx_in_train_step:
650652
if not lightning_module.automatic_optimization:
651-
self.warning_cache.deprecation(
653+
self._warning_cache.deprecation(
652654
"`training_step` hook signature has changed in v1.3."
653655
" `optimizer_idx` argument has been removed in case of manual optimization. Support for"
654656
" the old signature will be removed in v1.5"

pytorch_lightning/loops/dataloader/evaluation_loop.py

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -33,9 +33,11 @@ def __init__(self):
3333
super().__init__()
3434
self._max_batches: Optional[Union[int, Sequence[int]]] = None
3535
self.outputs = []
36+
3637
self.epoch_loop = EvaluationEpochLoop()
37-
self._has_run: bool = False
38+
3839
self._results = ResultCollection(training=False)
40+
self._has_run: bool = False
3941

4042
@property
4143
def num_dataloaders(self) -> int:
@@ -57,11 +59,6 @@ def dataloaders(self) -> Sequence[DataLoader]:
5759
return self.trainer.test_dataloaders
5860
return self.trainer.val_dataloaders
5961

60-
@property
61-
def results(self) -> ResultCollection:
62-
"""Returns the current results"""
63-
return self._results
64-
6562
@property
6663
def predictions(self):
6764
"""Returns the predictions from all dataloaders"""
@@ -184,8 +181,8 @@ def on_evaluation_start(self, *args: Any, **kwargs: Any) -> None:
184181
"""Runs ``on_{validation/test}_start`` hooks"""
185182
self.should_track_batch_outputs_for_epoch_end: bool = self._should_track_batch_outputs_for_epoch_end()
186183

187-
assert self.results is not None
188-
self.results.to(device=self.trainer.lightning_module.device)
184+
assert self._results is not None
185+
self._results.to(device=self.trainer.lightning_module.device)
189186

190187
if self.trainer.testing:
191188
self.trainer.call_hook("on_test_start", *args, **kwargs)
@@ -266,3 +263,7 @@ def on_evaluation_epoch_end(self) -> None:
266263
self.trainer.call_hook(hook_name)
267264
self.trainer.call_hook("on_epoch_end")
268265
self.trainer.logger_connector.on_epoch_end()
266+
267+
def teardown(self) -> None:
268+
self._results.cpu()
269+
self.epoch_loop.teardown()

pytorch_lightning/loops/dataloader/prediction_loop.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,12 @@ class PredictionLoop(DataLoaderLoop):
1616

1717
def __init__(self):
1818
super().__init__()
19-
self.epoch_loop: PredictionEpochLoop = PredictionEpochLoop()
2019
self.predictions: Optional[List[List[Any]]] = None
2120
self.epoch_batch_indices: Optional[List[List[int]]] = None
21+
22+
self.epoch_loop: PredictionEpochLoop = PredictionEpochLoop()
23+
24+
self._results = None # for `trainer._results` access
2225
self._return_predictions: bool = False
2326

2427
@property

pytorch_lightning/loops/epoch/evaluation_epoch_loop.py

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -122,11 +122,10 @@ def advance(
122122

123123
def on_run_end(self) -> List[STEP_OUTPUT]:
124124
"""Returns the outputs of the whole run"""
125-
return self.outputs
126-
127-
def teardown(self) -> None:
128-
"""Frees memory of tracked outputs"""
125+
outputs = self.outputs
126+
# free memory
129127
self.outputs = []
128+
return outputs
130129

131130
def evaluation_step(self, batch: Any, batch_idx: int, dataloader_idx: int) -> Optional[STEP_OUTPUT]:
132131
"""The evaluation step (validation_step or test_step depending on the trainer's state).

0 commit comments

Comments
 (0)