Skip to content

Commit 721a7a6

Browse files
authored
Merge branch 'master' into add/rich_logging
2 parents 266bd66 + 92c7eec commit 721a7a6

File tree

11 files changed

+188
-104
lines changed

11 files changed

+188
-104
lines changed

CHANGELOG.md

Lines changed: 26 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -63,12 +63,15 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
6363
- Added `DataLoaderIterDataFetcher` ([#9020](https://github.com/PyTorchLightning/pytorch-lightning/pull/9020))
6464

6565

66-
- Added Rich Progress Bar ([#8929](https://github.com/PyTorchLightning/pytorch-lightning/pull/8929))
66+
- Added `DataFetcher` within `Fit / Evaluation` Loop ([#9047](https://github.com/PyTorchLightning/pytorch-lightning/pull/9047))
6767

6868

6969
- Added a friendly error message when DDP attempts to spawn new distributed processes with rank > 0 ([#9005](https://github.com/PyTorchLightning/pytorch-lightning/pull/9005))
7070

7171

72+
- Added Rich Progress Bar ([#8929](https://github.com/PyTorchLightning/pytorch-lightning/pull/8929))
73+
74+
7275
### Changed
7376

7477
- Parsing of the `gpus` Trainer argument has changed: `gpus="n"` (str) no longer selects the GPU index n and instead selects the first n devices. ([#8770](https://github.com/PyTorchLightning/pytorch-lightning/pull/8770))
@@ -183,55 +186,48 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
183186

184187
### Fixed
185188

186-
- Ensure the existence of `DDPPlugin._sync_dir` in `reconciliate_processes` ([#8939](https://github.com/PyTorchLightning/pytorch-lightning/pull/8939))
187-
188-
- Restore original loaders if replaced by entrypoint ([#8885](https://github.com/PyTorchLightning/pytorch-lightning/pull/8885))
189-
190-
- Fixed `trainer.fit_loop.split_idx` always returning `None` ([#8601](https://github.com/PyTorchLightning/pytorch-lightning/pull/8601))
191-
192-
193-
- Fixed references for `ResultCollection.extra` ([#8622](https://github.com/PyTorchLightning/pytorch-lightning/pull/8622))
194-
195-
196-
- Fixed reference issues during epoch end result collection ([#8621](https://github.com/PyTorchLightning/pytorch-lightning/pull/8621))
197-
198-
199-
- Fixed horovod auto-detection when horovod is not installed and the launcher is `mpirun` ([#8610](https://github.com/PyTorchLightning/pytorch-lightning/pull/8610))
200-
201-
202-
- Fixed an issue with `training_step` outputs not getting collected correctly for `training_epoch_end` ([#8613](https://github.com/PyTorchLightning/pytorch-lightning/pull/8613))
203-
204-
205189
- Fixed save/load/resume from checkpoint for DeepSpeed Plugin (
206190
[#8397](https://github.com/PyTorchLightning/pytorch-lightning/pull/8397),
207191
[#8644](https://github.com/PyTorchLightning/pytorch-lightning/pull/8644),
208192
[#8627](https://github.com/PyTorchLightning/pytorch-lightning/pull/8627))
209193

210194

211-
- Fixed recursive call for `apply_to_collection(include_none=False)` ([#8719](https://github.com/PyTorchLightning/pytorch-lightning/pull/8719))
212-
213-
214195
- Fixed an issue with logger outputs not being finalized correctly after prediction runs ([#8333](https://github.com/PyTorchLightning/pytorch-lightning/issues/8333))
215196

216197

217-
- Fixed `StochasticWeightAveraging` with a list of learning rates not applying them to each param group ([#8747](https://github.com/PyTorchLightning/pytorch-lightning/issues/8747))
198+
- Fixed bug where data-loading functions where not getting the correct running stage passed ([#8858](https://github.com/PyTorchLightning/pytorch-lightning/pull/8858))
218199

200+
- Fixed a bug in the binary search mode of auto batch size scaling where exception was thrown if the first trainer run resulted in OOM ([#8954](https://github.com/PyTorchLightning/pytorch-lightning/pull/8954))
219201

220-
- Fixed truncated backprop through time enablement when set as a property on the LightningModule and not the Trainer ([#8804](https://github.com/PyTorchLightning/pytorch-lightning/pull/8804/))
221202

203+
## [1.4.3] - 2021-08-17
222204

223205
- Fixed plateau scheduler stepping on incomplete epoch ([#8861](https://github.com/PyTorchLightning/pytorch-lightning/pull/8861))
206+
- Fixed infinite loop with `CycleIterator` and multiple loaders ([#8889](https://github.com/PyTorchLightning/pytorch-lightning/pull/8889))
207+
- Fixed `StochasticWeightAveraging` with a list of learning rates not applying them to each param group ([#8747](https://github.com/PyTorchLightning/pytorch-lightning/issues/8747))
208+
- Restore original loaders if replaced by entrypoint ([#8885](https://github.com/PyTorchLightning/pytorch-lightning/pull/8885))
209+
- Fixed lost reference to `_Metadata` object in `ResultMetricCollection` ([#8932](https://github.com/PyTorchLightning/pytorch-lightning/pull/8932))
210+
- Ensure the existence of `DDPPlugin._sync_dir` in `reconciliate_processes` ([#8939](https://github.com/PyTorchLightning/pytorch-lightning/pull/8939))
224211

225212

226-
- Fixed infinite loop with CycleIterator and multiple loaders ([#8889](https://github.com/PyTorchLightning/pytorch-lightning/pull/8889))
227-
213+
## [1.4.2] - 2021-08-10
228214

229-
- Fixed lost reference to `_Metadata` object in `ResultMetricCollection` ([#8932](https://github.com/PyTorchLightning/pytorch-lightning/pull/8932))
215+
- Fixed recursive call for `apply_to_collection(include_none=False)` ([#8719](https://github.com/PyTorchLightning/pytorch-lightning/pull/8719))
216+
- Fixed truncated backprop through time enablement when set as a property on the LightningModule and not the Trainer ([#8804](https://github.com/PyTorchLightning/pytorch-lightning/pull/8804/))
217+
- Fixed comments and exception message for metrics_to_scalars ([#8782](https://github.com/PyTorchLightning/pytorch-lightning/pull/8782/))
218+
- Fixed typo error in LightningLoggerBase.after_save_checkpoint docstring ([#8737](https://github.com/PyTorchLightning/pytorch-lightning/pull/8737/))
230219

231220

232-
- Fixed bug where data-loading functions where not getting the correct running stage passed ([#8858](https://github.com/PyTorchLightning/pytorch-lightning/pull/8858))
221+
## [1.4.1] - 2021-08-03
233222

234-
- Fixed a bug in the binary search mode of auto batch size scaling where exception was thrown if the first trainer run resulted in OOM ([#8954](https://github.com/PyTorchLightning/pytorch-lightning/pull/8954))
223+
- Fixed `trainer.fit_loop.split_idx` always returning `None` ([#8601](https://github.com/PyTorchLightning/pytorch-lightning/pull/8601))
224+
- Fixed references for `ResultCollection.extra` ([#8622](https://github.com/PyTorchLightning/pytorch-lightning/pull/8622))
225+
- Fixed reference issues during epoch end result collection ([#8621](https://github.com/PyTorchLightning/pytorch-lightning/pull/8621))
226+
- Fixed horovod auto-detection when horovod is not installed and the launcher is `mpirun` ([#8610](https://github.com/PyTorchLightning/pytorch-lightning/pull/8610))
227+
- Fixed an issue with `training_step` outputs not getting collected correctly for `training_epoch_end` ([#8613](https://github.com/PyTorchLightning/pytorch-lightning/pull/8613))
228+
- Fixed distributed types support for CPUs ([#8667](https://github.com/PyTorchLightning/pytorch-lightning/pull/8667))
229+
- Fixed a deadlock issue with DDP and torchelastic ([#8655](https://github.com/PyTorchLightning/pytorch-lightning/pull/8655))
230+
- Fixed `accelerator=ddp` choice for CPU ([#8645](https://github.com/PyTorchLightning/pytorch-lightning/pull/8645))
235231

236232

237233
## [1.4.0] - 2021-07-27

docs/source/starter/new-project.rst

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -120,13 +120,14 @@ A :doc:`lightning module <../common/lightning_module>` defines a *system* not a
120120
Examples of systems are:
121121

122122
- `Autoencoder <https://github.com/PyTorchLightning/lightning-bolts/blob/master/pl_bolts/models/autoencoders/basic_ae/basic_ae_module.py>`_
123-
- `BERT <https://colab.research.google.com/github/PytorchLightning/pytorch-lightning/blob/master/notebooks/04-transformers-text-classification.ipynb>`_
124-
- `DQN <https://colab.research.google.com/github/PytorchLightning/pytorch-lightning/blob/master/notebooks/08-Domain-specific-demos.ipynb>`_
125-
- `GAN <https://colab.research.google.com/github/PytorchLightning/pytorch-lightning/blob/master/notebooks/03-basic-gan.ipynb>`_
126-
- `Image classifier <https://colab.research.google.com/github/PytorchLightning/pytorch-lightning/blob/master/notebooks/01-mnist-hello-world.ipynb>`_
123+
- `BERT <https://colab.research.google.com/github/PyTorchLightning/lightning-tutorials/blob/publication/.notebooks/lightning_examples/text-transformers.ipynb>`_
124+
- `DQN <https://colab.research.google.com/github/PyTorchLightning/lightning-tutorials/blob/publication/.notebooks/lightning_examples/reinforce-learning-DQN.ipynb>`_
125+
- `GAN <https://colab.research.google.com/github/PyTorchLightning/lightning-tutorials/blob/publication/.notebooks/lightning_examples/basic-gan.ipynb>`_
126+
- `Image classifier <https://colab.research.google.com/github/PyTorchLightning/lightning-tutorials/blob/publication/.notebooks/lightning_examples/mnist-hello-world.ipynb>`_
127127
- Seq2seq
128128
- `SimCLR <https://github.com/PyTorchLightning/lightning-bolts/blob/master/pl_bolts/models/self_supervised/simclr/simclr_module.py>`_
129129
- `VAE <https://github.com/PyTorchLightning/lightning-bolts/blob/master/pl_bolts/models/autoencoders/basic_vae/basic_vae_module.py>`_
130+
- `and a lot more <https://github.com/PyTorchLightning/lightning-tutorials/tree/publication/.notebooks/lightning_examples>`_
130131

131132
Under the hood a LightningModule is still just a :class:`torch.nn.Module` that groups all research code into a single file to make it self-contained:
132133

pytorch_lightning/loops/batch/training_batch_loop.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -280,6 +280,8 @@ def _training_step(
280280
training_step_output = self.trainer.accelerator.training_step(step_kwargs)
281281
self.trainer.accelerator.post_training_step()
282282

283+
del step_kwargs
284+
283285
training_step_output = self.trainer.call_hook("training_step_end", training_step_output)
284286

285287
_check_training_step_output(self.trainer.lightning_module, training_step_output)

pytorch_lightning/loops/dataloader/evaluation_loop.py

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,15 +12,14 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15-
from typing import Any, List, Optional, Sequence, Union
15+
from typing import Any, Iterator, List, Optional, Sequence, Union
1616

1717
from deprecate.utils import void
1818
from torch.utils.data.dataloader import DataLoader
1919

2020
from pytorch_lightning.loops.dataloader import DataLoaderLoop
2121
from pytorch_lightning.loops.epoch import EvaluationEpochLoop
2222
from pytorch_lightning.trainer.connectors.logger_connector.result import ResultCollection
23-
from pytorch_lightning.utilities.fetching import DataFetcher
2423
from pytorch_lightning.utilities.model_helpers import is_overridden
2524
from pytorch_lightning.utilities.types import EPOCH_OUTPUT
2625

@@ -98,10 +97,13 @@ def on_run_start(self, *args: Any, **kwargs: Any) -> None:
9897
def advance(self, *args: Any, **kwargs: Any) -> None:
9998
"""Performs evaluation on one single dataloader"""
10099
void(*args, **kwargs)
100+
101101
dataloader = self.trainer.accelerator.process_dataloader(self.current_dataloader)
102-
data_fetcher = DataFetcher()
103-
data_fetcher.setup(dataloader)
104-
dataloader_iter = enumerate(data_fetcher)
102+
dataloader = self.trainer.data_connector.get_profiled_dataloader(
103+
dataloader, dataloader_idx=self.current_dataloader_idx
104+
)
105+
dataloader_iter = iter(dataloader)
106+
105107
dl_max_batches = self._max_batches[self.current_dataloader_idx]
106108

107109
dl_outputs = self.epoch_loop.run(

pytorch_lightning/loops/epoch/evaluation_epoch_loop.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -91,8 +91,9 @@ def advance(
9191
if batch is None:
9292
raise StopIteration
9393

94-
with self.trainer.profiler.profile("evaluation_batch_to_device"):
95-
batch = self.trainer.accelerator.batch_to_device(batch, dataloader_idx=dataloader_idx)
94+
if not self.trainer.data_connector.evaluation_data_fetcher.store_on_device:
95+
with self.trainer.profiler.profile("evaluation_batch_to_device"):
96+
batch = self.trainer.accelerator.batch_to_device(batch, dataloader_idx=dataloader_idx)
9697

9798
self.batch_progress.increment_ready()
9899

pytorch_lightning/loops/epoch/training_epoch_loop.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -132,14 +132,13 @@ def advance(self, dataloader_iter: Iterator, **kwargs: Any) -> None:
132132
else:
133133
_, (batch, is_last) = next(dataloader_iter)
134134

135-
# ------------------------------------
136-
# TRAINING_STEP + TRAINING_STEP_END
137-
# ------------------------------------
138-
# FIXME: Remove with InterBatchProcessor.
139-
if not self.trainer.data_connector.data_fetcher.store_on_device:
135+
if not self.trainer.data_connector.train_data_fetcher.store_on_device:
140136
with self.trainer.profiler.profile("training_batch_to_device"):
141137
batch = self.trainer.accelerator.batch_to_device(batch)
142138

139+
# ------------------------------------
140+
# TRAINING_STEP + TRAINING_STEP_END
141+
# ------------------------------------
143142
self.batch_progress.increment_ready()
144143

145144
with self.trainer.profiler.profile("run_training_batch"):

pytorch_lightning/loops/fit_loop.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414

1515
import logging
1616
from contextlib import suppress
17-
from typing import Optional
17+
from typing import Iterator, Optional
1818

1919
from pytorch_lightning.loops import Loop
2020
from pytorch_lightning.loops.epoch import TrainingEpochLoop
@@ -192,12 +192,13 @@ def on_advance_start(self) -> None:
192192

193193
def advance(self) -> None:
194194
"""Runs one whole epoch."""
195-
train_dataloader = self.trainer.accelerator.process_dataloader(self.trainer.train_dataloader)
196-
train_dataloader = self.trainer.data_connector.get_profiled_train_dataloader(train_dataloader)
195+
dataloader = self.trainer.accelerator.process_dataloader(self.trainer.train_dataloader)
196+
dataloader = self.trainer.data_connector.get_profiled_dataloader(dataloader)
197+
dataloader_iter = iter(dataloader)
197198

198199
with self.trainer.profiler.profile("run_training_epoch"):
199200
# run train epoch
200-
epoch_output = self.epoch_loop.run(train_dataloader)
201+
epoch_output = self.epoch_loop.run(dataloader_iter)
201202

202203
if epoch_output is None:
203204
return

pytorch_lightning/loops/processors/iterator_batch_processor.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ def run(self, dataloader_iter: Iterator) -> Optional[AttributeDict]:
113113
Args:
114114
dataloader_iter: the iterator over the dataloader producing the new batch
115115
"""
116-
_, (dataloader_iter, batch_idx, is_last) = next(dataloader_iter)
116+
batch_idx, (dataloader_iter, is_last) = next(dataloader_iter)
117117

118118
self.trainer.logger_connector.on_batch_start()
119119
response = self.trainer.call_hook("on_batch_start")

pytorch_lightning/trainer/connectors/data_connector.py

Lines changed: 65 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -11,22 +11,47 @@
1111
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
14-
14+
import os
15+
from functools import partial
1516
from typing import Callable, Iterable, Optional, Union
1617

1718
import pytorch_lightning as pl
1819
from pytorch_lightning.utilities import rank_zero_deprecation
1920
from pytorch_lightning.utilities.exceptions import MisconfigurationException
20-
from pytorch_lightning.utilities.fetching import AbstractDataFetcher, DataFetcher, InterBatchParallelDataFetcher
21+
from pytorch_lightning.utilities.fetching import (
22+
AbstractDataFetcher,
23+
DataFetcher,
24+
DataLoaderIterDataFetcher,
25+
InterBatchParallelDataFetcher,
26+
)
2127
from pytorch_lightning.utilities.model_helpers import is_overridden
28+
from pytorch_lightning.utilities.signature_utils import is_param_in_hook_signature
2229
from pytorch_lightning.utilities.types import EVAL_DATALOADERS, TRAIN_DATALOADERS
30+
from pytorch_lightning.utilities.warnings import rank_zero_warn
2331

2432

2533
class DataConnector:
26-
def __init__(self, trainer: "pl.Trainer", multiple_trainloader_mode: str = "max_size_cycle"):
34+
def __init__(
35+
self,
36+
trainer: "pl.Trainer",
37+
multiple_trainloader_mode: str = "max_size_cycle",
38+
train_data_fetcher: Optional[AbstractDataFetcher] = None,
39+
validate_data_fetcher: Optional[AbstractDataFetcher] = None,
40+
test_data_fetcher: Optional[AbstractDataFetcher] = None,
41+
):
2742
self.trainer = trainer
2843
self.multiple_trainloader_mode = multiple_trainloader_mode
29-
self.data_fetcher: AbstractDataFetcher = DataFetcher()
44+
45+
self.train_data_fetcher = train_data_fetcher
46+
self.validate_data_fetcher = validate_data_fetcher
47+
self.test_data_fetcher = test_data_fetcher
48+
self.sanity_check_data_fetcher: Optional[AbstractDataFetcher] = None
49+
50+
@property
51+
def evaluation_data_fetcher(self) -> Optional[AbstractDataFetcher]:
52+
if self.trainer.sanity_checking:
53+
return self.sanity_check_data_fetcher
54+
return self.test_data_fetcher if self.trainer.testing else self.validate_data_fetcher
3055

3156
def on_trainer_init(
3257
self,
@@ -66,15 +91,42 @@ def on_trainer_init(
6691
self.trainer.reload_dataloaders_every_n_epochs = reload_dataloaders_every_n_epochs
6792
self.trainer._is_data_prepared = False
6893

69-
def get_profiled_train_dataloader(self, train_dataloader) -> Iterable:
70-
# FIXME: Temporary hack
71-
if isinstance(self.data_fetcher, InterBatchParallelDataFetcher):
72-
self.data_fetcher.setup(train_dataloader, batch_to_device=self.trainer.accelerator.batch_to_device)
73-
else:
74-
self.data_fetcher.setup(train_dataloader)
75-
prefetcher_iter = iter(self.data_fetcher)
76-
profiled_dl = self.trainer.profiler.profile_iterable(enumerate(prefetcher_iter), "get_train_batch")
77-
return profiled_dl
94+
def _check_training_step_requires_dataloader_iter(self) -> bool:
95+
training_step_fx = getattr(self.trainer.lightning_module, "training_step")
96+
contains_dataloader_iter = is_param_in_hook_signature(training_step_fx, "dataloader_iter", explicit=True)
97+
return contains_dataloader_iter
98+
99+
def _select_data_fetcher(self) -> AbstractDataFetcher:
100+
if self.trainer.sanity_checking:
101+
return DataFetcher()
102+
103+
if self.trainer.training and self._check_training_step_requires_dataloader_iter():
104+
rank_zero_warn(
105+
"Found `dataloader_iter` argument in the `training_step`. Note that the support for "
106+
"this signature is experimental and the behavior is subject to change."
107+
)
108+
return DataLoaderIterDataFetcher()
109+
elif self.trainer.training and os.getenv("PL_INTER_BATCH_PARALLELISM", "0") == "1":
110+
# note: this is an experimental feature
111+
if not self.trainer.training_type_plugin.on_gpu:
112+
raise MisconfigurationException("Inter batch parallelism is available only when using Nvidia GPUs.")
113+
return InterBatchParallelDataFetcher()
114+
115+
return DataFetcher()
116+
117+
def get_profiled_dataloader(self, dataloader: Iterable, dataloader_idx: int = 0) -> Iterable:
118+
stage: str = self.trainer.state.stage.value
119+
data_fetcher = setattr(self, f"{stage}_data_fetcher", None) or self._select_data_fetcher()
120+
data_fetcher.setup(
121+
dataloader,
122+
stage=stage,
123+
batch_to_device=partial(self.trainer.accelerator.batch_to_device, dataloader_idx=dataloader_idx),
124+
profiler=self.trainer.profiler,
125+
)
126+
setattr(self, f"{stage}_data_fetcher", data_fetcher)
127+
if isinstance(data_fetcher, DataLoaderIterDataFetcher):
128+
return data_fetcher
129+
return enumerate(data_fetcher)
78130

79131
def prepare_data(self) -> None:
80132
# on multi-gpu jobs we only want to manipulate (download, etc) on node_rank=0, local_rank=0

0 commit comments

Comments
 (0)