Skip to content

Commit e7b6379

Browse files
authored
Merge branch 'master' into rich/default
2 parents 0bdc732 + fde326d commit e7b6379

File tree

102 files changed

+1758
-1568
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

102 files changed

+1758
-1568
lines changed

.circleci/config.yml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,19 @@ orbs:
55
go: circleci/[email protected]
66
codecov: codecov/[email protected]
77

8+
trigger:
9+
tags:
10+
include:
11+
- '*'
12+
branches:
13+
include:
14+
- "master"
15+
- "release/*"
16+
- "refs/tags/*"
17+
pr:
18+
- "master"
19+
- "release/*"
20+
821
# Workflow Steps:
922
# 1. Checkout
1023
# 2. Install GO

CHANGELOG.md

Lines changed: 75 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,13 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
3939

4040
- Added `LightningCLI.configure_optimizers` to override the `configure_optimizers` return value ([#10860](https://github.com/PyTorchLightning/pytorch-lightning/issues/10860))
4141

42+
43+
- Added a warning that shows when `max_epochs` in the `Trainer` is not set ([#10700](https://github.com/PyTorchLightning/pytorch-lightning/issues/10700))
44+
45+
46+
- Added `console_kwargs` for `RichProgressBar` to initialize inner Console ([#10875](https://github.com/PyTorchLightning/pytorch-lightning/pull/10875))
47+
48+
4249
### Changed
4350

4451
- Raised exception in `init_dist_connection()` when torch distibuted is not available ([#10418](https://github.com/PyTorchLightning/pytorch-lightning/issues/10418))
@@ -59,7 +66,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
5966
- Changes in `LightningCLI` required for the new major release of jsonargparse v4.0.0 ([#10426](https://github.com/PyTorchLightning/pytorch-lightning/pull/10426))
6067

6168

62-
- Renamed `refresh_rate_per_second` parameter to `referesh_rate` for `RichProgressBar` signature ([#10497](https://github.com/PyTorchLightning/pytorch-lightning/pull/10497))
69+
- Renamed `refresh_rate_per_second` parameter to `refresh_rate` for `RichProgressBar` signature ([#10497](https://github.com/PyTorchLightning/pytorch-lightning/pull/10497))
6370

6471

6572
- Moved ownership of the `PrecisionPlugin` into `TrainingTypePlugin` and updated all references ([#10570](https://github.com/PyTorchLightning/pytorch-lightning/pull/10570))
@@ -89,6 +96,28 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
8996
- `RichProgressBar` is promoted to be the default progress bar ([#10912](https://github.com/PyTorchLightning/pytorch-lightning/pull/10912)
9097

9198

99+
- Changed `training_step`, `validation_step`, `test_step` and `predict_step` method signatures in `Accelerator` and updated input from caller side ([#10908](https://github.com/PyTorchLightning/pytorch-lightning/pull/10908))
100+
101+
102+
- Changed the name of the temporary checkpoint that the `DDPSpawnPlugin` and related plugins save ([#10934](https://github.com/PyTorchLightning/pytorch-lightning/pull/10934))
103+
104+
105+
- Redesigned process creation for spawn-based plugins (`DDPSpawnPlugin`, `TPUSpawnPlugin`, etc.) ([#10896](https://github.com/PyTorchLightning/pytorch-lightning/pull/10896))
106+
* All spawn-based plugins now spawn processes immediately upon calling `Trainer.{fit,validate,test,predict}`
107+
* The hooks/callbacks `prepare_data`, `setup`, `configure_sharded_model` and `teardown` now run under initialized process group for spawn-based plugins just like their non-spawn counterparts
108+
* Some configuration errors that were previously raised as `MisconfigurationException`s will now be raised as `ProcessRaisedException` (torch>=1.8) or as `Exception` (torch<1.8)
109+
110+
111+
- Changed `batch_to_device` entry in profiling from stage-specific to generic, to match profiling of other hooks ([#11031](https://github.com/PyTorchLightning/pytorch-lightning/pull/11031))
112+
113+
114+
- Changed the info message for finalizing ddp-spawn worker processes to a debug-level message ([#10864](https://github.com/PyTorchLightning/pytorch-lightning/pull/10864))
115+
116+
117+
- Removed duplicated file extension when uploading model checkpoints with `NeptuneLogger` ([#11015](https://github.com/PyTorchLightning/pytorch-lightning/pull/11015))
118+
119+
120+
92121
### Deprecated
93122

94123
- Deprecated `ClusterEnvironment.master_{address,port}` in favor of `ClusterEnvironment.main_{address,port}` ([#10103](https://github.com/PyTorchLightning/pytorch-lightning/issues/10103))
@@ -109,6 +138,18 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
109138
- Deprecated the access to the attribute `IndexBatchSamplerWrapper.batch_indices` in favor of `IndexBatchSamplerWrapper.seen_batch_indices` ([#10870](https://github.com/PyTorchLightning/pytorch-lightning/pull/10870))
110139

111140

141+
- Deprecated `on_init_start` and `on_init_end` callback hooks ([#10940](https://github.com/PyTorchLightning/pytorch-lightning/pull/10940))
142+
143+
144+
- Deprecated `Trainer.call_hook` in favor of `Trainer._call_callback_hooks`, `Trainer._call_lightning_module_hook`, `Trainer._call_ttp_hook`, and `Trainer._call_accelerator_hook` ([#10979](https://github.com/PyTorchLightning/pytorch-lightning/pull/10979))
145+
146+
147+
- Deprecated `TrainingTypePlugin.post_dispatch` in favor of `TrainingTypePlugin.teardown` ([#10939](https://github.com/PyTorchLightning/pytorch-lightning/pull/10939))
148+
149+
150+
- Deprecated `ModelIO.on_hpc_{save/load}` in favor of `CheckpointHooks.on_{save/load}_checkpoint` ([#10911](https://github.com/PyTorchLightning/pytorch-lightning/pull/10911))
151+
152+
112153
### Removed
113154

114155
- Removed deprecated parameter `method` in `pytorch_lightning.utilities.model_helpers.is_overridden` ([#10507](https://github.com/PyTorchLightning/pytorch-lightning/pull/10507))
@@ -208,36 +249,60 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
208249
- Removed `model_sharded_context` method from `Accelerator` ([#10886](https://github.com/PyTorchLightning/pytorch-lightning/pull/10886))
209250

210251

211-
- Removed method `pre_dispatch` from the `PrecisionPlugin` method ([#10887](https://github.com/PyTorchLightning/pytorch-lightning/pull/10887))
252+
- Removed method `pre_dispatch` from the `PrecisionPlugin` ([#10887](https://github.com/PyTorchLightning/pytorch-lightning/pull/10887))
212253

213254

214-
### Fixed
255+
- Removed method `setup_optimizers_in_pre_dispatch` from the `strategies` and achieve the same logic in `setup` and `pre_dispatch` methods ([#10906](https://github.com/PyTorchLightning/pytorch-lightning/pull/10906))
215256

216-
- Fixed an issue with `SignalConnector` not restoring the default signal handlers on teardown when running on SLURM or with fault-tolerant training enabled ([#10611](https://github.com/PyTorchLightning/pytorch-lightning/pull/10611))
217257

258+
- Removed methods `pre_dispatch`, `dispatch` and `post_dispatch` from the `Accelerator` ([#10885](https://github.com/PyTorchLightning/pytorch-lightning/pull/10885))
218259

219-
- Fixed `SignalConnector._has_already_handler` check for callable type ([#10483](https://github.com/PyTorchLightning/pytorch-lightning/pull/10483))
220260

261+
- Removed method `training_step`, `test_step`, `validation_step` and `predict_step` from the `Accelerator` ([#10890](https://github.com/PyTorchLightning/pytorch-lightning/pull/10890))
221262

222-
- Disabled batch_size extraction for torchmetric instances because they accumulate the metrics internally ([#10815](https://github.com/PyTorchLightning/pytorch-lightning/pull/10815))
223263

264+
- Removed `TrainingTypePlugin.start_{training,evaluating,predicting}` hooks and the same in all subclasses ([#10989](https://github.com/PyTorchLightning/pytorch-lightning/pull/10989), [#10896](https://github.com/PyTorchLightning/pytorch-lightning/pull/10896))
224265

225-
- Improved exception message if `rich` version is less than `10.2.2` ([#10839](https://github.com/PyTorchLightning/pytorch-lightning/pull/10839))
226266

267+
- Removed `Accelerator.on_train_start` ([#10999](https://github.com/PyTorchLightning/pytorch-lightning/pull/10999))
227268

228-
- Fixed uploading best model checkpoint in NeptuneLogger ([#10369](https://github.com/PyTorchLightning/pytorch-lightning/pull/10369))
269+
### Fixed
229270

271+
- Fixed running sanity check with `RichProgressBar` ([#10913](https://github.com/PyTorchLightning/pytorch-lightning/pull/10913))
230272

231-
- Fixed early schedule reset logic in PyTorch profiler that was causing data leak ([#10837](https://github.com/PyTorchLightning/pytorch-lightning/pull/10837))
232273

274+
- Fixed support for `CombinedLoader` while checking for warning raised with eval dataloaders ([#10994](https://github.com/PyTorchLightning/pytorch-lightning/pull/10994))
233275

234-
- Fixed a bug that caused incorrect batch indices to be passed to the `BasePredictionWriter` hooks when using a dataloader with `num_workers > 0` ([#10870](https://github.com/PyTorchLightning/pytorch-lightning/pull/10870))
276+
277+
- Fixed a bug where the DeepSpeedPlugin arguments `cpu_checkpointing` and `contiguous_memory_optimization` were not being forwarded to deepspeed correctly ([#10874](https://github.com/PyTorchLightning/pytorch-lightning/issues/10874))
235278

236279

280+
- Fixed support for logging within callbacks returned from `LightningModule` ([#10991](https://github.com/PyTorchLightning/pytorch-lightning/pull/10991))
281+
237282

238283
-
239284

240285

286+
-
287+
288+
289+
## [1.5.5] - 2021-12-07
290+
291+
### Fixed
292+
293+
- Disabled batch_size extraction for torchmetric instances because they accumulate the metrics internally ([#10815](https://github.com/PyTorchLightning/pytorch-lightning/pull/10815))
294+
- Fixed an issue with `SignalConnector` not restoring the default signal handlers on teardown when running on SLURM or with fault-tolerant training enabled ([#10611](https://github.com/PyTorchLightning/pytorch-lightning/pull/10611))
295+
- Fixed `SignalConnector._has_already_handler` check for callable type ([#10483](https://github.com/PyTorchLightning/pytorch-lightning/pull/10483))
296+
- Fixed an issue to return the results for each dataloader separately instead of duplicating them for each ([#10810](https://github.com/PyTorchLightning/pytorch-lightning/pull/10810))
297+
- Improved exception message if `rich` version is less than `10.2.2` ([#10839](https://github.com/PyTorchLightning/pytorch-lightning/pull/10839))
298+
- Fixed uploading best model checkpoint in NeptuneLogger ([#10369](https://github.com/PyTorchLightning/pytorch-lightning/pull/10369))
299+
- Fixed early schedule reset logic in PyTorch profiler that was causing data leak ([#10837](https://github.com/PyTorchLightning/pytorch-lightning/pull/10837))
300+
- Fixed a bug that caused incorrect batch indices to be passed to the `BasePredictionWriter` hooks when using a dataloader with `num_workers > 0` ([#10870](https://github.com/PyTorchLightning/pytorch-lightning/pull/10870))
301+
- Fixed an issue with item assignment on the logger on rank > 0 for those who support it ([#10917](https://github.com/PyTorchLightning/pytorch-lightning/pull/10917))
302+
- Fixed importing `torch_xla.debug` for `torch-xla<1.8` ([#10836](https://github.com/PyTorchLightning/pytorch-lightning/pull/10836))
303+
- Fixed an issue with `DDPSpawnPlugin` and related plugins leaving a temporary checkpoint behind ([#10934](https://github.com/PyTorchLightning/pytorch-lightning/pull/10934))
304+
- Fixed a `TypeError` occuring in the `SingalConnector.teardown()` method ([#10961](https://github.com/PyTorchLightning/pytorch-lightning/pull/10961))
305+
241306

242307
## [1.5.4] - 2021-11-30
243308

docs/source/_templates/layout.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
{% block footer %}
55
{{ super() }}
66
<script script type="text/javascript">
7-
var collapsedSections = ['Best practices', 'Optional extensions', 'Tutorials', 'API References', 'Bolts', 'Examples', 'Partner Domain Frameworks', 'Community'];
7+
var collapsedSections = ['Best practices', 'Optional Extensions', 'Tutorials', 'API References', 'Bolts', 'Examples', 'Partner Domain Frameworks', 'Community'];
88
</script>
99

1010
{% endblock %}

docs/source/advanced/sequences.rst

Lines changed: 0 additions & 32 deletions
This file was deleted.

docs/source/advanced/training_tricks.rst

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,3 +154,51 @@ Advanced GPU Optimizations
154154

155155
When training on single or multiple GPU machines, Lightning offers a host of advanced optimizations to improve throughput, memory efficiency, and model scaling.
156156
Refer to :doc:`Advanced GPU Optimized Training for more details <../advanced/advanced_gpu>`.
157+
158+
----------
159+
160+
Sharing Datasets Across Process Boundaries
161+
------------------------------------------
162+
The :class:`~pytorch_lightning.DataModule` class provides an organized way to decouple data loading from training logic, with :meth:`~pytorch_lightning.DataModule.prepare_data` being used for downloading and pre-processing the dataset on a single process, and :meth:`~pytorch_lightning.DataModule.setup` loading the pre-processed data for each process individually:
163+
164+
.. code-block:: python
165+
166+
class MNISTDataModule(pl.LightningDataModule):
167+
def prepare_data(self):
168+
MNIST(self.data_dir, download=True)
169+
170+
def setup(self, stage: Optional[str] = None):
171+
self.mnist = MNIST(self.data_dir)
172+
173+
def train_loader(self):
174+
return DataLoader(self.mnist, batch_size=128)
175+
176+
However, for in-memory datasets, that means that each process will hold a (redundant) replica of the dataset in memory, which may be impractical when using many processes while utilizing datasets that nearly fit into CPU memory, as the memory consumption will scale up linearly with the number of processes.
177+
For example, when training Graph Neural Networks, a common strategy is to load the entire graph into CPU memory for fast access to the entire graph structure and its features, and to then perform neighbor sampling to obtain mini-batches that fit onto the GPU.
178+
179+
A simple way to prevent redundant dataset replicas is to rely on :obj:`torch.multiprocessing` to share the `data automatically between spawned processes via shared memory <https://pytorch.org/docs/stable/notes/multiprocessing.html>`_.
180+
For this, all data pre-loading should be done on the main process inside :meth:`DataModule.__init__`.
181+
As a result, all tensor-data will get automatically shared when using the :class:`~pytorch_lightning.plugins.DDPSpawnPlugin` training type plugin:
182+
183+
.. warning::
184+
185+
:obj:`torch.multiprocessing` will send a handle of each individual tensor to other processes.
186+
In order to prevent any errors due to too many open file handles, try to reduce the number of tensors to share, *e.g.*, by stacking your data into a single tensor.
187+
188+
.. code-block:: python
189+
190+
class MNISTDataModule(pl.LightningDataModule):
191+
def __init__(self, data_dir: str):
192+
self.mnist = MNIST(data_dir, download=True, transform=T.ToTensor())
193+
194+
def train_loader(self):
195+
return DataLoader(self.mnist, batch_size=128)
196+
197+
198+
model = Model(...)
199+
datamodule = MNISTDataModule("data/MNIST")
200+
201+
trainer = Trainer(gpus=2, strategy="ddp_spawn")
202+
trainer.fit(model, datamodule)
203+
204+
See the `graph-level <https://github.com/pyg-team/pytorch_geometric/blob/master/examples/pytorch_lightning/gin.py>`_ and `node-level <https://github.com/pyg-team/pytorch_geometric/blob/master/examples/pytorch_lightning/graph_sage.py>`_ prediction examples in PyTorch Geometric for practical use-cases.

docs/source/common/debugging.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -79,10 +79,10 @@ argument of :class:`~pytorch_lightning.trainer.trainer.Trainer`)
7979

8080
.. testcode::
8181

82-
# use only 1% of training data (and use the same training dataloader (with shuffle off) in val and test)
82+
# use only 1% of training data (and turn off validation)
8383
trainer = Trainer(overfit_batches=0.01)
8484

85-
# similar, but with a fixed 10 batches no matter the size of the dataset
85+
# similar, but with a fixed 10 batches
8686
trainer = Trainer(overfit_batches=10)
8787

8888
With this flag, the train, val, and test sets will all be the same train set. We will also replace the sampler

docs/source/common/lightning_cli.rst

Lines changed: 48 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ LightningCLI
8080

8181
The implementation of training command line tools is done via the :class:`~pytorch_lightning.utilities.cli.LightningCLI`
8282
class. The minimal installation of pytorch-lightning does not include this support. To enable it, either install
83-
Lightning as :code:`pytorch-lightning[extra]` or install the package :code:`jsonargparse[signatures]`.
83+
Lightning as :code:`pytorch-lightning[extra]` or install the package :code:`pip install -U jsonargparse[signatures]`.
8484

8585
The case in which the user's :class:`~pytorch_lightning.core.lightning.LightningModule` class implements all required
8686
:code:`*_dataloader` methods, a :code:`trainer.py` tool can be as simple as:
@@ -757,6 +757,34 @@ Instantiation links are used to automatically determine the order of instantiati
757757
<https://jsonargparse.readthedocs.io/en/stable/#jsonargparse.core.ArgumentParser.link_arguments>`_.
758758
759759
760+
Variable Interpolation
761+
^^^^^^^^^^^^^^^^^^^^^^
762+
763+
The linking of arguments is intended for things that are meant to be non-configurable. This improves the CLI user
764+
experience since it avoids the need for providing more parameters. A related concept is
765+
variable interpolation which in contrast keeps things being configurable.
766+
767+
The YAML standard defines anchors and aliases which is a way to reuse the content in multiple places of the YAML. This is
768+
supported in the ``LightningCLI`` though it has limitations. Support for OmegaConf's more powerful `variable
769+
interpolation <https://omegaconf.readthedocs.io/en/2.1_branch/usage.html#variable-interpolation>`__ will be available
770+
out of the box if this package is installed. To install it run :code:`pip install omegaconf`. Then to enable the use
771+
of OmegaConf in a ``LightningCLI``, when instantiating a parameter needs to be given for the parser as follows:
772+
773+
.. testcode::
774+
775+
cli = LightningCLI(MyModel, parser_kwargs={"parser_mode": "omegaconf"})
776+
777+
With the encoder-decoder example model above a possible YAML that uses variable interpolation could be the following:
778+
779+
.. code-block:: yaml
780+
781+
model:
782+
encoder_layers: 12
783+
decoder_layers:
784+
- ${model.encoder_layers}
785+
- 4
786+
787+
760788
Optimizers and learning rate schedulers
761789
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
762790
@@ -929,6 +957,25 @@ You can also pass the class path directly, for example, if the optimizer hasn't
929957
--gen_discriminator.init_args.lr=0.0001
930958
931959
960+
Troubleshooting
961+
^^^^^^^^^^^^^^^
962+
963+
The standard behavior for CLIs, when they fail, is to terminate the process with a non-zero exit code and a short message
964+
to hint the user about the cause. This is problematic while developing the CLI since there is no information to track
965+
down the root of the problem. A simple change in the instantiation of the ``LightningCLI`` can be used such that when
966+
there is a failure an exception is raised and the full stack trace printed.
967+
968+
.. testcode::
969+
970+
cli = LightningCLI(MyModel, parser_kwargs={"error_handler": None})
971+
972+
.. note::
973+
974+
When asking about problems and reporting issues please set the ``error_handler`` to ``None`` and include the stack
975+
trace in your description. With this, it is more likely for people to help out identifying the cause without needing
976+
to create a reproducible script.
977+
978+
932979
Notes related to reproducibility
933980
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
934981

0 commit comments

Comments
 (0)