Skip to content

Commit ca2ff4b

Browse files
committed
Merge branch 'master' into refactor/loops/loops_everywhere
2 parents b090e4f + 0c958c5 commit ca2ff4b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

61 files changed

+1408
-866
lines changed

.azure-pipelines/gpu-tests.yml

Lines changed: 2 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -55,13 +55,9 @@ jobs:
5555
displayName: 'Image info & NVIDIA'
5656
5757
- bash: |
58-
export GIT_TERMINAL_PROMPT=1
59-
#sudo apt-get install -y cmake
60-
# python -m pip install "pip==20.1"
61-
pip install --requirement requirements.txt
6258
python -c "fname = 'requirements/extra.txt' ; lines = [line for line in open(fname).readlines() if 'horovod' not in line] ; open(fname, 'w').writelines(lines)"
63-
pip install --requirement ./requirements/devel.txt --upgrade-strategy only-if-needed
64-
pip install fairscale>=0.3.4 --upgrade-strategy only-if-needed
59+
pip install fairscale>=0.3.4
60+
pip install . --requirement requirements/devel.txt
6561
pip list
6662
displayName: 'Install dependencies'
6763
@@ -114,15 +110,9 @@ jobs:
114110
115111
- script: |
116112
set -e
117-
python setup.py install --user
118-
rm -rf pytorch_lightning
119-
pip list
120113
python -m pytest pl_examples -v --maxfail=2 --durations=0
121114
bash pl_examples/run_examples-args.sh --trainer.gpus 1 --trainer.max_epochs 1 --data.batch_size 64 --trainer.limit_train_batches 5 --trainer.limit_val_batches 3
122115
bash pl_examples/run_ddp-examples.sh --trainer.max_epochs 1 --data.batch_size 32 --trainer.limit_train_batches 2 --trainer.limit_val_batches 2
123-
# cd pl_examples/basic_examples
124-
# bash submit_ddp_job.sh
125-
# bash submit_ddp2_job.sh
126116
env:
127117
PL_USE_MOCKED_MNIST: "1"
128118
displayName: 'Examples'

.github/workflows/ci_pkg-install.yml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,7 @@ jobs:
1515
fail-fast: false
1616
# max-parallel: 6
1717
matrix:
18-
# PyTorch 1.5 is failing on Win and bolts requires torchvision>=0.5
19-
os: [ubuntu-20.04, macOS-10.15 , windows-2019] #
18+
os: [ubuntu-20.04, macOS-10.15, windows-2019]
2019
python-version: [3.6, 3.9]
2120

2221
steps:

CHANGELOG.md

Lines changed: 35 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
99

1010
### Added
1111

12+
- Added support to `LightningModule.to_torchscript` for saving to custom filesystems with fsspec ([#7617](https://github.com/PyTorchLightning/pytorch-lightning/pull/7617))
13+
14+
1215
- Added `KubeflowEnvironment` for use with the `PyTorchJob` operator in Kubeflow
1316

1417

@@ -18,7 +21,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
1821
- Added support for checkpointing based on a provided time interval during training ([#7515](https://github.com/PyTorchLightning/pytorch-lightning/pull/7515))
1922

2023

21-
- Added dataclasses for progress tracking ([#6603](https://github.com/PyTorchLightning/pytorch-lightning/pull/6603))
24+
- Added dataclasses for progress tracking (
25+
[#6603](https://github.com/PyTorchLightning/pytorch-lightning/pull/6603),
26+
[#7574](https://github.com/PyTorchLightning/pytorch-lightning/pull/7574))
2227

2328

2429
- Added argument `trainer.predict(ckpt_path)` ([#7430](https://github.com/PyTorchLightning/pytorch-lightning/pull/7430))
@@ -33,8 +38,12 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
3338
- Added correct `dataloader_idx` to batch transfer hooks ([#6241](https://github.com/PyTorchLightning/pytorch-lightning/pull/6241))
3439

3540

41+
- Added `ddp_fully_sharded` support ([#7487](https://github.com/PyTorchLightning/pytorch-lightning/pull/7487))
42+
43+
3644
### Changed
3745

46+
- Changed calling of `untoggle_optimizer(opt_idx)` out of the closure function ([#7563](https://github.com/PyTorchLightning/pytorch-lightning/pull/7563)
3847

3948
- Changed the `Trainer`'s `checkpoint_callback` argument to allow only boolean values ([#7539](https://github.com/PyTorchLightning/pytorch-lightning/pull/7539))
4049

@@ -74,38 +83,55 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
7483
- MLflowLogger now uses the env variable `MLFLOW_TRACKING_URI` as default tracking uri ([#7457](https://github.com/PyTorchLightning/pytorch-lightning/pull/7457))
7584

7685

86+
- MLFlowLogger now accepts `run_name` as an constructor argument ([#7622](https://github.com/PyTorchLightning/pytorch-lightning/issues/7622))
87+
88+
89+
- Changed `teardown()` in `Accelerator` to allow `training_type_plugin` to customize `teardown` logic ([#7579](https://github.com/PyTorchLightning/pytorch-lightning/pull/7579))
90+
91+
7792
### Deprecated
7893

7994

8095
- Deprecated `TrainerModelHooksMixin` in favor of `pytorch_lightning.utilities.signature_utils` ([#7422](https://github.com/PyTorchLightning/pytorch-lightning/pull/7422))
8196

8297

83-
- Deprecated `num_nodes` and `sync_batchnorm` arguments in `DDPPlugin` and `DDPSpawnPlugin` ([7026](https://github.com/PyTorchLightning/pytorch-lightning/pull/7026))
98+
- Deprecated `num_nodes` and `sync_batchnorm` arguments in `DDPPlugin` and `DDPSpawnPlugin` ([#7026](https://github.com/PyTorchLightning/pytorch-lightning/pull/7026))
8499

85100

86101
### Removed
87102

88-
- Prune deprecated classif. metrics from `pytorch_lightning.metrics.functional.classification` ([7499](https://github.com/PyTorchLightning/pytorch-lightning/pull/7499))
103+
- Prune deprecated classif. metrics from `pytorch_lightning.metrics.functional.classification` ([#7499](https://github.com/PyTorchLightning/pytorch-lightning/pull/7499))
89104

90105

91-
- Removed deprecated data parallel classes `LightningDataParallel` and `LightningDistributedDataParallel` from `pytorch_lightning.overrides.data_parallel` ([7510](https://github.com/PyTorchLightning/pytorch-lightning/pull/7510))
106+
- Removed deprecated data parallel classes `LightningDataParallel` and `LightningDistributedDataParallel` from `pytorch_lightning.overrides.data_parallel` ([#7510](https://github.com/PyTorchLightning/pytorch-lightning/pull/7510))
92107

93108

94-
- Removed deprecated trainer attributes - `get_model` and `accelerator_backend` ([7502](https://github.com/PyTorchLightning/pytorch-lightning/pull/7502))
109+
- Removed deprecated trainer attributes - `get_model` and `accelerator_backend` ([#7502](https://github.com/PyTorchLightning/pytorch-lightning/pull/7502))
95110

96111

97-
- Removed deprecated utils modules `model_utils`, `warning_utils`, `xla_device_utils` and partially `argparse_utils` ([7503](https://github.com/PyTorchLightning/pytorch-lightning/pull/7503))
112+
- Removed support for `self.log(tbptt_reduce_fx)` and `self.log(tbptt_pad_token)`. Please, open a discussion explaining your use-case if you relied on these. ([#7644](https://github.com/PyTorchLightning/pytorch-lightning/pull/7644))
113+
114+
115+
- Removed deprecated utils modules `model_utils`, `warning_utils`, `xla_device_utils` and partially `argparse_utils` ([#7503](https://github.com/PyTorchLightning/pytorch-lightning/pull/7503))
98116

99117

100118
- Removed deprecated trainer attributes - `on_cpu`, `on_tpu`, `use_tpu`, `on_gpu`, `use_dp`, `use_ddp`, `use_ddp2`, `use_horovod`, `use_single_gpu` ([#7501](https://github.com/PyTorchLightning/pytorch-lightning/pull/7501))
101119

102120

103121
### Fixed
104122

123+
- Fixed dataloaders are not reset when tuning the model ([#7566](https://github.com/PyTorchLightning/pytorch-lightning/pull/7566))
124+
105125

106126
- Fixed parsing of multiple training dataloaders ([#7433](https://github.com/PyTorchLightning/pytorch-lightning/pull/7433))
107127

108128

129+
- Fixed broadcasting in multi-node, multi-gpu DDP using torch 1.7 ([#7592](https://github.com/PyTorchLightning/pytorch-lightning/pull/7592))
130+
131+
132+
- Fixed `ProgressBar` pickling after calling `trainer.predict` ([#7608](https://github.com/PyTorchLightning/pytorch-lightning/pull/7608))
133+
134+
109135
- Fixed recursive passing of `wrong_type` keyword argument in `pytorch_lightning.utilities.apply_to_collection` ([#7433](https://github.com/PyTorchLightning/pytorch-lightning/pull/7433))
110136

111137

@@ -1326,7 +1352,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
13261352
- Fixed getting `experiment_id` from MLFlow only once instead of each training loop ([#3394](https://github.com/PyTorchLightning/pytorch-lightning/pull/3394))
13271353
- Fixed `overfit_batches` which now correctly disables shuffling for the training loader. ([#3501](https://github.com/PyTorchLightning/pytorch-lightning/pull/3501))
13281354
- Fixed gradient norm tracking for `row_log_interval > 1` ([#3489](https://github.com/PyTorchLightning/pytorch-lightning/pull/3489))
1329-
- Fixed `ModelCheckpoint` name formatting ([3164](https://github.com/PyTorchLightning/pytorch-lightning/pull/3163))
1355+
- Fixed `ModelCheckpoint` name formatting ([#3164](https://github.com/PyTorchLightning/pytorch-lightning/pull/3163))
13301356
- Fixed example implementation of AutoEncoder ([#3190](https://github.com/PyTorchLightning/pytorch-lightning/pull/3190))
13311357
- Fixed invalid paths when remote logging with TensorBoard ([#3236](https://github.com/PyTorchLightning/pytorch-lightning/pull/3236))
13321358
- Fixed change `t()` to `transpose()` as XLA devices do not support `.t()` on 1-dim tensor ([#3252](https://github.com/PyTorchLightning/pytorch-lightning/pull/3252))
@@ -1586,8 +1612,8 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
15861612
- Added option `save_last` to save the model at the end of every epoch in `ModelCheckpoint` ([#1908](https://github.com/PyTorchLightning/pytorch-lightning/pull/1908))
15871613
- Early stopping checks `on_validation_end` ([#1458](https://github.com/PyTorchLightning/pytorch-lightning/pull/1458))
15881614
- Speed up single-core TPU training by loading data using `ParallelLoader` ([#2033](https://github.com/PyTorchLightning/pytorch-lightning/pull/2033))
1589-
- Added a model hook `transfer_batch_to_device` that enables moving custom data structures to the target device ([1756](https://github.com/PyTorchLightning/pytorch-lightning/pull/1756))
1590-
- Added [black](https://black.readthedocs.io/en/stable/) formatter for the code with code-checker on pull ([1610](https://github.com/PyTorchLightning/pytorch-lightning/pull/1610))
1615+
- Added a model hook `transfer_batch_to_device` that enables moving custom data structures to the target device ([#1756](https://github.com/PyTorchLightning/pytorch-lightning/pull/1756))
1616+
- Added [black](https://black.readthedocs.io/en/stable/) formatter for the code with code-checker on pull ([#1610](https://github.com/PyTorchLightning/pytorch-lightning/pull/1610))
15911617
- Added back the slow spawn ddp implementation as `ddp_spawn` ([#2115](https://github.com/PyTorchLightning/pytorch-lightning/pull/2115))
15921618
- Added loading checkpoints from URLs ([#1667](https://github.com/PyTorchLightning/pytorch-lightning/pull/1667))
15931619
- Added a callback method `on_keyboard_interrupt` for handling KeyboardInterrupt events during training ([#2134](https://github.com/PyTorchLightning/pytorch-lightning/pull/2134))

docs/source/advanced/multi_gpu.rst

Lines changed: 2 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -75,39 +75,10 @@ register the tensor as a buffer in your modules's ``__init__`` method with :meth
7575

7676
Remove samplers
7777
^^^^^^^^^^^^^^^
78-
In PyTorch, you must use :class:`~torch.utils.data.distributed.DistributedSampler`
79-
for multi-node or TPU training. The sampler makes sure each GPU sees the appropriate part of your data.
8078

81-
.. testcode::
82-
83-
# without lightning
84-
def train_dataloader(self):
85-
dataset = MNIST(...)
86-
sampler = None
87-
88-
if self.on_tpu:
89-
sampler = DistributedSampler(dataset)
90-
91-
return DataLoader(dataset, sampler=sampler)
92-
93-
Lightning adds the correct samplers when needed, so no need to explicitly add samplers.
94-
95-
.. testcode::
96-
97-
# with lightning
98-
def train_dataloader(self):
99-
dataset = MNIST(...)
100-
return DataLoader(dataset)
101-
102-
.. note::
103-
By default it will add ``shuffle=True`` for train sampler and ``shuffle=False`` for val/test sampler.
104-
``drop_last`` in :class:`~torch.utils.data.distributed.DistributedSampler` will be set to its default value in PyTorch.
105-
If you called :func:`~pytorch_lightning.utilities.seed.seed_everyting`, Lightning will set the same seed for the
106-
sampler.
107-
108-
.. note:: You can disable this behavior with ``Trainer(replace_sampler_ddp=False)``
79+
:class:`~torch.utils.data.distributed.DistributedSampler` is automatically handled by Lightning.
10980

110-
.. note:: For iterable datasets, we don't do this automatically.
81+
See :ref:`replace-sampler-ddp` for more information.
11182

11283

11384
Synchronize validation and test logging

docs/source/clouds/cluster.rst

Lines changed: 1 addition & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -300,27 +300,4 @@ Set the ``NCCL_DEBUG=INFO`` environment variable to see the ACTUAL error.
300300

301301
.. code-block:: bash
302302
303-
python NCCL_DEBUG=INFO train.py ...
304-
305-
306-
Distributed sampler
307-
-------------------
308-
309-
Normally now you would need to add a
310-
:class:`~torch.utils.data.distributed.DistributedSampler` to your dataset, however
311-
Lightning automates this for you. But if you still need to set a sampler set the Trainer flag
312-
:paramref:`~pytorch_lightning.Trainer.replace_sampler_ddp` to ``False``.
313-
314-
Here's an example of how to add your own sampler (again, not needed with Lightning).
315-
316-
.. testcode::
317-
318-
# in your LightningModule
319-
def train_dataloader(self):
320-
dataset = MyDataset()
321-
dist_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
322-
dataloader = Dataloader(dataset, sampler=dist_sampler)
323-
return dataloader
324-
325-
# in your training script
326-
trainer = Trainer(replace_sampler_ddp=False)
303+
NCCL_DEBUG=INFO python train.py ...

docs/source/common/hyperparameters.rst

Lines changed: 1 addition & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -152,23 +152,7 @@ improve readability and reproducibility.
152152
model = LitMNIST.load_from_checkpoint(PATH, loss_fx=torch.nn.SomeOtherLoss, generator_network=MyGenerator())
153153
154154
155-
3. Assign to `self.hparams`. Anything assigned to `self.hparams` will also be saved automatically.
156-
157-
.. code-block:: python
158-
159-
# using a argparse.Namespace
160-
class LitMNIST(LightningModule):
161-
def __init__(self, hparams, *args, **kwargs):
162-
super().__init__()
163-
self.hparams = hparams
164-
self.layer_1 = nn.Linear(28 * 28, self.hparams.layer_1_dim)
165-
self.layer_2 = nn.Linear(self.hparams.layer_1_dim, self.hparams.layer_2_dim)
166-
self.layer_3 = nn.Linear(self.hparams.layer_2_dim, 10)
167-
def train_dataloader(self):
168-
return DataLoader(mnist_train, batch_size=self.hparams.batch_size)
169-
170-
171-
4. You can also save full objects such as `dict` or `Namespace` to the checkpoint.
155+
3. You can also save full objects such as `dict` or `Namespace` to the checkpoint.
172156

173157
.. code-block:: python
174158

docs/source/common/lightning_module.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ Notice a few things.
5454
new_x = torch.Tensor(2, 3)
5555
new_x = new_x.type_as(x)
5656
57-
5. There are no samplers for distributed, Lightning also does this for you.
57+
5. Lightning by default handles the distributed sampler for you.
5858

5959
|
6060

docs/source/common/trainer.rst

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1278,6 +1278,8 @@ Set to True to reload dataloaders every epoch.
12781278
train_loader = model.train_dataloader()
12791279
for batch in train_loader:
12801280
1281+
.. _replace-sampler-ddp:
1282+
12811283
replace_sampler_ddp
12821284
^^^^^^^^^^^^^^^^^^^
12831285

@@ -1289,9 +1291,10 @@ replace_sampler_ddp
12891291

12901292
|
12911293
1292-
Enables auto adding of distributed sampler. By default it will add ``shuffle=True``
1293-
for train sampler and ``shuffle=False`` for val/test sampler. If you want to customize
1294-
it, you can set ``replace_sampler_ddp=False`` and add your own distributed sampler.
1294+
Enables auto adding of :class:`~torch.utils.data.distributed.DistributedSampler`. In PyTorch, you must use it in
1295+
distributed settings such as TPUs or multi-node. The sampler makes sure each GPU sees the appropriate part of your data.
1296+
By default it will add ``shuffle=True`` for train sampler and ``shuffle=False`` for val/test sampler.
1297+
If you want to customize it, you can set ``replace_sampler_ddp=False`` and add your own distributed sampler.
12951298
If ``replace_sampler_ddp=True`` and a distributed sampler was already added,
12961299
Lightning will not replace the existing one.
12971300

@@ -1304,9 +1307,15 @@ By setting to False, you have to add your own distributed sampler:
13041307

13051308
.. code-block:: python
13061309
1307-
# default used by the Trainer
1308-
sampler = torch.utils.data.distributed.DistributedSampler(dataset, shuffle=True)
1309-
dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)
1310+
1311+
# in your LightningModule or LightningDataModule
1312+
def train_dataloader(self):
1313+
# default used by the Trainer
1314+
sampler = torch.utils.data.distributed.DistributedSampler(dataset, shuffle=True)
1315+
dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)
1316+
return dataloader
1317+
1318+
.. note:: For iterable datasets, we don't do this automatically.
13101319

13111320
resume_from_checkpoint
13121321
^^^^^^^^^^^^^^^^^^^^^^
@@ -1389,10 +1398,6 @@ as you request.
13891398

13901399
Your effective batch size is batch_size * total tpu cores.
13911400

1392-
.. note::
1393-
No need to add a :class:`~torch.utils.data.distributed.DistributedSampler`,
1394-
Lightning automatically does it for you.
1395-
13961401
This parameter can be either 1 or 8.
13971402

13981403
Example::

docs/source/extensions/logging.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -208,8 +208,8 @@ To change this behaviour, set the `log_every_n_steps` :class:`~pytorch_lightning
208208
Log writing frequency
209209
=====================
210210

211-
Writing to a logger can be expensive, so by default Lightning write logs to disc or to the given logger every 100 training steps.
212-
To change this behaviour, set the interval at which you wish to flush logs to the filesystem using `log_every_n_steps` :class:`~pytorch_lightning.trainer.trainer.Trainer` flag.
211+
Writing to a logger can be expensive, so by default Lightning writes logs to disk or to the given logger every 100 training steps.
212+
To change this behaviour, set the interval at which you wish to flush logs to the filesystem using the `flush_logs_every_n_steps` :class:`~pytorch_lightning.trainer.trainer.Trainer` flag.
213213

214214
.. testcode::
215215

pl_examples/basic_examples/README.md

Lines changed: 0 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -58,16 +58,3 @@ To run this demo do the following:
5858
1. Log into the jumphost node of your SLURM-managed cluster.
5959
2. Create a conda environment with Lightning and a GPU PyTorch version.
6060
3. Choose a script to submit
61-
62-
#### DDP
63-
Submit this job to run with DistributedDataParallel (2 nodes, 2 gpus each)
64-
```bash
65-
sbatch submit_ddp_job.sh YourEnv
66-
```
67-
68-
#### DDP2
69-
Submit this job to run with a different implementation of DistributedDataParallel.
70-
In this version, each node acts like DataParallel but syncs across nodes like DDP.
71-
```bash
72-
sbatch submit_ddp2_job.sh YourEnv
73-
```

0 commit comments

Comments
 (0)