Skip to content

Commit ead7602

Browse files
authored
Merge branch 'master' into tests/deprecated-checkpoint_callback
2 parents 8b9adc3 + 20b806a commit ead7602

23 files changed

+1059
-76
lines changed

.github/workflows/release-pypi.yml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,17 @@ jobs:
2828
python setup.py sdist bdist_wheel
2929
ls -lh dist/
3030
31+
- name: Upload to release
32+
if: startsWith(github.event.ref, 'refs/tags') || github.event_name == 'release'
33+
uses: svenstaro/upload-release-action@v2
34+
with:
35+
repo_token: ${{ secrets.GITHUB_TOKEN }}
36+
file: dist/*
37+
tag: ${{ github.ref }}
38+
asset_name: packages
39+
overwrite: false
40+
file_glob: true
41+
3142
- name: Delay releasing
3243
if: startsWith(github.event.ref, 'refs/tags') || github.event_name == 'release'
3344
uses: juliangruber/sleep-action@v1

.pre-commit-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,5 +32,6 @@ repos:
3232
types: [python]
3333

3434
- repo: https://github.com/pre-commit/mirrors-mypy
35+
rev: v0.790
3536
hooks:
3637
- id: mypy

CHANGELOG.md

Lines changed: 26 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,8 @@ All notable changes to this project will be documented in this file.
44

55
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
66

7-
## Unreleased
8-
9-
### Added
10-
11-
- Added `all_gather` method to `LightningModule` which allows gradient based tensor synchronizations for use-cases such as negative sampling. ([#5012](https://github.com/PyTorchLightning/pytorch-lightning/pull/5012))
12-
13-
### Fixed
14-
15-
- Fixed `LoggerConnector` to have logged metrics on root device in DP ([#4138](https://github.com/PyTorchLightning/pytorch-lightning/pull/4138))
167

17-
18-
## [1.1.0rc] - 2020-12-02
8+
## [1.1.0rc2] - 2020-12-02
199

2010
### Added
2111

@@ -89,6 +79,15 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
8979
- Added `Pytorch Geometric` integration example with Lightning ([#4568](https://github.com/PyTorchLightning/pytorch-lightning/pull/4568))
9080

9181

82+
- Added `all_gather` method to `LightningModule` which allows gradient based tensor synchronizations for use-cases such as negative sampling. ([#5012](https://github.com/PyTorchLightning/pytorch-lightning/pull/5012))
83+
84+
85+
- Enabled `self.log` in most functions ([#4969](https://github.com/PyTorchLightning/pytorch-lightning/pull/4969))
86+
87+
88+
- Added changeable extension variable for `ModelCheckpoint` ([#4977](https://github.com/PyTorchLightning/pytorch-lightning/pull/4977))
89+
90+
9291
### Changed
9392

9493
- Removed `multiclass_roc` and `multiclass_precision_recall_curve`, use `roc` and `precision_recall_curve` instead ([#4549](https://github.com/PyTorchLightning/pytorch-lightning/pull/4549))
@@ -108,6 +107,12 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
108107
- Changed `Simple Profiler` report to order by percentage time spent + num calls ([#4880](https://github.com/PyTorchLightning/pytorch-lightning/pull/4880))
109108

110109

110+
- Simplify optimization Logic ([#4984](https://github.com/PyTorchLightning/pytorch-lightning/pull/4984))
111+
112+
113+
- Classification metrics overhaul ([#4837](https://github.com/PyTorchLightning/pytorch-lightning/pull/4837))
114+
115+
111116
### Deprecated
112117

113118
- Deprecated `prefix` argument in `ModelCheckpoint` ([#4765](https://github.com/PyTorchLightning/pytorch-lightning/pull/4765))
@@ -127,12 +132,22 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
127132

128133
- Added feature to move tensors to CPU before saving ([#4309](https://github.com/PyTorchLightning/pytorch-lightning/pull/4309))
129134

135+
130136
- Fixed `LoggerConnector` to have logged metrics on root device in DP ([#4138](https://github.com/PyTorchLightning/pytorch-lightning/pull/4138))
131137

132138

133139
- Auto convert tensors to contiguous format when `gather_all` ([#4907](https://github.com/PyTorchLightning/pytorch-lightning/pull/4907))
134140

135141

142+
- Fixed `PYTHONPATH` for ddp test model ([#4528](https://github.com/PyTorchLightning/pytorch-lightning/pull/4528))
143+
144+
145+
- Fixed allowing logger to support indexing ([#4595](https://github.com/PyTorchLightning/pytorch-lightning/pull/4595))
146+
147+
148+
- Fixed DDP and manual_optimization ([#4976](https://github.com/PyTorchLightning/pytorch-lightning/pull/4976))
149+
150+
136151
## [1.0.8] - 2020-11-24
137152

138153
### Added
@@ -166,11 +181,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
166181

167182
- Added lambda closure to `manual_optimizer_step` ([#4618](https://github.com/PyTorchLightning/pytorch-lightning/pull/4618))
168183

169-
170184
### Changed
171185

172186
- Change Metrics `persistent` default mode to `False` ([#4685](https://github.com/PyTorchLightning/pytorch-lightning/pull/4685))
173-
174187
- LoggerConnector log_metrics will use `total_batch_idx` instead of `global_step` when logging on `training step` ([#4738](https://github.com/PyTorchLightning/pytorch-lightning/pull/4738))
175188

176189

MANIFEST.in

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,6 @@ exclude tests
4242
recursive-exclude docs *
4343
exclude docs
4444
recursive-include docs/source/_images/logos/ *
45-
recursive-include docs/source/_images/badges/ *
4645
recursive-include docs/source/_images/general/ pl_overview* tf_* tutorial_* PTL101_*
4746

4847
# Include the Requirements

README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ Scale your models, not the boilerplate.**
4444

4545
## PyTorch Lightning is just organized PyTorch
4646
Lightning disentangles PyTorch code to decouple the science from the engineering.
47-
![PT to PL](/docs/source/_images/general/pl_quick_start_full_compressed.gif)
47+
![PT to PL](docs/source/_images/general/pl_quick_start_full_compressed.gif)
4848

4949
---
5050

@@ -91,12 +91,12 @@ Lightning can automatically export to ONNX or TorchScript for those cases.
9191

9292
| System / PyTorch ver. | 1.3 (min. req.)* | 1.4 | 1.5 | 1.6 | 1.7 (latest) | 1.8 (nightly) |
9393
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
94-
| Conda py3.7 [linux] | [![PyTorch & Conda](https://github.com/PyTorchLightning/pytorch-lightning/workflows/PyTorch%20&%20Conda/badge.svg)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22PyTorch+%26+Conda%22+branch%3Amaster) | [![PyTorch & Conda](https://github.com/PyTorchLightning/pytorch-lightning/workflows/PyTorch%20&%20Conda/badge.svg)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22PyTorch+%26+Conda%22+branch%3Amaster) | [![PyTorch & Conda](https://github.com/PyTorchLightning/pytorch-lightning/workflows/PyTorch%20&%20Conda/badge.svg)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22PyTorch+%26+Conda%22+branch%3Amaster) | [![PyTorch & Conda](https://github.com/PyTorchLightning/pytorch-lightning/workflows/PyTorch%20&%20Conda/badge.svg)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22PyTorch+%26+Conda%22+branch%3Amaster) | [![PyTorch & Conda](https://github.com/PyTorchLightning/pytorch-lightning/workflows/PyTorch%20&%20Conda/badge.svg)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22PyTorch+%26+Conda%22+branch%3Amaster) | [![PyTorch & Conda](https://github.com/PyTorchLightning/pytorch-lightning/workflows/PyTorch%20&%20Conda/badge.svg)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22PyTorch+%26+Conda%22+branch%3Amaster) |
95-
| Linux py3.7 [GPUs**] | - | - | - | [![Build Status](http://104.154.220.231/api/badges/PyTorchLightning/pytorch-lightning/status.svg)](http://104.154.220.231/PyTorchLightning/pytorch-lightning) | - | - |
96-
| Linux py3.{6,7} [TPUs***] | - | - | - | [![TPU tests](https://github.com/PyTorchLightning/pytorch-lightning/workflows/TPU%20tests/badge.svg)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22TPU+tests%22+branch%3Amaster) | [![TPU tests](https://github.com/PyTorchLightning/pytorch-lightning/workflows/TPU%20tests/badge.svg)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22TPU+tests%22+branch%3Amaster) | - |
97-
| Linux py3.{6,7} | [![CI complete testing](https://github.com/PyTorchLightning/pytorch-lightning/workflows/CI%20complete%20testing/badge.svg?event=push)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22CI+testing%22) | - | - | - | [![CI complete testing](https://github.com/PyTorchLightning/pytorch-lightning/workflows/CI%20complete%20testing/badge.svg?event=push)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22CI+testing%22) | - |
98-
| OSX py3.{6,7,8} | - | [![CI complete testing](https://github.com/PyTorchLightning/pytorch-lightning/workflows/CI%20complete%20testing/badge.svg?event=push)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22CI+testing%22) | - | - | [![CI complete testing](https://github.com/PyTorchLightning/pytorch-lightning/workflows/CI%20complete%20testing/badge.svg?event=push)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22CI+testing%22) | - |
99-
| Windows py3.{6,7,8} | [![CI complete testing](https://github.com/PyTorchLightning/pytorch-lightning/workflows/CI%20complete%20testing/badge.svg?event=push)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22CI+testing%22) | - | - | - | [![CI complete testing](https://github.com/PyTorchLightning/pytorch-lightning/workflows/CI%20complete%20testing/badge.svg?event=push)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22CI+testing%22) | - |
94+
| Conda py3.7 [linux] | [![PyTorch & Conda](https://github.com/PyTorchLightning/pytorch-lightning/workflows/PyTorch%20&%20Conda/badge.svg?branch=master&event=push)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22PyTorch+%26+Conda%22+branch%3Amaster) | [![PyTorch & Conda](https://github.com/PyTorchLightning/pytorch-lightning/workflows/PyTorch%20&%20Conda/badge.svg?branch=master&event=push)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22PyTorch+%26+Conda%22+branch%3Amaster) | [![PyTorch & Conda](https://github.com/PyTorchLightning/pytorch-lightning/workflows/PyTorch%20&%20Conda/badge.svg?branch=master&event=push)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22PyTorch+%26+Conda%22+branch%3Amaster) | [![PyTorch & Conda](https://github.com/PyTorchLightning/pytorch-lightning/workflows/PyTorch%20&%20Conda/badge.svg?branch=master&event=push)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22PyTorch+%26+Conda%22+branch%3Amaster) | [![PyTorch & Conda](https://github.com/PyTorchLightning/pytorch-lightning/workflows/PyTorch%20&%20Conda/badge.svg?branch=master&event=push)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22PyTorch+%26+Conda%22+branch%3Amaster) | [![PyTorch & Conda](https://github.com/PyTorchLightning/pytorch-lightning/workflows/PyTorch%20&%20Conda/badge.svg?branch=master&event=push)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22PyTorch+%26+Conda%22+branch%3Amaster) |
95+
| Linux py3.7 [GPUs**] | - | - | - | [![GPUs Status](http://104.154.220.231/api/badges/PyTorchLightning/pytorch-lightning/status.svg)](http://104.154.220.231/PyTorchLightning/pytorch-lightning) | - | - |
96+
| Linux py3.{6,7} [TPUs***] | - | - | - | [![TPU tests](https://github.com/PyTorchLightning/pytorch-lightning/workflows/TPU%20tests/badge.svg?branch=master&event=push)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22TPU+tests%22+branch%3Amaster) | [![TPU tests](https://github.com/PyTorchLightning/pytorch-lightning/workflows/TPU%20tests/badge.svg?branch=master&event=push)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22TPU+tests%22+branch%3Amaster) | - |
97+
| Linux py3.{6,7} | [![CI complete testing](https://github.com/PyTorchLightning/pytorch-lightning/workflows/CI%20complete%20testing/badge.svg?branch=master&event=push)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22CI+testing%22) | - | - | - | [![CI complete testing](https://github.com/PyTorchLightning/pytorch-lightning/workflows/CI%20complete%20testing/badge.svg?branch=master&event=push)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22CI+testing%22) | - |
98+
| OSX py3.{6,7,8} | - | [![CI complete testing](https://github.com/PyTorchLightning/pytorch-lightning/workflows/CI%20complete%20testing/badge.svg?branch=master&event=push)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22CI+testing%22) | - | - | [![CI complete testing](https://github.com/PyTorchLightning/pytorch-lightning/workflows/CI%20complete%20testing/badge.svg?branch=master&event=push)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22CI+testing%22) | - |
99+
| Windows py3.{6,7,8} | [![CI complete testing](https://github.com/PyTorchLightning/pytorch-lightning/workflows/CI%20complete%20testing/badge.svg?branch=master&event=push)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22CI+testing%22) | - | - | - | [![CI complete testing](https://github.com/PyTorchLightning/pytorch-lightning/workflows/CI%20complete%20testing/badge.svg?branch=master&event=push)](https://github.com/PyTorchLightning/pytorch-lightning/actions?query=workflow%3A%22CI+testing%22) | - |
100100

101101
- _\* `torch>=1.4` is the minimal pytorch version for Python 3.8_
102102
- _\** tests run on two NVIDIA K80_

benchmarks/test_sharded_parity.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -131,6 +131,7 @@ def test_ddp_sharded_plugin_correctness_amp_multi_gpu_ddp(tmpdir, args=None):
131131
)
132132

133133

134+
@pytest.mark.skip(reason="Current issue with multiple optimizers and FairScale.")
134135
@pytest.mark.skipif(torch.cuda.device_count() < 2, reason="test requires multi-GPU machine")
135136
@pytest.mark.skipif(platform.system() == "Windows",
136137
reason="Distributed training is not supported on Windows")
@@ -148,6 +149,7 @@ def test_ddp_sharded_plugin_correctness_multi_gpu_multi_optim():
148149
)
149150

150151

152+
@pytest.mark.skip(reason="Current issue with multiple optimizers and FairScale.")
151153
@pytest.mark.skipif(torch.cuda.device_count() < 2, reason="test requires multi-GPU machine")
152154
@pytest.mark.skipif(platform.system() == "Windows",
153155
reason="Distributed training is not supported on Windows")
@@ -189,7 +191,7 @@ def training_step(self, batch, batch_idx, optimizer_idx):
189191

190192
# ensure we forward the correct params to the optimizer
191193
# without retain_graph we can't do multiple backward passes
192-
self.manual_backward(loss_2, opt_b, retain_graph=True)
194+
self.manual_backward(loss_2, opt_b)
193195
# todo: understand why synchronization breaks there.
194196
# self.manual_backward(loss_2, opt_a, retain_graph=True)
195197
opt_b.step()

docs/source/governance.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,3 +24,4 @@ Core Maintainers
2424
- Lezwon Castelino (`lezwon <https://github.com/lezwon>`_)
2525
- Jeff Yang (`ydcjeff <https://github.com/ydcjeff>`_)
2626
- Roger Shieh (`s-rog <https://github.com/s-rog>`_)
27+
- Carlos Mocholí (`carmocca <https://github.com/carmocca>`_)

docs/source/multi_gpu.rst

Lines changed: 81 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -612,6 +612,7 @@ This is useful when dealing with large Transformer based models, or in environme
612612
Lightning currently offers the following methods to leverage model parallelism:
613613

614614
- Sharded Training (partitioning your gradients and optimizer state across multiple GPUs, for reduced memory overhead with **no performance loss**)
615+
- Sequential Model Parallelism with Checkpointing (partition your :class:`nn.Sequential <torch.nn.Sequential>` module across multiple GPUs, leverage checkpointing and microbatching for further memory improvements and device utilization)
615616

616617
Sharded Training
617618
^^^^^^^^^^^^^^^^
@@ -666,7 +667,7 @@ To use Sharded Training, you need to first install FairScale using the command b
666667

667668
.. code-block:: bash
668669
669-
pip install https://github.com/facebookresearch/fairscale/archive/bb468670838b98dc8f8d67be4eabf195042a7994.zip
670+
pip install https://github.com/PyTorchLightning/fairscale/archive/pl_1.1.0.zip
670671
671672
672673
.. code-block:: python
@@ -678,6 +679,80 @@ Sharded Training can work across all DDP variants by adding the additional ``--p
678679

679680
Internally we re-initialize your optimizers and shard them across your machines and processes. We handle all communication using PyTorch distributed, so no code changes are required.
680681

682+
----------
683+
684+
.. _sequential-parallelism:
685+
686+
Sequential Model Parallelism with Checkpointing
687+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
688+
PyTorch Lightning integration for Sequential Model Parallelism using `FairScale <https://github.com/facebookresearch/fairscale>`_.
689+
Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially.
690+
We also provide auto-balancing techniques through FairScale, to find optimal balances for the model across GPUs.
691+
In addition, we use Gradient Checkpointing to reduce GPU memory requirements further, and micro-batches to minimizing device under-utilization automatically.
692+
693+
Reference: https://arxiv.org/abs/1811.06965
694+
695+
.. note:: DDPSequentialPlugin is currently supported only for Pytorch 1.6.
696+
697+
To get started, install FairScale through extras using with ``pip install pytorch-lightning["extra"]``
698+
699+
or directly using
700+
701+
.. code-block:: bash
702+
703+
pip install https://github.com/PyTorchLightning/fairscale/archive/pl_1.1.0.zip
704+
705+
To use Sequential Model Parallelism, you must define a :class:`nn.Sequential <torch.nn.Sequential>` module that defines the layers you wish to parallelize across GPUs.
706+
This should be kept within the ``sequential_module`` variable within your ``LightningModule`` like below.
707+
708+
.. code-block:: python
709+
710+
from pytorch_lightning.plugins.ddp_sequential_plugin import DDPSequentialPlugin
711+
from pytorch_lightning import LightningModule
712+
713+
class MyModel(LightningModule):
714+
def __init__(self):
715+
...
716+
self.sequential_module = torch.nn.Sequential(my_layers)
717+
718+
# Split my module across 4 gpus, one layer each
719+
model = MyModel()
720+
plugin = DDPSequentialPlugin(balance=[1, 1, 1, 1])
721+
trainer = Trainer(accelerator='ddp', gpus=4, plugins=[plugin])
722+
trainer.fit(model)
723+
724+
725+
We provide a minimal example of Sequential Model Parallelism using a convolutional model training on cifar10, split onto GPUs `here <https://github.com/PyTorchLightning/pytorch-lightning/tree/master/pl_examples/basic_examples/conv_sequential_example.py>`_.
726+
To run the example, you need to install `Bolts <https://github.com/PyTorchLightning/pytorch-lightning-bolts>`_. Install with ``pip install pytorch-lightning-bolts``.
727+
728+
When running the Sequential Model Parallelism example on 2 GPUS we achieve these memory savings.
729+
730+
.. list-table:: GPU Memory Utilization
731+
:widths: 25 25 50
732+
:header-rows: 1
733+
734+
* - GPUS
735+
- Without Balancing
736+
- With Balancing
737+
* - Gpu 0
738+
- 4436 MB
739+
- 1554 MB
740+
* - Gpu 1
741+
- ~0
742+
- 994 MB
743+
744+
To run the example with Sequential Model Parallelism:
745+
746+
.. code-block:: bash
747+
748+
python pl_examples/basic_examples/conv_sequential_example.py --batch_size 1024 --gpus 2 --accelerator ddp --use_ddp_sequential
749+
750+
To run the same example without Sequential Model Parallelism:
751+
752+
.. code-block:: bash
753+
754+
python pl_examples/basic_examples/conv_sequential_example.py --batch_size 1024 --gpus 1
755+
681756
682757
Batch size
683758
----------
@@ -728,17 +803,17 @@ Lightning supports the use of TorchElastic to enable fault-tolerant and elastic
728803
.. code-block:: python
729804
730805
Trainer(gpus=8, accelerator='ddp')
731-
732-
806+
807+
733808
Following the `TorchElastic Quickstart documentation <https://pytorch.org/elastic/latest/quickstart.html>`_, you then need to start a single-node etcd server on one of the hosts:
734809

735810
.. code-block:: bash
736811
737812
etcd --enable-v2
738813
--listen-client-urls http://0.0.0.0:2379,http://127.0.0.1:4001
739814
--advertise-client-urls PUBLIC_HOSTNAME:2379
740-
741-
815+
816+
742817
And then launch the elastic job with:
743818

744819
.. code-block:: bash
@@ -750,7 +825,7 @@ And then launch the elastic job with:
750825
--rdzv_backend=etcd
751826
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
752827
YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...)
753-
828+
754829
755830
See the official `TorchElastic documentation <https://pytorch.org/elastic>`_ for details
756831
on installation and more use cases.

0 commit comments

Comments
 (0)