Skip to content

Commit d3d5cf7

Browse files
authored
Merge branch 'master' into hyperparameters_for_datamodule
2 parents 43c75fe + 3102922 commit d3d5cf7

28 files changed

+539
-44
lines changed

CHANGELOG.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
8181
- Added trainer stage hooks for Training Plugins and Accelerators ([#7864](https://github.com/PyTorchLightning/pytorch-lightning/pull/7864))
8282

8383

84+
- Added the `on_before_optimizer_step` hook ([#8048](https://github.com/PyTorchLightning/pytorch-lightning/pull/8048))
85+
86+
8487
- Added IPU Accelerator ([#7867](https://github.com/PyTorchLightning/pytorch-lightning/pull/7867))
8588

8689

@@ -149,6 +152,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
149152
- Added support for `save_hyperparameters` in `LightningDataModule` ([#3792](https://github.com/PyTorchLightning/pytorch-lightning/pull/3792))
150153

151154

155+
- Added `LSFEnvironment` for distributed training with the LSF resource manager `jsrun` ([#5102](https://github.com/PyTorchLightning/pytorch-lightning/pull/5102))
156+
157+
152158
### Changed
153159

154160

@@ -247,10 +253,10 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
247253
- Moved profilers to their own file ([#7822](https://github.com/PyTorchLightning/pytorch-lightning/pull/7822))
248254

249255

250-
- The `on_after_backward` hook is now called on accumulating iterations ([#8328](https://github.com/PyTorchLightning/pytorch-lightning/pull/8328))
256+
- The `on_after_backward` hook is now called on accumulating iterations. Use the `on_before_optimizer_step` hook to mimic the old behaviour ([#8328](https://github.com/PyTorchLightning/pytorch-lightning/pull/8328))
251257

252258

253-
- The mixed precision loss is no longer unscaled before the `on_after_backward` hook ([#8328](https://github.com/PyTorchLightning/pytorch-lightning/pull/8328))
259+
- The mixed precision loss is no longer unscaled before the `on_after_backward` hook. Use the `on_before_optimizer_step` hook to mimic the old behaviour ([#8328](https://github.com/PyTorchLightning/pytorch-lightning/pull/8328))
254260

255261

256262
- The `TrainingTypePlugin.{pre,post}_backward` hooks no longer take the `optimizer, opt_idx, should_accumulate` arguments ([#8328](https://github.com/PyTorchLightning/pytorch-lightning/pull/8328))
@@ -262,6 +268,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
262268
- The `PrecisionPlugin.backward` hooks no longer takes a `should_accumulate` argument ([#8328](https://github.com/PyTorchLightning/pytorch-lightning/pull/8328))
263269

264270

271+
- Added the `on_before_backward` hook ([#7865](https://github.com/PyTorchLightning/pytorch-lightning/pull/7865))
272+
273+
265274
- `LightningCLI` now aborts with a clearer message if config already exists and disables save config during `fast_dev_run`([#7963](https://github.com/PyTorchLightning/pytorch-lightning/pull/7963))
266275

267276

docs/source/common/lightning_module.rst

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1191,9 +1191,11 @@ for more information.
11911191
on_before_zero_grad()
11921192
optimizer_zero_grad()
11931193
1194+
on_before_backward()
11941195
backward()
11951196
on_after_backward()
11961197
1198+
on_before_optimizer_step()
11971199
optimizer_step()
11981200
11991201
on_train_batch_end()
@@ -1246,6 +1248,12 @@ get_progress_bar_dict
12461248
.. automethod:: pytorch_lightning.core.lightning.LightningModule.get_progress_bar_dict
12471249
:noindex:
12481250

1251+
on_before_backward
1252+
~~~~~~~~~~~~~~~~~~
1253+
1254+
.. automethod:: pytorch_lightning.core.hooks.ModelHooks.on_before_backward
1255+
:noindex:
1256+
12491257
on_after_backward
12501258
~~~~~~~~~~~~~~~~~
12511259

@@ -1444,6 +1452,12 @@ on_test_model_train
14441452
.. automethod:: pytorch_lightning.core.hooks.ModelHooks.on_test_model_train
14451453
:noindex:
14461454

1455+
on_before_optimizer_step
1456+
~~~~~~~~~~~~~~~~~~~~~~~~
1457+
1458+
.. automethod:: pytorch_lightning.core.hooks.ModelHooks.on_before_optimizer_step
1459+
:noindex:
1460+
14471461
optimizer_step
14481462
~~~~~~~~~~~~~~
14491463

docs/source/extensions/callbacks.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -351,12 +351,24 @@ on_load_checkpoint
351351
.. automethod:: pytorch_lightning.callbacks.Callback.on_load_checkpoint
352352
:noindex:
353353

354+
on_before_backward
355+
^^^^^^^^^^^^^^^^^^
356+
357+
.. automethod:: pytorch_lightning.callbacks.Callback.on_before_backward
358+
:noindex:
359+
354360
on_after_backward
355361
^^^^^^^^^^^^^^^^^
356362

357363
.. automethod:: pytorch_lightning.callbacks.Callback.on_after_backward
358364
:noindex:
359365

366+
on_before_optimizer_step
367+
^^^^^^^^^^^^^^^^^^^^^^^^
368+
369+
.. automethod:: pytorch_lightning.callbacks.Callback.on_before_optimizer_step
370+
:noindex:
371+
360372
on_before_zero_grad
361373
^^^^^^^^^^^^^^^^^^^
362374

docs/source/extensions/plugins.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,7 @@ Cluster Environments
148148

149149
ClusterEnvironment
150150
LightningEnvironment
151+
LSFEnvironment
151152
TorchElasticEnvironment
152153
KubeflowEnvironment
153154
SLURMEnvironment

docs/source/guides/speed.rst

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,26 @@ This by default comes with a performance hit, and can be disabled in most cases.
9090
plugins=DDPPlugin(find_unused_parameters=False),
9191
)
9292
93+
When using DDP on a multi-node cluster, set NCCL parameters
94+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
95+
96+
`NCCL <https://developer.nvidia.com/nccl>`__ is the NVIDIA Collective Communications Library which is used under the hood by PyTorch to handle communication across nodes and GPUs. There are reported benefits in terms of speedups when adjusting NCCL parameters as seen in this `issue <https://github.com/PyTorchLightning/pytorch-lightning/issues/7179>`__. In the issue we see a 30% speed improvement when training the Transformer XLM-RoBERTa and a 15% improvement in training with Detectron2.
97+
98+
NCCL parameters can be adjusted via environment variables.
99+
100+
.. note::
101+
102+
AWS and GCP already set default values for these on their clusters. This is typically useful for custom cluster setups.
103+
104+
* `NCCL_NSOCKS_PERTHREAD <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-nsocks-perthread>`__
105+
* `NCCL_SOCKET_NTHREADS <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-socket-nthreads>`__
106+
* `NCCL_MIN_NCHANNELS <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-min-nchannels>`__
107+
108+
.. code-block:: bash
109+
110+
export NCCL_NSOCKS_PERTHREAD=4
111+
export NCCL_SOCKET_NTHREADS=2
112+
93113
Dataloaders
94114
^^^^^^^^^^^
95115
When building your DataLoader set ``num_workers > 0`` and ``pin_memory=True`` (only for GPUs).

pytorch_lightning/callbacks/base.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
import abc
2020
from typing import Any, Dict, List, Optional
2121

22+
import torch
2223
from torch.optim import Optimizer
2324

2425
import pytorch_lightning as pl
@@ -296,8 +297,18 @@ def on_load_checkpoint(
296297
"""
297298
pass
298299

300+
def on_before_backward(self, trainer: 'pl.Trainer', pl_module: 'pl.LightningModule', loss: torch.Tensor) -> None:
301+
"""Called before ``loss.backward()``."""
302+
pass
303+
299304
def on_after_backward(self, trainer: 'pl.Trainer', pl_module: 'pl.LightningModule') -> None:
300-
"""Called after ``loss.backward()`` and before optimizers do anything."""
305+
"""Called after ``loss.backward()`` and before optimizers are stepped."""
306+
pass
307+
308+
def on_before_optimizer_step(
309+
self, trainer: 'pl.Trainer', pl_module: 'pl.LightningModule', optimizer: Optimizer, opt_idx: int
310+
) -> None:
311+
"""Called before ``optimizer.step()``."""
301312
pass
302313

303314
def on_before_zero_grad(self, trainer: 'pl.Trainer', pl_module: 'pl.LightningModule', optimizer: Optimizer) -> None:

pytorch_lightning/callbacks/lambda_function.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,9 @@ def __init__(
7777
on_keyboard_interrupt: Optional[Callable] = None,
7878
on_save_checkpoint: Optional[Callable] = None,
7979
on_load_checkpoint: Optional[Callable] = None,
80+
on_before_backward: Optional[Callable] = None,
8081
on_after_backward: Optional[Callable] = None,
82+
on_before_optimizer_step: Optional[Callable] = None,
8183
on_before_zero_grad: Optional[Callable] = None,
8284
on_predict_start: Optional[Callable] = None,
8385
on_predict_end: Optional[Callable] = None,

pytorch_lightning/core/hooks.py

Lines changed: 30 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -295,21 +295,47 @@ def on_before_zero_grad(self, optimizer: Optimizer) -> None:
295295
optimizer: The optimizer for which grads should be zeroed.
296296
"""
297297

298+
def on_before_backward(self, loss: torch.Tensor) -> None:
299+
"""
300+
Called before ``loss.backward()``.
301+
302+
Args:
303+
loss: Loss divided by number of batches for gradient accumulation and scaled if using native AMP.
304+
"""
305+
pass
306+
298307
def on_after_backward(self) -> None:
299308
"""
300-
Called in the training loop after loss.backward() and before optimizers do anything.
301-
This is the ideal place to inspect or log gradient information.
309+
Called after ``loss.backward()`` and before optimizers are stepped.
310+
311+
Note:
312+
If using native AMP, the gradients will not be unscaled at this point.
313+
Use the ``on_before_optimizer_step`` if you need the unscaled gradients.
314+
"""
315+
316+
def on_before_optimizer_step(self, optimizer: Optimizer, optimizer_idx: int) -> None:
317+
"""
318+
Called before ``optimizer.step()``.
319+
320+
The hook is only called if gradients do not need to be accumulated.
321+
See: :paramref:`~pytorch_lightning.trainer.Trainer.accumulate_grad_batches`.
322+
If using native AMP, the loss will be unscaled before calling this hook.
323+
See these `docs <https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-unscaled-gradients>`__
324+
for more information on the scaling of gradients.
325+
326+
Args:
327+
optimizer: Current optimizer being used.
328+
optimizer_idx: Index of the current optimizer being used.
302329
303330
Example::
304331
305-
def on_after_backward(self):
332+
def on_before_optimizer_step(self, optimizer, optimizer_idx):
306333
# example to inspect gradient information in tensorboard
307334
if self.trainer.global_step % 25 == 0: # don't make the tf file huge
308335
for k, v in self.named_parameters():
309336
self.logger.experiment.add_histogram(
310337
tag=k, values=v.grad, global_step=self.trainer.global_step
311338
)
312-
313339
"""
314340

315341
def on_post_move_to_device(self) -> None:

pytorch_lightning/plugins/environments/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,5 +14,6 @@
1414
from pytorch_lightning.plugins.environments.cluster_environment import ClusterEnvironment # noqa: F401
1515
from pytorch_lightning.plugins.environments.kubeflow_environment import KubeflowEnvironment # noqa: F401
1616
from pytorch_lightning.plugins.environments.lightning_environment import LightningEnvironment # noqa: F401
17+
from pytorch_lightning.plugins.environments.lsf_environment import LSFEnvironment # noqa: F401
1718
from pytorch_lightning.plugins.environments.slurm_environment import SLURMEnvironment # noqa: F401
1819
from pytorch_lightning.plugins.environments.torchelastic_environment import TorchElasticEnvironment # noqa: F401
Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
# Copyright The PyTorch Lightning team.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import os
16+
import socket
17+
18+
from pytorch_lightning import _logger as log
19+
from pytorch_lightning.plugins.environments import ClusterEnvironment
20+
21+
22+
class LSFEnvironment(ClusterEnvironment):
23+
"""
24+
An environment for running on clusters managed by the LSF resource manager.
25+
26+
It is expected that any execution using this ClusterEnvironment was executed
27+
using the Job Step Manager i.e. ``jsrun``.
28+
29+
This plugin expects the following environment variables.
30+
31+
LSB_JOBID:
32+
The LSF assigned job ID
33+
34+
LSB_HOSTS:
35+
The hosts used in the job. This string is expected to have the format "batch <rank_0_host> ...."
36+
37+
JSM_NAMESPACE_LOCAL_RANK:
38+
The node local rank for the task. This environment variable is set by jsrun
39+
40+
JSM_NAMESPACE_SIZE:
41+
The world size for the task. This environment variable is set by jsrun
42+
"""
43+
44+
def __init__(self):
45+
self._master_address = self._get_master_address()
46+
self._master_port = self._get_master_port()
47+
log.debug(f"MASTER_ADDR: {self._master_address}")
48+
log.debug(f"MASTER_PORT: {self._master_port}")
49+
50+
@staticmethod
51+
def is_using_lsf() -> bool:
52+
""" Returns ``True`` if the current process was launched using the jsrun command. """
53+
required_env_vars = (
54+
"LSB_JOBID",
55+
"LSB_HOSTS",
56+
"JSM_NAMESPACE_LOCAL_RANK",
57+
"JSM_NAMESPACE_SIZE",
58+
)
59+
return all(v in os.environ for v in required_env_vars)
60+
61+
def creates_children(self) -> bool:
62+
return True
63+
64+
def master_address(self):
65+
""" The master address is read from a list of hosts contained in the environment variable `LSB_HOSTS`. """
66+
return self._master_address
67+
68+
def master_port(self):
69+
""" THe master port gets calculated from the LSF job ID. """
70+
return self._master_port
71+
72+
def world_size(self):
73+
""" The world size is read from the environment variable `JSM_NAMESPACE_SIZE`. """
74+
var = "JSM_NAMESPACE_SIZE"
75+
world_size = os.environ.get(var)
76+
if world_size is None:
77+
raise ValueError(
78+
f"Cannot determine world size from environment variable {var}."
79+
" Make sure you run your executable with `jsrun`"
80+
)
81+
return int(world_size)
82+
83+
def set_world_size(self, size: int) -> None:
84+
log.debug("LSFEnvironment.set_world_size was called, but setting world size is not allowed. Ignored.")
85+
86+
def global_rank(self):
87+
""" The world size is read from the environment variable `JSM_NAMESPACE_RANK`. """
88+
var = "JSM_NAMESPACE_RANK"
89+
global_rank = os.environ.get(var)
90+
if global_rank is None:
91+
raise ValueError(
92+
f"Cannot determine global rank from environment variable {var}."
93+
" Make sure you run your executable with `jsrun`"
94+
)
95+
return int(global_rank)
96+
97+
def set_global_rank(self, rank: int) -> None:
98+
log.debug("LSFEnvironment.set_global_rank was called, but setting global rank is not allowed. Ignored.")
99+
100+
def local_rank(self):
101+
""" The local rank is read from the environment variable `JSM_NAMESPACE_LOCAL_RANK`. """
102+
var = "JSM_NAMESPACE_LOCAL_RANK"
103+
local_rank = os.environ.get(var)
104+
if local_rank is None:
105+
raise ValueError(
106+
f"Cannot determine local rank from environment variable {var}."
107+
" Make sure you run your executable with `jsrun`"
108+
)
109+
return int(local_rank)
110+
111+
def node_rank(self):
112+
"""
113+
The node rank is determined by the position of the current hostname in the list of hosts stored in
114+
the environment variable `LSB_HOSTS`.
115+
"""
116+
hosts = self._read_hosts()
117+
count = dict()
118+
for host in hosts:
119+
if "batch" in host or "login" in host:
120+
continue
121+
if host not in count:
122+
count[host] = len(count)
123+
return count[socket.gethostname()]
124+
125+
@staticmethod
126+
def _read_hosts():
127+
hosts = os.environ.get("LSB_HOSTS")
128+
if not hosts:
129+
raise ValueError("Could not find hosts in environment variable LSB_HOSTS")
130+
hosts = hosts.split()
131+
if len(hosts) < 2:
132+
raise ValueError(
133+
"Cannot parse hosts from LSB_HOSTS environment variable."
134+
" Expected format: \"batch <rank_0_host> ...\""
135+
)
136+
return hosts
137+
138+
def _get_master_address(self):
139+
hosts = self._read_hosts()
140+
return hosts[1]
141+
142+
@staticmethod
143+
def _get_master_port():
144+
"""
145+
A helper function for accessing the master port.
146+
Uses the LSF job ID so all ranks can compute the master port.
147+
"""
148+
# check for user-specified master port
149+
port = os.environ.get("MASTER_PORT")
150+
if not port:
151+
jobid = os.environ.get("LSB_JOBID")
152+
if not jobid:
153+
raise ValueError("Could not find job id in environment variable LSB_JOBID")
154+
port = int(jobid)
155+
# all ports should be in the 10k+ range
156+
port = int(port) % 1000 + 10000
157+
log.debug(f"calculated LSF master port: {port}")
158+
else:
159+
log.debug(f"using externally specified master port: {port}")
160+
return int(port)

0 commit comments

Comments
 (0)