Skip to content

Commit ffe9c85

Browse files
authored
Merge branch 'master' into feature/1947_load_disparity
2 parents aa82615 + ee35907 commit ffe9c85

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+7500
-7015
lines changed

.drone.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,8 @@ steps:
3232
- pip --version
3333
- nvidia-smi
3434
- pip install -r ./requirements/devel.txt --upgrade-strategy only-if-needed -v --no-cache-dir
35+
# when Image has defined CUDa version we can switch to this package spec "nvidia-dali-cuda${CUDA_VERSION%%.*}0"
36+
- pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-cuda100 --upgrade-strategy only-if-needed
3537
- pip list
3638
- coverage run --source pytorch_lightning -m pytest pytorch_lightning tests -v --color=yes --durations=25 # --flake8
3739
- python -m pytest benchmarks pl_examples -v --color=yes --maxfail=2 --durations=0 # --flake8

.pyrightconfig.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@
3030
"pytorch_lightning/trainer/training_tricks.py",
3131
"pytorch_lightning/trainer/batch_size_scaling.py",
3232
"pytorch_lightning/trainer/distrib_data_parallel.py",
33+
"pytorch_lightning/trainer/properties.py",
3334
"pytorch_lightning/trainer/lr_scheduler_connector.py",
3435
"pytorch_lightning/trainer/training_loop_temp.py",
3536
"pytorch_lightning/trainer/connectors/checkpoint_connector.py",

docs/source/accelerators.rst

Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
############
2+
Accelerators
3+
############
4+
Accelerators connect a Lightning Trainer to arbitrary accelerators (CPUs, GPUs, TPUs, etc). Accelerators
5+
also manage distributed accelerators (like DP, DDP, HPC cluster).
6+
7+
Accelerators can also be configured to run on arbitrary clusters using Plugins or to link up to arbitrary
8+
computational strategies like 16-bit precision via AMP and Apex.
9+
10+
----------
11+
12+
******************************
13+
Implement a custom accelerator
14+
******************************
15+
To link up arbitrary hardware, implement your own Accelerator subclass
16+
17+
.. code-block:: python
18+
19+
from pytorch_lightning.accelerators.accelerator import Accelerator
20+
21+
class MyAccelerator(Accelerator):
22+
def __init__(self, trainer, cluster_environment=None):
23+
super().__init__(trainer, cluster_environment)
24+
self.nickname = 'my_accelator'
25+
26+
def setup(self):
27+
# find local rank, etc, custom things to implement
28+
29+
def train(self):
30+
# implement what happens during training
31+
32+
def training_step(self):
33+
# implement how to do a training_step on this accelerator
34+
35+
def validation_step(self):
36+
# implement how to do a validation_step on this accelerator
37+
38+
def test_step(self):
39+
# implement how to do a test_step on this accelerator
40+
41+
def backward(self, closure_loss, optimizer, opt_idx, *args, **kwargs):
42+
# implement how to do a backward pass with this accelerator
43+
44+
def barrier(self, name: Optional[str] = None):
45+
# implement this accelerator's barrier
46+
47+
def broadcast(self, obj, src=0):
48+
# implement this accelerator's broadcast function
49+
50+
def sync_tensor(self,
51+
tensor: Union[torch.Tensor],
52+
group: Optional[Any] = None,
53+
reduce_op: Optional[Union[ReduceOp, str]] = None) -> torch.Tensor:
54+
# implement how to sync tensors when reducing metrics across accelerators
55+
56+
********
57+
Examples
58+
********
59+
The following examples illustrate customizing accelerators.
60+
61+
Example 1: Arbitrary HPC cluster
62+
================================
63+
To link any accelerator with an arbitrary cluster (SLURM, Condor, etc), pass in a Cluster Plugin which will be passed
64+
into any accelerator.
65+
66+
First, implement your own ClusterEnvironment. Here is the torch elastic implementation.
67+
68+
.. code-block:: python
69+
70+
import os
71+
from pytorch_lightning import _logger as log
72+
from pytorch_lightning.utilities import rank_zero_warn
73+
from pytorch_lightning.cluster_environments.cluster_environment import ClusterEnvironment
74+
75+
class TorchElasticEnvironment(ClusterEnvironment):
76+
77+
def __init__(self):
78+
super().__init__()
79+
80+
def master_address(self):
81+
if "MASTER_ADDR" not in os.environ:
82+
rank_zero_warn(
83+
"MASTER_ADDR environment variable is not defined. Set as localhost"
84+
)
85+
os.environ["MASTER_ADDR"] = "127.0.0.1"
86+
log.debug(f"MASTER_ADDR: {os.environ['MASTER_ADDR']}")
87+
master_address = os.environ.get('MASTER_ADDR')
88+
return master_address
89+
90+
def master_port(self):
91+
if "MASTER_PORT" not in os.environ:
92+
rank_zero_warn(
93+
"MASTER_PORT environment variable is not defined. Set as 12910"
94+
)
95+
os.environ["MASTER_PORT"] = "12910"
96+
log.debug(f"MASTER_PORT: {os.environ['MASTER_PORT']}")
97+
98+
port = os.environ.get('MASTER_PORT')
99+
return port
100+
101+
def world_size(self):
102+
return os.environ.get('WORLD_SIZE')
103+
104+
def local_rank(self):
105+
return int(os.environ['LOCAL_RANK'])
106+
107+
Now, pass it into the trainer which will use Torch Elastic across your accelerator of choice.
108+
109+
.. code-block:: python
110+
111+
cluster = TorchElasticEnvironment()
112+
accelerator = MyAccelerator()
113+
trainer = Trainer(plugins=[cluster], accelerator=MyAccelerator())
114+
115+
In this example, MyAccelerator can define arbitrary hardware (like IPUs or TPUs) and links it to an arbitrary
116+
compute cluster.
117+
118+
------------
119+
120+
**********************
121+
Available Accelerators
122+
**********************
123+
124+
CPU Accelerator
125+
===============
126+
127+
.. autoclass:: pytorch_lightning.accelerators.cpu_accelerator.CPUAccelerator
128+
:noindex:
129+
130+
DDP Accelerator
131+
===============
132+
133+
.. autoclass:: pytorch_lightning.accelerators.ddp_accelerator.DDPAccelerator
134+
:noindex:
135+
136+
DDP2 Accelerator
137+
================
138+
139+
.. autoclass:: pytorch_lightning.accelerators.ddp2_accelerator.DDP2Accelerator
140+
:noindex:
141+
142+
DDP CPU HPC Accelerator
143+
=======================
144+
145+
.. autoclass:: pytorch_lightning.accelerators.ddp_cpu_hpc_accelerator.DDPCPUHPCAccelerator
146+
:noindex:
147+
148+
DDP CPU Spawn Accelerator
149+
=========================
150+
151+
.. autoclass:: pytorch_lightning.accelerators.ddp_cpu_spawn_accelerator.DDPCPUSpawnAccelerator
152+
:noindex:
153+
154+
DDP HPC Accelerator
155+
===================
156+
157+
.. autoclass:: pytorch_lightning.accelerators.ddp_hpc_accelerator.DDPHPCAccelerator
158+
:noindex:
159+
160+
DDP Spawn Accelerator
161+
=====================
162+
163+
.. autoclass:: pytorch_lightning.accelerators.ddp_spawn_accelerator.DDPSpawnAccelerator
164+
:noindex:
165+
166+
GPU Accelerator
167+
===============
168+
169+
.. autoclass:: pytorch_lightning.accelerators.gpu_accelerator.GPUAccelerator
170+
:noindex:
171+
172+
Horovod Accelerator
173+
===================
174+
175+
.. autoclass:: pytorch_lightning.accelerators.horovod_accelerator.HorovodAccelerator
176+
:noindex:
177+
178+
TPU Accelerator
179+
===============
180+
181+
.. autoclass:: pytorch_lightning.accelerators.tpu_accelerator.TPUAccelerator
182+
:noindex:

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ PyTorch Lightning Documentation
3939
:name: docs
4040
:caption: Optional extensions
4141

42+
accelerators
4243
callbacks
4344
datamodules
4445
logging

docs/source/lightning_module.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -172,10 +172,11 @@ Under the hood, Lightning does the following (pseudocode):
172172
model.train()
173173
torch.set_grad_enabled(True)
174174
175-
outs = []
175+
losses = []
176176
for batch in train_dataloader:
177177
# forward
178-
out = training_step(val_batch)
178+
loss = training_step(batch)
179+
losses.append(loss.detach())
179180
180181
# backward
181182
loss.backward()
@@ -184,6 +185,7 @@ Under the hood, Lightning does the following (pseudocode):
184185
optimizer.step()
185186
optimizer.zero_grad()
186187
188+
187189
Training epoch-level metrics
188190
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
189191
If you want to calculate epoch-level metrics and log them, use the `.log` method

docs/source/loggers.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,15 @@ but you can pass to the :class:`~pytorch_lightning.trainer.trainer.Trainer` any
1919

2020
Read more about :ref:`logging` options.
2121

22+
To log arbitrary artifacts like images or audio samples use the `trainer.log_dir` property to resolve
23+
the path.
24+
25+
.. code-block:: python
26+
27+
def training_step(self, batch, batch_idx):
28+
img = ...
29+
log_image(img, self.trainer.log_dir)
30+
2231
Comet.ml
2332
========
2433

docs/source/new-project.rst

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -608,7 +608,13 @@ Here's an example adding a not-so-fancy learning rate decay rule:
608608
new_lr_group.append(new_lr)
609609
param_group['lr'] = new_lr
610610
self.old_lrs[opt_idx] = new_lr_group
611-
611+
612+
And pass the callback to the Trainer
613+
614+
.. code-block:: python
615+
616+
decay_callback = DecayLearningRate()
617+
trainer = Trainer(callbacks=[decay_callback])
612618
613619
Things you can do with a callback:
614620

0 commit comments

Comments
 (0)