Skip to content

Commit 6d7c01b

Browse files
areshytkooreshytkotchatonBordaawaelchli
committed
[docs] Add docs for non-SLURM cluster setup (#5754)
* Add docs for non-slurm cluster setup * Apply suggestions from code review Co-authored-by: Adrian Wälchli <[email protected]> * Update docs/source/cluster.rst Co-authored-by: Jirka Borovec <[email protected]> * Update docs/source/cluster.rst Co-authored-by: Alexander <[email protected]> Co-authored-by: chaton <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Rohit Gupta <[email protected]>
1 parent 834f4bb commit 6d7c01b

File tree

4 files changed

+243
-14
lines changed

4 files changed

+243
-14
lines changed

docs/source/accelerators.rst

Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
2+
.. _accelerators:
3+
4+
############
5+
Accelerators
6+
############
7+
Accelerators connect a Lightning Trainer to arbitrary accelerators (CPUs, GPUs, TPUs, etc). Accelerators
8+
also manage distributed accelerators (like DP, DDP, HPC cluster).
9+
10+
Accelerators can also be configured to run on arbitrary clusters using Plugins or to link up to arbitrary
11+
computational strategies like 16-bit precision via AMP and Apex.
12+
13+
----------
14+
15+
******************************
16+
Implement a custom accelerator
17+
******************************
18+
To link up arbitrary hardware, implement your own Accelerator subclass
19+
20+
.. code-block:: python
21+
22+
from pytorch_lightning.accelerators.accelerator import Accelerator
23+
24+
class MyAccelerator(Accelerator):
25+
def __init__(self, trainer, cluster_environment=None):
26+
super().__init__(trainer, cluster_environment)
27+
self.nickname = 'my_accelerator'
28+
29+
def setup(self):
30+
# find local rank, etc, custom things to implement
31+
32+
def train(self):
33+
# implement what happens during training
34+
35+
def training_step(self):
36+
# implement how to do a training_step on this accelerator
37+
38+
def validation_step(self):
39+
# implement how to do a validation_step on this accelerator
40+
41+
def test_step(self):
42+
# implement how to do a test_step on this accelerator
43+
44+
def backward(self, closure_loss, optimizer, opt_idx, *args, **kwargs):
45+
# implement how to do a backward pass with this accelerator
46+
47+
def barrier(self, name: Optional[str] = None):
48+
# implement this accelerator's barrier
49+
50+
def broadcast(self, obj, src=0):
51+
# implement this accelerator's broadcast function
52+
53+
def sync_tensor(self,
54+
tensor: Union[torch.Tensor],
55+
group: Optional[Any] = None,
56+
reduce_op: Optional[Union[ReduceOp, str]] = None) -> torch.Tensor:
57+
# implement how to sync tensors when reducing metrics across accelerators
58+
59+
********
60+
Examples
61+
********
62+
The following examples illustrate customizing accelerators.
63+
64+
Example 1: Arbitrary HPC cluster
65+
================================
66+
To link any accelerator with an arbitrary cluster (SLURM, Condor, etc), pass in a Cluster Plugin which will be passed
67+
into any accelerator.
68+
69+
First, implement your own ClusterEnvironment. Here is the torch elastic implementation.
70+
71+
.. code-block:: python
72+
73+
import os
74+
from pytorch_lightning import _logger as log
75+
from pytorch_lightning.utilities import rank_zero_warn
76+
from pytorch_lightning.cluster_environments.cluster_environment import ClusterEnvironment
77+
78+
class TorchElasticEnvironment(ClusterEnvironment):
79+
80+
def __init__(self):
81+
super().__init__()
82+
83+
def master_address(self):
84+
if "MASTER_ADDR" not in os.environ:
85+
rank_zero_warn(
86+
"MASTER_ADDR environment variable is not defined. Set as localhost"
87+
)
88+
os.environ["MASTER_ADDR"] = "127.0.0.1"
89+
log.debug(f"MASTER_ADDR: {os.environ['MASTER_ADDR']}")
90+
master_address = os.environ.get('MASTER_ADDR')
91+
return master_address
92+
93+
def master_port(self):
94+
if "MASTER_PORT" not in os.environ:
95+
rank_zero_warn(
96+
"MASTER_PORT environment variable is not defined. Set as 12910"
97+
)
98+
os.environ["MASTER_PORT"] = "12910"
99+
log.debug(f"MASTER_PORT: {os.environ['MASTER_PORT']}")
100+
101+
port = os.environ.get('MASTER_PORT')
102+
return port
103+
104+
def world_size(self):
105+
return os.environ.get('WORLD_SIZE')
106+
107+
def local_rank(self):
108+
return int(os.environ['LOCAL_RANK'])
109+
110+
Now, pass it into the trainer which will use Torch Elastic across your accelerator of choice.
111+
112+
.. code-block:: python
113+
114+
cluster = TorchElasticEnvironment()
115+
accelerator = MyAccelerator()
116+
trainer = Trainer(plugins=[cluster], accelerator=MyAccelerator())
117+
118+
In this example, MyAccelerator can define arbitrary hardware (like IPUs or TPUs) and links it to an arbitrary
119+
compute cluster.
120+
121+
------------
122+
123+
**********************
124+
Available Accelerators
125+
**********************
126+
127+
CPU Accelerator
128+
===============
129+
130+
.. autoclass:: pytorch_lightning.accelerators.cpu_accelerator.CPUAccelerator
131+
:noindex:
132+
133+
DDP Accelerator
134+
===============
135+
136+
.. autoclass:: pytorch_lightning.accelerators.ddp_accelerator.DDPAccelerator
137+
:noindex:
138+
139+
DDP2 Accelerator
140+
================
141+
142+
.. autoclass:: pytorch_lightning.accelerators.ddp2_accelerator.DDP2Accelerator
143+
:noindex:
144+
145+
DDP CPU HPC Accelerator
146+
=======================
147+
148+
.. autoclass:: pytorch_lightning.accelerators.ddp_cpu_hpc_accelerator.DDPCPUHPCAccelerator
149+
:noindex:
150+
151+
DDP CPU Spawn Accelerator
152+
=========================
153+
154+
.. autoclass:: pytorch_lightning.accelerators.ddp_cpu_spawn_accelerator.DDPCPUSpawnAccelerator
155+
:noindex:
156+
157+
DDP HPC Accelerator
158+
===================
159+
160+
.. autoclass:: pytorch_lightning.accelerators.ddp_hpc_accelerator.DDPHPCAccelerator
161+
:noindex:
162+
163+
DDP Spawn Accelerator
164+
=====================
165+
166+
.. autoclass:: pytorch_lightning.accelerators.ddp_spawn_accelerator.DDPSpawnAccelerator
167+
:noindex:
168+
169+
GPU Accelerator
170+
===============
171+
172+
.. autoclass:: pytorch_lightning.accelerators.gpu_accelerator.GPUAccelerator
173+
:noindex:
174+
175+
Horovod Accelerator
176+
===================
177+
178+
.. autoclass:: pytorch_lightning.accelerators.horovod_accelerator.HorovodAccelerator
179+
:noindex:
180+
181+
TPU Accelerator
182+
===============
183+
184+
.. autoclass:: pytorch_lightning.accelerators.tpu_accelerator.TPUAccelerator
185+
:noindex:

docs/source/advanced/cluster.rst

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
2+
.. _non-slurm:
3+
4+
Computing cluster
5+
=================
6+
7+
With Lightning it is easy to run your training script on a computing cluster without almost any modifications to the script.
8+
This guide shows how to run a training job on a general purpose cluster.
9+
10+
Also, check :ref:`accelerators` as a new and more general approach to a cluster setup.
11+
12+
--------
13+
14+
15+
Cluster setup
16+
-------------
17+
18+
To setup a multi-node computing cluster you need:
19+
20+
1) Multiple computers with PyTorch Lightning installed
21+
2) A network connectivity between them with firewall rules that allow traffic flow on a specified *MASTER_PORT*.
22+
3) Defined environment variables on each node required for the PyTorch Lightning multi-node distributed training
23+
24+
PyTorch Lightning follows the design of `PyTorch distributed communication package <https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization>`_. and requires the following environment variables to be defined on each node:
25+
26+
- *MASTER_PORT* - required; has to be a free port on machine with NODE_RANK 0
27+
- *MASTER_ADDR* - required (except for NODE_RANK 0); address of NODE_RANK 0 node
28+
- *WORLD_SIZE* - required; how many nodes are in the cluster
29+
- *NODE_RANK* - required; id of the node in the cluster
30+
31+
32+
Training script design
33+
----------------------
34+
35+
To train a model using multiple nodes, do the following:
36+
37+
1. Design your :ref:`lightning_module` (no need to add anything specific here).
38+
39+
2. Enable DDP in the trainer
40+
41+
.. code-block:: python
42+
43+
# train on 32 GPUs across 4 nodes
44+
trainer = Trainer(gpus=8, num_nodes=4, accelerator='ddp')
45+
46+
47+
Submit a job to the cluster
48+
---------------------------
49+
50+
To submit a training job to the cluster you need to run the same training script on each node of the cluster.
51+
This means that you need to:
52+
53+
1. Copy all third-party libraries to each node (usually means - distribute requirements.txt file and install it).
54+
55+
2. Copy all your import dependencies and the script itself to each node.
56+
57+
3. Run the script on each node.

docs/source/conf.py

Lines changed: 0 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -369,23 +369,11 @@ def package_list_from_file(file):
369369
# only run doctests marked with a ".. doctest::" directive
370370
doctest_test_doctest_blocks = ''
371371
doctest_global_setup = """
372-
373372
import importlib
374373
import os
375374
import torch
376375
from torch import nn
377-
378376
import pytorch_lightning as pl
379-
<<<<<<< HEAD
380-
from pytorch_lightning import LightningDataModule, LightningModule, Trainer
381-
from pytorch_lightning.utilities import (
382-
_NATIVE_AMP_AVAILABLE,
383-
_APEX_AVAILABLE,
384-
_XLA_AVAILABLE,
385-
_TPU_AVAILABLE,
386-
)
387-
_TORCHVISION_AVAILABLE = importlib.util.find_spec("torchvision") is not None
388-
=======
389377
from pytorch_lightning import LightningModule, Trainer
390378
from pytorch_lightning.utilities import (
391379
NATIVE_AMP_AVAILABLE,
@@ -395,7 +383,5 @@ def package_list_from_file(file):
395383
_module_available,
396384
)
397385
TORCHVISION_AVAILABLE = _module_available("torchvision")
398-
>>>>>>> d71659b4 (Fix docs typo (#4930))
399-
400386
"""
401387
coverage_skip_undoc_in_source = True

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,7 @@ PyTorch Lightning Documentation
113113
advanced/training_tricks
114114
advanced/transfer_learning
115115
advanced/tpu
116+
advanced/cluster
116117
common/test_set
117118
common/production_inference
118119

0 commit comments

Comments
 (0)