[docs] distributed_backend -> accelerator (#4429)

Jeff Yang · rohitgr7 · Borda · commit e94d48c5ec4f · 2020-11-04T01:24:28.000+01:00
* distributed_backend -> accelerator * distributed_backend -> accelerator * use_amp -> precision * format Co-authored-by: rohitgr7 <rohitgr1998@gmail.com> (cherry picked from commit ebe3a31)
diff --git a/docs/source/introduction_guide.rst b/docs/source/introduction_guide.rst
@@ -543,7 +543,7 @@ Or multiple nodes
 
     # (32 GPUs)
     model = LitMNIST()
-    trainer = Trainer(gpus=8, num_nodes=4, distributed_backend='ddp')
+    trainer = Trainer(gpus=8, num_nodes=4, accelerator='ddp')
     trainer.fit(model, train_loader)
 
 Refer to the :ref:`distributed computing guide for more details <multi_gpu>`.
diff --git a/docs/source/lightning_module.rst b/docs/source/lightning_module.rst
@@ -256,7 +256,7 @@ The matching pseudocode is:
 
 Training with DataParallel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
-When training using a `distributed_backend` that splits data from each batch across GPUs, sometimes you might
+When training using a `accelerator` that splits data from each batch across GPUs, sometimes you might
 need to aggregate them on the master GPU for processing (dp, or ddp2).
 
 In this case, implement the `training_step_end` method
@@ -360,7 +360,7 @@ If you need to do something with all the outputs of each `validation_step`, over
 
 Validating with DataParallel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-When training using a `distributed_backend` that splits data from each batch across GPUs, sometimes you might
+When training using a `accelerator` that splits data from each batch across GPUs, sometimes you might
 need to aggregate them on the master GPU for processing (dp, or ddp2).
 
 In this case, implement the `validation_step_end` method
diff --git a/docs/source/multi_gpu.rst b/docs/source/multi_gpu.rst
@@ -231,11 +231,11 @@ Distributed modes
 -----------------
 Lightning allows multiple ways of training
 
-- Data Parallel (`distributed_backend='dp'`) (multiple-gpus, 1 machine)
-- DistributedDataParallel (`distributed_backend='ddp'`) (multiple-gpus across many machines (python script based)).
-- DistributedDataParallel (`distributed_backend='ddp_spawn'`) (multiple-gpus across many machines (spawn based)).
-- DistributedDataParallel 2 (`distributed_backend='ddp2'`) (DP in a machine, DDP across machines).
-- Horovod (`distributed_backend='horovod'`) (multi-machine, multi-gpu, configured at runtime)
+- Data Parallel (`accelerator='dp'`) (multiple-gpus, 1 machine)
+- DistributedDataParallel (`accelerator='ddp'`) (multiple-gpus across many machines (python script based)).
+- DistributedDataParallel (`accelerator='ddp_spawn'`) (multiple-gpus across many machines (spawn based)).
+- DistributedDataParallel 2 (`accelerator='ddp2'`) (DP in a machine, DDP across machines).
+- Horovod (`accelerator='horovod'`) (multi-machine, multi-gpu, configured at runtime)
 - TPUs (`tpu_cores=8|x`) (tpu or TPU pod)
 
 .. note::
@@ -258,7 +258,7 @@ after which the root node will aggregate the results.
     :skipif: torch.cuda.device_count() < 2
 
     # train on 2 GPUs (using DP mode)
-    trainer = Trainer(gpus=2, distributed_backend='dp')
+    trainer = Trainer(gpus=2, accelerator='dp')
 
 Distributed Data Parallel
 ^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -281,10 +281,10 @@ Distributed Data Parallel
 .. code-block:: python
 
     # train on 8 GPUs (same machine (ie: node))
-    trainer = Trainer(gpus=8, distributed_backend='ddp')
+    trainer = Trainer(gpus=8, accelerator='ddp')
 
     # train on 32 GPUs (4 nodes)
-    trainer = Trainer(gpus=8, distributed_backend='ddp', num_nodes=4)
+    trainer = Trainer(gpus=8, accelerator='ddp', num_nodes=4)
 
 This Lightning implementation of DDP calls your script under the hood multiple times with the correct environment
 variables:
@@ -330,7 +330,7 @@ In  this case, we can use DDP2 which behaves like DP in a machine and DDP across
 .. code-block:: python
 
     # train on 32 GPUs (4 nodes)
-    trainer = Trainer(gpus=8, distributed_backend='ddp2', num_nodes=4)
+    trainer = Trainer(gpus=8, accelerator='ddp2', num_nodes=4)
 
 Distributed Data Parallel Spawn
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -348,7 +348,7 @@ project module) you can use the following method:
 .. code-block:: python
 
     # train on 8 GPUs (same machine (ie: node))
-    trainer = Trainer(gpus=8, distributed_backend='ddp')
+    trainer = Trainer(gpus=8, accelerator='ddp')
 
 We STRONGLY discourage this use because it has limitations (due to Python and PyTorch):
 
@@ -400,7 +400,7 @@ You can then call your scripts anywhere
 .. code-block:: bash
 
     cd /project/src
-    python some_file.py --distributed_backend 'ddp' --gpus 8
+    python some_file.py --accelerator 'ddp' --gpus 8
 
 
 Horovod
@@ -421,10 +421,10 @@ Horovod can be configured in the training script to run with any number of GPUs
 .. code-block:: python
 
     # train Horovod on GPU (number of GPUs / machines provided on command-line)
-    trainer = Trainer(distributed_backend='horovod', gpus=1)
+    trainer = Trainer(accelerator='horovod', gpus=1)
 
     # train Horovod on CPU (number of processes / machines provided on command-line)
-    trainer = Trainer(distributed_backend='horovod')
+    trainer = Trainer(accelerator='horovod')
 
 When starting the training job, the driver application will then be used to specify the total
 number of worker processes:
@@ -554,13 +554,13 @@ Below are the possible configurations we support.
 +=======+=========+====+=====+=========+============================================================+
 | Y     |         |    |     |         | `Trainer(gpus=1)`                                          |
 +-------+---------+----+-----+---------+------------------------------------------------------------+
-| Y     |         |    |     | Y       | `Trainer(gpus=1, use_amp=True)`                            |
+| Y     |         |    |     | Y       | `Trainer(gpus=1, precision=16)`                            |
 +-------+---------+----+-----+---------+------------------------------------------------------------+
-|       | Y       | Y  |     |         | `Trainer(gpus=k, distributed_backend='dp')`                |
+|       | Y       | Y  |     |         | `Trainer(gpus=k, accelerator='dp')`                        |
 +-------+---------+----+-----+---------+------------------------------------------------------------+
-|       | Y       |    | Y   |         | `Trainer(gpus=k, distributed_backend='ddp')`               |
+|       | Y       |    | Y   |         | `Trainer(gpus=k, accelerator='ddp')`                       |
 +-------+---------+----+-----+---------+------------------------------------------------------------+
-|       | Y       |    | Y   | Y       | `Trainer(gpus=k, distributed_backend='ddp', use_amp=True)` |
+|       | Y       |    | Y   | Y       | `Trainer(gpus=k, accelerator='ddp', precision=16)`         |
 +-------+---------+----+-----+---------+------------------------------------------------------------+
 
 
@@ -590,10 +590,10 @@ In (DDP, Horovod) your effective batch size will be 7 * gpus * num_nodes.
 .. code-block:: python
 
     # effective batch size = 7 * 8
-    Trainer(gpus=8, distributed_backend='ddp|horovod')
+    Trainer(gpus=8, accelerator='ddp|horovod')
 
     # effective batch size = 7 * 8 * 10
-    Trainer(gpus=8, num_nodes=10, distributed_backend='ddp|horovod')
+    Trainer(gpus=8, num_nodes=10, accelerator='ddp|horovod')
 
 
 In DDP2, your effective batch size will be 7 * num_nodes.
@@ -602,10 +602,10 @@ The reason is that the full batch is visible to all GPUs on the node when using
 .. code-block:: python
 
     # effective batch size = 7
-    Trainer(gpus=8, distributed_backend='ddp2')
+    Trainer(gpus=8, accelerator='ddp2')
 
     # effective batch size = 7 * 10
-    Trainer(gpus=8, num_nodes=10, distributed_backend='ddp2')
+    Trainer(gpus=8, num_nodes=10, accelerator='ddp2')
 
 
 .. note:: Huge batch sizes are actually really bad for convergence. Check out:
@@ -619,7 +619,7 @@ Lightning supports the use of PytorchElastic to enable fault-tolerent and elasti
 
 .. code-block:: python
 
-    Trainer(gpus=8, distributed_backend='ddp')
+    Trainer(gpus=8, accelerator='ddp')
     
     
 Following the `PytorchElastic Quickstart documentation <https://pytorch.org/elastic/latest/quickstart.html>`_, you then need to start a single-node etcd server on one of the hosts:
diff --git a/docs/source/performance.rst b/docs/source/performance.rst
@@ -33,9 +33,9 @@ The best thing to do is to increase the ``num_workers`` slowly and stop once you
 
 Spawn
 ^^^^^
-When using ``distributed_backend=ddp_spawn`` (the ddp default) or TPU training, the way multiple GPUs/TPU cores are used is by calling ``.spawn()`` under the hood.
+When using ``accelerator=ddp_spawn`` (the ddp default) or TPU training, the way multiple GPUs/TPU cores are used is by calling ``.spawn()`` under the hood.
 The problem is that PyTorch has issues with ``num_workers > 0`` when using ``.spawn()``. For this reason we recommend you
-use ``distributed_backend=ddp`` so you can increase the ``num_workers``, however your script has to be callable like so:
+use ``accelerator=ddp`` so you can increase the ``num_workers``, however your script has to be callable like so:
 
 .. code-block:: bash
 
diff --git a/docs/source/slurm.rst b/docs/source/slurm.rst
@@ -24,7 +24,7 @@ To train a model using multiple nodes, do the following:
     .. code-block:: python
 
        # train on 32 GPUs across 4 nodes
-       trainer = Trainer(gpus=8, num_nodes=4, distributed_backend='ddp')
+       trainer = Trainer(gpus=8, num_nodes=4, accelerator='ddp')
 
 3.  It's a good idea to structure your training script like this:
 
@@ -37,7 +37,7 @@ To train a model using multiple nodes, do the following:
             trainer = pl.Trainer(
                 gpus=8,
                 num_nodes=4,
-                distributed_backend='ddp'
+                accelerator='ddp'
             )
 
             trainer.fit(model)
diff --git a/docs/source/tpu.rst b/docs/source/tpu.rst
@@ -140,7 +140,7 @@ Lightning supports training on a single TPU core. Just pass the TPU core ID [1-8
 
 Distributed Backend with TPU
 ----------------------------
-The ```distributed_backend``` option used for GPUs does not apply to TPUs.
+The ``accelerator`` option used for GPUs does not apply to TPUs.
 TPUs work in DDP mode by default (distributing over each core)
 
 ----------------
diff --git a/pytorch_lightning/trainer/__init__.py b/pytorch_lightning/trainer/__init__.py
@@ -203,18 +203,18 @@ def forward(self, x):
 .. testcode::
 
     # default used by the Trainer
-    trainer = Trainer(distributed_backend=None)
+    trainer = Trainer(accelerator=None)
 
 Example::
 
     # dp = DataParallel
-    trainer = Trainer(gpus=2, distributed_backend='dp')
+    trainer = Trainer(gpus=2, accelerator='dp')
 
     # ddp = DistributedDataParallel
-    trainer = Trainer(gpus=2, num_nodes=2, distributed_backend='ddp')
+    trainer = Trainer(gpus=2, num_nodes=2, accelerator='ddp')
 
     # ddp2 = DistributedDataParallel + dp
-    trainer = Trainer(gpus=2, num_nodes=2, distributed_backend='ddp2')
+    trainer = Trainer(gpus=2, num_nodes=2, accelerator='ddp2')
 
 .. note:: this option does not apply to TPU. TPUs use ```ddp``` by default (over each core)
 
@@ -948,16 +948,16 @@ def on_train_end(self, trainer, pl_module):
 |
 
 Number of processes to train with. Automatically set to the number of GPUs
-when using ``distrbuted_backend="ddp"``. Set to a number greater than 1 when
-using ``distributed_backend="ddp_cpu"`` to mimic distributed training on a
+when using ``accelerator="ddp"``. Set to a number greater than 1 when
+using ``accelerator="ddp_cpu"`` to mimic distributed training on a
 machine without GPUs. This is useful for debugging, but **will not** provide
 any speedup, since single-process Torch already makes effient use of multiple
 CPUs.
 
 .. testcode::
 
     # Simulate DDP for debugging on your GPU-less laptop
-    trainer = Trainer(distributed_backend="ddp_cpu", num_processes=2)
+    trainer = Trainer(accelerator="ddp_cpu", num_processes=2)
 
 num_sanity_val_steps
 ^^^^^^^^^^^^^^^^^^^^