Skip to content

Commit 7ac8660

Browse files
committed
Merge branch 'master' into refactor/legacy-ddp
2 parents 9c04996 + 7da931d commit 7ac8660

File tree

82 files changed

+1821
-431
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

82 files changed

+1821
-431
lines changed

.azure-pipelines/gpu-tests.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ jobs:
5252
python -c "fname = 'requirements/extra.txt' ; lines = [line for line in open(fname).readlines() if 'horovod' not in line] ; open(fname, 'w').writelines(lines)"
5353
pip install fairscale==0.4.0
5454
pip install deepspeed==0.5.7
55+
pip install bagua-cuda102==0.9.0
5556
pip install . --requirement requirements/devel.txt
5657
pip list
5758
displayName: 'Install dependencies'

CHANGELOG.md

Lines changed: 76 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -60,12 +60,21 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
6060
- Added `console_kwargs` for `RichProgressBar` to initialize inner Console ([#10875](https://github.com/PyTorchLightning/pytorch-lightning/pull/10875))
6161

6262

63+
- Added support for shorthand notation to instantiate loggers with the `LightningCLI` ([#11533](https://github.com/PyTorchLightning/pytorch-lightning/pull/11533))
64+
65+
66+
- Added a `LOGGER_REGISTRY` instance to register custom loggers to the `LightningCLI` ([#11533](https://github.com/PyTorchLightning/pytorch-lightning/pull/11533))
67+
68+
6369
- Added a `PrecisionPlugin.teardown` method ([#10990](https://github.com/PyTorchLightning/pytorch-lightning/pull/10990))
6470

6571

6672
- Added `LightningModule.lr_scheduler_step` ([#10249](https://github.com/PyTorchLightning/pytorch-lightning/pull/10249))
6773

6874

75+
- Added support for no pre-fetching to `DataFetcher` ([#11606](https://github.com/PyTorchLightning/pytorch-lightning/pull/11606))
76+
77+
6978
- Added `opt_idx` to scheduler config if not assigned by user ([#11247](https://github.com/PyTorchLightning/pytorch-lightning/pull/11247))
7079

7180

@@ -74,9 +83,20 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
7483

7584
- Added a `MisconfigurationException` if user provided `opt_idx` in scheduler config doesn't match with actual optimizer index of its respective optimizer ([#11247](https://github.com/PyTorchLightning/pytorch-lightning/pull/11247))
7685

86+
- Added support for DDP when using a `CombinedLoader` for the training data ([#11648](https://github.com/PyTorchLightning/pytorch-lightning/pull/11648))
87+
88+
89+
- Added a warning when using `DistributedSampler` during validation/testing ([#11479](https://github.com/PyTorchLightning/pytorch-lightning/pull/11479))
90+
91+
92+
- Added support for `Bagua` training strategy ([#11146](https://github.com/PyTorchLightning/pytorch-lightning/pull/11146))
93+
7794

7895
### Changed
7996

97+
- Implemented a new native and rich format in `_print_results` method of the `EvaluationLoop` ([#11332](https://github.com/PyTorchLightning/pytorch-lightning/pull/11332))
98+
99+
80100
- Set the `prog_bar` flag to False in `LightningModule.log_grad_norm` ([#11472](https://github.com/PyTorchLightning/pytorch-lightning/pull/11472))
81101

82102

@@ -162,9 +182,6 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
162182
- Removed duplicated file extension when uploading model checkpoints with `NeptuneLogger` ([#11015](https://github.com/PyTorchLightning/pytorch-lightning/pull/11015))
163183

164184

165-
- Changed `LSFEnvironment` to use `LSB_DJOB_RANKFILE` environment variable instead of `LSB_HOSTS` for determining node rank and main address ([#10825](https://github.com/PyTorchLightning/pytorch-lightning/pull/10825))
166-
167-
168185
- Removed `__getstate__` and `__setstate__` of `RichProgressBar` ([#11100](https://github.com/PyTorchLightning/pytorch-lightning/pull/11100))
169186

170187

@@ -195,6 +212,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
195212
- Marked the `ResultCollection`, `ResultMetric`, and `ResultMetricCollection` classes as protected ([#11130](https://github.com/PyTorchLightning/pytorch-lightning/pull/11130))
196213

197214

215+
- Marked `trainer.checkpoint_connector` as protected ([#11550](https://github.com/PyTorchLightning/pytorch-lightning/pull/11550))
216+
217+
198218
- The epoch start/end hooks are now called by the `FitLoop` instead of the `TrainingEpochLoop` ([#11201](https://github.com/PyTorchLightning/pytorch-lightning/pull/11201))
199219

200220

@@ -224,10 +244,18 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
224244

225245
- Changed `MisconfigurationException` to `ModuleNotFoundError` when `rich` isn't available ([#11360](https://github.com/PyTorchLightning/pytorch-lightning/pull/11360))
226246

247+
- Inherit from `ABC` for `Accelerator`: Users need to implement `auto_device_count` ([#11521](https://github.com/PyTorchLightning/pytorch-lightning/pull/11521))
248+
249+
250+
- Changed `parallel_devices` property in `ParallelStrategy` to be lazy initialized ([#11572](https://github.com/PyTorchLightning/pytorch-lightning/pull/11572))
251+
227252

228253
- Sorted `SimpleProfiler(extended=False)` summary based on mean duration for each hook ([#11671](https://github.com/PyTorchLightning/pytorch-lightning/pull/11671))
229254

230255

256+
- Avoid enforcing `shuffle=False` for eval dataloaders ([#11575](https://github.com/PyTorchLightning/pytorch-lightning/pull/11575))
257+
258+
231259
### Deprecated
232260

233261
- Deprecated `ClusterEnvironment.master_{address,port}` in favor of `ClusterEnvironment.main_{address,port}` ([#10103](https://github.com/PyTorchLightning/pytorch-lightning/pull/10103))
@@ -287,6 +315,12 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
287315
- Deprecated function `pytorch_lightning.callbacks.device_stats_monitor.prefix_metric_keys` ([#11254](https://github.com/PyTorchLightning/pytorch-lightning/pull/11254))
288316

289317

318+
- Deprecated `on_batch_start` and `on_batch_end` callback hooks in favor of `on_train_batch_start` and `on_train_batch_end` ([#11577](https://github.com/PyTorchLightning/pytorch-lightning/pull/11577))
319+
320+
321+
- Deprecated `on_configure_sharded_model` callback hook in favor of `setup` ([#11627](https://github.com/PyTorchLightning/pytorch-lightning/pull/11627))
322+
323+
290324
### Removed
291325

292326
- Removed deprecated parameter `method` in `pytorch_lightning.utilities.model_helpers.is_overridden` ([#10507](https://github.com/PyTorchLightning/pytorch-lightning/pull/10507))
@@ -454,6 +488,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
454488
- Fixed wrong typehint for `Trainer.lightning_optimizers` ([#11155](https://github.com/PyTorchLightning/pytorch-lightning/pull/11155))
455489

456490

491+
- Fixed the lr-scheduler state not being dumped to checkpoint when using the deepspeed strategy ([#11307](https://github.com/PyTorchLightning/pytorch-lightning/pull/11307))
492+
493+
457494
- Fixed bug where the path for "last" checkpoints was not getting saved correctly which caused newer runs to not remove the previous "last" checkpoint ([#11481](https://github.com/PyTorchLightning/pytorch-lightning/pull/11481))
458495

459496

@@ -469,21 +506,53 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
469506
- Fixed type promotion when tensors of higher category than float are logged ([#11401](https://github.com/PyTorchLightning/pytorch-lightning/pull/11401))
470507

471508

472-
- Fixed the lr-scheduler state not being dumped to checkpoint when using the deepspeed strategy ([#11307](https://github.com/PyTorchLightning/pytorch-lightning/pull/11307))
509+
- Fixed `SimpleProfiler` summary ([#11414](https://github.com/PyTorchLightning/pytorch-lightning/pull/11414))
473510

474511

475-
- Fixed `SimpleProfiler` summary ([#11414](https://github.com/PyTorchLightning/pytorch-lightning/pull/11414))
512+
- Fixed bug where progress bar was not being disabled when not in rank zero during predict ([#11377](https://github.com/PyTorchLightning/pytorch-lightning/pull/11377))
513+
514+
515+
- Fixed an issue to avoid val bar disappear after `trainer.validate()` ([#11700](https://github.com/PyTorchLightning/pytorch-lightning/pull/11700))
516+
517+
518+
- Fixed the mid-epoch warning call while resuming training ([#11556](https://github.com/PyTorchLightning/pytorch-lightning/pull/11556))
519+
520+
521+
- Fixed an issue in `RichProgressbar` to display the metrics logged only on main progress bar ([#11690](https://github.com/PyTorchLightning/pytorch-lightning/pull/11690))
522+
523+
524+
- Fixed `RichProgressBar` progress when refresh rate does not evenly divide the total counter ([#11668](https://github.com/PyTorchLightning/pytorch-lightning/pull/11668))
525+
476526

527+
- Fixed `RichProgressBar` progress validation bar total when using multiple validation runs within a single training epoch ([#11668](https://github.com/PyTorchLightning/pytorch-lightning/pull/11668))
477528

478-
- Disbled sampler replacement when using `IterableDataset` ([#11507](https://github.com/PyTorchLightning/pytorch-lightning/pull/11507))
479529

530+
- The `RichProgressBar` now correctly shows the `on_epoch` logged values on train epoch end ([#11689](https://github.com/PyTorchLightning/pytorch-lightning/pull/11689))
480531

481-
- The Rich progress bar now correctly shows the `on_epoch` logged values on train epoch end ([#11689](https://github.com/PyTorchLightning/pytorch-lightning/pull/11689))
532+
533+
- Fixed check for available modules ([#11526](https://github.com/PyTorchLightning/pytorch-lightning/pull/11526))
482534

483535

484536
- Fixed an issue to avoid validation loop run on restart ([#11552](https://github.com/PyTorchLightning/pytorch-lightning/pull/11552))
485537

486538

539+
- Fixed an issue to make the `step` argument in `WandbLogger.log_image` work ([#11716](https://github.com/PyTorchLightning/pytorch-lightning/pull/11716))
540+
541+
542+
## [1.5.9] - 2022-01-20
543+
544+
### Fixed
545+
546+
- Pinned sphinx-autodoc-typehints with <v1.15 ([#11400](https://github.com/PyTorchLightning/pytorch-lightning/pull/11400))
547+
- Skipped testing with PyTorch 1.7 and Python 3.9 on Ubuntu ([#11217](https://github.com/PyTorchLightning/pytorch-lightning/pull/11217))
548+
- Fixed type promotion when tensors of higher category than float are logged ([#11401](https://github.com/PyTorchLightning/pytorch-lightning/pull/11401))
549+
- Fixed the format of the configuration saved automatically by the CLI's `SaveConfigCallback` ([#11532](https://github.com/PyTorchLightning/pytorch-lightning/pull/11532))
550+
551+
### Changed
552+
- Changed `LSFEnvironment` to use `LSB_DJOB_RANKFILE` environment variable instead of `LSB_HOSTS` for determining node rank and main address ([#10825](https://github.com/PyTorchLightning/pytorch-lightning/pull/10825))
553+
- Disabled sampler replacement when using `IterableDataset` ([#11507](https://github.com/PyTorchLightning/pytorch-lightning/pull/11507))
554+
555+
487556
## [1.5.8] - 2022-01-05
488557

489558
### Fixed

docs/source/accelerators/gpu.rst

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -282,6 +282,7 @@ Lightning allows multiple ways of training
282282
- DistributedDataParallel (``strategy='ddp_spawn'``) (multiple-gpus across many machines (spawn based)).
283283
- DistributedDataParallel 2 (``strategy='ddp2'``) (DP in a machine, DDP across machines).
284284
- Horovod (``strategy='horovod'``) (multi-machine, multi-gpu, configured at runtime)
285+
- Bagua (``strategy='bagua'``) (multiple-gpus across many machines with advanced training algorithms)
285286
- TPUs (``tpu_cores=8|x``) (tpu or TPU pod)
286287

287288
.. note::
@@ -489,6 +490,120 @@ number of worker processes:
489490
See the official `Horovod documentation <https://horovod.readthedocs.io/en/stable>`_ for details
490491
on installation and performance tuning.
491492

493+
494+
Bagua
495+
^^^^^
496+
`Bagua <https://github.com/BaguaSys/bagua>`_ is a deep learning training acceleration framework which supports
497+
multiple advanced distributed training algorithms including:
498+
499+
- `Gradient AllReduce <https://tutorials.baguasys.com/algorithms/gradient-allreduce>`_ for centralized synchronous communication, where gradients are averaged among all workers.
500+
- `Decentralized SGD <https://tutorials.baguasys.com/algorithms/decentralized>`_ for decentralized synchronous communication, where each worker exchanges data with one or a few specific workers.
501+
- `ByteGrad <https://tutorials.baguasys.com/algorithms/bytegrad>`_ and `QAdam <https://tutorials.baguasys.com/algorithms/q-adam>`_ for low precision communication, where data is compressed into low precision before communication.
502+
- `Asynchronous Model Average <https://tutorials.baguasys.com/algorithms/async-model-average>`_ for asynchronous communication, where workers are not required to be synchronized in the same iteration in a lock-step style.
503+
504+
By default, Bagua uses *Gradient AllReduce* algorithm, which is also the algorithm implemented in Distributed Data Parallel and Horovod,
505+
but Bagua can usually produce a higher training throughput due to its backend written in Rust.
506+
507+
.. code-block:: python
508+
509+
# train on 2 GPUs (using Bagua mode)
510+
trainer = Trainer(strategy="bagua", accelerator="gpu", devices=4)
511+
512+
513+
By specifying the ``algorithm`` in the ``BaguaStrategy``, you can select more advanced training algorithms featured by Bagua:
514+
515+
516+
.. code-block:: python
517+
518+
# train on 4 GPUs, using Bagua Gradient AllReduce algorithm
519+
trainer = Trainer(
520+
strategy=BaguaStrategy(algorithm="gradient_allreduce"),
521+
accelerator="gpu",
522+
devices=4,
523+
)
524+
525+
# train on 4 GPUs, using Bagua ByteGrad algorithm
526+
trainer = Trainer(
527+
strategy=BaguaStrategy(algorithm="bytegrad"),
528+
accelerator="gpu",
529+
devices=4,
530+
)
531+
532+
# train on 4 GPUs, using Bagua Decentralized SGD
533+
trainer = Trainer(
534+
strategy=BaguaStrategy(algorithm="decentralized"),
535+
accelerator="gpu",
536+
devices=4,
537+
)
538+
539+
# train on 4 GPUs, using Bagua Low Precision Decentralized SGD
540+
trainer = Trainer(
541+
strategy=BaguaStrategy(algorithm="low_precision_decentralized"),
542+
accelerator="gpu",
543+
devices=4,
544+
)
545+
546+
# train on 4 GPUs, using Asynchronous Model Average algorithm, with a synchronization interval of 100ms
547+
trainer = Trainer(
548+
strategy=BaguaStrategy(algorithm="async", sync_interval_ms=100),
549+
accelerator="gpu",
550+
devices=4,
551+
)
552+
553+
To use *QAdam*, we need to initialize
554+
`QAdamOptimizer <https://bagua.readthedocs.io/en/latest/autoapi/bagua/torch_api/algorithms/q_adam/index.html#bagua.torch_api.algorithms.q_adam.QAdamOptimizer>`_ first:
555+
556+
.. code-block:: python
557+
558+
from pytorch_lightning.strategies import BaguaStrategy
559+
from bagua.torch_api.algorithms.q_adam import QAdamOptimizer
560+
561+
562+
class MyModel(pl.LightningModule):
563+
...
564+
565+
def configure_optimizers(self):
566+
# initialize QAdam Optimizer
567+
return QAdamOptimizer(self.parameters(), lr=0.05, warmup_steps=100)
568+
569+
570+
model = MyModel()
571+
trainer = Trainer(
572+
accelerator="gpu",
573+
devices=4,
574+
strategy=BaguaStrategy(algorithm="qadam"),
575+
)
576+
trainer.fit(model)
577+
578+
Bagua relies on its own `launcher <https://tutorials.baguasys.com/getting-started/#launch-job>`_ to schedule jobs.
579+
Below, find examples using ``bagua.distributed.launch`` which follows ``torch.distributed.launch`` API:
580+
581+
.. code-block:: bash
582+
583+
# start training with 8 GPUs on a single node
584+
python -m bagua.distributed.launch --nproc_per_node=8 train.py
585+
586+
If the ssh service is available with passwordless login on each node, you can launch the distributed job on a
587+
single node with ``baguarun`` which has a similar syntax as ``mpirun``. When staring the job, ``baguarun`` will
588+
automatically spawn new processes on each of your training node provided by ``--host_list`` option and each node in it
589+
is described as an ip address followed by a ssh port.
590+
591+
.. code-block:: bash
592+
593+
# Run on node1 (or node2) to start training on two nodes (node1 and node2), 8 GPUs per node
594+
baguarun --host_list hostname1:ssh_port1,hostname2:ssh_port2 --nproc_per_node=8 --master_port=port1 train.py
595+
596+
597+
.. note:: You can also start training in the same way as Distributed Data Parallel. However, system optimizations like
598+
`Bagua-Net <https://tutorials.baguasys.com/more-optimizations/bagua-net>`_ and
599+
`Performance autotuning <https://tutorials.baguasys.com/performance-autotuning/>`_ can only be enabled through bagua
600+
launcher. It is worth noting that with ``Bagua-Net``, Distributed Data Parallel can also achieve
601+
better performance without modifying the training script.
602+
603+
604+
See `Bagua Tutorials <https://tutorials.baguasys.com/>`_ for more details on installation and advanced features.
605+
606+
492607
DP/DDP2 caveats
493608
^^^^^^^^^^^^^^^
494609
In DP and DDP2 each GPU within a machine sees a portion of a batch.

docs/source/advanced/profiler.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,12 @@ PyTorch Lightning supports profiling standard actions in the training loop out o
1818

1919
- on_epoch_start
2020
- on_epoch_end
21-
- on_batch_start
21+
- on_train_batch_start
2222
- model_forward
2323
- model_backward
2424
- on_after_backward
2525
- optimizer_step
26-
- on_batch_end
26+
- on_train_batch_end
2727
- training_step_end
2828
- on_training_end
2929
- etc...

docs/source/api_references.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ Strategy API
4343
:nosignatures:
4444
:template: classtemplate.rst
4545

46+
BaguaStrategy
4647
DDP2Strategy
4748
DDPFullyShardedStrategy
4849
DDPShardedStrategy

0 commit comments

Comments
 (0)