Skip to content

Commit b96f570

Browse files
committed
Resolve merge conflicts
2 parents d54ea35 + 1026ceb commit b96f570

File tree

71 files changed

+1281
-942
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

71 files changed

+1281
-942
lines changed

.github/workflows/ci_test-conda.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,8 +53,9 @@ jobs:
5353
- name: Upload pytest results
5454
uses: actions/upload-artifact@v2
5555
with:
56-
name: pytest-results-${{ runner.os }}-${{ matrix.python-version }}-${{ matrix.requires }}
57-
path: junit/test-results-${{ runner.os }}-${{ matrix.python-version }}-${{ matrix.requires }}.xml
56+
name: pytest-results-${{ runner.os }}-torch${{ matrix.pytorch-version }}
57+
path: junit/test-results-${{ runner.os }}-torch${{ matrix.pytorch-version }}.xml
58+
if-no-files-found: error
5859
if: failure()
5960

6061
- name: Statistics

.github/workflows/ci_test-full.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -147,8 +147,9 @@ jobs:
147147
- name: Upload pytest results
148148
uses: actions/upload-artifact@v2
149149
with:
150-
name: pytest-results-${{ runner.os }}-${{ matrix.python-version }}-${{ matrix.requires }}-${{ matrix.release }}
151-
path: junit/test-results-${{ runner.os }}-${{ matrix.python-version }}-${{ matrix.requires }}-${{ matrix.release }}.xml
150+
name: pytest-results-${{ runner.os }}-py${{ matrix.python-version }}-${{ matrix.requires }}-${{ matrix.release }}
151+
path: junit/test-results-${{ runner.os }}-py${{ matrix.python-version }}-${{ matrix.requires }}-${{ matrix.release }}.xml
152+
if-no-files-found: error
152153
if: failure()
153154

154155
- name: Statistics

.github/workflows/ci_test-slow.yml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -57,15 +57,16 @@ jobs:
5757

5858
- name: Tests
5959
run: |
60-
coverage run --source pytorch_lightning -m pytest tests -v --junitxml=junit/test-results-${{ runner.os }}-${{ matrix.python-version }}.xml
60+
coverage run --source pytorch_lightning -m pytest tests -v --junitxml=junit/test-results-${{ runner.os }}-py${{ matrix.python-version }}.xml
6161
env:
6262
PL_RUN_SLOW_TESTS: 1
6363

6464
- name: Upload pytest test results
6565
uses: actions/upload-artifact@v2
6666
with:
67-
name: pytest-results-${{ runner.os }}-${{ matrix.python-version }}
68-
path: junit/test-results-${{ runner.os }}-${{ matrix.python-version }}.xml
67+
name: pytest-results-${{ runner.os }}-py${{ matrix.python-version }}
68+
path: junit/test-results-${{ runner.os }}-py${{ matrix.python-version }}.xml
69+
if-no-files-found: error
6970
if: failure()
7071

7172
- name: Statistics

CHANGELOG.md

Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,10 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
99

1010
### Added
1111

12-
- Add new `DETAIL` log level to provide useful logs for improving monitoring and debugging of batch jobs
12+
- Enable gradient accumulation using Horovod's `backward_passes_per_step` ([#11911](https://github.com/PyTorchLightning/pytorch-lightning/pull/11911))
13+
14+
15+
- Add new `DETAIL` log level to provide useful logs for improving monitoring and debugging of batch jobs ([#11008](https://github.com/PyTorchLightning/pytorch-lightning/pull/11008))
1316

1417

1518
- Added a flag `SLURMEnvironment(auto_requeue=True|False)` to control whether Lightning handles the requeuing ([#10601](https://github.com/PyTorchLightning/pytorch-lightning/pull/10601))
@@ -33,6 +36,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
3336
- Added a function to validate if fault tolerant training is supported. ([#10465](https://github.com/PyTorchLightning/pytorch-lightning/pull/10465))
3437

3538

39+
- Added a private callback to manage the creation and deletion of fault-tolerance checkpoints ([#11862](https://github.com/PyTorchLightning/pytorch-lightning/pull/11862))
40+
41+
3642
- Show a better error message when a custom `DataLoader` implementation is not well implemented and we need to reconstruct it ([#10719](https://github.com/PyTorchLightning/pytorch-lightning/pull/10719))
3743

3844

@@ -66,6 +72,8 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
6672
- Added a `LOGGER_REGISTRY` instance to register custom loggers to the `LightningCLI` ([#11533](https://github.com/PyTorchLightning/pytorch-lightning/pull/11533))
6773

6874

75+
- Added info message when the `Trainer` arguments `limit_*_batches`, `overfit_batches`, or `val_check_interval` are set to `1` or `1.0` ([#11950](https://github.com/PyTorchLightning/pytorch-lightning/pull/11950))
76+
6977
- Added a `PrecisionPlugin.teardown` method ([#10990](https://github.com/PyTorchLightning/pytorch-lightning/pull/10990))
7078

7179

@@ -117,9 +125,13 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
117125
- Added `Accelerator.is_available` to check device availability ([#11797](https://github.com/PyTorchLightning/pytorch-lightning/pull/11797))
118126

119127

128+
- Enabled static type-checking on the signature of `Trainer` ([#11888](https://github.com/PyTorchLightning/pytorch-lightning/pull/11888))
129+
130+
120131
- Added utility functions for moving optimizers to devices ([#11758](https://github.com/PyTorchLightning/pytorch-lightning/pull/11758))
121132

122133

134+
123135
### Changed
124136

125137
- Implemented a new native and rich format in `_print_results` method of the `EvaluationLoop` ([#11332](https://github.com/PyTorchLightning/pytorch-lightning/pull/11332))
@@ -296,6 +308,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
296308

297309
- Changed default logger name to `lightning_logs` for consistency ([#11762](https://github.com/PyTorchLightning/pytorch-lightning/pull/11762))
298310

311+
312+
- Rewrote `accelerator_connector` ([#11448](https://github.com/PyTorchLightning/pytorch-lightning/pull/11448))
313+
299314
### Deprecated
300315

301316
- Deprecated `training_type_plugin` property in favor of `strategy` in `Trainer` and updated the references ([#11141](https://github.com/PyTorchLightning/pytorch-lightning/pull/11141))
@@ -400,6 +415,12 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
400415
- Deprecated `pytorch_lightning.utilities.warnings.LightningDeprecationWarning` in favor of `pytorch_lightning.utilities.rank_zero.LightningDeprecationWarning`
401416

402417

418+
- Deprecated `agg_key_funcs` and `agg_default_func` parameters from `LightningLoggerBase` ([#11871](https://github.com/PyTorchLightning/pytorch-lightning/pull/11871))
419+
420+
421+
- Deprecated `LightningLoggerBase.update_agg_funcs` ([#11871](https://github.com/PyTorchLightning/pytorch-lightning/pull/11871))
422+
423+
403424
- Deprecated `LightningLoggerBase.agg_and_log_metrics` in favor of `LightningLoggerBase.log_metrics` ([#11832](https://github.com/PyTorchLightning/pytorch-lightning/pull/11832))
404425

405426

@@ -553,6 +574,12 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
553574
- Removed `log_text` and `log_image` from the `LightningLoggerBase` API ([#11857](https://github.com/PyTorchLightning/pytorch-lightning/pull/11857))
554575

555576

577+
- Removed calls to `profile("model_forward")` in favor of profiling `training_step` ([#12032](https://github.com/PyTorchLightning/pytorch-lightning/pull/12032))
578+
579+
580+
- Removed `get_mp_spawn_kwargs` from `DDPSpawnStrategy` and `TPUSpawnStrategy` in favor of configuration in the `_SpawnLauncher` ([#11966](https://github.com/PyTorchLightning/pytorch-lightning/pull/11966))
581+
582+
556583
### Fixed
557584

558585
- Fixed an issue where `HorovodStrategy.teardown()` did not complete gracefully if an exception was thrown during callback setup [#11752](https://github.com/PyTorchLightning/pytorch-lightning/pull/11752)
@@ -605,6 +632,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
605632
- Configure native Deepspeed schedulers with interval='step' ([#11788](https://github.com/PyTorchLightning/pytorch-lightning/pull/11788))
606633

607634

635+
- Update `RichProgressBarTheme` styles after detecting light theme on colab ([#10993](https://github.com/PyTorchLightning/pytorch-lightning/pull/10993))
636+
637+
608638
## [1.5.10] - 2022-02-08
609639

610640
### Fixed
@@ -641,6 +671,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
641671
- Disabled sampler replacement when using `IterableDataset` ([#11507](https://github.com/PyTorchLightning/pytorch-lightning/pull/11507))
642672

643673

674+
- Disable loading dataloades if corresponding `limit_batches=0` ([#11576](https://github.com/PyTorchLightning/pytorch-lightning/pull/11576))
675+
676+
644677
## [1.5.8] - 2022-01-05
645678

646679
### Fixed

docs/source/accelerators/gpu.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -506,7 +506,7 @@ but Bagua can usually produce a higher training throughput due to its backend wr
506506

507507
.. code-block:: python
508508
509-
# train on 2 GPUs (using Bagua mode)
509+
# train on 4 GPUs (using Bagua mode)
510510
trainer = Trainer(strategy="bagua", accelerator="gpu", devices=4)
511511
512512

docs/source/advanced/profiler.rst

Lines changed: 44 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,6 @@ PyTorch Lightning supports profiling standard actions in the training loop out o
1919
- on_train_epoch_start
2020
- on_train_epoch_end
2121
- on_train_batch_start
22-
- model_forward
2322
- model_backward
2423
- on_after_backward
2524
- optimizer_step
@@ -66,7 +65,6 @@ The profiler's results will be printed at the completion of a training ``trainer
6665
| run_training_epoch | 6.1558 | 6.1558 |
6766
| run_training_batch | 0.0022506 | 0.015754 |
6867
| [LightningModule]BoringModel.optimizer_step | 0.0017477 | 0.012234 |
69-
| model_forward | 0.00055868 | 0.0039108 |
7068
| [LightningModule]BoringModel.val_dataloader | 0.00024388 | 0.00024388 |
7169
| on_train_batch_start | 0.00014637 | 0.0010246 |
7270
| [LightningModule]BoringModel.teardown | 2.15e-06 | 2.15e-06 |
@@ -210,6 +208,50 @@ To visualize the profiled operation, you can either:
210208
python -c 'import torch; print(torch.autograd.profiler.load_nvprof("trace_name.prof"))'
211209
212210
211+
XLA Profiler
212+
============
213+
214+
:class:`~pytorch_lightning.profiler.xla.XLAProfiler` will help you debug and optimize training
215+
workload performance for your models using Cloud TPU performance tools.
216+
217+
.. code-block:: python
218+
219+
# by passing the `XLAProfiler` alias
220+
trainer = Trainer(..., profiler="xla")
221+
222+
# or by passing an instance
223+
from pytorch_lightning.profiler import XLAProfiler
224+
225+
profiler = XLAProfiler(port=9001)
226+
trainer = Trainer(..., profiler=profiler)
227+
228+
229+
Manual Capture via TensorBoard
230+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
231+
232+
The following instructions are for capturing traces from a running program:
233+
234+
0. This `guide <https://cloud.google.com/tpu/docs/pytorch-xla-performance-profiling-tpu-vm#tpu-vm>`_ will
235+
help you with the Cloud TPU setup with the required installations.
236+
237+
1. Start a `TensorBoard <https://www.tensorflow.org/tensorboard>`_ server. You could view the TensorBoard output at ``http://localhost:9001`` on your local machine, and then open the
238+
``PROFILE`` plugin from the top right dropdown or open ``http://localhost:9001/#profile``
239+
240+
.. code-block:: bash
241+
242+
tensorboard --logdir ./tensorboard --port 9001
243+
244+
2. Once the code you'd like to profile is running, click on the ``CAPTURE PROFILE`` button. Enter
245+
``localhost:9001`` (default port for XLA Profiler) as the Profile Service URL. Then, enter
246+
the number of milliseconds for the profiling duration, and click ``CAPTURE``
247+
248+
3. Make sure the code is running while you are trying to capture the traces. Also, it would lead to better
249+
performance insights if the profiling duration is longer than the step time.
250+
251+
4. Once the capture is finished, the page will refresh and you can browse through the insights using the
252+
``Tools`` dropdown at the top left
253+
254+
213255
----------------
214256

215257
****************

docs/source/common/trainer.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1544,8 +1544,8 @@ val_check_interval
15441544
How often within one training epoch to check the validation set.
15451545
Can specify as float or int.
15461546

1547-
- use (float) to check within a training epoch
1548-
- use (int) to check every n steps (batches)
1547+
- pass a ``float`` in the range [0.0, 1.0] to check after a fraction of the training epoch.
1548+
- pass an ``int`` to check after a fixed number of training batches.
15491549

15501550
.. testcode::
15511551

0 commit comments

Comments
 (0)