Skip to content

Conversation

@awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Feb 11, 2021

What does this PR do?

Solves a TODO for decoupling the TorchElasticEnvironment from the Lightning DDP "environment" (this todo was present even before the major acc-refactor, it was just in a different place).

Also adds more tests (100% coverage for all cluster environments)

Follow up to #5743

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Check that target branch and milestone match!

Did you have fun?

Make sure you had fun coding 🙃

@awaelchli awaelchli mentioned this pull request Feb 12, 2021
16 tasks
Base automatically changed from accelerator-refactor-sharded to master February 12, 2021 20:48
@awaelchli awaelchli force-pushed the refactor/cluster-spawn branch from e9c2459 to f9cff77 Compare February 15, 2021 00:46
@codecov
Copy link

codecov bot commented Feb 15, 2021

Codecov Report

Merging #5915 (03bcc33) into master (248a8e8) will increase coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #5915   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         160     161    +1     
  Lines       11415   11456   +41     
======================================
+ Hits        10644   10691   +47     
+ Misses        771     765    -6     

@awaelchli awaelchli added this to the 1.3 milestone Feb 15, 2021
@awaelchli awaelchli changed the title refactor - default cluster environment for lightning specific ddp introduce default cluster environment for lightning specific ddp Feb 19, 2021
@awaelchli awaelchli changed the title introduce default cluster environment for lightning specific ddp introduce default cluster environment for lightning-specific ddp Feb 19, 2021
@awaelchli awaelchli marked this pull request as ready for review February 21, 2021 22:14
@awaelchli awaelchli added distributed Generic distributed-related topic environment: slurm labels Feb 21, 2021
from pytorch_lightning.utilities.distributed import find_free_network_port


class DefaultEnvironment(ClusterEnvironment):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not let ClusterEnvironment have these default implementations instead of separating the interface ClusterEnvironment from DefaultEnvironment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good question. I thought about it too. The idea was to have the abstract class as an interface and by marking methods as abstract the user would know exactly what to implement. Taking this into consideration, would you still recommend merging them together?

Copy link
Contributor

@carmocca carmocca Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice how master_address() and master_port() are duplicate in both DefaultEnvironment and TorchElasticEnvironment because each implements (subclasses) the interface directly.

I'm not an expert on ABC's tricks. Is there any way we can set ClusterEnvironment as the abstract interface AND DefaultEnvironment as the default implementation? That would be ideal.

I guess the simplest way to solve it is by merging the interface and the default implementation.

Copy link
Contributor Author

@awaelchli awaelchli Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are not exactly duplicates, But I know what you mean.
Maybe the name DefaultEnvironment is not a good choice..
Essentially, there is a key difference between the "DefaultEnvironment" and the others that we currently have, and that is the way processes are spawned (controlled by spawns_children bool method). Making this the superclass from which all derive would be suboptimal. Perhaps it should be called LightningEnvironment or something like that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BasicEnvironment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into it further, I'm not so sure anymore. I guess it depends on whether users are encouraged (or only allowed) to inherit from cluster or default/base.

Copy link
Contributor

@tchaton tchaton Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are using localhost 127.0.0.1, we could call it LocalhostEnvironment which might be more appropriate.

Copy link
Contributor

@ajtritt ajtritt Feb 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The typical naming convention here would be to add Abstract to the abstract base class. In this case, you'd change ClusterEnvironment to AbstractClusterEnvironment. This quite explicitly tells users that they cannot instantiate it directly. The basic instantiable subclass would then be called ClusterEnvironment. Anyone looking to provide a custom environment would then extend AbstractClusterEnvironment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the duplicate code in DefaultEnvironment and TorchElasticEnvironment.... one option would be to make a LocalhostEnvironment class, which extends the abstract ClusterEnvironment and only implements master_port and master_address to default to 127.0.0.1. DefaultEnvironment and TorchElasticEnvironment would then extend LocalhostEnvironment.

Copy link
Contributor Author

@awaelchli awaelchli Mar 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed DefaultEnvironment to LightningEnvironment for now. I want to clear up some potential points of confusion: The previous name was a poor choice, I admit that, because it would suggest that DefaultEnvironment holds the default implementation and others should inherit from that. But that is the opposite I wanted to achieve. The goal here is to provide a specification, and therefore the abstract class.
I do not plan to provide a base class that implements all methods, and that all environments inherit from. We need to hold back here and not abuse inheritance for the sole purpose of "code reuse".
I'll see what I can do about the "duplicated code" for master address and master port.

from pytorch_lightning.utilities.distributed import find_free_network_port


class DefaultEnvironment(ClusterEnvironment):
Copy link
Contributor

@tchaton tchaton Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are using localhost 127.0.0.1, we could call it LocalhostEnvironment which might be more appropriate.

@carmocca carmocca mentioned this pull request Feb 22, 2021
11 tasks
@Borda
Copy link
Collaborator

Borda commented Mar 1, 2021

@awaelchli any update here? ;]

@mergify mergify bot removed the has conflicts label Mar 2, 2021
@mergify mergify bot added the has conflicts label Mar 2, 2021
Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see what I can do about the "duplicated code" for master address and master port.

Do you plan to do that here?

@mergify mergify bot removed the has conflicts label Mar 2, 2021
@awaelchli
Copy link
Contributor Author

@carmocca What would be your advice? Motivation of this PR was to disentangle the TorchElasticEnvironment from Lightnings own mechanism. I'd be happy to follow up with smaller refactors after discussion. I'd also like to unblock @ajtritt for their LSF addition

@carmocca
Copy link
Contributor

carmocca commented Mar 2, 2021

I'd be happy to follow up with smaller refactors after discussion. I'd also like to unblock @ajtritt for their LSF addition

Sure! No problem. My only advice would be to open an issue if you don't plan to do it soon™️ so we don't forget

@carmocca carmocca added the ready PRs ready to be merged label Mar 2, 2021
@awaelchli
Copy link
Contributor Author

Great. Made a bullet point list here #6303 of potential improvements :)

Comment on lines 22 to 23
def spawns_children(self) -> bool:
""" Whether the environment spawns the subprocesses or not. """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this control whether at all children will be spawned or whether this is done by the environment or the plugin?

Nit: Maybe spawn is not the best term here, since we don't always spawn processes, but sometimes also fork them. Maybe create would be better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this control whether at all children will be spawned or whether this is done by the environment or the plugin?

The latter: whether this is done by the environment or the plugin.
If spawns_children returns False, the DDPPlugin will lauch the subprocesses for each gpu. Otherwise we assume the process we are in is already a worker process.

I took your suggestion and renamed to creates_children, that's better.

def __init__(self):
self._world_size = None
@abstractmethod
def spawns_children(self) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should all of those be functions or properties instead?

Copy link
Contributor Author

@awaelchli awaelchli Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be, not sure, added your suggestion to #6303
for this PR I just followed the design that was already there.

@mergify mergify bot removed the has conflicts label Mar 3, 2021
@Borda Borda requested review from s-rog and tchaton March 4, 2021 18:59
@Borda Borda enabled auto-merge (squash) March 4, 2021 19:00
@Borda
Copy link
Collaborator

Borda commented Mar 4, 2021

@awaelchli not sure why there are 312 commits, mind rebase so it clear what is the change?

@awaelchli awaelchli force-pushed the refactor/cluster-spawn branch from 13fa3c5 to 6c9b21e Compare March 4, 2021 19:12
@awaelchli
Copy link
Contributor Author

@Borda that was from the accelerator refactor.
Just one commit now

@mergify mergify bot added the has conflicts label Mar 4, 2021
Copy link
Contributor

@s-rog s-rog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, much clearer than before!

@mergify mergify bot removed the has conflicts label Mar 5, 2021
@Borda Borda merged commit ec8d46e into master Mar 5, 2021
@Borda Borda deleted the refactor/cluster-spawn branch March 5, 2021 01:47
@justusschock justusschock mentioned this pull request Mar 8, 2021
11 tasks
@tchaton tchaton mentioned this pull request Mar 8, 2021
awaelchli added a commit to awaelchli/pytorch-lightning that referenced this pull request Apr 13, 2021
…htning-AI#5915)

* handle distributed_sampler_kwargs

* move emptying cache to accelertor

* fix a few tests

* restoring the result from subprocess

* fix queue.get() order for results

* add missing "block_backward_sync" context manager

* add missing "block_backward_sync" context manager

* fix sync_batchnorm

* fix supported gpu-ids for tuple

* fix clip gradients and inf recursion

* accelerator selection: added cluster_environment plugin

* fix torchelastic test

* fix reduce early stopping decision for DDP

* fix tests: callbacks, conversion to lightning optimizer

* fix lightning optimizer does not pickle

* fix setting benchmark and deterministic option

* fix slurm amp test

* fix prepare_data test and determine node_rank

* fix retrieving last path when testing

* remove obsolete plugin argument

* fix test: test_trainer_config

* fix torchscript tests

* fix trainer.model access

* move properties

* fix test_transfer_batch_hook

* fix auto_select_gpus

* fix omegaconf test

* fix test that needs to simulate slurm ddp

* add horovod plugin

* fix test with named arguments

* clean up whitespace

* fix datamodules test

* remove old accelerators

* fix naming

* move old plugins

* move to plugins

* create precision subpackage

* create training_type subpackage

* fix all new import errors

* fix wrong arguments order passed to test

* fix LR finder

* Added sharded training type and amp plugin

* Move clip grad to precision plugin

* Added sharded spawn, select accelerators based on distributed_backend + enable custom fp16 plugin automatically

* Fix import issue, attempting to fix tests

* Fix initial test

* Reflect hook logic from master, should wrap model after move to device

* Optional state consolidation, since master has optimizers not wrapped

* change attribute for instance test

* reset optimizers

optimizers are not used in main process, so state would be wrong.

* legacy

* imports in accel

* legacy2

* trainer imports

* fix import errors after rebase

* move hook to new setup location

* provide unwrapping logic

* fix trainer callback system

* added ddp2 implementation

* fix imports .legacy

* move plugins

* restore legacy

* drop test.py from root

* add tpu accelerator and plugins

* fixes

* fix lightning optimizer merge

* reset bugreportmodel

* unwrapping

* step routing forward

* model access

* unwrap

* opt

* integrate distrib_type

* sync changes

* sync

* fixes

* add forgotten generators

* add missing logic

* update

* import

* missed imports

* import fixes

* isort

* mv f

* changelog

* format

* move helper to parallel plugin

* d

* add world size

* clean up

* duplicate

* activate ddp_sharded and tpu

* set nvidia flags

* remove unused colab var

* use_tpu <-> on_tpu attrs

* make some ddp_cpu and clusterplugin tests pass

* Ref/accelerator connector (Lightning-AI#5742)

* final cleanup

Co-authored-by: Adrian Wälchli <[email protected]>

* connector cleanup

Co-authored-by: Adrian Wälchli <[email protected]>

* trainer cleanup

Co-authored-by: Adrian Wälchli <[email protected]>

* accelerator cleanup + missing logic in accelerator connector

Co-authored-by: Adrian Wälchli <[email protected]>

* add missing changes to callbacks

Co-authored-by: Adrian Wälchli <[email protected]>

* reflect accelerator changes to lightning module

Co-authored-by: Adrian Wälchli <[email protected]>

* clean cluster envs

Co-authored-by: Adrian Wälchli <[email protected]>

* cleanup plugins

Co-authored-by: Adrian Wälchli <[email protected]>

* add broadcasting

Co-authored-by: Adrian Wälchli <[email protected]>

* yapf

* remove plugin connector

Co-authored-by: Adrian Wälchli <[email protected]>

* plugins

* manual optimization

* update optimizer routing

* add rank to torchelastic

* fix memory mixed precision

* setstate on trainer for pickling in ddp spawn

* add predict method

* add back commented accelerator code

* adapt test for sync_batch_norm to new plugin

* fix deprecated tests

* fix ddp cpu choice when no num_processes are given

* yapf format

* skip a memory test that cannot pass anymore

* fix pickle error in spawn plugin

* x

* avoid

* x

* fix cyclic import in docs build

* add support for sharded

* update typing

* add sharded and sharded_spawn to distributed types

* make unwrap model default

* refactor LightningShardedDataParallel similar to LightningDistributedDataParallel

* update sharded spawn to reflect changes

* update sharded to reflect changes

* Merge 1.1.5 changes

* fix merge

* fix merge

* yapf isort

* fix merge

* yapf isort

* fix indentation in test

* copy over reinit scheduler implementation from dev1.2

* fix apex tracking calls with dev_debugger

* reduce diff to dev1.2, clean up

* fix trainer config test  when gpus>0 and num_processes >0 and ddp_cpu

* sort plugin tests legacy/new

* fix error handling for amp on cpu

* fix merge

fix merge

fix merge

* [Feat] Resolve manual_backward (Lightning-AI#5837)

* resolve manual_backward

* resolve flake8

* update

* resolve for ddp_spawn

* resolve flake8

* resolve flake8

* resolve flake8

Co-authored-by: Ubuntu <[email protected]>

* fix tests/accelerator tests on cpu

* [BugFix] Resolve manual optimization (Lightning-AI#5852)

* resolve manual_optimization

* update

* update

Co-authored-by: Ubuntu <[email protected]>

* Remove copy trainer parameters to happen earlier within the loop and add safe guard to get ref model (Lightning-AI#5856)

* resovle a bug

* Accelerator refactor sharded rpc (Lightning-AI#5854)

* rpc branch

* merge

* update handling of rpc

* make devices etc. Optional in RPC

* set devices etc. later if necessary

* remove devices from sequential

* make devices optional in rpc

* fix import

* uncomment everything

* fix cluster selection

Co-authored-by: Ubuntu <[email protected]>

* resolve bug

* fix assert in rpc test

* resolve a test

* fix docs compilation

* accelerator refactor - fix for sharded parity test (Lightning-AI#5866)

* fix memory issue with ddp_spawn

* x

x

x

x

x

x

x

x

x

* x

* Remove DDP2 as this does not apply

* Add missing pre optimizer hook to ensure lambda closure is called

* fix apex docstring

* [accelerator][BugFix] Resolve some test for 1 gpu (Lightning-AI#5863)

* update

* revert init

* resolve a bug

* update

* resolve flake8

* update

* update

* update

* revert init

* resolve a bug

* update

* resolve flake8

* update

* update

* update

* update

* update

* revert init

* resolve a bug

* update

* resolve flake8

* update

* update

* update

* revert init

* update

* resolve flake8

* update

* update

* update

* update

* update

* all_gather

* update

* make plugins work, add misconfig for RPC

* update

* update

* remove breaking test

* resolve some tests

* resolve flake8

* revert to ddp_spawn

Co-authored-by: root <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Justus Schock <[email protected]>

* yapf isort

* resolve flake8

* fix apex doctests

* fix apex doctests 2

* resolve docs

* update drone

* clean env

* update

* update

* update

* update

* merge

* Fix RPC related tests, clean out old API, update for new accelerator API [skip ci] (Lightning-AI#5881)

* Fix RPC related tests, clean out old API, update for new accelerator API

* Move tests out of legacy folder, update paths and names

* Update test_remove_1-4.py

* Expose properties for tpu cores/gpus/num_gpus

* Add root GPU property

* Move properties to properties.py

* move tests that were previously in drone

* Fix root GPU property (Lightning-AI#5908)

* Move root GPU to property, remove horovod set as this is handled in horovod plugin, ensure we mock correctly to set GPU accelerator

* Add missing tests back

* fix best model path transfer when no checkpoint callback available

* Fix setup hook order [wip] (Lightning-AI#5858)

* Call trainer setup hook before accelerator setup

* Add test case

* add new test

* typo

* fix callback order in test

Co-authored-by: tchaton <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>

* rename ddp sequential -> rpc sequential for special test

* revert

* fix stupid merge problem

* abstract the cluster plugins

* default plugin

* integrate default environment

* fix property

* adapt tests

* adjust test

* fix world size access

* base cluster env

* revert rebase errors

* revert rebase errors

* missing import

* revert unrelated change

* remove unused cluster local rank

* remove unrelated changes

* fix unrelated changes

* fix pep8

* remove unused var

* reset permissions

* ypaf

* test default environment

* test torchelastic environment

* world  size as int

* tests for slurm environment

* changelog

* test comments

* remove unintended change

* keep master port fixed after it is generated

* test random master port

* yapf

* add missing default environment

* move helper function

* rename default environment

* rename

* rename

* yapf

* Update pytorch_lightning/plugins/environments/lightning_environment.py

Co-authored-by: Carlos Mocholí <[email protected]>

* Update CHANGELOG.md

Co-authored-by: Justus Schock <[email protected]>

* spawn -> create

Co-authored-by: justusschock <[email protected]>
Co-authored-by: SeanNaren <[email protected]>
Co-authored-by: Justus Schock <[email protected]>
Co-authored-by: Jirka Borovec <[email protected]>
Co-authored-by: Justus Schock <[email protected]>
Co-authored-by: chaton <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Sean Naren <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Carlos Mocholí <[email protected]>
facebook-github-bot pushed a commit to facebookresearch/d2go that referenced this pull request Apr 14, 2021
…ter) to github/third-party/PyTorchLightning/pytorch-lightning

Summary:
### New commit log messages
## [UnReleased] - 2021-MM-DD

### Added

- Added more explicit exception message when trying to execute `trainer.test()` or `trainer.validate()` with `fast_dev_run=True` ([#6667](Lightning-AI/pytorch-lightning#6667))

- Added `LightningCLI` class to provide simple reproducibility with minimum boilerplate training cli. ([#4492](Lightning-AI/pytorch-lightning#4492))

- Trigger warning when non-metric logged value with multi processes hasn't been reduced ([#6417](Lightning-AI/pytorch-lightning#6417))

- Added `gradient_clip_algorithm` argument to Trainer for gradient clipping by value ([#6123](Lightning-AI/pytorch-lightning#6123)).

- Added a way to print to terminal without breaking up the progress bar ([#5470](Lightning-AI/pytorch-lightning#5470))

- Added support to checkpoint after training steps in `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146))

- Added `checkpoint` parameter to callback's `on_save_checkpoint` hook ([#6072](Lightning-AI/pytorch-lightning#6072))

- Added `RunningStage.SANITY_CHECKING` ([#4945](Lightning-AI/pytorch-lightning#4945))

- Added `TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}` ([#4945](Lightning-AI/pytorch-lightning#4945))

- Added `Trainer.validate()` method to perform one evaluation epoch over the validation set ([#4948](Lightning-AI/pytorch-lightning#4948))

- Added `LightningEnvironment` for Lightning-specific DDP ([#5915](Lightning-AI/pytorch-lightning#5915))

- Added `teardown()` hook to LightningDataModule ([#4673](Lightning-AI/pytorch-lightning#4673))

- Added `auto_insert_metric_name` parameter to `ModelCheckpoint` ([#6277](Lightning-AI/pytorch-lightning#6277))

- Added arg to `self.log` that enables users to give custom names when dealing with multiple dataloaders ([#6274](Lightning-AI/pytorch-lightning#6274))

- Added `teardown` method to `BaseProfiler` to enable subclasses defining post-profiling steps outside of `__del__` ([#6370](Lightning-AI/pytorch-lightning#6370))

- Added `setup` method to `BaseProfiler` to enable subclasses defining pre-profiling steps for every process ([#6633](Lightning-AI/pytorch-lightning#6633))

- Added no return warning to predict ([#6139](Lightning-AI/pytorch-lightning#6139))

- Added `Trainer.predict` config validation ([#6543](Lightning-AI/pytorch-lightning#6543))

- Added `AbstractProfiler` interface ([#6621](Lightning-AI/pytorch-lightning#6621))

- Added support for including module names for forward in the autograd trace of `PyTorchProfiler` ([#6349](Lightning-AI/pytorch-lightning#6349))

- Added support for the PyTorch 1.8.1 autograd profiler ([#6618](Lightning-AI/pytorch-lightning#6618))

- Added `outputs` parameter to callback's `on_validation_epoch_end` & `on_test_epoch_end` hooks ([#6120](Lightning-AI/pytorch-lightning#6120))

- Added `configure_sharded_model` hook ([#6679](Lightning-AI/pytorch-lightning#6679))

- Added support for `precision=64`, enabling training with double precision ([#6595](Lightning-AI/pytorch-lightning#6595))

- Added support for DDP communication hooks ([#6736](Lightning-AI/pytorch-lightning#6736))

- Added `artifact_location` argument to `MLFlowLogger` which will be passed to the `MlflowClient.create_experiment` call ([#6677](Lightning-AI/pytorch-lightning#6677))

- Added `model` parameter to precision plugins' `clip_gradients` signature ([#6764](Lightning-AI/pytorch-lightning#6764))

### Changed

- Renamed `pytorch_lightning.callbacks.swa` to `pytorch_lightning.callbacks.stochastic_weight_avg` ([#6259](Lightning-AI/pytorch-lightning#6259))

- Refactor `RunningStage` and `TrainerState` usage ([#4945](Lightning-AI/pytorch-lightning#4945))

- Changed `trainer.evaluating` to return `True` if validating or testing ([#4945](Lightning-AI/pytorch-lightning#4945))

- Changed `setup()` and `teardown()` stage argument to take any of `{fit,validate,test,predict}` ([#6386](Lightning-AI/pytorch-lightning#6386))

- Changed profilers to save separate report files per state and rank ([#6621](Lightning-AI/pytorch-lightning#6621))

- Changed `PyTorchProfiler` to use `torch.autograd.profiler.record_function` to record functions ([#6349](Lightning-AI/pytorch-lightning#6349))

### Deprecated

- `period` has been deprecated in favor of `every_n_val_epochs` in the `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146))

- Deprecated `trainer.running_sanity_check` in favor of `trainer.sanity_checking` ([#4945](Lightning-AI/pytorch-lightning#4945))

- Deprecated `Profiler(output_filename)` in favor of `dirpath` and `filename` ([#6621](Lightning-AI/pytorch-lightning#6621))

- Deprecated `PytorchProfiler(profiled_functions)` in favor of `record_functions` ([#6349](Lightning-AI/pytorch-lightning#6349))

- Deprecated metrics in favor of `torchmetrics` ([#6505](Lightning-AI/pytorch-lightning#6505),
    [#6530](Lightning-AI/pytorch-lightning#6530),
    [#6540](Lightning-AI/pytorch-lightning#6540),
    [#6547](Lightning-AI/pytorch-lightning#6547),
    [#6515](Lightning-AI/pytorch-lightning#6515),
    [#6572](Lightning-AI/pytorch-lightning#6572),
    [#6573](Lightning-AI/pytorch-lightning#6573),
    [#6584](Lightning-AI/pytorch-lightning#6584),
    [#6636](Lightning-AI/pytorch-lightning#6636),
    [#6637](Lightning-AI/pytorch-lightning#6637),
    [#6649](Lightning-AI/pytorch-lightning#6649),
    [#6659](Lightning-AI/pytorch-lightning#6659),
)

### Removed

- Removed support for passing a bool value to `profiler` argument of Trainer ([#6164](Lightning-AI/pytorch-lightning#6164))

- Removed no return warning from val/test step ([#6139](Lightning-AI/pytorch-lightning#6139))

- Removed passing a `ModelCheckpoint` instance to `Trainer(checkpoint_callback)` ([#6166](Lightning-AI/pytorch-lightning#6166))

- Removed deprecated Trainer argument `enable_pl_optimizer` and `automatic_optimization` ([#6163](Lightning-AI/pytorch-lightning#6163))

- Removed deprecated metrics ([#6161](Lightning-AI/pytorch-lightning#6161))
    * from `pytorch_lightning.metrics.functional.classification` removed `to_onehot`, `to_categorical`, `get_num_classes`, `roc`, `multiclass_roc`, `average_precision`, `precision_recall_curve`, `multiclass_precision_recall_curve`
    * from `pytorch_lightning.metrics.functional.reduction` removed `reduce`, `class_reduce`

- Removed deprecated `ModelCheckpoint` arguments `prefix`, `mode="auto"` ([#6162](Lightning-AI/pytorch-lightning#6162))

- Removed `mode='auto'` from `EarlyStopping` ([#6167](Lightning-AI/pytorch-lightning#6167))

- Removed legacy references for magic keys in the `Result` object ([#6016](Lightning-AI/pytorch-lightning#6016))

- Removed deprecated `LightningModule` `hparams` setter ([#6207](Lightning-AI/pytorch-lightning#6207))

- Removed legacy code to log or include metrics in the progress bar by returning them in a dict with the `"log"/"progress_bar"` magic keys. Use `self.log` instead ([#6734](Lightning-AI/pytorch-lightning#6734))

- Removed `optimizer_idx` argument from `training_step` in manual optimization ([#6093](Lightning-AI/pytorch-lightning#6093))

### Fixed

- Set better defaults for `rank_zero_only.rank` when training is launched with SLURM and torchelastic ([#6802](Lightning-AI/pytorch-lightning#6802))

- Made the `Plugin.reduce` method more consistent across all Plugins to reflect a mean-reduction by default ([#6011](Lightning-AI/pytorch-lightning#6011))

- Move lightning module to correct device type when using LightningDistributedWrapper ([#6070](Lightning-AI/pytorch-lightning#6070))

- Do not print top-k verbose log with `ModelCheckpoint(monitor=None)` ([#6109](Lightning-AI/pytorch-lightning#6109))

- Fixed csv extension check ([#6436](Lightning-AI/pytorch-lightning#6436))

- Fixed `ModelCheckpoint(monitor=None, save_last=True)` not saving checkpoints ([#6136](Lightning-AI/pytorch-lightning#6136))

- Fixed `ModelCheckpoint(save_top_k=0, save_last=True)` not saving the `last` checkpoint ([#6136](Lightning-AI/pytorch-lightning#6136))

- Fixed `.teardown(stage='fit')` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386))

- Fixed `.on_fit_{start,end}()` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386))

- Fixed LightningModule `all_gather` on cpu tensors ([#6416](Lightning-AI/pytorch-lightning#6416))

- Fixed torch distributed not available in setup hook for DDP ([#6506](Lightning-AI/pytorch-lightning#6506))

- Fixed `EarlyStopping` logic when `min_epochs` or `min_steps` requirement is not met ([#6705](Lightning-AI/pytorch-lightning#6705))

## [1.2.7] - 2021-04-06

### Fixed

- Fixed resolve a bug with omegaconf and xm.save ([#6741](Lightning-AI/pytorch-lightning#6741))
- Fixed an issue with IterableDataset when __len__ is not defined ([#6828](Lightning-AI/pytorch-lightning#6828))
- Sanitize None params during pruning ([#6836](Lightning-AI/pytorch-lightning#6836))
- Enforce an epoch scheduler interval when using SWA ([#6588](Lightning-AI/pytorch-lightning#6588))
- Fixed TPU Colab hang issue, post training ([#6816](Lightning-AI/pytorch-lightning#6816))
- Fixed a bug where `TensorBoardLogger` would give a warning and not log correctly to a symbolic link `save_dir` ([#6730](Lightning-AI/pytorch-lightning#6730))

## [1.2.6] - 2021-03-30

### Changed

- Changed the behavior of `on_epoch_start` to run at the beginning of validation & test epoch ([#6498](Lightning-AI/pytorch-lightning#6498))

### Removed

- Removed legacy code to include `step` dictionary returns in `callback_metrics`. Use `self.log_dict` instead. ([#6682](Lightning-AI/pytorch-lightning#6682))

### Fixed

- Fixed `DummyLogger.log_hyperparams` raising a `TypeError` when running with `fast_dev_run=True` ([#6398](Lightning-AI/pytorch-lightning#6398))
- Fixed error on TPUs when there was no `ModelCheckpoint` ([#6654](Lightning-AI/pytorch-lightning#6654))
- Fixed `trainer.test` freeze on TPUs ([#6654](Lightning-AI/pytorch-lightning#6654))
- Fixed a bug where gradients were disabled after calling `Trainer.predict` ([#6657](Lightning-AI/pytorch-lightning#6657))
- Fixed bug where no TPUs were detected in a TPU pod env ([#6719](Lightning-AI/pytorch-lightning#6719))

## [1.2.5] - 2021-03-23

### Changed

- Update Gradient Clipping for the TPU Accelerator ([#6576](Lightning-AI/pytorch-lightning#6576))
- Refactored setup for typing friendly ([#6590](Lightning-AI/pytorch-lightning#6590))

### Fixed

- Fixed a bug where `all_gather` would not work correctly with `tpu_cores=8` ([#6587](Lightning-AI/pytorch-lightning#6587))
- Fixed comparing required versions ([#6434](Lightning-AI/pytorch-lightning#6434))
- Fixed duplicate logs appearing in console when using the python logging module ([#6275](Lightning-AI/pytorch-lightning#6275))
- Added Autocast in validation, test and predict modes for Native AMP ([#6565](Lightning-AI/pytorch-lightning#6565))

Reviewed By: shuyingsunshine21

Differential Revision: D27528929

fbshipit-source-id: 311c88f71461c2c79bbf185e28d7a6d683ccc26f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

distributed Generic distributed-related topic environment: slurm ready PRs ready to be merged refactor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants