introduce default cluster environment for lightning-specific ddp #5915

awaelchli · 2021-02-11T00:56:38Z

What does this PR do?

Solves a TODO for decoupling the TorchElasticEnvironment from the Lightning DDP "environment" (this todo was present even before the major acc-refactor, it was just in a different place).

Also adds more tests (100% coverage for all cluster environments)

Follow up to #5743

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified
Check that target branch and milestone match!

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2021-02-15T00:50:42Z

Codecov Report

Merging #5915 (03bcc33) into master (248a8e8) will increase coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #5915   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         160     161    +1     
  Lines       11415   11456   +41     
======================================
+ Hits        10644   10691   +47     
+ Misses        771     765    -6

pytorch_lightning/plugins/environments/default_environment.py

pytorch_lightning/plugins/training_type/ddp_spawn.py

carmocca · 2021-02-22T00:16:20Z

pytorch_lightning/plugins/environments/default_environment.py

+from pytorch_lightning.utilities.distributed import find_free_network_port
+
+
+class DefaultEnvironment(ClusterEnvironment):


Why not let ClusterEnvironment have these default implementations instead of separating the interface ClusterEnvironment from DefaultEnvironment?

Yes, good question. I thought about it too. The idea was to have the abstract class as an interface and by marking methods as abstract the user would know exactly what to implement. Taking this into consideration, would you still recommend merging them together?

Notice how master_address() and master_port() are duplicate in both DefaultEnvironment and TorchElasticEnvironment because each implements (subclasses) the interface directly.

I'm not an expert on ABC's tricks. Is there any way we can set ClusterEnvironment as the abstract interface AND DefaultEnvironment as the default implementation? That would be ideal.

I guess the simplest way to solve it is by merging the interface and the default implementation.

They are not exactly duplicates, But I know what you mean.
Maybe the name DefaultEnvironment is not a good choice..
Essentially, there is a key difference between the "DefaultEnvironment" and the others that we currently have, and that is the way processes are spawned (controlled by spawns_children bool method). Making this the superclass from which all derive would be suboptimal. Perhaps it should be called LightningEnvironment or something like that?

BasicEnvironment?

Looking into it further, I'm not so sure anymore. I guess it depends on whether users are encouraged (or only allowed) to inherit from cluster or default/base.

As we are using localhost 127.0.0.1, we could call it LocalhostEnvironment which might be more appropriate.

The typical naming convention here would be to add Abstract to the abstract base class. In this case, you'd change ClusterEnvironment to AbstractClusterEnvironment. This quite explicitly tells users that they cannot instantiate it directly. The basic instantiable subclass would then be called ClusterEnvironment. Anyone looking to provide a custom environment would then extend AbstractClusterEnvironment.

Regarding the duplicate code in DefaultEnvironment and TorchElasticEnvironment.... one option would be to make a LocalhostEnvironment class, which extends the abstract ClusterEnvironment and only implements master_port and master_address to default to 127.0.0.1. DefaultEnvironment and TorchElasticEnvironment would then extend LocalhostEnvironment.

I renamed DefaultEnvironment to LightningEnvironment for now. I want to clear up some potential points of confusion: The previous name was a poor choice, I admit that, because it would suggest that DefaultEnvironment holds the default implementation and others should inherit from that. But that is the opposite I wanted to achieve. The goal here is to provide a specification, and therefore the abstract class.
I do not plan to provide a base class that implements all methods, and that all environments inherit from. We need to hold back here and not abuse inheritance for the sole purpose of "code reuse".
I'll see what I can do about the "duplicated code" for master address and master port.

tchaton · 2021-02-22T09:31:14Z

pytorch_lightning/plugins/environments/default_environment.py

+from pytorch_lightning.utilities.distributed import find_free_network_port
+
+
+class DefaultEnvironment(ClusterEnvironment):


As we are using localhost 127.0.0.1, we could call it LocalhostEnvironment which might be more appropriate.

pytorch_lightning/plugins/environments/slurm_environment.py

Borda · 2021-03-01T09:49:15Z

@awaelchli any update here? ;]

carmocca

I'll see what I can do about the "duplicated code" for master address and master port.

Do you plan to do that here?

pytorch_lightning/plugins/environments/lightning_environment.py

awaelchli · 2021-03-02T15:12:45Z

@carmocca What would be your advice? Motivation of this PR was to disentangle the TorchElasticEnvironment from Lightnings own mechanism. I'd be happy to follow up with smaller refactors after discussion. I'd also like to unblock @ajtritt for their LSF addition

carmocca · 2021-03-02T15:20:19Z

I'd be happy to follow up with smaller refactors after discussion. I'd also like to unblock @ajtritt for their LSF addition

Sure! No problem. My only advice would be to open an issue if you don't plan to do it soon™️ so we don't forget

awaelchli · 2021-03-02T15:37:49Z

Great. Made a bullet point list here #6303 of potential improvements :)

CHANGELOG.md

justusschock · 2021-03-03T13:47:31Z

pytorch_lightning/plugins/environments/cluster_environment.py

+    def spawns_children(self) -> bool:
+        """ Whether the environment spawns the subprocesses or not. """


Does this control whether at all children will be spawned or whether this is done by the environment or the plugin?

Nit: Maybe spawn is not the best term here, since we don't always spawn processes, but sometimes also fork them. Maybe create would be better?

Does this control whether at all children will be spawned or whether this is done by the environment or the plugin?

The latter: whether this is done by the environment or the plugin.
If spawns_children returns False, the DDPPlugin will lauch the subprocesses for each gpu. Otherwise we assume the process we are in is already a worker process.

I took your suggestion and renamed to creates_children, that's better.

justusschock · 2021-03-03T13:47:58Z

pytorch_lightning/plugins/environments/cluster_environment.py

-    def __init__(self):
-        self._world_size = None
+    @abstractmethod
+    def spawns_children(self) -> bool:


Should all of those be functions or properties instead?

could be, not sure, added your suggestion to #6303
for this PR I just followed the design that was already there.

Borda · 2021-03-04T19:02:04Z

@awaelchli not sure why there are 312 commits, mind rebase so it clear what is the change?

awaelchli · 2021-03-04T19:13:51Z

@Borda that was from the accelerator refactor.
Just one commit now

s-rog

LGTM, much clearer than before!

…htning-AI#5915) * handle distributed_sampler_kwargs * move emptying cache to accelertor * fix a few tests * restoring the result from subprocess * fix queue.get() order for results * add missing "block_backward_sync" context manager * add missing "block_backward_sync" context manager * fix sync_batchnorm * fix supported gpu-ids for tuple * fix clip gradients and inf recursion * accelerator selection: added cluster_environment plugin * fix torchelastic test * fix reduce early stopping decision for DDP * fix tests: callbacks, conversion to lightning optimizer * fix lightning optimizer does not pickle * fix setting benchmark and deterministic option * fix slurm amp test * fix prepare_data test and determine node_rank * fix retrieving last path when testing * remove obsolete plugin argument * fix test: test_trainer_config * fix torchscript tests * fix trainer.model access * move properties * fix test_transfer_batch_hook * fix auto_select_gpus * fix omegaconf test * fix test that needs to simulate slurm ddp * add horovod plugin * fix test with named arguments * clean up whitespace * fix datamodules test * remove old accelerators * fix naming * move old plugins * move to plugins * create precision subpackage * create training_type subpackage * fix all new import errors * fix wrong arguments order passed to test * fix LR finder * Added sharded training type and amp plugin * Move clip grad to precision plugin * Added sharded spawn, select accelerators based on distributed_backend + enable custom fp16 plugin automatically * Fix import issue, attempting to fix tests * Fix initial test * Reflect hook logic from master, should wrap model after move to device * Optional state consolidation, since master has optimizers not wrapped * change attribute for instance test * reset optimizers optimizers are not used in main process, so state would be wrong. * legacy * imports in accel * legacy2 * trainer imports * fix import errors after rebase * move hook to new setup location * provide unwrapping logic * fix trainer callback system * added ddp2 implementation * fix imports .legacy * move plugins * restore legacy * drop test.py from root * add tpu accelerator and plugins * fixes * fix lightning optimizer merge * reset bugreportmodel * unwrapping * step routing forward * model access * unwrap * opt * integrate distrib_type * sync changes * sync * fixes * add forgotten generators * add missing logic * update * import * missed imports * import fixes * isort * mv f * changelog * format * move helper to parallel plugin * d * add world size * clean up * duplicate * activate ddp_sharded and tpu * set nvidia flags * remove unused colab var * use_tpu <-> on_tpu attrs * make some ddp_cpu and clusterplugin tests pass * Ref/accelerator connector (Lightning-AI#5742) * final cleanup Co-authored-by: Adrian Wälchli <[email protected]> * connector cleanup Co-authored-by: Adrian Wälchli <[email protected]> * trainer cleanup Co-authored-by: Adrian Wälchli <[email protected]> * accelerator cleanup + missing logic in accelerator connector Co-authored-by: Adrian Wälchli <[email protected]> * add missing changes to callbacks Co-authored-by: Adrian Wälchli <[email protected]> * reflect accelerator changes to lightning module Co-authored-by: Adrian Wälchli <[email protected]> * clean cluster envs Co-authored-by: Adrian Wälchli <[email protected]> * cleanup plugins Co-authored-by: Adrian Wälchli <[email protected]> * add broadcasting Co-authored-by: Adrian Wälchli <[email protected]> * yapf * remove plugin connector Co-authored-by: Adrian Wälchli <[email protected]> * plugins * manual optimization * update optimizer routing * add rank to torchelastic * fix memory mixed precision * setstate on trainer for pickling in ddp spawn * add predict method * add back commented accelerator code * adapt test for sync_batch_norm to new plugin * fix deprecated tests * fix ddp cpu choice when no num_processes are given * yapf format * skip a memory test that cannot pass anymore * fix pickle error in spawn plugin * x * avoid * x * fix cyclic import in docs build * add support for sharded * update typing * add sharded and sharded_spawn to distributed types * make unwrap model default * refactor LightningShardedDataParallel similar to LightningDistributedDataParallel * update sharded spawn to reflect changes * update sharded to reflect changes * Merge 1.1.5 changes * fix merge * fix merge * yapf isort * fix merge * yapf isort * fix indentation in test * copy over reinit scheduler implementation from dev1.2 * fix apex tracking calls with dev_debugger * reduce diff to dev1.2, clean up * fix trainer config test when gpus>0 and num_processes >0 and ddp_cpu * sort plugin tests legacy/new * fix error handling for amp on cpu * fix merge fix merge fix merge * [Feat] Resolve manual_backward (Lightning-AI#5837) * resolve manual_backward * resolve flake8 * update * resolve for ddp_spawn * resolve flake8 * resolve flake8 * resolve flake8 Co-authored-by: Ubuntu <[email protected]> * fix tests/accelerator tests on cpu * [BugFix] Resolve manual optimization (Lightning-AI#5852) * resolve manual_optimization * update * update Co-authored-by: Ubuntu <[email protected]> * Remove copy trainer parameters to happen earlier within the loop and add safe guard to get ref model (Lightning-AI#5856) * resovle a bug * Accelerator refactor sharded rpc (Lightning-AI#5854) * rpc branch * merge * update handling of rpc * make devices etc. Optional in RPC * set devices etc. later if necessary * remove devices from sequential * make devices optional in rpc * fix import * uncomment everything * fix cluster selection Co-authored-by: Ubuntu <[email protected]> * resolve bug * fix assert in rpc test * resolve a test * fix docs compilation * accelerator refactor - fix for sharded parity test (Lightning-AI#5866) * fix memory issue with ddp_spawn * x x x x x x x x x * x * Remove DDP2 as this does not apply * Add missing pre optimizer hook to ensure lambda closure is called * fix apex docstring * [accelerator][BugFix] Resolve some test for 1 gpu (Lightning-AI#5863) * update * revert init * resolve a bug * update * resolve flake8 * update * update * update * revert init * resolve a bug * update * resolve flake8 * update * update * update * update * update * revert init * resolve a bug * update * resolve flake8 * update * update * update * revert init * update * resolve flake8 * update * update * update * update * update * all_gather * update * make plugins work, add misconfig for RPC * update * update * remove breaking test * resolve some tests * resolve flake8 * revert to ddp_spawn Co-authored-by: root <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Justus Schock <[email protected]> * yapf isort * resolve flake8 * fix apex doctests * fix apex doctests 2 * resolve docs * update drone * clean env * update * update * update * update * merge * Fix RPC related tests, clean out old API, update for new accelerator API [skip ci] (Lightning-AI#5881) * Fix RPC related tests, clean out old API, update for new accelerator API * Move tests out of legacy folder, update paths and names * Update test_remove_1-4.py * Expose properties for tpu cores/gpus/num_gpus * Add root GPU property * Move properties to properties.py * move tests that were previously in drone * Fix root GPU property (Lightning-AI#5908) * Move root GPU to property, remove horovod set as this is handled in horovod plugin, ensure we mock correctly to set GPU accelerator * Add missing tests back * fix best model path transfer when no checkpoint callback available * Fix setup hook order [wip] (Lightning-AI#5858) * Call trainer setup hook before accelerator setup * Add test case * add new test * typo * fix callback order in test Co-authored-by: tchaton <[email protected]> Co-authored-by: Adrian Wälchli <[email protected]> * rename ddp sequential -> rpc sequential for special test * revert * fix stupid merge problem * abstract the cluster plugins * default plugin * integrate default environment * fix property * adapt tests * adjust test * fix world size access * base cluster env * revert rebase errors * revert rebase errors * missing import * revert unrelated change * remove unused cluster local rank * remove unrelated changes * fix unrelated changes * fix pep8 * remove unused var * reset permissions * ypaf * test default environment * test torchelastic environment * world size as int * tests for slurm environment * changelog * test comments * remove unintended change * keep master port fixed after it is generated * test random master port * yapf * add missing default environment * move helper function * rename default environment * rename * rename * yapf * Update pytorch_lightning/plugins/environments/lightning_environment.py Co-authored-by: Carlos Mocholí <[email protected]> * Update CHANGELOG.md Co-authored-by: Justus Schock <[email protected]> * spawn -> create Co-authored-by: justusschock <[email protected]> Co-authored-by: SeanNaren <[email protected]> Co-authored-by: Justus Schock <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Justus Schock <[email protected]> Co-authored-by: chaton <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Sean Naren <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]>

…ter) to github/third-party/PyTorchLightning/pytorch-lightning Summary: ### New commit log messages ## [UnReleased] - 2021-MM-DD ### Added - Added more explicit exception message when trying to execute `trainer.test()` or `trainer.validate()` with `fast_dev_run=True` ([#6667](Lightning-AI/pytorch-lightning#6667)) - Added `LightningCLI` class to provide simple reproducibility with minimum boilerplate training cli. ([#4492](Lightning-AI/pytorch-lightning#4492)) - Trigger warning when non-metric logged value with multi processes hasn't been reduced ([#6417](Lightning-AI/pytorch-lightning#6417)) - Added `gradient_clip_algorithm` argument to Trainer for gradient clipping by value ([#6123](Lightning-AI/pytorch-lightning#6123)). - Added a way to print to terminal without breaking up the progress bar ([#5470](Lightning-AI/pytorch-lightning#5470)) - Added support to checkpoint after training steps in `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146)) - Added `checkpoint` parameter to callback's `on_save_checkpoint` hook ([#6072](Lightning-AI/pytorch-lightning#6072)) - Added `RunningStage.SANITY_CHECKING` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Added `TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Added `Trainer.validate()` method to perform one evaluation epoch over the validation set ([#4948](Lightning-AI/pytorch-lightning#4948)) - Added `LightningEnvironment` for Lightning-specific DDP ([#5915](Lightning-AI/pytorch-lightning#5915)) - Added `teardown()` hook to LightningDataModule ([#4673](Lightning-AI/pytorch-lightning#4673)) - Added `auto_insert_metric_name` parameter to `ModelCheckpoint` ([#6277](Lightning-AI/pytorch-lightning#6277)) - Added arg to `self.log` that enables users to give custom names when dealing with multiple dataloaders ([#6274](Lightning-AI/pytorch-lightning#6274)) - Added `teardown` method to `BaseProfiler` to enable subclasses defining post-profiling steps outside of `__del__` ([#6370](Lightning-AI/pytorch-lightning#6370)) - Added `setup` method to `BaseProfiler` to enable subclasses defining pre-profiling steps for every process ([#6633](Lightning-AI/pytorch-lightning#6633)) - Added no return warning to predict ([#6139](Lightning-AI/pytorch-lightning#6139)) - Added `Trainer.predict` config validation ([#6543](Lightning-AI/pytorch-lightning#6543)) - Added `AbstractProfiler` interface ([#6621](Lightning-AI/pytorch-lightning#6621)) - Added support for including module names for forward in the autograd trace of `PyTorchProfiler` ([#6349](Lightning-AI/pytorch-lightning#6349)) - Added support for the PyTorch 1.8.1 autograd profiler ([#6618](Lightning-AI/pytorch-lightning#6618)) - Added `outputs` parameter to callback's `on_validation_epoch_end` & `on_test_epoch_end` hooks ([#6120](Lightning-AI/pytorch-lightning#6120)) - Added `configure_sharded_model` hook ([#6679](Lightning-AI/pytorch-lightning#6679)) - Added support for `precision=64`, enabling training with double precision ([#6595](Lightning-AI/pytorch-lightning#6595)) - Added support for DDP communication hooks ([#6736](Lightning-AI/pytorch-lightning#6736)) - Added `artifact_location` argument to `MLFlowLogger` which will be passed to the `MlflowClient.create_experiment` call ([#6677](Lightning-AI/pytorch-lightning#6677)) - Added `model` parameter to precision plugins' `clip_gradients` signature ([#6764](Lightning-AI/pytorch-lightning#6764)) ### Changed - Renamed `pytorch_lightning.callbacks.swa` to `pytorch_lightning.callbacks.stochastic_weight_avg` ([#6259](Lightning-AI/pytorch-lightning#6259)) - Refactor `RunningStage` and `TrainerState` usage ([#4945](Lightning-AI/pytorch-lightning#4945)) - Changed `trainer.evaluating` to return `True` if validating or testing ([#4945](Lightning-AI/pytorch-lightning#4945)) - Changed `setup()` and `teardown()` stage argument to take any of `{fit,validate,test,predict}` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Changed profilers to save separate report files per state and rank ([#6621](Lightning-AI/pytorch-lightning#6621)) - Changed `PyTorchProfiler` to use `torch.autograd.profiler.record_function` to record functions ([#6349](Lightning-AI/pytorch-lightning#6349)) ### Deprecated - `period` has been deprecated in favor of `every_n_val_epochs` in the `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146)) - Deprecated `trainer.running_sanity_check` in favor of `trainer.sanity_checking` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Deprecated `Profiler(output_filename)` in favor of `dirpath` and `filename` ([#6621](Lightning-AI/pytorch-lightning#6621)) - Deprecated `PytorchProfiler(profiled_functions)` in favor of `record_functions` ([#6349](Lightning-AI/pytorch-lightning#6349)) - Deprecated metrics in favor of `torchmetrics` ([#6505](Lightning-AI/pytorch-lightning#6505), [#6530](Lightning-AI/pytorch-lightning#6530), [#6540](Lightning-AI/pytorch-lightning#6540), [#6547](Lightning-AI/pytorch-lightning#6547), [#6515](Lightning-AI/pytorch-lightning#6515), [#6572](Lightning-AI/pytorch-lightning#6572), [#6573](Lightning-AI/pytorch-lightning#6573), [#6584](Lightning-AI/pytorch-lightning#6584), [#6636](Lightning-AI/pytorch-lightning#6636), [#6637](Lightning-AI/pytorch-lightning#6637), [#6649](Lightning-AI/pytorch-lightning#6649), [#6659](Lightning-AI/pytorch-lightning#6659), ) ### Removed - Removed support for passing a bool value to `profiler` argument of Trainer ([#6164](Lightning-AI/pytorch-lightning#6164)) - Removed no return warning from val/test step ([#6139](Lightning-AI/pytorch-lightning#6139)) - Removed passing a `ModelCheckpoint` instance to `Trainer(checkpoint_callback)` ([#6166](Lightning-AI/pytorch-lightning#6166)) - Removed deprecated Trainer argument `enable_pl_optimizer` and `automatic_optimization` ([#6163](Lightning-AI/pytorch-lightning#6163)) - Removed deprecated metrics ([#6161](Lightning-AI/pytorch-lightning#6161)) * from `pytorch_lightning.metrics.functional.classification` removed `to_onehot`, `to_categorical`, `get_num_classes`, `roc`, `multiclass_roc`, `average_precision`, `precision_recall_curve`, `multiclass_precision_recall_curve` * from `pytorch_lightning.metrics.functional.reduction` removed `reduce`, `class_reduce` - Removed deprecated `ModelCheckpoint` arguments `prefix`, `mode="auto"` ([#6162](Lightning-AI/pytorch-lightning#6162)) - Removed `mode='auto'` from `EarlyStopping` ([#6167](Lightning-AI/pytorch-lightning#6167)) - Removed legacy references for magic keys in the `Result` object ([#6016](Lightning-AI/pytorch-lightning#6016)) - Removed deprecated `LightningModule` `hparams` setter ([#6207](Lightning-AI/pytorch-lightning#6207)) - Removed legacy code to log or include metrics in the progress bar by returning them in a dict with the `"log"/"progress_bar"` magic keys. Use `self.log` instead ([#6734](Lightning-AI/pytorch-lightning#6734)) - Removed `optimizer_idx` argument from `training_step` in manual optimization ([#6093](Lightning-AI/pytorch-lightning#6093)) ### Fixed - Set better defaults for `rank_zero_only.rank` when training is launched with SLURM and torchelastic ([#6802](Lightning-AI/pytorch-lightning#6802)) - Made the `Plugin.reduce` method more consistent across all Plugins to reflect a mean-reduction by default ([#6011](Lightning-AI/pytorch-lightning#6011)) - Move lightning module to correct device type when using LightningDistributedWrapper ([#6070](Lightning-AI/pytorch-lightning#6070)) - Do not print top-k verbose log with `ModelCheckpoint(monitor=None)` ([#6109](Lightning-AI/pytorch-lightning#6109)) - Fixed csv extension check ([#6436](Lightning-AI/pytorch-lightning#6436)) - Fixed `ModelCheckpoint(monitor=None, save_last=True)` not saving checkpoints ([#6136](Lightning-AI/pytorch-lightning#6136)) - Fixed `ModelCheckpoint(save_top_k=0, save_last=True)` not saving the `last` checkpoint ([#6136](Lightning-AI/pytorch-lightning#6136)) - Fixed `.teardown(stage='fit')` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Fixed `.on_fit_{start,end}()` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Fixed LightningModule `all_gather` on cpu tensors ([#6416](Lightning-AI/pytorch-lightning#6416)) - Fixed torch distributed not available in setup hook for DDP ([#6506](Lightning-AI/pytorch-lightning#6506)) - Fixed `EarlyStopping` logic when `min_epochs` or `min_steps` requirement is not met ([#6705](Lightning-AI/pytorch-lightning#6705)) ## [1.2.7] - 2021-04-06 ### Fixed - Fixed resolve a bug with omegaconf and xm.save ([#6741](Lightning-AI/pytorch-lightning#6741)) - Fixed an issue with IterableDataset when __len__ is not defined ([#6828](Lightning-AI/pytorch-lightning#6828)) - Sanitize None params during pruning ([#6836](Lightning-AI/pytorch-lightning#6836)) - Enforce an epoch scheduler interval when using SWA ([#6588](Lightning-AI/pytorch-lightning#6588)) - Fixed TPU Colab hang issue, post training ([#6816](Lightning-AI/pytorch-lightning#6816)) - Fixed a bug where `TensorBoardLogger` would give a warning and not log correctly to a symbolic link `save_dir` ([#6730](Lightning-AI/pytorch-lightning#6730)) ## [1.2.6] - 2021-03-30 ### Changed - Changed the behavior of `on_epoch_start` to run at the beginning of validation & test epoch ([#6498](Lightning-AI/pytorch-lightning#6498)) ### Removed - Removed legacy code to include `step` dictionary returns in `callback_metrics`. Use `self.log_dict` instead. ([#6682](Lightning-AI/pytorch-lightning#6682)) ### Fixed - Fixed `DummyLogger.log_hyperparams` raising a `TypeError` when running with `fast_dev_run=True` ([#6398](Lightning-AI/pytorch-lightning#6398)) - Fixed error on TPUs when there was no `ModelCheckpoint` ([#6654](Lightning-AI/pytorch-lightning#6654)) - Fixed `trainer.test` freeze on TPUs ([#6654](Lightning-AI/pytorch-lightning#6654)) - Fixed a bug where gradients were disabled after calling `Trainer.predict` ([#6657](Lightning-AI/pytorch-lightning#6657)) - Fixed bug where no TPUs were detected in a TPU pod env ([#6719](Lightning-AI/pytorch-lightning#6719)) ## [1.2.5] - 2021-03-23 ### Changed - Update Gradient Clipping for the TPU Accelerator ([#6576](Lightning-AI/pytorch-lightning#6576)) - Refactored setup for typing friendly ([#6590](Lightning-AI/pytorch-lightning#6590)) ### Fixed - Fixed a bug where `all_gather` would not work correctly with `tpu_cores=8` ([#6587](Lightning-AI/pytorch-lightning#6587)) - Fixed comparing required versions ([#6434](Lightning-AI/pytorch-lightning#6434)) - Fixed duplicate logs appearing in console when using the python logging module ([#6275](Lightning-AI/pytorch-lightning#6275)) - Added Autocast in validation, test and predict modes for Native AMP ([#6565](Lightning-AI/pytorch-lightning#6565)) Reviewed By: shuyingsunshine21 Differential Revision: D27528929 fbshipit-source-id: 311c88f71461c2c79bbf185e28d7a6d683ccc26f

awaelchli added the refactor label Feb 11, 2021

awaelchli mentioned this pull request Feb 12, 2021

PoC: Accelerator refactor #5743

Merged

16 tasks

Base automatically changed from accelerator-refactor-sharded to master February 12, 2021 20:48

awaelchli force-pushed the refactor/cluster-spawn branch from e9c2459 to f9cff77 Compare February 15, 2021 00:46

awaelchli added this to the 1.3 milestone Feb 15, 2021

awaelchli changed the title ~~refactor - default cluster environment for lightning specific ddp~~ introduce default cluster environment for lightning specific ddp Feb 19, 2021

awaelchli changed the title ~~introduce default cluster environment for lightning specific ddp~~ introduce default cluster environment for lightning-specific ddp Feb 19, 2021

awaelchli marked this pull request as ready for review February 21, 2021 22:14

awaelchli requested review from Borda, SeanNaren, carmocca, justusschock, tchaton and williamFalcon as code owners February 21, 2021 22:14

awaelchli added distributed Generic distributed-related topic environment: slurm labels Feb 21, 2021

carmocca reviewed Feb 22, 2021

View reviewed changes

tchaton reviewed Feb 22, 2021

View reviewed changes

carmocca mentioned this pull request Feb 22, 2021

Add LSF support #5102

Merged

11 tasks

mergify bot added the has conflicts label Feb 23, 2021

mergify bot removed the has conflicts label Mar 2, 2021

SeanNaren approved these changes Mar 2, 2021

View reviewed changes

mergify bot added the has conflicts label Mar 2, 2021

carmocca reviewed Mar 2, 2021

View reviewed changes

pytorch_lightning/plugins/environments/lightning_environment.py Outdated Show resolved Hide resolved

mergify bot removed the has conflicts label Mar 2, 2021

carmocca approved these changes Mar 2, 2021

View reviewed changes

carmocca added the ready PRs ready to be merged label Mar 2, 2021

awaelchli mentioned this pull request Mar 2, 2021

ClusterEnvironment interface and improvements #6303

Closed

8 tasks

mergify bot added has conflicts and removed has conflicts labels Mar 2, 2021

justusschock approved these changes Mar 3, 2021

View reviewed changes

mergify bot removed the has conflicts label Mar 3, 2021

Borda requested review from s-rog and tchaton March 4, 2021 18:59

Borda enabled auto-merge (squash) March 4, 2021 19:00

all commits in one

6c9b21e

awaelchli force-pushed the refactor/cluster-spawn branch from 13fa3c5 to 6c9b21e Compare March 4, 2021 19:12

mergify bot added the has conflicts label Mar 4, 2021

s-rog approved these changes Mar 5, 2021

View reviewed changes

Merge branch 'master' into refactor/cluster-spawn

b1136ac

mergify bot removed the has conflicts label Mar 5, 2021

Borda merged commit ec8d46e into master Mar 5, 2021

Borda deleted the refactor/cluster-spawn branch March 5, 2021 01:47

justusschock mentioned this pull request Mar 8, 2021

Add Sagemaker DDP Plugin #6271

Closed

11 tasks

tchaton mentioned this pull request Mar 8, 2021

1.2.x cherries 🍒 #6083

Closed

		from pytorch_lightning.utilities.distributed import find_free_network_port


		class DefaultEnvironment(ClusterEnvironment):

		def spawns_children(self) -> bool:
		""" Whether the environment spawns the subprocesses or not. """

introduce default cluster environment for lightning-specific ddp #5915

introduce default cluster environment for lightning-specific ddp #5915

Uh oh!

Conversation

awaelchli commented Feb 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

PR review

Did you have fun?

Uh oh!

codecov bot commented Feb 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carmocca Feb 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awaelchli Feb 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tchaton Feb 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajtritt Feb 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awaelchli Mar 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tchaton Feb 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Borda commented Mar 1, 2021

Uh oh!

carmocca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

awaelchli commented Mar 2, 2021

Uh oh!

carmocca commented Mar 2, 2021

Uh oh!

awaelchli commented Mar 2, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awaelchli Mar 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Borda commented Mar 4, 2021

Uh oh!

awaelchli commented Mar 4, 2021

Uh oh!

s-rog left a comment

awaelchli commented Feb 11, 2021 •

edited

Loading

codecov bot commented Feb 15, 2021 •

edited

Loading

carmocca Feb 22, 2021 •

edited

Loading

awaelchli Feb 22, 2021 •

edited

Loading

tchaton Feb 22, 2021 •

edited

Loading

ajtritt Feb 24, 2021 •

edited

Loading

awaelchli Mar 1, 2021 •

edited

Loading

tchaton Feb 22, 2021 •

edited

Loading

awaelchli Mar 3, 2021 •

edited

Loading