Merge branch 'master' into add/rich_logging

tchaton · web-flow · commit 266bd667f48e · 2021-08-23T20:15:25.000+01:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -66,6 +66,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Added Rich Progress Bar ([#8929](https://github.com/PyTorchLightning/pytorch-lightning/pull/8929))
 
 
+- Added a friendly error message when DDP attempts to spawn new distributed processes with rank > 0 ([#9005](https://github.com/PyTorchLightning/pytorch-lightning/pull/9005))
+
+
 ### Changed
 
 - Parsing of the `gpus` Trainer argument has changed: `gpus="n"` (str) no longer selects the GPU index n and instead selects the first n devices. ([#8770](https://github.com/PyTorchLightning/pytorch-lightning/pull/8770))
@@ -169,6 +172,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Removed deprecated `GradInformation` module in favor of `pytorch_lightning.utilities.grads` ([#8831](https://github.com/PyTorchLightning/pytorch-lightning/pull/8831/))
 
 
+- Removed `TrainingTypePlugin.on_save` and `Accelerator.on_save` ([#9023](https://github.com/PyTorchLightning/pytorch-lightning/pull/9023))
+
+
 - Removed deprecated `connect_precision_plugin` and `connect_training_type_plugin` from `Accelerator` ([#9019](https://github.com/PyTorchLightning/pytorch-lightning/pull/9019))
 
 
diff --git a/docs/source/governance.rst b/docs/source/governance.rst
@@ -1,7 +1,14 @@
 .. _governance:
 
-Lightning Governance | Persons of interest
-==========================================
+Lightning Governance
+####################
+
+This document describes governance processes we follow in developing PyTorch Lightning.
+
+Persons of Interest
+*******************
+
+.. _governance_bdfl:
 
 BDFL
 ----
@@ -14,7 +21,7 @@ Leads
 -----
 - Jirka Borovec (`Borda <https://github.com/Borda>`_)
 - Ethan Harris (`ethanwharris <https://github.com/ethanwharris>`_) (Torchbearer founder)
-- Justus Schock (`justusschock <https://github.com/justusschock>`_) (Former Core Member PyTorch Ignite)
+- Justus Schock (`justusschock <https://github.com/justusschock>`_)
 - Adrian Wälchli (`awaelchli <https://github.com/awaelchli>`_)
 - Thomas Chaton (`tchaton <https://github.com/tchaton>`_)
 - Sean Narenthiran (`SeanNaren <https://github.com/SeanNaren>`_)
@@ -44,3 +51,50 @@ Alumni
 - Teddy Koker (`teddykoker <https://github.com/teddykoker>`_)
 - Nate Raw (`nateraw <https://github.com/nateraw>`_)
 - Peter Yu (`yukw777 <https://github.com/yukw777>`_)
+
+
+Releases
+********
+
+We release a new minor version (e.g., 1.5.0) every three months and bugfix releases every week.
+The minor versions contain new features, API changes, deprecations, removals, potential backward-incompatible
+changes and also all previous bugfixes included in any bugfix release. With every release, we publish a changelog
+where we list additions, removals, changed functionality and fixes.
+
+Project Management and Decision Making
+**************************************
+
+The decision what goes into a release is governed by the :ref:`staff contributors and leaders <governance>` of
+Lightning development. Whenever possible, discussion happens publicly on GitHub and includes the whole community.
+For controversial changes, it is mandatory to seek consultation from :ref:`governance_bdfl` for a final decision.
+When a consensus is reached, staff and core contributors assign milestones and labels to the issue and/or pull request
+and start tracking the development. It is possible that priorities change over time.
+
+Commits to the project are exclusively to be added by pull requests on GitHub and anyone in the community is welcome to
+review them. However, reviews submitted by
+`code owners <https://github.com/PyTorchLightning/pytorch-lightning/blob/master/.github/CODEOWNERS>`_
+have higher weight and it is necessary to get the approval of code owners before a pull request can be merged.
+Additional requirements may apply case by case.
+
+API Evolution
+*************
+
+Lightning's development is driven by research and best practices in a rapidly developing field of AI and machine
+learning. Change is inevitable and when it happens, the Lightning team is committed to minimizing user friction and
+maximizing ease of transition from one version to the next. We take backward compatibility and reproducibility very
+seriously.
+
+For API removal, renaming or other forms of backward-incompatible changes, the procedure is:
+
+#. A deprecation process is initiated at version X, producing warning messages at runtime and in the documentation.
+#. Calls to the deprecated API remain unchanged in their function during the deprecation phase.
+#. Two minor versions in the future at version X+2 the breaking change takes effect.
+
+The "X+2" rule is a recommendation and not a strict requirement. Longer deprecation cylces may apply for some cases.
+
+New API and features are declared as:
+
+- *Experimental*: Anything labelled as *experimental* or *beta* in the documentation is considered unstable and should
+    not be used in production. The community is encouraged to test the feature and report issues directly on GitHub.
+- *Stable*: Everything not specifically labelled as experimental should be considered stable. Reported issues will be
+    treated with priority.
diff --git a/pytorch_lightning/accelerators/accelerator.py b/pytorch_lightning/accelerators/accelerator.py
@@ -371,9 +371,6 @@ def lightning_module_state_dict(self) -> Dict[str, Union[Any, Tensor]]:
         """
         return self.training_type_plugin.lightning_module_state_dict()
 
-    def on_save(self, checkpoint: Dict[str, Union[Any, Tensor]]) -> Dict[str, Union[Any, Tensor]]:
-        return self.training_type_plugin.on_save(checkpoint)
-
     def barrier(self, name: Optional[str] = None) -> None:
         self.training_type_plugin.barrier(name=name)
 
diff --git a/pytorch_lightning/loggers/mlflow.py b/pytorch_lightning/loggers/mlflow.py
@@ -171,14 +171,24 @@ def experiment(self) -> MlflowClient:
         return self._mlflow_client
 
     @property
-    def run_id(self):
-        # create the experiment if it does not exist to get the run id
+    def run_id(self) -> str:
+        """
+        Create the experiment if it does not exist to get the run id.
+
+        Returns:
+            The run id.
+        """
         _ = self.experiment
         return self._run_id
 
     @property
-    def experiment_id(self):
-        # create the experiment if it does not exist to get the experiment id
+    def experiment_id(self) -> str:
+        """
+        Create the experiment if it does not exist to get the experiment id.
+
+        Returns:
+            The experiment id.
+        """
         _ = self.experiment
         return self._experiment_id
 
@@ -239,8 +249,20 @@ def save_dir(self) -> Optional[str]:
 
     @property
     def name(self) -> str:
+        """
+        Get the experiment id.
+
+        Returns:
+            The experiment id.
+        """
         return self.experiment_id
 
     @property
     def version(self) -> str:
+        """
+        Get the run id.
+
+        Returns:
+            The run id.
+        """
         return self.run_id
diff --git a/pytorch_lightning/plugins/training_type/ddp.py b/pytorch_lightning/plugins/training_type/ddp.py
@@ -106,7 +106,6 @@ def __init__(
         self.dist = LightningDistributed()
         self.num_processes = len(self.parallel_devices) if self.parallel_devices is not None else 0
         self._ddp_kwargs = kwargs
-        self._has_spawned_children = False
         self._task_idx = None
         self._ddp_comm_state = ddp_comm_state
         self._ddp_comm_hook = ddp_comm_hook
@@ -174,9 +173,7 @@ def setup_environment(self) -> None:
 
     def _call_children_scripts(self):
         # bookkeeping of spawned processes
-        assert self.local_rank == 0
         self._check_can_spawn_children()
-        self._has_spawned_children = True
 
         # DDP Environment variables
         os.environ["MASTER_ADDR"] = self.cluster_environment.master_address()
@@ -260,10 +257,11 @@ def setup_distributed(self):
         self.dist.device = self.root_device
 
     def _check_can_spawn_children(self):
-        if self._has_spawned_children:
+        if self.local_rank != 0:
             raise RuntimeError(
-                "You tried to run `.fit` or `.test` multiple times in the same script."
-                " This is not supported in DDP mode, switch to `distributed_backend='ddp_spawn'` instead."
+                "Lightning attempted to launch new distributed processes with `local_rank > 0`. This should not happen."
+                " Possible reasons: 1) LOCAL_RANK environment variable was incorrectly modified by the user,"
+                " 2) `ClusterEnvironment.creates_children()` incorrectly implemented."
             )
 
     def set_world_ranks(self) -> None:
diff --git a/pytorch_lightning/plugins/training_type/ddp_spawn.py b/pytorch_lightning/plugins/training_type/ddp_spawn.py
@@ -280,7 +280,7 @@ def __transfer_distrib_spawn_state_on_fit_end(self, trainer: "pl.Trainer", resul
             last_path = None
             if trainer.state.fn == TrainerFn.FITTING and best_model_path is not None and len(best_model_path) > 0:
                 last_path = re.sub(".ckpt", ".tmp_end.ckpt", best_model_path)
-                atomic_save(self.on_save(state_dict), last_path)
+                atomic_save(state_dict, last_path)
 
             # todo, pass complete checkpoint as state dictionary
             self.mp_queue.put(best_model_path)
diff --git a/pytorch_lightning/plugins/training_type/training_type_plugin.py b/pytorch_lightning/plugins/training_type/training_type_plugin.py
@@ -201,9 +201,6 @@ def validation_step_end(self, output):
     def test_step_end(self, output):
         return output
 
-    def on_save(self, checkpoint: Dict[str, Union[Any, torch.Tensor]]) -> Dict[str, Union[Any, torch.Tensor]]:
-        return checkpoint
-
     def process_dataloader(self, dataloader: Union[Iterable, DataLoader]) -> Union[Iterable, DataLoader]:
         """Wraps the dataloader if necessary
 
@@ -273,8 +270,6 @@ def save_checkpoint(self, checkpoint: Dict[str, Any], filepath: str) -> None:
             checkpoint: dict containing model and trainer state
             filepath: write-target file's path
         """
-        # dump states as a checkpoint dictionary object
-        checkpoint = self.on_save(checkpoint)
         if self.should_rank_save_checkpoint:
             return self.checkpoint_io.save_checkpoint(checkpoint, filepath)
 
diff --git a/pytorch_lightning/trainer/connectors/checkpoint_connector.py b/pytorch_lightning/trainer/connectors/checkpoint_connector.py
@@ -294,8 +294,6 @@ def hpc_save(self, folderpath: str, logger):
 
         model.on_hpc_save(checkpoint)
 
-        checkpoint = self.trainer.accelerator.on_save(checkpoint)
-
         # do the actual save
         # TODO: fix for anything with multiprocess DP, DDP, DDP2
         try:
diff --git a/pytorch_lightning/utilities/distributed.py b/pytorch_lightning/utilities/distributed.py
@@ -353,7 +353,7 @@ def init_ddp_connection(
     torch_distributed_backend: str,
     global_rank: Optional[int] = None,
     world_size: Optional[int] = None,
-    **kwargs,
+    **kwargs: Any,
 ) -> None:
     """
     Utility function to initialize DDP connection by setting env variables
diff --git a/tests/plugins/test_ddp_plugin.py b/tests/plugins/test_ddp_plugin.py
@@ -11,13 +11,16 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+import os
 from unittest import mock
 
+import pytest
 import torch
 from torch.nn.parallel import DistributedDataParallel
 
 from pytorch_lightning import Trainer
 from pytorch_lightning.plugins import DDPPlugin
+from pytorch_lightning.plugins.environments import LightningEnvironment
 from tests.helpers.boring_model import BoringModel
 from tests.helpers.runif import RunIf
 
@@ -69,3 +72,25 @@ def test_ddp_barrier_non_consecutive_device_ids(barrier_mock, tmpdir):
     trainer = Trainer(default_root_dir=tmpdir, max_steps=1, gpus=gpus, accelerator="ddp")
     trainer.fit(model)
     barrier_mock.assert_any_call(device_ids=[gpus[trainer.local_rank]])
+
+
+@mock.patch.dict(os.environ, {"LOCAL_RANK": "1"})
+def test_incorrect_ddp_script_spawning(tmpdir):
+    """Test an error message when user accidentally instructs Lightning to spawn children processes on rank > 0."""
+
+    class WronglyImplementedEnvironment(LightningEnvironment):
+        def creates_children(self):
+            # returning false no matter what means Lightning would spawn also on ranks > 0 new processes
+            return False
+
+    model = BoringModel()
+    trainer = Trainer(
+        default_root_dir=tmpdir,
+        accelerator="ddp",
+        num_processes=2,
+        plugins=[DDPPlugin(), WronglyImplementedEnvironment()],
+    )
+    with pytest.raises(
+        RuntimeError, match="Lightning attempted to launch new distributed processes with `local_rank > 0`."
+    ):
+        trainer.fit(model)