Skip to content

Commit 266bd66

Browse files
authored
Merge branch 'master' into add/rich_logging
2 parents 7df6033 + c8e9fb4 commit 266bd66

File tree

10 files changed

+120
-25
lines changed

10 files changed

+120
-25
lines changed

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
6666
- Added Rich Progress Bar ([#8929](https://github.com/PyTorchLightning/pytorch-lightning/pull/8929))
6767

6868

69+
- Added a friendly error message when DDP attempts to spawn new distributed processes with rank > 0 ([#9005](https://github.com/PyTorchLightning/pytorch-lightning/pull/9005))
70+
71+
6972
### Changed
7073

7174
- Parsing of the `gpus` Trainer argument has changed: `gpus="n"` (str) no longer selects the GPU index n and instead selects the first n devices. ([#8770](https://github.com/PyTorchLightning/pytorch-lightning/pull/8770))
@@ -169,6 +172,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
169172
- Removed deprecated `GradInformation` module in favor of `pytorch_lightning.utilities.grads` ([#8831](https://github.com/PyTorchLightning/pytorch-lightning/pull/8831/))
170173

171174

175+
- Removed `TrainingTypePlugin.on_save` and `Accelerator.on_save` ([#9023](https://github.com/PyTorchLightning/pytorch-lightning/pull/9023))
176+
177+
172178
- Removed deprecated `connect_precision_plugin` and `connect_training_type_plugin` from `Accelerator` ([#9019](https://github.com/PyTorchLightning/pytorch-lightning/pull/9019))
173179

174180

docs/source/governance.rst

Lines changed: 57 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,14 @@
11
.. _governance:
22

3-
Lightning Governance | Persons of interest
4-
==========================================
3+
Lightning Governance
4+
####################
5+
6+
This document describes governance processes we follow in developing PyTorch Lightning.
7+
8+
Persons of Interest
9+
*******************
10+
11+
.. _governance_bdfl:
512

613
BDFL
714
----
@@ -14,7 +21,7 @@ Leads
1421
-----
1522
- Jirka Borovec (`Borda <https://github.com/Borda>`_)
1623
- Ethan Harris (`ethanwharris <https://github.com/ethanwharris>`_) (Torchbearer founder)
17-
- Justus Schock (`justusschock <https://github.com/justusschock>`_) (Former Core Member PyTorch Ignite)
24+
- Justus Schock (`justusschock <https://github.com/justusschock>`_)
1825
- Adrian Wälchli (`awaelchli <https://github.com/awaelchli>`_)
1926
- Thomas Chaton (`tchaton <https://github.com/tchaton>`_)
2027
- Sean Narenthiran (`SeanNaren <https://github.com/SeanNaren>`_)
@@ -44,3 +51,50 @@ Alumni
4451
- Teddy Koker (`teddykoker <https://github.com/teddykoker>`_)
4552
- Nate Raw (`nateraw <https://github.com/nateraw>`_)
4653
- Peter Yu (`yukw777 <https://github.com/yukw777>`_)
54+
55+
56+
Releases
57+
********
58+
59+
We release a new minor version (e.g., 1.5.0) every three months and bugfix releases every week.
60+
The minor versions contain new features, API changes, deprecations, removals, potential backward-incompatible
61+
changes and also all previous bugfixes included in any bugfix release. With every release, we publish a changelog
62+
where we list additions, removals, changed functionality and fixes.
63+
64+
Project Management and Decision Making
65+
**************************************
66+
67+
The decision what goes into a release is governed by the :ref:`staff contributors and leaders <governance>` of
68+
Lightning development. Whenever possible, discussion happens publicly on GitHub and includes the whole community.
69+
For controversial changes, it is mandatory to seek consultation from :ref:`governance_bdfl` for a final decision.
70+
When a consensus is reached, staff and core contributors assign milestones and labels to the issue and/or pull request
71+
and start tracking the development. It is possible that priorities change over time.
72+
73+
Commits to the project are exclusively to be added by pull requests on GitHub and anyone in the community is welcome to
74+
review them. However, reviews submitted by
75+
`code owners <https://github.com/PyTorchLightning/pytorch-lightning/blob/master/.github/CODEOWNERS>`_
76+
have higher weight and it is necessary to get the approval of code owners before a pull request can be merged.
77+
Additional requirements may apply case by case.
78+
79+
API Evolution
80+
*************
81+
82+
Lightning's development is driven by research and best practices in a rapidly developing field of AI and machine
83+
learning. Change is inevitable and when it happens, the Lightning team is committed to minimizing user friction and
84+
maximizing ease of transition from one version to the next. We take backward compatibility and reproducibility very
85+
seriously.
86+
87+
For API removal, renaming or other forms of backward-incompatible changes, the procedure is:
88+
89+
#. A deprecation process is initiated at version X, producing warning messages at runtime and in the documentation.
90+
#. Calls to the deprecated API remain unchanged in their function during the deprecation phase.
91+
#. Two minor versions in the future at version X+2 the breaking change takes effect.
92+
93+
The "X+2" rule is a recommendation and not a strict requirement. Longer deprecation cylces may apply for some cases.
94+
95+
New API and features are declared as:
96+
97+
- *Experimental*: Anything labelled as *experimental* or *beta* in the documentation is considered unstable and should
98+
not be used in production. The community is encouraged to test the feature and report issues directly on GitHub.
99+
- *Stable*: Everything not specifically labelled as experimental should be considered stable. Reported issues will be
100+
treated with priority.

pytorch_lightning/accelerators/accelerator.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -371,9 +371,6 @@ def lightning_module_state_dict(self) -> Dict[str, Union[Any, Tensor]]:
371371
"""
372372
return self.training_type_plugin.lightning_module_state_dict()
373373

374-
def on_save(self, checkpoint: Dict[str, Union[Any, Tensor]]) -> Dict[str, Union[Any, Tensor]]:
375-
return self.training_type_plugin.on_save(checkpoint)
376-
377374
def barrier(self, name: Optional[str] = None) -> None:
378375
self.training_type_plugin.barrier(name=name)
379376

pytorch_lightning/loggers/mlflow.py

Lines changed: 26 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -171,14 +171,24 @@ def experiment(self) -> MlflowClient:
171171
return self._mlflow_client
172172

173173
@property
174-
def run_id(self):
175-
# create the experiment if it does not exist to get the run id
174+
def run_id(self) -> str:
175+
"""
176+
Create the experiment if it does not exist to get the run id.
177+
178+
Returns:
179+
The run id.
180+
"""
176181
_ = self.experiment
177182
return self._run_id
178183

179184
@property
180-
def experiment_id(self):
181-
# create the experiment if it does not exist to get the experiment id
185+
def experiment_id(self) -> str:
186+
"""
187+
Create the experiment if it does not exist to get the experiment id.
188+
189+
Returns:
190+
The experiment id.
191+
"""
182192
_ = self.experiment
183193
return self._experiment_id
184194

@@ -239,8 +249,20 @@ def save_dir(self) -> Optional[str]:
239249

240250
@property
241251
def name(self) -> str:
252+
"""
253+
Get the experiment id.
254+
255+
Returns:
256+
The experiment id.
257+
"""
242258
return self.experiment_id
243259

244260
@property
245261
def version(self) -> str:
262+
"""
263+
Get the run id.
264+
265+
Returns:
266+
The run id.
267+
"""
246268
return self.run_id

pytorch_lightning/plugins/training_type/ddp.py

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,6 @@ def __init__(
106106
self.dist = LightningDistributed()
107107
self.num_processes = len(self.parallel_devices) if self.parallel_devices is not None else 0
108108
self._ddp_kwargs = kwargs
109-
self._has_spawned_children = False
110109
self._task_idx = None
111110
self._ddp_comm_state = ddp_comm_state
112111
self._ddp_comm_hook = ddp_comm_hook
@@ -174,9 +173,7 @@ def setup_environment(self) -> None:
174173

175174
def _call_children_scripts(self):
176175
# bookkeeping of spawned processes
177-
assert self.local_rank == 0
178176
self._check_can_spawn_children()
179-
self._has_spawned_children = True
180177

181178
# DDP Environment variables
182179
os.environ["MASTER_ADDR"] = self.cluster_environment.master_address()
@@ -260,10 +257,11 @@ def setup_distributed(self):
260257
self.dist.device = self.root_device
261258

262259
def _check_can_spawn_children(self):
263-
if self._has_spawned_children:
260+
if self.local_rank != 0:
264261
raise RuntimeError(
265-
"You tried to run `.fit` or `.test` multiple times in the same script."
266-
" This is not supported in DDP mode, switch to `distributed_backend='ddp_spawn'` instead."
262+
"Lightning attempted to launch new distributed processes with `local_rank > 0`. This should not happen."
263+
" Possible reasons: 1) LOCAL_RANK environment variable was incorrectly modified by the user,"
264+
" 2) `ClusterEnvironment.creates_children()` incorrectly implemented."
267265
)
268266

269267
def set_world_ranks(self) -> None:

pytorch_lightning/plugins/training_type/ddp_spawn.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -280,7 +280,7 @@ def __transfer_distrib_spawn_state_on_fit_end(self, trainer: "pl.Trainer", resul
280280
last_path = None
281281
if trainer.state.fn == TrainerFn.FITTING and best_model_path is not None and len(best_model_path) > 0:
282282
last_path = re.sub(".ckpt", ".tmp_end.ckpt", best_model_path)
283-
atomic_save(self.on_save(state_dict), last_path)
283+
atomic_save(state_dict, last_path)
284284

285285
# todo, pass complete checkpoint as state dictionary
286286
self.mp_queue.put(best_model_path)

pytorch_lightning/plugins/training_type/training_type_plugin.py

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -201,9 +201,6 @@ def validation_step_end(self, output):
201201
def test_step_end(self, output):
202202
return output
203203

204-
def on_save(self, checkpoint: Dict[str, Union[Any, torch.Tensor]]) -> Dict[str, Union[Any, torch.Tensor]]:
205-
return checkpoint
206-
207204
def process_dataloader(self, dataloader: Union[Iterable, DataLoader]) -> Union[Iterable, DataLoader]:
208205
"""Wraps the dataloader if necessary
209206
@@ -273,8 +270,6 @@ def save_checkpoint(self, checkpoint: Dict[str, Any], filepath: str) -> None:
273270
checkpoint: dict containing model and trainer state
274271
filepath: write-target file's path
275272
"""
276-
# dump states as a checkpoint dictionary object
277-
checkpoint = self.on_save(checkpoint)
278273
if self.should_rank_save_checkpoint:
279274
return self.checkpoint_io.save_checkpoint(checkpoint, filepath)
280275

pytorch_lightning/trainer/connectors/checkpoint_connector.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -294,8 +294,6 @@ def hpc_save(self, folderpath: str, logger):
294294

295295
model.on_hpc_save(checkpoint)
296296

297-
checkpoint = self.trainer.accelerator.on_save(checkpoint)
298-
299297
# do the actual save
300298
# TODO: fix for anything with multiprocess DP, DDP, DDP2
301299
try:

pytorch_lightning/utilities/distributed.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -353,7 +353,7 @@ def init_ddp_connection(
353353
torch_distributed_backend: str,
354354
global_rank: Optional[int] = None,
355355
world_size: Optional[int] = None,
356-
**kwargs,
356+
**kwargs: Any,
357357
) -> None:
358358
"""
359359
Utility function to initialize DDP connection by setting env variables

tests/plugins/test_ddp_plugin.py

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,16 @@
1111
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
14+
import os
1415
from unittest import mock
1516

17+
import pytest
1618
import torch
1719
from torch.nn.parallel import DistributedDataParallel
1820

1921
from pytorch_lightning import Trainer
2022
from pytorch_lightning.plugins import DDPPlugin
23+
from pytorch_lightning.plugins.environments import LightningEnvironment
2124
from tests.helpers.boring_model import BoringModel
2225
from tests.helpers.runif import RunIf
2326

@@ -69,3 +72,25 @@ def test_ddp_barrier_non_consecutive_device_ids(barrier_mock, tmpdir):
6972
trainer = Trainer(default_root_dir=tmpdir, max_steps=1, gpus=gpus, accelerator="ddp")
7073
trainer.fit(model)
7174
barrier_mock.assert_any_call(device_ids=[gpus[trainer.local_rank]])
75+
76+
77+
@mock.patch.dict(os.environ, {"LOCAL_RANK": "1"})
78+
def test_incorrect_ddp_script_spawning(tmpdir):
79+
"""Test an error message when user accidentally instructs Lightning to spawn children processes on rank > 0."""
80+
81+
class WronglyImplementedEnvironment(LightningEnvironment):
82+
def creates_children(self):
83+
# returning false no matter what means Lightning would spawn also on ranks > 0 new processes
84+
return False
85+
86+
model = BoringModel()
87+
trainer = Trainer(
88+
default_root_dir=tmpdir,
89+
accelerator="ddp",
90+
num_processes=2,
91+
plugins=[DDPPlugin(), WronglyImplementedEnvironment()],
92+
)
93+
with pytest.raises(
94+
RuntimeError, match="Lightning attempted to launch new distributed processes with `local_rank > 0`."
95+
):
96+
trainer.fit(model)

0 commit comments

Comments
 (0)