-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Bug description
When EarlyStopping is used in a distributed context, early stopping conditions may be met in some processes before others.
Per the described intended behavior of EarlyStopping:
https://github.com/Lightning-AI/lightning/blob/be1eb5e86d07fe22b53a59184089aac569875117/src/pytorch_lightning/callbacks/early_stopping.py#L204-L205
, all training processes should be stopped when an EarlyStopping threshold is reached in any process. The current behavior of reduce_boolean_decision is to only return True when all input process decisions are True:
https://github.com/Lightning-AI/lightning/blob/be1eb5e86d07fe22b53a59184089aac569875117/src/lightning_lite/strategies/parallel.py#L88-L92
Though this issue can be avoided when logging the monitored metric with sync_dist=True, since that configuration is not mandatory, reduce_boolean_decision should be adapted to behave as the EarlyStopping callback expects.
I will be submitting a PR shortly that maintains the current reduce_boolean_decision behavior by default, but enhances the function to accommodate any-analogous semantics as expected by the EarlyStopping callback. The PR will also include an additional test to validate the aforementioned new behavior resolves the issue described.
How to reproduce the bug
The easiest way to reproduce will be to checkout the forthcoming PR and use the new test in combination with the original ``EarlyStopping`` callback usage of ``reduce_boolean_decision``
pytest -v tests/tests_pytorch/callbacks/test_early_stopping.py::test_multiple_early_stopping_callbacks[callbacks2-2-False-ddp_spawn-2-2]
### Error messages and logs
_No response_
### Environment
- CUDA:
- GPU:
- NVIDIA GeForce RTX 2070 SUPER
- NVIDIA GeForce RTX 2070
- available: True
- version: 11.7
- GPU:
- Lightning:
- lightning-utilities: 0.3.0
- pt-lightning-sphinx-theme: 0.0.31
- pytorch-lightning: 1.8.0rc0
- torch: 1.13.0
- torchmetrics: 0.10.0
- torchtext: 0.14.0
- torchvision: 0.14.0
- Packages:
- absl-py: 1.3.0
- aiohttp: 3.8.3
- aiosignal: 1.2.0
- alabaster: 0.7.12
- alembic: 1.8.1
- antlr4-python3-runtime: 4.9.3
- anyio: 3.6.2
- argon2-cffi: 21.3.0
- argon2-cffi-bindings: 21.2.0
- asttokens: 2.0.8
- async-generator: 1.10
- async-timeout: 4.0.2
- attrs: 22.1.0
- babel: 2.10.3
- backcall: 0.2.0
- beautifulsoup4: 4.11.1
- black: 22.10.0
- bleach: 5.0.1
- boto3: 1.24.95
- botocore: 1.27.95
- bracex: 2.3.post1
- bravado: 11.0.3
- bravado-core: 5.17.1
- brotlipy: 0.7.0
- cachetools: 5.2.0
- certifi: 2022.9.24
- cffi: 1.15.1
- cfgv: 3.3.1
- charset-normalizer: 2.0.4
- click: 8.1.3
- cloudpickle: 2.2.0
- codecov: 2.1.12
- coloredlogs: 15.0.1
- comet-ml: 3.31.15
- commonmark: 0.9.1
- configobj: 5.0.6
- contourpy: 1.0.5
- coverage: 6.5.0
- cryptography: 37.0.1
- curio: 1.5
- cycler: 0.11.0
- databricks-cli: 0.17.3
- debugpy: 1.6.3
- decorator: 5.1.1
- deepspeed: 0.7.3
- defusedxml: 0.7.1
- distlib: 0.3.6
- docker: 6.0.0
- docker-pycreds: 0.4.0
- docstring-parser: 0.15
- docutils: 0.17.1
- dulwich: 0.20.46
- entrypoints: 0.4
- everett: 3.0.0
- exceptiongroup: 1.0.0rc9
- executing: 1.1.1
- fairscale: 0.4.12
- fastapi: 0.85.1
- fastjsonschema: 2.16.2
- filelock: 3.8.0
- fire: 0.4.0
- flask: 2.2.2
- flatbuffers: 22.9.24
- fonttools: 4.37.4
- frozenlist: 1.3.1
- fsspec: 2022.10.0
- future: 0.18.2
- gitdb: 4.0.9
- gitpython: 3.1.29
- google-auth: 2.13.0
- google-auth-oauthlib: 0.4.6
- greenlet: 1.1.3.post0
- grpcio: 1.50.0
- gunicorn: 20.1.0
- gym: 0.26.2
- gym-notices: 0.0.8
- h11: 0.14.0
- hjson: 3.1.0
- humanfriendly: 10.0
- hydra-core: 1.2.0
- identify: 2.5.6
- idna: 3.4
- imagesize: 1.4.1
- importlib-metadata: 5.0.0
- iniconfig: 1.1.1
- ipykernel: 6.16.1
- ipyparallel: 8.4.1
- ipython: 8.5.0
- ipython-genutils: 0.2.0
- ipywidgets: 8.0.2
- itsdangerous: 2.1.2
- jedi: 0.18.1
- jinja2: 3.0.3
- jmespath: 1.0.1
- joblib: 1.2.0
- jsonargparse: 4.15.2
- jsonpointer: 2.3
- jsonref: 0.3.0
- jsonschema: 3.2.0
- jupyter-client: 7.4.3
- jupyter-core: 4.11.2
- jupyter-server: 1.21.0
- jupyterlab-pygments: 0.2.2
- jupyterlab-widgets: 3.0.3
- kiwisolver: 1.4.4
- lightning-utilities: 0.3.0
- mako: 1.2.3
- markdown: 3.4.1
- markdown-it-py: 2.1.0
- markupsafe: 2.1.1
- matplotlib: 3.6.1
- matplotlib-inline: 0.1.6
- mdit-py-plugins: 0.3.1
- mdurl: 0.1.2
- mistune: 2.0.4
- mkl-fft: 1.3.1
- mkl-random: 1.2.2
- mkl-service: 2.4.0
- mlflow: 1.30.0
- monotonic: 1.6
- mpmath: 1.2.1
- msgpack: 1.0.4
- multidict: 6.0.2
- mypy: 0.971
- mypy-extensions: 0.4.3
- myst-parser: 0.16.1
- nbclassic: 0.4.5
- nbclient: 0.7.0
- nbconvert: 7.2.2
- nbformat: 5.7.0
- nbsphinx: 0.8.9
- neptune-client: 0.16.9
- nest-asyncio: 1.5.6
- ninja: 1.10.2.4
- nodeenv: 1.7.0
- notebook: 6.5.1
- notebook-shim: 0.2.0
- numpy: 1.23.3
- oauthlib: 3.2.2
- omegaconf: 2.2.3
- onnxruntime: 1.12.1
- outcome: 1.2.0
- packaging: 21.3
- pandas: 1.5.1
- pandoc: 2.2
- pandocfilters: 1.5.0
- parso: 0.8.3
- pathspec: 0.10.1
- pathtools: 0.1.2
- pexpect: 4.8.0
- pickleshare: 0.7.5
- pillow: 9.2.0
- pip: 22.2.2
- platformdirs: 2.5.2
- pluggy: 1.0.0
- plumbum: 1.8.0
- ply: 3.11
- pre-commit: 2.20.0
- prometheus-client: 0.15.0
- prometheus-flask-exporter: 0.20.3
- promise: 2.3
- prompt-toolkit: 3.0.31
- protobuf: 3.19.6
- psutil: 5.9.3
- pt-lightning-sphinx-theme: 0.0.31
- ptyprocess: 0.7.0
- pure-eval: 0.2.2
- py: 1.11.0
- py-cpuinfo: 8.0.0
- pyasn1: 0.4.8
- pyasn1-modules: 0.2.8
- pycparser: 2.21
- pydantic: 1.10.2
- pygame: 2.1.0
- pygments: 2.13.0
- pyjwt: 2.6.0
- pyopenssl: 22.0.0
- pyparsing: 3.0.9
- pyrsistent: 0.18.1
- pysocks: 1.7.1
- pytest: 7.0.1
- pytest-asyncio: 0.20.1
- pytest-cov: 4.0.0
- pytest-forked: 1.4.0
- pytest-rerunfailures: 10.2
- python-dateutil: 2.8.2
- pytorch-lightning: 1.8.0rc0
- pytz: 2022.5
- pyyaml: 6.0
- pyzmq: 24.0.1
- qtconsole: 5.3.2
- qtpy: 2.2.1
- querystring-parser: 1.2.4
- requests: 2.28.1
- requests-oauthlib: 1.3.1
- requests-toolbelt: 0.10.0
- rfc3987: 1.3.8
- rich: 12.6.0
- rsa: 4.9
- s3transfer: 0.6.0
- scikit-learn: 1.1.2
- scipy: 1.9.3
- semantic-version: 2.10.0
- send2trash: 1.8.0
- sentry-sdk: 1.10.1
- setproctitle: 1.3.2
- setuptools: 63.4.1
- shortuuid: 1.0.9
- simplejson: 3.17.6
- six: 1.16.0
- smmap: 5.0.0
- sniffio: 1.3.0
- snowballstemmer: 2.2.0
- sortedcontainers: 2.4.0
- soupsieve: 2.3.2.post1
- sphinx: 4.5.0
- sphinx-autodoc-typehints: 1.19.1
- sphinx-copybutton: 0.5.0
- sphinx-multiproject: 1.0.0rc1
- sphinx-paramlinks: 0.5.4
- sphinx-togglebutton: 0.3.2
- sphinxcontrib-applehelp: 1.0.2
- sphinxcontrib-devhelp: 1.0.2
- sphinxcontrib-fulltoc: 1.2.0
- sphinxcontrib-htmlhelp: 2.0.0
- sphinxcontrib-jsmath: 1.0.1
- sphinxcontrib-mockautodoc: 0.0.1.dev20130518
- sphinxcontrib-qthelp: 1.0.3
- sphinxcontrib-serializinghtml: 1.1.5
- sqlalchemy: 1.4.42
- sqlparse: 0.4.3
- stack-data: 0.5.1
- starlette: 0.20.4
- strict-rfc3339: 0.7
- swagger-spec-validator: 3.0.2
- sympy: 1.11.1
- tabulate: 0.9.0
- tensorboard: 2.10.1
- tensorboard-data-server: 0.6.1
- tensorboard-plugin-wit: 1.8.1
- termcolor: 2.0.1
- terminado: 0.16.0
- testpath: 0.6.0
- threadpoolctl: 3.1.0
- tinycss2: 1.2.1
- toml: 0.10.2
- tomli: 2.0.1
- torch: 1.13.0
- torchmetrics: 0.10.0
- torchtext: 0.14.0
- torchvision: 0.14.0
- tornado: 6.2
- tqdm: 4.64.1
- traitlets: 5.5.0
- trio: 0.22.0
- types-croniter: 1.3.2
- types-cryptography: 3.3.23.1
- types-protobuf: 3.20.4.1
- types-pyopenssl: 22.1.0.1
- types-python-dateutil: 2.8.19.2
- types-pyyaml: 6.0.12
- types-redis: 4.3.21.2
- types-requests: 2.28.11.2
- types-setuptools: 65.5.0.1
- types-six: 1.16.21
- types-tabulate: 0.9.0.0
- types-ujson: 5.5.0
- types-urllib3: 1.26.25.1
- typing-extensions: 4.3.0
- urllib3: 1.26.12
- uvicorn: 0.19.0
- virtualenv: 20.16.5
- wandb: 0.13.4
- wcmatch: 8.4.1
- wcwidth: 0.2.5
- webcolors: 1.12
- webencodings: 0.5.1
- websocket-client: 1.3.3
- werkzeug: 2.2.2
- wheel: 0.37.1
- widgetsnbextension: 4.0.3
- wrapt: 1.14.1
- wurlitzer: 3.0.2
- yarl: 1.8.1
- zipp: 3.9.0
- System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.10.6
- version: add Codecov info #144-Ubuntu SMP Tue Sep 20 11:00:04 UTC 2022
### More info
_No response_