-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Add LSF support #5102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Add LSF support #5102
Changes from all commits
Commits
Show all changes
41 commits
Select commit
Hold shift + click to select a range
9abe28e
add ClusterEnvironment for LSF systems
ajtritt f2e44c1
update init file
ajtritt 615f08d
add available cluster environments
ajtritt 86f2fa1
clean up LSFEnvironment
ajtritt b72b42d
add ddp_hpc as a distributed backend
ajtritt 6a9a4ca
clean up SLURMEnvironment
ajtritt 5bbba77
Merge branch 'master' into lsf_env
ajtritt 94e4d4b
remove extra blank line
ajtritt 113e787
init device for DDPHPCAccelerator
ajtritt d12d652
committing current state
ajtritt d0ac793
Merge branch 'master' into lsf_env
ajtritt b53d153
add additional methods to ClusterEnvironments
ajtritt 0b6edfe
add NVIDIA mixin for setting up CUDA envars
ajtritt f7d87f6
remove troubleshooting prints
ajtritt 3c9edf9
cleanup SLURMEnvironment
ajtritt 77f3b71
fix docstring
ajtritt eb7d07c
cleanup TorchElasticEnvironment and add documentation
ajtritt 09064e1
PEP8 puts a cork in it
ajtritt fb30942
Merge branch 'master' into lsf_env
ajtritt 7be8f1d
add set_ranks_to_trainer
ajtritt 5c04b8e
Merge remote-tracking branch 'pl/master' into lsf_env
ajtritt 004daef
Merge remote-tracking branch 'pl/master' into lsf_env
ajtritt a113210
remove unused import
ajtritt d17281c
move to new location
ajtritt b4028a7
Merge branch 'master' into lsf_env
awaelchli 7a23376
update LSF environment
awaelchli 02410ff
remove mixin
awaelchli 7f91740
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 1b3bc7a
changelog
awaelchli 5ec0e9f
Merge remote-tracking branch 'ajtritt/lsf_env' into lsf_env
awaelchli 92215ab
reset slurm env
awaelchli a613759
add tests
awaelchli f7c5e0e
add licence
awaelchli 00de88e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] cfd59b8
test node_rank
awaelchli 5ec99e9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] cfe544f
add lsf env to docs
awaelchli 71569de
add auto detection for lsf environment
awaelchli 7c26b41
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 077964d
fix is_using_lsf() and test
awaelchli 7f127c8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
160 changes: 160 additions & 0 deletions
160
pytorch_lightning/plugins/environments/lsf_environment.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,160 @@ | ||
| # Copyright The PyTorch Lightning team. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| import os | ||
| import socket | ||
|
|
||
| from pytorch_lightning import _logger as log | ||
| from pytorch_lightning.plugins.environments import ClusterEnvironment | ||
|
|
||
|
|
||
| class LSFEnvironment(ClusterEnvironment): | ||
| """ | ||
| An environment for running on clusters managed by the LSF resource manager. | ||
|
|
||
| It is expected that any execution using this ClusterEnvironment was executed | ||
| using the Job Step Manager i.e. ``jsrun``. | ||
|
|
||
| This plugin expects the following environment variables. | ||
|
|
||
| LSB_JOBID: | ||
| The LSF assigned job ID | ||
|
|
||
| LSB_HOSTS: | ||
| The hosts used in the job. This string is expected to have the format "batch <rank_0_host> ...." | ||
|
|
||
| JSM_NAMESPACE_LOCAL_RANK: | ||
| The node local rank for the task. This environment variable is set by jsrun | ||
|
|
||
| JSM_NAMESPACE_SIZE: | ||
| The world size for the task. This environment variable is set by jsrun | ||
| """ | ||
|
|
||
| def __init__(self): | ||
| self._master_address = self._get_master_address() | ||
| self._master_port = self._get_master_port() | ||
| log.debug(f"MASTER_ADDR: {self._master_address}") | ||
| log.debug(f"MASTER_PORT: {self._master_port}") | ||
|
|
||
| @staticmethod | ||
| def is_using_lsf() -> bool: | ||
| """ Returns ``True`` if the current process was launched using the jsrun command. """ | ||
| required_env_vars = ( | ||
| "LSB_JOBID", | ||
| "LSB_HOSTS", | ||
| "JSM_NAMESPACE_LOCAL_RANK", | ||
| "JSM_NAMESPACE_SIZE", | ||
| ) | ||
| return all(v in os.environ for v in required_env_vars) | ||
|
|
||
| def creates_children(self) -> bool: | ||
| return True | ||
|
|
||
| def master_address(self): | ||
| """ The master address is read from a list of hosts contained in the environment variable `LSB_HOSTS`. """ | ||
| return self._master_address | ||
|
|
||
| def master_port(self): | ||
| """ THe master port gets calculated from the LSF job ID. """ | ||
| return self._master_port | ||
|
|
||
| def world_size(self): | ||
| """ The world size is read from the environment variable `JSM_NAMESPACE_SIZE`. """ | ||
| var = "JSM_NAMESPACE_SIZE" | ||
awaelchli marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| world_size = os.environ.get(var) | ||
| if world_size is None: | ||
| raise ValueError( | ||
| f"Cannot determine world size from environment variable {var}." | ||
| " Make sure you run your executable with `jsrun`" | ||
| ) | ||
| return int(world_size) | ||
|
|
||
| def set_world_size(self, size: int) -> None: | ||
| log.debug("LSFEnvironment.set_world_size was called, but setting world size is not allowed. Ignored.") | ||
|
|
||
| def global_rank(self): | ||
| """ The world size is read from the environment variable `JSM_NAMESPACE_RANK`. """ | ||
| var = "JSM_NAMESPACE_RANK" | ||
| global_rank = os.environ.get(var) | ||
| if global_rank is None: | ||
| raise ValueError( | ||
| f"Cannot determine global rank from environment variable {var}." | ||
| " Make sure you run your executable with `jsrun`" | ||
| ) | ||
| return int(global_rank) | ||
|
|
||
| def set_global_rank(self, rank: int) -> None: | ||
| log.debug("LSFEnvironment.set_global_rank was called, but setting global rank is not allowed. Ignored.") | ||
|
|
||
| def local_rank(self): | ||
| """ The local rank is read from the environment variable `JSM_NAMESPACE_LOCAL_RANK`. """ | ||
| var = "JSM_NAMESPACE_LOCAL_RANK" | ||
| local_rank = os.environ.get(var) | ||
| if local_rank is None: | ||
| raise ValueError( | ||
| f"Cannot determine local rank from environment variable {var}." | ||
| " Make sure you run your executable with `jsrun`" | ||
| ) | ||
| return int(local_rank) | ||
|
|
||
| def node_rank(self): | ||
| """ | ||
| The node rank is determined by the position of the current hostname in the list of hosts stored in | ||
| the environment variable `LSB_HOSTS`. | ||
| """ | ||
| hosts = self._read_hosts() | ||
| count = dict() | ||
| for host in hosts: | ||
| if "batch" in host or "login" in host: | ||
| continue | ||
| if host not in count: | ||
| count[host] = len(count) | ||
| return count[socket.gethostname()] | ||
|
|
||
| @staticmethod | ||
| def _read_hosts(): | ||
| hosts = os.environ.get("LSB_HOSTS") | ||
| if not hosts: | ||
| raise ValueError("Could not find hosts in environment variable LSB_HOSTS") | ||
| hosts = hosts.split() | ||
| if len(hosts) < 2: | ||
| raise ValueError( | ||
| "Cannot parse hosts from LSB_HOSTS environment variable." | ||
| " Expected format: \"batch <rank_0_host> ...\"" | ||
| ) | ||
| return hosts | ||
|
|
||
| def _get_master_address(self): | ||
| hosts = self._read_hosts() | ||
| return hosts[1] | ||
|
|
||
| @staticmethod | ||
| def _get_master_port(): | ||
| """ | ||
| A helper function for accessing the master port. | ||
| Uses the LSF job ID so all ranks can compute the master port. | ||
| """ | ||
| # check for user-specified master port | ||
| port = os.environ.get("MASTER_PORT") | ||
| if not port: | ||
| jobid = os.environ.get("LSB_JOBID") | ||
| if not jobid: | ||
| raise ValueError("Could not find job id in environment variable LSB_JOBID") | ||
| port = int(jobid) | ||
| # all ports should be in the 10k+ range | ||
| port = int(port) % 1000 + 10000 | ||
| log.debug(f"calculated LSF master port: {port}") | ||
| else: | ||
| log.debug(f"using externally specified master port: {port}") | ||
| return int(port) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,89 @@ | ||
| # Copyright The PyTorch Lightning team. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| import os | ||
| from unittest import mock | ||
|
|
||
| import pytest | ||
|
|
||
| from pytorch_lightning.plugins.environments import LSFEnvironment | ||
|
|
||
|
|
||
| @mock.patch.dict(os.environ, { | ||
| "LSB_HOSTS": "batch 10.10.10.0 10.10.10.1", | ||
| "LSB_JOBID": "1234", | ||
| }) | ||
| def test_missing_lsb_hosts(): | ||
| """ Test an error when the lsb hosts list cannot be found. """ | ||
| del os.environ["LSB_HOSTS"] | ||
| with pytest.raises(ValueError, match="Could not find hosts in environment variable LSB_HOSTS"): | ||
| LSFEnvironment() | ||
|
|
||
|
|
||
| @mock.patch.dict(os.environ, { | ||
| "LSB_HOSTS": "batch 10.10.10.0 10.10.10.1", | ||
| "LSB_JOBID": "1234", | ||
| }) | ||
| def test_missing_lsb_job_id(): | ||
| """ Test an error when the job id cannot be found. """ | ||
| del os.environ["LSB_JOBID"] | ||
| with pytest.raises(ValueError, match="Could not find job id in environment variable LSB_JOBID"): | ||
| LSFEnvironment() | ||
|
|
||
|
|
||
| @mock.patch.dict( | ||
| os.environ, { | ||
| "MASTER_PORT": "4321", | ||
| "LSB_JOBID": "1234", | ||
| "LSB_HOSTS": "batch 10.10.10.0 10.10.10.1", | ||
| } | ||
| ) | ||
| def test_manual_master_port_and_address(): | ||
| """ Test a user can set the port manually through the MASTER_PORT env variable. """ | ||
| env = LSFEnvironment() | ||
| assert env.master_port() == 4321 | ||
|
|
||
|
|
||
| @mock.patch.dict( | ||
| os.environ, { | ||
| "LSB_HOSTS": "batch 10.10.10.0 10.10.10.1 10.10.10.2 10.10.10.3", | ||
| "LSB_JOBID": "1234", | ||
| "JSM_NAMESPACE_SIZE": "4", | ||
| "JSM_NAMESPACE_RANK": "3", | ||
| "JSM_NAMESPACE_LOCAL_RANK": "1" | ||
| } | ||
| ) | ||
| def test_attributes_from_environment_variables(): | ||
| """ Test that the LSF environment takes the attributes from the environment variables. """ | ||
| env = LSFEnvironment() | ||
| assert env.creates_children() | ||
| assert env.master_address() == "10.10.10.0" | ||
| assert env.master_port() == 10234 | ||
| assert env.world_size() == 4 | ||
| assert env.global_rank() == 3 | ||
| assert env.local_rank() == 1 | ||
| env.set_global_rank(100) | ||
| assert env.global_rank() == 3 | ||
| env.set_world_size(100) | ||
| assert env.world_size() == 4 | ||
| assert LSFEnvironment.is_using_lsf() | ||
|
|
||
|
|
||
| @mock.patch("socket.gethostname", return_value="host2") | ||
| @mock.patch.dict(os.environ, { | ||
| "LSB_HOSTS": "batch host0 host1 host2 host3", | ||
| "LSB_JOBID": "1234", | ||
| }) | ||
| def test_node_rank(_): | ||
| env = LSFEnvironment() | ||
| assert env.node_rank() == 2 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
13 changes: 13 additions & 0 deletions
13
tests/plugins/environments/test_torchelastic_environment.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.