-
Notifications
You must be signed in to change notification settings - Fork 602
add multinode support via slurm trainer, large scale race condition fix #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
6ee8941
add multinode support via slurm trainer
lessw2020 d345408
updated license agreement
lessw2020 5dda674
Merge branch 'pytorch-labs:main' into expand_multi_node
lessw2020 3e59daf
add info comments in readme and slurm file for usage tips.
lessw2020 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,56 @@ | ||
| #!/bin/bash | ||
|
|
||
| # --- This script is optimized for AWS with EFA | ||
| # --- adjust NCCL_BUFFSIZE if you encounter memory | ||
| # --- constraint issues or to tune for improved performance. | ||
| # --- | ||
|
|
||
| #SBATCH --job-name=torchtrain_multi_node | ||
lessw2020 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| #SBATCH --ntasks=2 | ||
|
|
||
| #SBATCH --nodes=2 | ||
|
|
||
| #SBATCH --gpus-per-task=8 | ||
|
|
||
| #SBATCH --cpus-per-task=96 | ||
|
|
||
| #SBATCH --partition=train | ||
|
|
||
|
|
||
| nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) ) | ||
| nodes_array=($nodes) | ||
| head_node=${nodes_array[0]} | ||
| head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address) | ||
|
|
||
| echo Node IP: $head_node_ip | ||
| export LOGLEVEL=INFO | ||
| # Enable for A100 | ||
| export FI_PROVIDER="efa" | ||
| # Ensure that P2P is available | ||
| # export NCCL_P2P_DISABLE=1 | ||
| export NCCL_IB_DISABLE=1 | ||
|
|
||
| # debugging flags (optional) | ||
| export NCCL_DEBUG=WARN | ||
| export PYTHONFAULTHANDLER=1 | ||
| # optional debug settings | ||
| # export NCCL_DEBUG=INFO | ||
| # NCCL_DEBUG_SUBSYS=INIT,GRAPH,ENV | ||
|
|
||
| export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH | ||
| export LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH | ||
| export CUDA_LAUNCH_BLOCKING=0 | ||
|
|
||
| # on your cluster you might need these: | ||
| # set the network interface | ||
| export NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond" | ||
| export NCCL_BUFFSIZE=2097152 | ||
| #export TORCH_DIST_INIT_BARRIER=1 | ||
| export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 | ||
|
|
||
| dcgmi profile --pause | ||
| # adjust sbatch --ntasks and sbatch --nodes above and --nnodes below | ||
| # to your specific node count, and update target launch file. | ||
| srun torchrun --nnodes 2 --nproc_per_node 8 --rdzv_id 101 --rdzv_backend c10d --rdzv_endpoint "$head_node_ip:29500" ./train.py --steps 10 | ||
| dcgmi profile --resume | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.