|
| 1 | +# Source Separation Example |
| 2 | + |
| 3 | +## Usage |
| 4 | + |
| 5 | +### Overview |
| 6 | + |
| 7 | +To traing a model, you can use [`train.py`](./train.py). This script takes the form of |
| 8 | +`[parameters for distributed training] -- [parameters for model/training]` |
| 9 | + |
| 10 | +If you would like to just try out the traing script, then try it without any parameters |
| 11 | +for distributed training. |
| 12 | + |
| 13 | + ``` |
| 14 | + python train.py \ |
| 15 | + [--worker-id WORKER_ID] \ |
| 16 | + [--device-id DEVICE_ID] \ |
| 17 | + [--num-workers NUM_WORKERS] \ |
| 18 | + [--sync-protocol SYNC_PROTOCOL] \ |
| 19 | + -- \ |
| 20 | + <model specific training parameters> |
| 21 | + |
| 22 | + # For the detail of the parameter values, use; |
| 23 | + python train.py --help |
| 24 | + |
| 25 | + # For the detail of the model parameters, use; |
| 26 | + python train.py -- --help |
| 27 | + ``` |
| 28 | + |
| 29 | +This script runs training in Distributed Data Parallel (DDP) framework and has two major |
| 30 | +operation modes. This behavior depends on if `--worker-id` argument is given or not. |
| 31 | + |
| 32 | +1. (`--worker-id` is not given) Launchs worker subprocesses that performs the actual training. |
| 33 | +2. (`--worker-id` is given) Performs the training as a part of distributed training. |
| 34 | + |
| 35 | +When launching the script without any distributed trainig parameters (operation mode 1), |
| 36 | +this script will check the number of GPUs available on the local system and spawns the same |
| 37 | +number of training subprocesses (as operaiton mode 2). You can reduce the number of GPUs with |
| 38 | +`--num-workers`. If there is no GPU available, only one subprocess is launched and providing |
| 39 | +`--num-workers` larger than 1 results in error. |
| 40 | + |
| 41 | +When launching the script as a worker process of a distributed training, you need to configure |
| 42 | +the coordination of the workers. |
| 43 | + |
| 44 | +- `--num-workers` is the number of training processes being launched. |
| 45 | +- `--worker-id` is the process rank (must be unique across all the processes). |
| 46 | +- `--device-id` is the GPU device ID (should be unique within node). |
| 47 | +- `--sync-protocl` is how each worker process communicate and synchronize. |
| 48 | + If the training is carried out on a single node, then the default `"env://"` should do. |
| 49 | + If the training processes span across multiple nodes, then a path to the file to which all the |
| 50 | + traiing processes have access, has to be provided with `"file://..."` protocol. |
| 51 | + |
| 52 | +### Distributed Training Notes |
| 53 | + |
| 54 | +<details><summary>Quick overview on DDP (distributed data parallel)</summary> |
| 55 | + |
| 56 | +DDP is single-program multiple-data training paradigm. |
| 57 | +With DDP, the model is replicated on every process, |
| 58 | +and every model replica will be fed with a different set of input data samples. |
| 59 | + |
| 60 | +Process: Worker process (as in Linux process). P processes per a Node |
| 61 | +Node: A machine. There are N machines, each of which holds P processes. |
| 62 | +World: network of nodes, composed of N nodes and N * P processes. |
| 63 | + |
| 64 | +Rank: Grobal process ID (unique across nodes) [0, N * P) |
| 65 | +Local Rank: Local process ID (unique only within a node) [0, P) |
| 66 | + |
| 67 | +``` |
| 68 | + Node 0 Node 1 Node N-1 |
| 69 | +┌────────────────────────┐┌─────────────────────────┐ ┌───────────────────────────┐ |
| 70 | +│╔══════════╗ ┌─────────┐││┌───────────┐ ┌─────────┐│ │┌─────────────┐ ┌─────────┐│ |
| 71 | +│║ Process ╟─┤ GPU: 0 ││││ Process ├─┤ GPU: 0 ││ ││ Process ├─┤ GPU: 0 ││ |
| 72 | +│║ Rank: 0 ║ └─────────┘│││ Rank:P │ └─────────┘│ ││ Rank:NP-P │ └─────────┘│ |
| 73 | +│╚══════════╝ ││└───────────┘ │ │└─────────────┘ │ |
| 74 | +│┌──────────┐ ┌─────────┐││┌───────────┐ ┌─────────┐│ │┌─────────────┐ ┌─────────┐│ |
| 75 | +││ Process ├─┤ GPU: 1 ││││ Process ├─┤ GPU: 1 ││ ││ Process ├─┤ GPU: 1 ││ |
| 76 | +││ Rank: 1 │ └─────────┘│││ Rank:P+1 │ └─────────┘│ ││ Rank:NP-P+1 │ └─────────┘│ |
| 77 | +│└──────────┘ ││└───────────┘ │ ... │└─────────────┘ │ |
| 78 | +│ ││ │ │ │ |
| 79 | +│ ... ││ ... │ │ ... │ |
| 80 | +│ ││ │ │ │ |
| 81 | +│┌──────────┐ ┌─────────┐││┌───────────┐ ┌─────────┐│ │┌─────────────┐ ┌─────────┐│ |
| 82 | +││ Process ├─┤ GPU:P-1 ││││ Process ├─┤ GPU:P-1 ││ ││ Process ├─┤ GPU:P-1 ││ |
| 83 | +││ Rank:P-1 │ └─────────┘│││ Rank:2P-1 │ └─────────┘│ ││ Rank:NP-1 │ └─────────┘│ |
| 84 | +│└──────────┘ ││└───────────┘ │ │└─────────────┘ │ |
| 85 | +└────────────────────────┘└─────────────────────────┘ └───────────────────────────┘ |
| 86 | +``` |
| 87 | + |
| 88 | +</details> |
| 89 | + |
| 90 | +### SLURM |
| 91 | + |
| 92 | +When launched as SLURM job, the follwoing environment variables correspond to |
| 93 | + |
| 94 | +SLURM_PROCID: `--worker-id` (Rank) |
| 95 | +SLURM_NTASKS (or legacy SLURM_NPPROCS): the number of total processes (`--num-workers` == world size) |
| 96 | +SLURM_LOCALID: Local Rank (to be mapped with GPU index*) |
| 97 | + |
| 98 | +* Even when GPU resource is allocated with `--gpus-per-task=1`, if there are muptiple |
| 99 | +tasks allocated on the same node, (thus multiple GPUs of the node are allocated to the job) |
| 100 | +each task can see all the GPUs allocated for the tasks. Therefore we need to use |
| 101 | +SLURM_LOCALID to tell each task to which GPU it should be using. |
| 102 | + |
| 103 | +<details><summary>Example scripts for running the training on SLURM cluster</summary> |
| 104 | + |
| 105 | +- **launch_job.sh** |
| 106 | + |
| 107 | +```bash |
| 108 | +#!/bin/bash |
| 109 | + |
| 110 | +#SBATCH --job-name=source_separation |
| 111 | + |
| 112 | +#SBATCH --output=/checkpoint/%u/jobs/%x/%j.out |
| 113 | + |
| 114 | +#SBATCH --error=/checkpoint/%u/jobs/%x/%j.err |
| 115 | + |
| 116 | +#SBATCH --nodes=1 |
| 117 | + |
| 118 | +#SBATCH --ntasks-per-node=8 |
| 119 | + |
| 120 | +#SBATCH --cpus-per-task=8 |
| 121 | + |
| 122 | +#SBATCH --mem-per-cpu=16G |
| 123 | + |
| 124 | +#SBATCH --gpus-per-task=1 |
| 125 | + |
| 126 | +#srun env |
| 127 | +srun wrapper.sh $@ |
| 128 | +``` |
| 129 | + |
| 130 | +- **wrapper.sh** |
| 131 | + |
| 132 | +```bash |
| 133 | +#!/bin/bash |
| 134 | +this_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )" |
| 135 | +save_dir="/checkpoint/${USER}/jobs/${SLURM_JOB_NAME}/${SLURM_JOB_ID}" |
| 136 | +dataset_dir="/dataset/wsj0-mix/2speakers/wav8k/min" |
| 137 | + |
| 138 | +if [ "${SLURM_JOB_NUM_NODES}" -gt 1 ]; then |
| 139 | + protocol="file:///checkpoint/${USER}/jobs/source_separation/${SLURM_JOB_ID}/sync" |
| 140 | +else |
| 141 | + protocol="env://" |
| 142 | +fi |
| 143 | + |
| 144 | +mkdir -p "${save_dir}" |
| 145 | + |
| 146 | +python -u \ |
| 147 | + "${this_dir}/train.py" \ |
| 148 | + --worker-id "${SLURM_PROCID}" \ |
| 149 | + --num-workers "${SLURM_NTASKS}" \ |
| 150 | + --device-id "${SLURM_LOCALID}" \ |
| 151 | + --sync-protocol "${protocol}" \ |
| 152 | + -- \ |
| 153 | + --sample-rate 8000 \ |
| 154 | + --batch-size $((32 / SLURM_NTASKS)) \ |
| 155 | + --dataset-dir "${dataset_dir}" \ |
| 156 | + --save-dir "${save_dir}" |
| 157 | +``` |
| 158 | + |
| 159 | +</details> |
0 commit comments