pytorch
diff --git a/‎examples/source_separation/README.md‎
Lines changed: 159 additions & 0 deletions b/‎examples/source_separation/README.md‎
Lines changed: 159 additions & 0 deletions
diff --git a/‎examples/source_separation/conv_tasnet/__init__.py‎
Lines changed: 5 additions & 0 deletions b/‎examples/source_separation/conv_tasnet/__init__.py‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎examples/source_separation/conv_tasnet/train.py‎
Lines changed: 173 additions & 0 deletions b/‎examples/source_separation/conv_tasnet/train.py‎
Lines changed: 173 additions & 0 deletions
@@ -0,0 +1,159 @@
+# Source Separation Example
+
+## Usage
+
+### Overview
+
+To traing a model, you can use [`train.py`](./train.py). This script takes the form of
+`[parameters for distributed training] -- [parameters for model/training]`
+
+If you would like to just try out the traing script, then try it without any parameters
+for distributed training.
+
+    ```
+    python train.py \
+            [--worker-id WORKER_ID] \
+            [--device-id DEVICE_ID] \
+            [--num-workers NUM_WORKERS] \
+            [--sync-protocol SYNC_PROTOCOL] \
+            -- \
+            <model specific training parameters>
+
+    # For the detail of the parameter values, use;
+    python train.py --help
+
+    # For the detail of the model parameters, use;
+    python train.py -- --help
+    ```
+
+This script runs training in Distributed Data Parallel (DDP) framework and has two major
+operation modes. This behavior depends on if `--worker-id` argument is given or not.
+
+1. (`--worker-id` is not given) Launchs worker subprocesses that performs the actual training.
+2. (`--worker-id` is given) Performs the training as a part of distributed training.
+
+When launching the script without any distributed trainig parameters (operation mode 1),
+this script will check the number of GPUs available on the local system and spawns the same
+number of training subprocesses (as operaiton mode 2). You can reduce the number of GPUs with
+`--num-workers`. If there is no GPU available, only one subprocess is launched and providing
+`--num-workers` larger than 1 results in error.
+
+When launching the script as a worker process of a distributed training, you need to configure
+the coordination of the workers.
+
+- `--num-workers` is the number of training processes being launched.
+- `--worker-id` is the process rank (must be unique across all the processes).
+- `--device-id` is the GPU device ID (should be unique within node).
+- `--sync-protocl` is how each worker process communicate and synchronize.
+  If the training is carried out on a single node, then the default `"env://"` should do.
+  If the training processes span across multiple nodes, then a path to the file to which all the
+  traiing processes have access, has to be provided with `"file://..."` protocol.
+
+### Distributed Training Notes
+
+<details><summary>Quick overview on DDP (distributed data parallel)</summary>
+
+DDP is single-program multiple-data training paradigm.
+With DDP, the model is replicated on every process,
+and every model replica will be fed with a different set of input data samples.
+
+Process: Worker process (as in Linux process). P processes per a Node
+Node: A machine. There are N machines, each of which holds P processes.
+World: network of nodes, composed of N nodes and N * P processes.
+
+Rank: Grobal process ID (unique across nodes) [0, N * P)
+Local Rank: Local process ID (unique only within a node) [0, P)
+
+```
+          Node 0                    Node 1                          Node N-1
+┌────────────────────────┐┌─────────────────────────┐     ┌───────────────────────────┐
+│╔══════════╗ ┌─────────┐││┌───────────┐ ┌─────────┐│     │┌─────────────┐ ┌─────────┐│
+│║ Process  ╟─┤ GPU: 0  ││││ Process   ├─┤ GPU: 0  ││     ││ Process     ├─┤ GPU: 0  ││
+│║ Rank: 0  ║ └─────────┘│││ Rank:P    │ └─────────┘│     ││ Rank:NP-P   │ └─────────┘│
+│╚══════════╝            ││└───────────┘            │     │└─────────────┘            │
+│┌──────────┐ ┌─────────┐││┌───────────┐ ┌─────────┐│     │┌─────────────┐ ┌─────────┐│
+││ Process  ├─┤ GPU: 1  ││││ Process   ├─┤ GPU: 1  ││     ││ Process     ├─┤ GPU: 1  ││
+││ Rank: 1  │ └─────────┘│││ Rank:P+1  │ └─────────┘│     ││ Rank:NP-P+1 │ └─────────┘│
+│└──────────┘            ││└───────────┘            │ ... │└─────────────┘            │
+│                        ││                         │     │                           │
+│ ...                    ││ ...                     │     │ ...                       │
+│                        ││                         │     │                           │
+│┌──────────┐ ┌─────────┐││┌───────────┐ ┌─────────┐│     │┌─────────────┐ ┌─────────┐│
+││ Process  ├─┤ GPU:P-1 ││││ Process   ├─┤ GPU:P-1 ││     ││ Process     ├─┤ GPU:P-1 ││
+││ Rank:P-1 │ └─────────┘│││ Rank:2P-1 │ └─────────┘│     ││ Rank:NP-1   │ └─────────┘│
+│└──────────┘            ││└───────────┘            │     │└─────────────┘            │
+└────────────────────────┘└─────────────────────────┘     └───────────────────────────┘
+```
+
+</details>
+
+### SLURM
+
+When launched as SLURM job, the follwoing environment variables correspond to
+
+SLURM_PROCID: `--worker-id` (Rank)
+SLURM_NTASKS (or legacy SLURM_NPPROCS): the number of total processes (`--num-workers` == world size)
+SLURM_LOCALID: Local Rank (to be mapped with GPU index*)
+
+* Even when GPU resource is allocated with `--gpus-per-task=1`, if there are muptiple
+tasks allocated on the same node, (thus multiple GPUs of the node are allocated to the job)
+each task can see all the GPUs allocated for the tasks. Therefore we need to use
+SLURM_LOCALID to tell each task to which GPU it should be using.
+
+<details><summary>Example scripts for running the training on SLURM cluster</summary>
+
+- **launch_job.sh**
+
+```bash
+#!/bin/bash
+
+#SBATCH --job-name=source_separation
+
+#SBATCH --output=/checkpoint/%u/jobs/%x/%j.out
+
+#SBATCH --error=/checkpoint/%u/jobs/%x/%j.err
+
+#SBATCH --nodes=1
+
+#SBATCH --ntasks-per-node=8
+
+#SBATCH --cpus-per-task=8
+
+#SBATCH --mem-per-cpu=16G
+
+#SBATCH --gpus-per-task=1
+
+#srun env
+srun wrapper.sh $@
+```
+
+- **wrapper.sh**
+
+```bash
+#!/bin/bash
+this_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+save_dir="/checkpoint/${USER}/jobs/${SLURM_JOB_NAME}/${SLURM_JOB_ID}"
+dataset_dir="/dataset/wsj0-mix/2speakers/wav8k/min"
+
+if [ "${SLURM_JOB_NUM_NODES}" -gt 1 ]; then
+    protocol="file:///checkpoint/${USER}/jobs/source_separation/${SLURM_JOB_ID}/sync"
+else
+    protocol="env://"
+fi
+
+mkdir -p "${save_dir}"
+
+python -u \
+  "${this_dir}/train.py" \
+  --worker-id "${SLURM_PROCID}" \
+  --num-workers "${SLURM_NTASKS}" \
+  --device-id "${SLURM_LOCALID}" \
+  --sync-protocol "${protocol}" \
+  -- \
+  --sample-rate 8000 \
+  --batch-size $((32 / SLURM_NTASKS)) \
+  --dataset-dir "${dataset_dir}" \
+  --save-dir "${save_dir}"
+```
+
+</details>
@@ -0,0 +1,5 @@
+from . import (
+    model,
+    metrics,
+    trainer,
+)
@@ -0,0 +1,173 @@
+#!/usr/bin/env python3
+"""Train Conv-TasNet"""
+import pathlib
+import argparse
+
+import torch.utils.data
+
+import conv_tasnet
+import dataset_utils
+import dist_utils
+
+_LG = dist_utils.getLogger(__name__)
+
+
+def _parse_args(args):
+    parser = argparse.ArgumentParser(
+        description=__doc__, formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    group = parser.add_argument_group("model")
+    group.add_argument(
+        "--num-speakers", default=2, type=int, help="The number of speakers."
+    )
+    group = parser.add_argument_group("dataset")
+    group.add_argument(
+        "--sample-rate",
+        required=True,
+        type=int,
+        help="Sample rate of audio files in the given dataset.",
+    )
+    group.add_argument(
+        "--dataset", default="wsj0mix",
+        choices=["wsj0mix"]
+    )
+    group.add_argument(
+        "--dataset-dir",
+        required=True,
+        type=pathlib.Path,
+        help=(
+            "Directory where dataset is found. "
+            'If the dataset type is "wsj9mix", then this is the directory where '
+            '"cv", "tt" and "tr" subdirectories are found.'
+        ),
+    )
+    group = parser.add_argument_group("save")
+    group.add_argument(
+        "--save-dir",
+        required=True,
+        type=pathlib.Path,
+        help=(
+            "Directory where the checkpoints are saved. "
+            "Though, only the worker 0 saves checkpoint data, all the worker processes must "
+            "have access to the directory.",
+        ),
+    )
+    group = parser.add_argument_group("dataloader")
+    group.add_argument(
+        "--batch-size", default=32, type=int,
+    )
+    group = parser.add_argument_group("training")
+    group.add_argument(
+        "--epochs", default=100, type=int, help="The number of epochs to train."
+    )
+    group.add_argument(
+        "--learning-rate", default=1e-3, type=float, help="Initial learning rate."
+    )
+    group.add_argument(
+        "--grad-clip", default=5.0, type=float, help="Gradient clip value (l2 norm)."
+    )
+    group.add_argument(
+        "--resume",
+        help="Previous checkpoint file from which the training is resumed."
+    )
+    return parser.parse_args(args)
+
+
+def train(args):
+    args = _parse_args(args)
+    _LG.info("%s", args)
+
+    args.save_dir.mkdir(parents=True, exist_ok=True)
+
+    start_epoch = 1
+    if args.resume:
+        checkpoint = torch.load(args.resume)
+        if args.sample_rate != checkpoint['sample_rate']:
+            raise ValueError(
+                "The provided sample rate ({args.sample_rate}) does not match "
+                "the sample rate from the check point ({checkpoint['sample_rate']}).")
+        if args.num_speakers != checkpoint['num_speakers']:
+            raise ValueError(
+                "The provided #of speakers ({args.num_speakers}) does not match "
+                "the #of speakers from the check point ({checkpoint['num_speakers']}.)"
+            )
+        start_epoch = checkpoint['epoch']
+
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    _LG.info("Using: %s", device)
+
+    model = conv_tasnet.model.ConvTasNet(
+        num_speakers=args.num_speakers, enc_kernel_size=args.sample_rate * 2 // 1000
+    )
+    model.to(device)
+
+    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[device])
+    optimizer = torch.optim.Adam(model.parameters(), lr=args.learning_rate)
+
+    if args.resume:
+        _LG.info('Loading parameters from the checkpoint...')
+        model.module.load_state_dict(checkpoint['model'])
+        optimizer.load_state_dict(checkpoint['optimizer'])
+    else:
+        dist_utils.synchronize_params(str(args.save_dir / f"tmp.pt"), model, optimizer)
+
+    _LG.info_on_master("Model:\n%s", model)
+
+    lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
+        optimizer, factor=0.5, patience=3
+    )
+
+    train_loss_func = conv_tasnet.metrics.PITLoss(
+        loss_func=conv_tasnet.metrics.neg_si_snr
+    )
+
+    train_dataset, eval_dataset = dataset_utils.get_dataset(
+        args.dataset, args.dataset_dir, args.num_speakers
+    )
+    collate_fn = dataset_utils.get_collate_fn(args.dataset)
+
+    train_loader = torch.utils.data.DataLoader(
+        train_dataset,
+        batch_size=args.batch_size,
+        sampler=torch.utils.data.distributed.DistributedSampler(train_dataset),
+        collate_fn=collate_fn,
+    )
+    eval_loader = torch.utils.data.DataLoader(
+        eval_dataset,
+        batch_size=args.batch_size,
+        sampler=torch.utils.data.distributed.DistributedSampler(eval_dataset),
+        collate_fn=collate_fn,
+    )
+
+    trainer = conv_tasnet.trainer.Trainer(
+        model,
+        optimizer,
+        train_loader,
+        eval_loader,
+        train_loss_func,
+        args.grad_clip,
+        device,
+    )
+
+    _LG.info_on_master("Running %s epochs", args.epochs)
+    for epoch in range(start_epoch, start_epoch + args.epochs):
+        _LG.info_on_master("-" * 70)
+        _LG.info_on_master("Epoch: %s", epoch)
+        _LG.info_on_master("Learning rate: %s", optimizer.param_groups[0]["lr"])
+        _LG.info_on_master("-" * 70)
+
+        trainer.train_one_epoch()
+        loss = trainer.evaluate()
+        lr_scheduler.step(loss)
+
+        save_path = args.save_dir / f"epoch_{epoch}.pt"
+        dist_utils.save_on_master(
+            {
+                "model": model.module.state_dict(),
+                "optimizer": optimizer.state_dict(),
+                "num_speakers": args.num_speakers,
+                "sample_rate": args.sample_rate,
+                "epoch": epoch,
+            },
+            save_path,
+        )