Skip to content
This repository was archived by the owner on Mar 21, 2024. It is now read-only.

Commit f0d2337

Browse files
authored
Improve setup for running the HelloWorld model in AzureML (#693)
1 parent 63fb868 commit f0d2337

File tree

4 files changed

+158
-87
lines changed

4 files changed

+158
-87
lines changed

CHANGELOG.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,12 @@ For each Pull Request, the affected code parts should be briefly described and a
99
Once a release is done, the "Upcoming" section becomes the release changelog, and a new empty "Upcoming" should be
1010
created.
1111

12-
1312
## Upcoming
1413

1514
### Added
1615

1716
- ([#671](https://github.com/microsoft/InnerEye-DeepLearning/pull/671)) Remove sequence models and unused variables. Simplify README.
17+
- ([#693](https://github.com/microsoft/InnerEye-DeepLearning/pull/693)) Improve instructions for HelloWorld model in AzureML.
1818
- ([#678](https://github.com/microsoft/InnerEye-DeepLearning/pull/678)) Add function to get log level name and use it for logging.
1919
- ([#666](https://github.com/microsoft/InnerEye-DeepLearning/pull/666)) Replace RadIO with TorchIO for patch-based inference.
2020
- ([#643](https://github.com/microsoft/InnerEye-DeepLearning/pull/643)) Test for recovery of SSL job. Tracks learning rate and train
@@ -160,7 +160,6 @@ in inference-only runs when using lightning containers.
160160
- ([#633](https://github.com/microsoft/InnerEye-DeepLearning/pull/633)) Model fields `recovery_checkpoint_save_interval` and `recovery_checkpoints_save_last_k` have been retired.
161161
Recovery checkpoint handling is now controlled by `autosave_every_n_val_epochs`.
162162

163-
164163
## 0.3 (2021-06-01)
165164

166165
### Added
@@ -291,6 +290,7 @@ console for easier diagnostics.
291290
container models on machines with >1 GPU
292291

293292
### Removed
293+
294294
- ([#439](https://github.com/microsoft/InnerEye-DeepLearning/pull/439)) Deprecated `start_epoch` config argument.
295295
- ([#450](https://github.com/microsoft/InnerEye-DeepLearning/pull/450)) Delete unused `classification_report.ipynb`.
296296
- ([#455](https://github.com/microsoft/InnerEye-DeepLearning/pull/455)) Removed the AzureRunner conda environment.
@@ -307,11 +307,11 @@ console for easier diagnostics.
307307

308308
- ([#323](https://github.com/microsoft/InnerEye-DeepLearning/pull/323)) There are new model configuration fields
309309
(and hence, commandline options), in particular for controlling PyTorch Lightning (PL) training:
310-
- `max_num_gpus` controls how many GPUs are used at most for training (default: all GPUs, value -1).
311-
- `pl_num_sanity_val_steps` controls the PL trainer flag `num_sanity_val_steps`
312-
- `pl_deterministic` controls the PL trainer flags `benchmark` and `deterministic`
313-
- `generate_report` controls if a HTML report will be written (default: True)
314-
- `recovery_checkpoint_save_interval` determines how often a checkpoint for training recovery is saved.
310+
- `max_num_gpus` controls how many GPUs are used at most for training (default: all GPUs, value -1).
311+
- `pl_num_sanity_val_steps` controls the PL trainer flag `num_sanity_val_steps`
312+
- `pl_deterministic` controls the PL trainer flags `benchmark` and `deterministic`
313+
- `generate_report` controls if a HTML report will be written (default: True)
314+
- `recovery_checkpoint_save_interval` determines how often a checkpoint for training recovery is saved.
315315
- ([#336](https://github.com/microsoft/InnerEye-DeepLearning/pull/336)) New extensions of
316316
SegmentationModelBases `HeadAndNeckBase` and `ProstateBase`. Use these classes to build your own Head&Neck or Prostate
317317
models, by just providing a list of foreground classes.
@@ -326,17 +326,17 @@ console for easier diagnostics.
326326

327327
- ([#323](https://github.com/microsoft/InnerEye-DeepLearning/pull/323)) The codebase has undergone a massive
328328
refactoring, to use PyTorch Lightning as the foundation for all training. As a consequence of that:
329-
- Training is now using Distributed Data Parallel with synchronized `batchnorm`. The number of GPUs to use can be
329+
- Training is now using Distributed Data Parallel with synchronized `batchnorm`. The number of GPUs to use can be
330330
controlled by a new commandline argument `max_num_gpus`.
331-
- Several classes, like `ModelTrainingSteps*`, have been removed completely.
332-
- The final model is now always the one that is written at the end of all training epochs.
333-
- The old code that options to run full image inference at multiple epochs (i.e., multiple checkpoints), this has
331+
- Several classes, like `ModelTrainingSteps*`, have been removed completely.
332+
- The final model is now always the one that is written at the end of all training epochs.
333+
- The old code that options to run full image inference at multiple epochs (i.e., multiple checkpoints), this has
334334
been removed, alongside the respective commandline options `save_start_epoch`, `save_step_epochs`,
335335
`epochs_to_test`, `test_diff_epochs`, `test_step_epochs`, `test_start_epoch`
336-
- The commandline option `register_model_only_for_epoch` is now called `only_register_model`, and is boolean.
337-
- All metrics are written to AzureML and Tensorboard in a unified format. A training Dice score for 'bladder' would
336+
- The commandline option `register_model_only_for_epoch` is now called `only_register_model`, and is boolean.
337+
- All metrics are written to AzureML and Tensorboard in a unified format. A training Dice score for 'bladder' would
338338
previously be called Train_Dice/bladder, now it is train/Dice/bladder.
339-
- Due to a different checkpoint format, it is no longer possible to use checkpoints written by the previous version
339+
- Due to a different checkpoint format, it is no longer possible to use checkpoints written by the previous version
340340
of the code.
341341
- The arguments of the `score.py` script changed: `data_root` -> `data_folder`, it no longer assumes a fixed
342342
`data` subfolder. `project_root` -> `model_root`, `test_image_channels` -> `image_files`.

InnerEye/ML/configs/segmentation/HelloWorld.py

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -28,11 +28,8 @@ class HelloWorld(SegmentationModelBase):
2828
2929
* This model can be trained from the commandline: python InnerEye/runner.py --model=HelloWorld
3030
31-
* If you want to test that your AzureML workspace is working:
32-
- Upload to datasets storage account for your AzureML workspace: Test/ML/test_data/dataset.csv and
33-
Test/ML/test_data/train_and_test_data and name the folder "hello_world"
34-
- If you have set up AzureML then parameter search can be performed for this model by running:
35-
python InnerEye/ML/ runner.py --model=HelloWorld --azureml=True --hyperdrive=True
31+
* If you want to test that your AzureML workspace is working, please follow the instructions in
32+
<repo_root>/docs/hello_world_model.md.
3633
3734
In this example, the model is trained on 2 input image channels channel1 and channel2, and
3835
predicts 2 foreground classes region, region_1.

docs/hello_world_model.md

Lines changed: 73 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,81 @@
11
# Training a Hello World segmentation model
22

3-
In the configs folder, you will find a config file called [HelloWorld.py](../InnerEye/ML/configs/segmentation/HelloWorld.py)
3+
In the configs folder, you will find a config file called [HelloWorld.py](../InnerEye/ML/configs/segmentation/HelloWorld.py)
44
We have created this file to demonstrate how to:
55

66
1. Subclass SegmentationModelBase which is the base config for all segmentation model configs
77
1. Configure the UNet3D implemented in this package
88
1. Configure Azure HyperDrive based parameter search
99

10-
* This model can be trained from the commandline, from the root of the repo: `python InnerEye/ML/runner.py --model=HelloWorld`
11-
* If you want to test your AzureML workspace with the HelloWorld model:
12-
* Make sure your AzureML workspace has been set up. You should have inside the folder InnerEye a settings.yml file
13-
that specifies the datastore, the resource group, and the workspace on which to run
14-
* Upload to datasets storage account for your AzureML workspace: `Tests/ML/test_data/dataset.csv` and
15-
`Test/ML/test_data/train_and_test_data` and name the folder "hello_world"
16-
* If you have set up AzureML then parameter search can be performed for this model by running:
17-
`python InnerEye/ML/runner.py --model=HelloWorld --azureml --hyperdrive`
10+
This model can be trained from the commandline from the root of the repo: `python InnerEye/ML/runner.py --model=HelloWorld`.
11+
When used like this, it will use dummy 3D scans as the training data, that are included in this repository. Training will run
12+
on your local dev machine.
13+
14+
In order to get this model to train in AzureML, you need to upload the data to blob storage. This can be done via
15+
[Azure Storage Explorer](https://azure.microsoft.com/en-gb/features/storage-explorer/) or via the
16+
[Azure commandline tools](https://docs.microsoft.com/en-us/cli/azure/). Please find the detailed instructions for both
17+
options below.
18+
19+
Before uploading, you need to know what storage account you have set up to hold the data for your AzureML workspace, see
20+
[Step 4 in the Azure setup](setting_up_aml.md): For the upload you need to know the name of that storage account.
21+
22+
## Option 1: Upload via Azure Storage explorer
23+
24+
First install [Azure Storage Explorer](https://azure.microsoft.com/en-gb/features/storage-explorer/).
25+
26+
When starting Storage Explorer, you need to [log in to Azure](https://docs.microsoft.com/en-gb/azure/vs-azure-tools-storage-manage-with-storage-explorer?tabs=windows).
27+
28+
* Select your subscription in the left-hand navigation, and then the storage account that you set up earlier.
29+
* There should be a section "Blob Containers" for that account.
30+
* Right-click on "Blob Containers", and choose "Create Blob Container". Give that container the name "datasets"
31+
* Click on the newly created container "datasets". You should see no files present.
32+
* Press "Upload" / "Upload folder"
33+
* As the folder to upload, select `<repo_root>/Tests/ML/test_data/train_and_test_data`
34+
* As the destination directory, select `/hello_world`.
35+
* Start the upload. Press the "Refresh" button after a couple of seconds, you should now see a folder `hello_world`, and inside of it, a subfolder `train_and_test_data`.
36+
* Press "Upload" / "Upload files".
37+
* Choose `<repo_root>/Tests/ML/test_data/dataset.csv`, and `/hello_world` as the destination directory.
38+
* Start the upload and refresh.
39+
* Verify that you now have files `/hello_world/dataset.csv` and `/hello_world/train_and_test_data/id1_channel1.nii.gz`
40+
41+
## Option 2: Upload via the Azure CLI
42+
43+
First, install the [Azure commandline tools](https://docs.microsoft.com/en-us/cli/azure/).
44+
45+
Run the following in the command prompt:
46+
47+
```shell
48+
az login
49+
az account list
50+
```
51+
52+
If the `az account list` command returns more than one subscription, run `az account set --name "your subscription name"`
53+
54+
The code below assumes that you are uploading to a storage account that has the name
55+
`stor_acct`, please replace with your actual storage account name.
56+
57+
```shell
58+
cd <your_repository_root>
59+
az storage container create --account-name stor_acct --name datasets
60+
az storage blob upload --account-name stor_acct --container-name datasets --file ./Tests/ML/test_data/dataset.csv --name hello_world/dataset.csv
61+
az storage blob upload-batch --account-name stor_acct --destination datasets --source ./Tests/ML/test_data/train_and_test_data --destination-path hello_world/train_and_test_data
62+
```
63+
64+
## Create an AzureML datastore
65+
66+
A "datastore" in AzureML lingo is an abstraction for the ML systems to access files that can come from different places. In our case, the datastore points to a storage container to which we have just uploaded the data.
67+
68+
Instructions to create the datastore are given
69+
[in the AML setup instructions](setting_up_aml.md) in step 5.
70+
71+
## Run the HelloWorld model in AzureML
72+
73+
Double-check that you have copied your Azure settings into the settings file, as described
74+
[in the AML setup instructions](setting_up_aml.md) in step 6.
75+
76+
Then execute:
77+
78+
```shell
79+
conda activate InnerEye
80+
python InnerEye/ML/runner.py --model=HelloWorld
81+
```

0 commit comments

Comments
 (0)