Improve setup for running the HelloWorld model in AzureML (#693)

ant0nsc · web-flow · commit f0d2337d1b2a · 2022-03-15T14:07:28.000Z
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,12 +9,12 @@ For each Pull Request, the affected code parts should be briefly described and a
 Once a release is done, the "Upcoming" section becomes the release changelog, and a new empty "Upcoming" should be
 created.
 
-
 ## Upcoming
 
 ### Added
 
 - ([#671](https://github.com/microsoft/InnerEye-DeepLearning/pull/671)) Remove sequence models and unused variables. Simplify README.
+- ([#693](https://github.com/microsoft/InnerEye-DeepLearning/pull/693)) Improve instructions for HelloWorld model in AzureML.
 - ([#678](https://github.com/microsoft/InnerEye-DeepLearning/pull/678)) Add function to get log level name and use it for logging.
 - ([#666](https://github.com/microsoft/InnerEye-DeepLearning/pull/666)) Replace RadIO with TorchIO for patch-based inference.
 - ([#643](https://github.com/microsoft/InnerEye-DeepLearning/pull/643)) Test for recovery of SSL job. Tracks learning rate and train
@@ -160,7 +160,6 @@ in inference-only runs when using lightning containers.
 - ([#633](https://github.com/microsoft/InnerEye-DeepLearning/pull/633)) Model fields `recovery_checkpoint_save_interval` and `recovery_checkpoints_save_last_k` have been retired.
   Recovery checkpoint handling is now controlled by `autosave_every_n_val_epochs`.
 
-
 ## 0.3 (2021-06-01)
 
 ### Added
@@ -291,6 +290,7 @@ console for easier diagnostics.
  container models on machines with >1 GPU
 
 ### Removed
+
 - ([#439](https://github.com/microsoft/InnerEye-DeepLearning/pull/439)) Deprecated `start_epoch` config argument.
 - ([#450](https://github.com/microsoft/InnerEye-DeepLearning/pull/450)) Delete unused `classification_report.ipynb`.
 - ([#455](https://github.com/microsoft/InnerEye-DeepLearning/pull/455)) Removed the AzureRunner conda environment.
@@ -307,11 +307,11 @@ console for easier diagnostics.
 
 - ([#323](https://github.com/microsoft/InnerEye-DeepLearning/pull/323)) There are new model configuration fields
   (and hence, commandline options), in particular for controlling PyTorch Lightning (PL) training:
-    - `max_num_gpus` controls how many GPUs are used at most for training (default: all GPUs, value -1).
-    - `pl_num_sanity_val_steps` controls the PL trainer flag `num_sanity_val_steps`
-    - `pl_deterministic` controls the PL trainer flags `benchmark` and `deterministic`
-    - `generate_report` controls if a HTML report will be written (default: True)
-    - `recovery_checkpoint_save_interval` determines how often a checkpoint for training recovery is saved.
+  - `max_num_gpus` controls how many GPUs are used at most for training (default: all GPUs, value -1).
+  - `pl_num_sanity_val_steps` controls the PL trainer flag `num_sanity_val_steps`
+  - `pl_deterministic` controls the PL trainer flags `benchmark` and `deterministic`
+  - `generate_report` controls if a HTML report will be written (default: True)
+  - `recovery_checkpoint_save_interval` determines how often a checkpoint for training recovery is saved.
 - ([#336](https://github.com/microsoft/InnerEye-DeepLearning/pull/336)) New extensions of
   SegmentationModelBases `HeadAndNeckBase` and `ProstateBase`. Use these classes to build your own Head&Neck or Prostate
   models, by just providing a list of foreground classes.
@@ -326,17 +326,17 @@ console for easier diagnostics.
 
 - ([#323](https://github.com/microsoft/InnerEye-DeepLearning/pull/323)) The codebase has undergone a massive
   refactoring, to use PyTorch Lightning as the foundation for all training. As a consequence of that:
-    - Training is now using Distributed Data Parallel with synchronized `batchnorm`. The number of GPUs to use can be
+  - Training is now using Distributed Data Parallel with synchronized `batchnorm`. The number of GPUs to use can be
       controlled by a new commandline argument `max_num_gpus`.
-    - Several classes, like `ModelTrainingSteps*`, have been removed completely.
-    - The final model is now always the one that is written at the end of all training epochs.
-    - The old code that options to run full image inference at multiple epochs (i.e., multiple checkpoints), this has
+  - Several classes, like `ModelTrainingSteps*`, have been removed completely.
+  - The final model is now always the one that is written at the end of all training epochs.
+  - The old code that options to run full image inference at multiple epochs (i.e., multiple checkpoints), this has
       been removed, alongside the respective commandline options `save_start_epoch`, `save_step_epochs`,
       `epochs_to_test`, `test_diff_epochs`, `test_step_epochs`, `test_start_epoch`
-    - The commandline option `register_model_only_for_epoch` is now called `only_register_model`, and is boolean.
-    - All metrics are written to AzureML and Tensorboard in a unified format. A training Dice score for 'bladder' would
+  - The commandline option `register_model_only_for_epoch` is now called `only_register_model`, and is boolean.
+  - All metrics are written to AzureML and Tensorboard in a unified format. A training Dice score for 'bladder' would
       previously be called Train_Dice/bladder, now it is train/Dice/bladder.
-    - Due to a different checkpoint format, it is no longer possible to use checkpoints written by the previous version
+  - Due to a different checkpoint format, it is no longer possible to use checkpoints written by the previous version
       of the code.
 - The arguments of the `score.py` script changed: `data_root` -> `data_folder`, it no longer assumes a fixed
   `data` subfolder. `project_root` -> `model_root`, `test_image_channels` -> `image_files`.
diff --git a/InnerEye/ML/configs/segmentation/HelloWorld.py b/InnerEye/ML/configs/segmentation/HelloWorld.py
@@ -28,11 +28,8 @@ class HelloWorld(SegmentationModelBase):
 
     * This model can be trained from the commandline: python InnerEye/runner.py --model=HelloWorld
 
-    * If you want to test that your AzureML workspace is working:
-        - Upload to datasets storage account for your AzureML workspace: Test/ML/test_data/dataset.csv and
-        Test/ML/test_data/train_and_test_data and name the folder "hello_world"
-        - If you have set up AzureML then parameter search can be performed for this model by running:
-        python InnerEye/ML/ runner.py --model=HelloWorld --azureml=True --hyperdrive=True
+    * If you want to test that your AzureML workspace is working, please follow the instructions in 
+    <repo_root>/docs/hello_world_model.md.
 
     In this example, the model is trained on 2 input image channels channel1 and channel2, and
     predicts 2 foreground classes region, region_1.
diff --git a/docs/hello_world_model.md b/docs/hello_world_model.md
@@ -1,17 +1,81 @@
 # Training a Hello World segmentation model
 
-In the configs folder, you will find a config file called [HelloWorld.py](../InnerEye/ML/configs/segmentation/HelloWorld.py) 
+In the configs folder, you will find a config file called [HelloWorld.py](../InnerEye/ML/configs/segmentation/HelloWorld.py)
 We have created this file to demonstrate how to:
 
 1. Subclass SegmentationModelBase which is the base config for all segmentation model configs
 1. Configure the UNet3D implemented in this package
 1. Configure Azure HyperDrive based parameter search
 
-* This model can be trained from the commandline, from the root of the repo: `python InnerEye/ML/runner.py --model=HelloWorld`
-* If you want to test your AzureML workspace with the HelloWorld model:
-    * Make sure your AzureML workspace has been set up. You should have inside the folder InnerEye a settings.yml file
-      that specifies the datastore, the resource group, and the workspace on which to run
-    * Upload to datasets storage account for your AzureML workspace: `Tests/ML/test_data/dataset.csv` and
-    `Test/ML/test_data/train_and_test_data` and name the folder "hello_world"   
-    * If you have set up AzureML then parameter search can be performed for this model by running:
-    `python InnerEye/ML/runner.py --model=HelloWorld --azureml --hyperdrive`
+This model can be trained from the commandline from the root of the repo: `python InnerEye/ML/runner.py --model=HelloWorld`.
+When used like this, it will use dummy 3D scans as the training data, that are included in this repository. Training will run
+on your local dev machine.
+
+In order to get this model to train in AzureML, you need to upload the data to blob storage. This can be done via
+[Azure Storage Explorer](https://azure.microsoft.com/en-gb/features/storage-explorer/) or via the
+[Azure commandline tools](https://docs.microsoft.com/en-us/cli/azure/). Please find the detailed instructions for both
+options below.
+
+Before uploading, you need to know what storage account you have set up to hold the data for your AzureML workspace, see
+[Step 4 in the Azure setup](setting_up_aml.md): For the upload you need to know the name of that storage account.
+
+## Option 1: Upload via Azure Storage explorer
+
+First install [Azure Storage Explorer](https://azure.microsoft.com/en-gb/features/storage-explorer/).
+
+When starting Storage Explorer, you need to [log in to Azure](https://docs.microsoft.com/en-gb/azure/vs-azure-tools-storage-manage-with-storage-explorer?tabs=windows).
+
+* Select your subscription in the left-hand navigation, and then the storage account that you set up earlier.
+* There should be a section "Blob Containers" for that account.
+* Right-click on "Blob Containers", and choose "Create Blob Container". Give that container the name "datasets"
+* Click on the newly created container "datasets". You should see no files present.
+* Press "Upload" / "Upload folder"
+* As the folder to upload, select `<repo_root>/Tests/ML/test_data/train_and_test_data`
+* As the destination directory, select `/hello_world`.
+* Start the upload. Press the "Refresh" button after a couple of seconds, you should now see a folder `hello_world`, and inside of it, a subfolder `train_and_test_data`.
+* Press "Upload" / "Upload files".
+* Choose `<repo_root>/Tests/ML/test_data/dataset.csv`, and `/hello_world` as the destination directory.
+* Start the upload and refresh.
+* Verify that you now have files `/hello_world/dataset.csv` and `/hello_world/train_and_test_data/id1_channel1.nii.gz`
+
+## Option 2: Upload via the Azure CLI
+
+First, install the [Azure commandline tools](https://docs.microsoft.com/en-us/cli/azure/).
+
+Run the following in the command prompt:
+
+```shell
+az login
+az account list
+```
+
+If the `az account list` command returns more than one subscription, run `az account set --name "your subscription name"`
+
+The code below assumes that you are uploading to a storage account that has the name
+`stor_acct`, please replace with your actual storage account name.
+
+```shell
+cd <your_repository_root>
+az storage container create --account-name stor_acct --name datasets
+az storage blob upload --account-name stor_acct --container-name datasets --file ./Tests/ML/test_data/dataset.csv --name hello_world/dataset.csv
+az storage blob upload-batch --account-name stor_acct --destination datasets --source ./Tests/ML/test_data/train_and_test_data --destination-path hello_world/train_and_test_data
+```
+
+## Create an AzureML datastore
+
+A "datastore" in AzureML lingo is an abstraction for the ML systems to access files that can come from different places. In our case, the datastore points to a storage container to which we have just uploaded the data.
+
+Instructions to create the datastore are given
+[in the AML setup instructions](setting_up_aml.md) in step 5.
+
+## Run the HelloWorld model in AzureML
+
+Double-check that you have copied your Azure settings into the settings file, as described
+[in the AML setup instructions](setting_up_aml.md) in step 6.
+
+Then execute:
+
+```shell
+conda activate InnerEye
+python InnerEye/ML/runner.py --model=HelloWorld
+```
diff --git a/docs/setting_up_aml.md b/docs/setting_up_aml.md