Fix recovery of SSL training, scale SSL training to multiple nodes #565

ant0nsc · 2021-09-20T14:29:29Z

When SSL training gets interrupted on low-priority nodes, there are presently weird glitches on the metrics for the linear head. We suspect that those come from the fact that the optimizer for the linear head is not saved to the checkpoint, and hence has to re-learn all of its statistics.
This PR adds the linear head optimizer such that it is accessible to PL, and will be included in the checkpoint.

In addition, the semantics of batch_size in SSL training is changed: Previously it was the effective batch size, taking multiple nodes into account. This meant that the code was effectively hardcoding 16 GPUS. New behaviour: Batch size is now the batch size on a single GPU. As a consequence, we can scale to any number of GPUs without code changes.

There are also new flags pl_limit_train_batches and pl_limit_val_batches to speed up training, by reducing the number of batches processed.

InnerEye/ML/SSL/lightning_modules/byol/byol_module.py

DRY

This reverts commit e03f0a9.

InnerEye/ML/SSL/lightning_containers/ssl_container.py

ozan-oktay · 2021-10-06T15:06:09Z

InnerEye-DataQuality/InnerEyeDataQuality/deep_learning/self_supervised/simclr_module.py

@@ -1,8 +1,7 @@
 #  ------------------------------------------------------------------------------------------
 #  Copyright (c) Microsoft Corporation. All rights reserved.


I think it would be good to attach ssl training run results for future reference - before and after this manual optimisation change. (both for SimCLR and BYOL)

InnerEye/ML/SSL/lightning_modules/byol/byol_module.py

ozan-oktay · 2021-10-06T15:14:17Z

InnerEye/ML/configs/ssl/CIFAR_SSL_configs.py

        super().__init__(ssl_training_dataset_name=SSLDatasetName.CIFAR10,
                         linear_head_dataset_name=SSLDatasetName.CIFAR10,
-                         ssl_training_batch_size=512,
+                         ssl_training_batch_size=64,


Weren't we training these models in machines with 4 gpus? Should this be reduced to 128?

ozan-oktay · 2021-10-06T15:20:16Z

InnerEye/ML/configs/ssl/CXR_SSL_configs.py

                         recovery_checkpoint_save_interval=200,
                         num_epochs=1000,
-                         ssl_training_batch_size=1200,
+                         ssl_training_batch_size=75,


same in here.

ozan-oktay · 2021-10-06T15:21:50Z

InnerEye/ML/deep_learning_config.py

        param.Boolean(default=False,
                      doc="Controls the PyTorch Lightning flag 'find_unused_parameters' for the DDP plugin. "
                          "Setting it to True comes with a performance hit.")
+    pl_limit_train_batches: Optional[int] = \


what happens if the user specifies zero? Would it skip train/val automatically, is it tested?

0 is valid. We are passing this value straight through to PL, and it does whatever it would do then - yes, it would skip training/validation.

ant0nsc · 2021-11-15T16:32:19Z

superseded by #584

ant0nsc added 5 commits September 16, 2021 19:48

first version

a708061

Merge remote-tracking branch 'origin/main' into antonsc/recovery

6c65c6a

docu update

d1b86f1

docu update

be27a1b

tests working

b4b9d74

ant0nsc commented Sep 20, 2021

View reviewed changes

InnerEye/ML/SSL/lightning_modules/byol/byol_module.py Outdated Show resolved Hide resolved

ant0nsc added 24 commits September 21, 2021 15:10

unit test

8927c20

cleanup

a291d09

tests for nih

ca8bd69

adding more tests

608a404

DRY

index error

a6da745

diagnostics

b2f96f1

rank_zero for epoch

22d3f08

sync_dist for loss

0b0621b

diagnostics

eb266d8

Merge remote-tracking branch 'origin/main' into antonsc/recovery

bb36f59

disabling logger

79e3dbe

remove storing logger

1e33c5c

diagnostics in AML logger

cfa0dd6

change logged "epoch"

7fb62ad

improved logger

c1ff09e

byol on_step=True

2cdcd47

flags for speedup

aa7e306

cleaned up logging

8bd7b46

more logging cleanup

1375753

remove diagnostics

e68e469

flake and mypy

a0bc050

fix logging int problem

0adae84

fix logging issue

a5e5e6b

improve logging

7d11086

ant0nsc added 9 commits October 1, 2021 14:51

sync across GPU in linear layer

e03f0a9

Revert "sync across GPU in linear layer"

0ed3214

This reverts commit e03f0a9.

helpers to log learning rates

3d5e178

find unused

e8cd6a1

tests for logging LR

13b7884

change semantics of batch size

86ae2e4

manual optimization with single GPU loss function

92cfccb

fixing LR update for manual optimization

00443af

remove single GPU loss

8ae2746

ant0nsc changed the title ~~Fix recovery of SSL training~~ Fix recovery of SSL training, scale SSL training to multiple nodes Oct 6, 2021

ant0nsc requested a review from Shruthi42 October 6, 2021 14:41

changelog

aca6a26

ozan-oktay reviewed Oct 6, 2021

View reviewed changes

InnerEye/ML/SSL/lightning_containers/ssl_container.py Show resolved Hide resolved

ozan-oktay reviewed Oct 6, 2021

View reviewed changes

InnerEye/ML/SSL/lightning_modules/byol/byol_module.py Outdated Show resolved Hide resolved

ozan-oktay reviewed Oct 6, 2021

View reviewed changes

ant0nsc added 9 commits October 7, 2021 14:56

cleanup callback

a4ed77d

test fixes

ced9fd1

flake

1f3893c

mypy

c7b6e11

test fixes

56ca8da

fix batch sizes

0d23c83

reduce tolerance

3a7e9b9

batch size back to 75

ec75319

sync_dist=False

3fe959b

ant0nsc closed this Nov 15, 2021

ant0nsc deleted the antonsc/recovery branch January 31, 2022 14:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix recovery of SSL training, scale SSL training to multiple nodes #565

Fix recovery of SSL training, scale SSL training to multiple nodes #565

Uh oh!

ant0nsc commented Sep 20, 2021 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ozan-oktay Oct 6, 2021 •

edited

Loading

Uh oh!

Uh oh!

ozan-oktay Oct 6, 2021

Uh oh!

ozan-oktay Oct 6, 2021

Uh oh!

ozan-oktay Oct 6, 2021

Uh oh!

ant0nsc Oct 8, 2021

Uh oh!

ant0nsc commented Nov 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -1,8 +1,7 @@
		# ------------------------------------------------------------------------------------------
		# Copyright (c) Microsoft Corporation. All rights reserved.

Uh oh!

Fix recovery of SSL training, scale SSL training to multiple nodes #565

Fix recovery of SSL training, scale SSL training to multiple nodes #565

Uh oh!

Conversation

ant0nsc commented Sep 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ozan-oktay Oct 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ozan-oktay Oct 6, 2021

Choose a reason for hiding this comment

Uh oh!

ozan-oktay Oct 6, 2021

Choose a reason for hiding this comment

Uh oh!

ozan-oktay Oct 6, 2021

Choose a reason for hiding this comment

Uh oh!

ant0nsc Oct 8, 2021

Choose a reason for hiding this comment

Uh oh!

ant0nsc commented Nov 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ant0nsc commented Sep 20, 2021 •

edited

Loading

ozan-oktay Oct 6, 2021 •

edited

Loading