From 6686a1ba0effee0aaf0213a72209c93a825ea8e8 Mon Sep 17 00:00:00 2001 From: Tim Regan Date: Tue, 8 Jun 2021 10:54:35 +0100 Subject: [PATCH 01/10] o# This is a combination of 7 commits. This is a combination of 36 commits. This is the 1st commit message: remove exception on lightningcontainer as ensembles with cross-validation This is the commit message #2: Removing unnecessary lambda This is the commit message #3: Removing block for local run This is the commit message #4: Dealing with missing config for hyperdrive in LightningContainer CrossVal This is the commit message #5: Old typo in comment This is the commit message #6: Refactoring hyperdrive cross validation support into WorkflowParams This is the commit message #7: property not blocked fro cross validation This is the commit message #8: inconsistent blank rows This is the commit message #9: tidy up This is the commit message #10: Restoring block on offline segmentation cross validation AzureML is working, but offline is not. How come? The AzureML experiment runs 'offline' on Azure so how does it pass? Anyway, enabling block for now while writing test. This is the commit message #11: Commenting out blocks on offline cross validation for segmentation models When I remove lines 290, 342, and 343 from run_ml.py, i.e. when I remove the blocks on offline cross validation for segmentation models, then I hit the error: 'NoneType' object has no attribute 'number_of_cross_validation_splits' because in line 305, in spawn_offline_cross_val_classification_child_runs: for i in range(self.innereye_config.number_of_cross_validation_splits): self.innereye_config is None This is the commit message #12: Parking offline, since it will only work for 1 GPU anyway This is the commit message #13: unused param better as _ This is the commit message #14: unit test for lightningcontainer get_hyperdrive_config This is the commit message #15: MyPy fix This is the commit message #16: Removing get_total_number_of_cros_validation_runs Remnant of an old feature This is the commit message #17: flake8 fixes This is the commit message #18: Expanding abstract method documentation As per https://github.com/microsoft/InnerEye-DeepLearning/pull/483#discussion_r650810864 This is the commit message #19: Reverting, mypy errors How can changing two abstract method's doc-strings cause mypy errors? This is the commit message #20: Removing unneeded Optional This is the commit message #21: We do need that property But fixing typing via an exception feels all wrong This is the commit message #22: Extending abstract method documentation This is the commit message #23: Unit test for cross validation lightning changes to runner This is the commit message #24: Adding cross validation to HelloContainer This is the commit message #25: Correcting HalloContainer xval splits + comments This is the commit message #26: Unit test for xval work in HalloContainer But test not passing MyPy yet, grrrrrrr This is the commit message #27: mypy fixes on unit test This is the commit message #28: updating changelog This is the commit message #29: testing val sets add up correctly This is the commit message #30: finishing documentation This is the commit message #31: adding comments This is the commit message #32: Refactoring HelloDataset To remove clumsy init method parameters This is the commit message #33: homegrown -> sklearn.model_selection.KFold This is the commit message #34: Dropping notimplemented override This is the commit message #35: Restoring override in correct place This is the commit message #36: Refactoring get_parameter_search_hyperdrive_config As per Anton's comment Moving more xval methods into LightningContainer import fix resoring get_hyperdrive_config for non lightning models restoring required sampler method too reverting cosmetic change restoring unit test --- CHANGELOG.md | 1 + InnerEye/ML/configs/other/HelloContainer.py | 90 +++++++++++++++---- InnerEye/ML/lightning_container.py | 47 ++++++++++ InnerEye/ML/model_config_base.py | 9 +- InnerEye/ML/run_ml.py | 2 +- InnerEye/ML/runner.py | 8 +- .../ML/visualizers/plot_cross_validation.py | 2 +- .../ML/configs/other/test_hello_container.py | 86 ++++++++++++++++++ Tests/ML/models/test_scalar_model.py | 4 +- Tests/ML/runners/test_runner.py | 24 ++++- Tests/ML/test_lightning_containers.py | 34 ++++++- docs/bring_your_own_model.md | 4 +- docs/building_models.md | 4 +- 13 files changed, 277 insertions(+), 38 deletions(-) create mode 100644 Tests/ML/configs/other/test_hello_container.py diff --git a/CHANGELOG.md b/CHANGELOG.md index dc41842f2..92f6631b8 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -42,6 +42,7 @@ multiple large checkpoints can time out. ### Added +- ([#483](https://github.com/microsoft/InnerEye-DeepLearning/pull/483)) Allow cross validation with 'bring your own' Lightning models (without ensemble building). - ([#489](https://github.com/microsoft/InnerEye-DeepLearning/pull/489)) Remove portal query for outliers. - ([#488](https://github.com/microsoft/InnerEye-DeepLearning/pull/488)) Better handling of missing seriesId in segmentation cross validation reports. - ([#454](https://github.com/microsoft/InnerEye-DeepLearning/pull/454)) Checking that labels are mutually exclusive. diff --git a/InnerEye/ML/configs/other/HelloContainer.py b/InnerEye/ML/configs/other/HelloContainer.py index b911ccad7..8f7f33006 100644 --- a/InnerEye/ML/configs/other/HelloContainer.py +++ b/InnerEye/ML/configs/other/HelloContainer.py @@ -3,7 +3,7 @@ # Licensed under the MIT License (MIT). See LICENSE in the repo root for license information. # ------------------------------------------------------------------------------------------ from pathlib import Path -from typing import Any, Dict, List, Tuple +from typing import Any, Dict, List, Optional, Tuple import numpy as np import torch @@ -12,6 +12,8 @@ from torch.optim import Adam, Optimizer from torch.optim.lr_scheduler import StepLR, _LRScheduler from torch.utils.data import DataLoader, Dataset +from pytorch_lightning.metrics import MeanAbsoluteError +from sklearn.model_selection import KFold from InnerEye.Common import fixed_paths from InnerEye.ML.lightning_container import LightningContainer @@ -30,16 +32,13 @@ class HelloDataset(Dataset): # x = torch.rand((N, 1)) * 10 # y = 0.2 * x + 0.1 * torch.randn(x.size()) # xy = torch.cat((x, y), dim=1) - # np.savetxt("InnerEye/ML/configs/other/hellocontainer.csv", xy.numpy(), delimiter=",") - def __init__(self, root_folder: Path, start_index: int, end_index: int) -> None: + # np.savetxt("Tests/ML/test_data/hellocontainer.csv", xy.numpy(), delimiter=",") + def __init__(self, raw_data: List) -> None: """ Creates the 1-dim regression dataset. - :param root_folder: The folder in which the data file lives ("hellocontainer.csv") - :param start_index: The first row to read. - :param end_index: The last row to read (exclusive) + :param raw_data: The raw data, e.g. from a cross validation split or loaded from file """ - super().__init__() - raw_data = np.loadtxt(str(root_folder / "hellocontainer.csv"), delimiter=",")[start_index:end_index] + super().__init__() self.data = torch.tensor(raw_data, dtype=torch.float) def __len__(self) -> int: @@ -48,17 +47,67 @@ def __len__(self) -> int: def __getitem__(self, item: int) -> Dict[str, torch.Tensor]: return {'x': self.data[item][0:1], 'y': self.data[item][1:2]} + @staticmethod + def from_path_and_indexes( + root_folder: Path, + start_index: int, + end_index: int) -> 'HelloDataset': + ''' + Static method to instantiate a HelloDataset from the root folder with the start and end indexes. + :param root_folder: The folder in which the data file lives ("hellocontainer.csv") + :param start_index: The first row to read. + :param end_index: The last row to read (exclusive) + :return: A new instance based on the root folder and the start and end indexes. + ''' + raw_data = np.loadtxt(root_folder / "hellocontainer.csv", delimiter=",")[start_index:end_index] + return HelloDataset(raw_data) + class HelloDataModule(LightningDataModule): """ A data module that gives the training, validation and test data for a simple 1-dim regression task. + If not using cross validation a basic 50% / 20% / 30% split between train, validation, and test data + is made on the whole dataset. + For cross validation (if required) we use k-fold cross-validation. The test set remains unchanged + while the training and validation data cycle through the k-folds of the remaining data. """ - - def __init__(self, root_folder: Path) -> None: + def __init__( + self, + root_folder: Path, + number_of_cross_validation_splits: int = 0, + cross_validation_split_index: int = 0) -> None: super().__init__() - self.train = HelloDataset(root_folder, start_index=0, end_index=50) - self.val = HelloDataset(root_folder, start_index=50, end_index=70) - self.test = HelloDataset(root_folder, start_index=70, end_index=100) + if number_of_cross_validation_splits <= 1: + # For 0 or 1 splits just use the default values on the whole data-set. + self.train = HelloDataset.from_path_and_indexes(root_folder, start_index=0, end_index=50) + self.val = HelloDataset.from_path_and_indexes(root_folder, start_index=50, end_index=70) + self.test = HelloDataset.from_path_and_indexes(root_folder, start_index=70, end_index=100) + else: + # Raise exceptions for unreasonable values + if cross_validation_split_index >= number_of_cross_validation_splits: + raise IndexError(f"The cross_validation_split_index ({cross_validation_split_index}) is too large " + f"given the number_of_cross_validation_splits ({number_of_cross_validation_splits}) requested") + raw_data = np.loadtxt(root_folder / "hellocontainer.csv", delimiter=",") + np.random.seed(42) + np.random.shuffle(raw_data) + if number_of_cross_validation_splits >= len(raw_data): + raise ValueError(f"Asked for {number_of_cross_validation_splits} cross validation splits from a " + f"dataset of length {len(raw_data)}") + # Hold out the last 30% as test data + self.test = HelloDataset(raw_data[70:100]) + # Create k-folds from the remaining 70% of the data-set. Use one for the validation + # data and the rest for the training data + raw_data_remaining = raw_data[0:70] + k_fold = KFold(n_splits=number_of_cross_validation_splits) + train_indexes, val_indexes = list(k_fold.split(raw_data_remaining))[cross_validation_split_index] + self.train = HelloDataset(raw_data_remaining[train_indexes]) + self.val = HelloDataset(raw_data_remaining[val_indexes]) + + def prepare_data(self, *args: Any, **kwargs: Any) -> None: + pass + + def setup(self, stage: Optional[str] = None) -> None: + pass def train_dataloader(self, *args: Any, **kwargs: Any) -> DataLoader: return DataLoader(self.train, batch_size=5) @@ -135,7 +184,7 @@ def configure_optimizers(self) -> Tuple[List[Optimizer], List[_LRScheduler]]: This method is part of the standard PyTorch Lightning interface. For an introduction, please see https://pytorch-lightning.readthedocs.io/en/stable/starter/converting.html It returns the PyTorch optimizer(s) and learning rate scheduler(s) that should be used for training. -= """ + """ optimizer = Adam(self.parameters(), lr=1e-1) scheduler = StepLR(optimizer, step_size=20, gamma=0.5) return [optimizer], [scheduler] @@ -203,10 +252,19 @@ def create_model(self) -> LightningModule: return HelloRegression() # This method must be overridden by any subclass of LightningContainer. It returns a data module, which - # in turn contains 3 data loaders for training, validation, and test set. + # in turn contains 3 data loaders for training, validation, and test set. + # + # If the container is used for cross validation then this method must handle the cross validation splits. + # Because this deals with data loaders, not loaded data, we cannot check automatically that cross validation is + # handled correctly within the LightningContainer base class, i.e. if you forget to do the cross validation split + # in your subclass nothing will fail, but each child run will be identical since they will each be given the full + # dataset. def get_data_module(self) -> LightningDataModule: assert self.local_dataset is not None - return HelloDataModule(root_folder=self.local_dataset) # type: ignore + return HelloDataModule( + root_folder=self.local_dataset, + number_of_cross_validation_splits=self.number_of_cross_validation_splits, + cross_validation_split_index=self.cross_validation_split_index) # type: ignore # This is an optional override: This report creation method can read out any files that were written during # training, and cook them into a nice looking report. Here, the report is a simple text file. diff --git a/InnerEye/ML/lightning_container.py b/InnerEye/ML/lightning_container.py index f2f081a45..e87386888 100644 --- a/InnerEye/ML/lightning_container.py +++ b/InnerEye/ML/lightning_container.py @@ -11,8 +11,12 @@ from pytorch_lightning import LightningDataModule, LightningModule from torch.optim import Optimizer from torch.optim.lr_scheduler import _LRScheduler +from azureml.core import ScriptRunConfig +from azureml.train.hyperdrive import GridParameterSampling, HyperDriveConfig, PrimaryMetricGoal, choice +from InnerEye.Azure.azure_util import CROSS_VALIDATION_SPLIT_INDEX_TAG_KEY from InnerEye.Common.generic_parsing import GenericConfig, create_from_matching_params +from InnerEye.Common.metrics_constants import TrackedMetrics from InnerEye.ML.common import ModelExecutionMode from InnerEye.ML.deep_learning_config import DatasetParams, OptimizerParams, OutputParams, TrainerParams, \ WorkflowParams, load_checkpoint @@ -175,6 +179,9 @@ def get_data_module(self) -> LightningDataModule: The format of the data is not specified any further. The method must take cross validation into account, and ensure that logic to create training and validation sets takes cross validation with a given number of splits is correctly taken care of. + Because the method deals with data loaders, not loaded data, we cannot check automatically that cross validation + is handled correctly within the base class, i.e. if the cross validation split is not handled in the method then + nothing will fail, but each child run will be identical since they will each be given the full dataset. :return: A LightningDataModule """ return None # type: ignore @@ -200,6 +207,12 @@ def get_trainer_arguments(self) -> Dict[str, Any]: """ return dict() + def get_parameter_search_hyperdrive_config(self, _: ScriptRunConfig) -> HyperDriveConfig: # type: ignore + """ + Parameter search is not implemented. It should be implemented in a sub class if needed. + """ + raise NotImplementedError("Parameter search is not implemented. It should be implemented in a sub class if needed.") + def create_report(self) -> None: """ This method is called after training and testing has been completed. It can aggregate all files that were @@ -288,6 +301,40 @@ def create_lightning_module_and_store(self) -> None: self._model._optimizer_params = create_from_matching_params(self, OptimizerParams) self._model._trainer_params = create_from_matching_params(self, TrainerParams) + def get_cross_validation_hyperdrive_config(self, run_config: ScriptRunConfig) -> HyperDriveConfig: + """ + Returns a configuration for AzureML Hyperdrive that varies the cross validation split index. + :param run_config: The AzureML run configuration object that training for an individual model. + :return: A hyperdrive configuration object. + """ + return HyperDriveConfig( + run_config=run_config, + hyperparameter_sampling=self.get_cross_validation_hyperdrive_sampler(), + primary_metric_name=TrackedMetrics.Val_Loss.value, + primary_metric_goal=PrimaryMetricGoal.MINIMIZE, + max_total_runs=self.number_of_cross_validation_splits + ) + + def get_hyperdrive_config(self, run_config: ScriptRunConfig) -> HyperDriveConfig: + """ + Returns the HyperDrive config for either parameter search or cross validation + (if number_of_cross_validation_splits > 1). + :param run_config: AzureML estimator + :return: HyperDriveConfigs + """ + if self.perform_cross_validation: + return self.get_cross_validation_hyperdrive_config(run_config) + else: + return self.get_parameter_search_hyperdrive_config(run_config) + + def get_cross_validation_hyperdrive_sampler(self) -> GridParameterSampling: + """ + Returns the cross validation sampler, required to sample the entire parameter space for cross validation. + """ + return GridParameterSampling(parameter_space={ + CROSS_VALIDATION_SPLIT_INDEX_TAG_KEY: choice(list(range(self.number_of_cross_validation_splits))), + }) + def __str__(self) -> str: """Returns a string describing the present object, as a list of key: value strings.""" arguments_str = "\nContainer:\n" diff --git a/InnerEye/ML/model_config_base.py b/InnerEye/ML/model_config_base.py index 0454233e4..edd849e7f 100644 --- a/InnerEye/ML/model_config_base.py +++ b/InnerEye/ML/model_config_base.py @@ -150,13 +150,6 @@ def create_model(self) -> Any: # because this would prevent us from easily instantiating this class in tests. raise NotImplementedError("create_model must be overridden") - def get_total_number_of_cross_validation_runs(self) -> int: - """ - Returns the total number of HyperDrive/offline runs required to sample the entire - cross validation parameter space. - """ - return self.number_of_cross_validation_splits - def get_cross_validation_hyperdrive_sampler(self) -> GridParameterSampling: """ Returns the cross validation sampler, required to sample the entire parameter space for cross validation. @@ -176,7 +169,7 @@ def get_cross_validation_hyperdrive_config(self, run_config: ScriptRunConfig) -> hyperparameter_sampling=self.get_cross_validation_hyperdrive_sampler(), primary_metric_name=TrackedMetrics.Val_Loss.value, primary_metric_goal=PrimaryMetricGoal.MINIMIZE, - max_total_runs=self.get_total_number_of_cross_validation_runs() + max_total_runs=self.number_of_cross_validation_splits ) def get_cross_validation_dataset_splits(self, dataset_split: DatasetSplits) -> DatasetSplits: diff --git a/InnerEye/ML/run_ml.py b/InnerEye/ML/run_ml.py index a46473285..b6f018f13 100644 --- a/InnerEye/ML/run_ml.py +++ b/InnerEye/ML/run_ml.py @@ -802,7 +802,7 @@ def are_sibling_runs_finished(self) -> bool: """ if (not self.is_offline_run) \ and (azure_util.is_cross_validation_child_run(RUN_CONTEXT)): - n_splits = self.innereye_config.get_total_number_of_cross_validation_runs() + n_splits = self.innereye_config.number_of_cross_validation_splits child_runs = azure_util.fetch_child_runs(PARENT_RUN_CONTEXT, expected_number_cross_validation_splits=n_splits) pending_runs = [x.id for x in child_runs diff --git a/InnerEye/ML/runner.py b/InnerEye/ML/runner.py index 0cac30dd8..9ba94d2d4 100755 --- a/InnerEye/ML/runner.py +++ b/InnerEye/ML/runner.py @@ -192,8 +192,6 @@ def run(self) -> Tuple[Optional[DeepLearningConfig], Optional[Run]]: user_agent.append(azure_util.INNEREYE_SDK_NAME, azure_util.INNEREYE_SDK_VERSION) self.parse_and_load_model() if self.lightning_container.perform_cross_validation: - if self.model_config is None: - raise NotImplementedError("Cross validation for LightingContainer models is not yet supported.") # force hyperdrive usage if performing cross validation self.azure_config.hyperdrive = True run_object: Optional[Run] = None @@ -219,14 +217,14 @@ def submit_to_azureml(self) -> Run: if isinstance(self.model_config, DeepLearningConfig) and not self.lightning_container.azure_dataset_id: raise ValueError("When running an InnerEye built-in model in AzureML, the 'azure_dataset_id' " "property must be set.") - hyperdrive_func = lambda run_config: self.model_config.get_hyperdrive_config(run_config) # type: ignore source_config = SourceConfig( root_folder=self.project_root, entry_script=Path(sys.argv[0]).resolve(), conda_dependencies_files=get_all_environment_files(self.project_root), - hyperdrive_config_func=hyperdrive_func, + hyperdrive_config_func=(self.model_config.get_hyperdrive_config if self.model_config + else self.lightning_container.get_hyperdrive_config), # For large jobs, upload of results can time out because of large checkpoint files. Default is 600 - upload_timeout_seconds=86400, + upload_timeout_seconds=86400 ) source_config.set_script_params_except_submit_flag() # Reduce the size of the snapshot by adding unused folders to amlignore. The Test* subfolders are only needed diff --git a/InnerEye/ML/visualizers/plot_cross_validation.py b/InnerEye/ML/visualizers/plot_cross_validation.py index 143d05925..baf09eb7d 100644 --- a/InnerEye/ML/visualizers/plot_cross_validation.py +++ b/InnerEye/ML/visualizers/plot_cross_validation.py @@ -440,7 +440,7 @@ def crossval_config_from_model_config(train_config: DeepLearningConfig) -> PlotC model_category=train_config.model_category, epoch=epoch, should_validate=False, - number_of_cross_validation_splits=train_config.get_total_number_of_cross_validation_runs()) + number_of_cross_validation_splits=train_config.number_of_cross_validation_splits) def get_config_and_results_for_offline_runs(train_config: DeepLearningConfig) -> OfflineCrossvalConfigAndFiles: diff --git a/Tests/ML/configs/other/test_hello_container.py b/Tests/ML/configs/other/test_hello_container.py new file mode 100644 index 000000000..bcf38a1fb --- /dev/null +++ b/Tests/ML/configs/other/test_hello_container.py @@ -0,0 +1,86 @@ +# ------------------------------------------------------------------------------------------ +# Copyright (c) Microsoft Corporation. All rights reserved. +# Licensed under the MIT License (MIT). See LICENSE in the repo root for license information. +# ------------------------------------------------------------------------------------------ +import torch +import pytest +from typing import Optional +from InnerEye.Common import fixed_paths_for_tests +from InnerEye.ML.configs.other.HelloContainer import HelloDataModule + + +def test_cross_validation_splits() -> None: + ''' + Test that the exemplar HelloDataModule correctly splits the data set for cross validation. + + We test that unreasonable values raise an exception (i.e. a split index greater than the number of splits, + or requesting a number of splits larger than the available data). + - We test that between splits the train and validation data differ, and the test data remains the same. + - We test that all the data is used in each split. + - We test that across all the splits the validation data use all of the non-test data. + ''' + # Get full data-set for comparison + root_folder = fixed_paths_for_tests.full_ml_test_data_path() + data_module_no_xval = HelloDataModule(root_folder=root_folder) + + for number_of_cross_validation_splits in [0, 1, 5]: + previous_data_module_xval: Optional[HelloDataModule] = None + for cross_validation_split_index in range(number_of_cross_validation_splits): + data_module_xval = HelloDataModule( + root_folder=root_folder, + cross_validation_split_index=cross_validation_split_index, + number_of_cross_validation_splits=number_of_cross_validation_splits) + _assert_total_data_identical( + data_module_no_xval, + data_module_xval, + f"Total data mismatch for cross validation split ({number_of_cross_validation_splits}, {cross_validation_split_index})") + if number_of_cross_validation_splits <= 1: + break + if previous_data_module_xval: + _check_train_val_test(previous_data_module_xval, data_module_xval) + previous_data_module_xval = data_module_xval + if cross_validation_split_index == 0: + accrued_val_data = data_module_xval.val.data + else: + accrued_val_data = torch.cat((accrued_val_data, data_module_xval.val.data), dim=0) + if cross_validation_split_index == number_of_cross_validation_splits - 1: + all_non_test_data = torch.cat((data_module_xval.train.data, data_module_xval.val.data), dim=0) + msg = "Accrued validation sets from all the cross validations does not match the total non-test data" + assert torch.equal(torch.sort(all_non_test_data, dim=0)[0], torch.sort(accrued_val_data, dim=0)[0]), msg + + with pytest.raises(IndexError) as index_error: + data_module_xval = HelloDataModule( + root_folder=root_folder, + cross_validation_split_index=6, + number_of_cross_validation_splits=5) + assert "too large given the number_of_cross_validation_splits" in str(index_error.value) + + with pytest.raises(ValueError) as value_error: + data_module_xval = HelloDataModule( + root_folder=root_folder, + cross_validation_split_index=0, + number_of_cross_validation_splits=10_000) + assert "Asked for 10000 cross validation splits from a dataset of length" in str(value_error.value) + + +def _assert_total_data_identical(dm1: HelloDataModule, dm2: HelloDataModule, msg: str) -> None: + ''' + Check that the total of the two HelloDataModule's train, val, and test data is identical + ''' + all_data1 = torch.cat((dm1.train.data, dm1.val.data, dm1.test.data), dim=0) + all_data2 = torch.cat((dm2.train.data, dm2.val.data, dm2.test.data), dim=0) + all_data1_sorted, _ = torch.sort(all_data1, dim=0) + all_data2_sorted, _ = torch.sort(all_data2, dim=0) + assert torch.equal(all_data1_sorted, all_data2_sorted), msg + + +def _check_train_val_test(dm1: HelloDataModule, dm2: HelloDataModule) -> None: + ''' + Check that the two HelloDataModule's train and val data is different, but that their test data is identical + ''' + msg = "Two cross validation sets have the same training data" + assert not torch.equal(torch.sort(dm1.train.data, dim=0)[0], torch.sort(dm2.train.data, dim=0)[0]), msg + msg = "Two cross validation sets have the same validation data" + assert not torch.equal(torch.sort(dm1.val.data, dim=0)[0], torch.sort(dm2.val.data, dim=0)[0]), msg + msg = "Two cross validation sets have differing test data" + assert torch.equal(torch.sort(dm1.test.data, dim=0)[0], torch.sort(dm2.test.data, dim=0)[0]), msg diff --git a/Tests/ML/models/test_scalar_model.py b/Tests/ML/models/test_scalar_model.py index 6194aa7ad..12201ac9a 100644 --- a/Tests/ML/models/test_scalar_model.py +++ b/Tests/ML/models/test_scalar_model.py @@ -293,7 +293,7 @@ def test_run_ml_with_classification_model(test_output_dirs: OutputFolderForTests config_and_files = get_config_and_results_for_offline_runs(config) result_files = config_and_files.files # One file for VAL, one for TRAIN and one for TEST for each child run - assert len(result_files) == config.get_total_number_of_cross_validation_runs() * 3 + assert len(result_files) == config.number_of_cross_validation_splits * 3 for file in result_files: assert file.dataset_csv_file is not None assert file.dataset_csv_file.exists() @@ -526,7 +526,7 @@ def test_is_offline_cross_val_parent_run(offline_parent_cv_run: bool) -> None: def _check_offline_cross_validation_output_files(train_config: ScalarModelBase) -> None: metrics: Dict[ModelExecutionMode, List[pd.DataFrame]] = dict() root = Path(train_config.file_system_config.outputs_folder) - for x in range(train_config.get_total_number_of_cross_validation_runs()): + for x in range(train_config.number_of_cross_validation_splits): expected_outputs_folder = root / str(x) assert expected_outputs_folder.exists() for m in [ModelExecutionMode.TRAIN, ModelExecutionMode.VAL, ModelExecutionMode.TEST]: diff --git a/Tests/ML/runners/test_runner.py b/Tests/ML/runners/test_runner.py index 186c3ea27..a0d8eede9 100644 --- a/Tests/ML/runners/test_runner.py +++ b/Tests/ML/runners/test_runner.py @@ -4,15 +4,18 @@ # ------------------------------------------------------------------------------------------ import logging import time +from unittest import mock import pytest +from azureml.train.hyperdrive.runconfig import HyperDriveConfig -from InnerEye.Common import common_util +from InnerEye.Common import common_util, fixed_paths from InnerEye.Common.fixed_paths_for_tests import full_ml_test_data_path from InnerEye.Common.output_directories import OutputFolderForTests from InnerEye.ML.common import BEST_CHECKPOINT_FILE_NAME_WITH_SUFFIX, ModelExecutionMode from InnerEye.ML.metrics import InferenceMetricsForSegmentation from InnerEye.ML.run_ml import MLRunner +from InnerEye.ML.runner import Runner from Tests.ML.configs.DummyModel import DummyModel from Tests.ML.util import get_default_checkpoint_handler from Tests.ML.utils.test_model_util import create_model_and_store_checkpoint @@ -71,3 +74,22 @@ def test_logging_to_file(test_output_dirs: OutputFolderForTests) -> None: assert file_path.exists() assert log_line in file_path.read_text() assert should_not_be_present not in file_path.read_text() + + +def test_cross_validation_for_LightingContainer_models_is_supported() -> None: + ''' + Prior to https://github.com/microsoft/InnerEye-DeepLearning/pull/483 we raised an exception in + runner.run when cross validation was attempted on a lightning container. This test checks that + we do not raise the exception anymore, and instead pass on a cross validation hyperdrive config + to azure_runner's submit_to_azureml method. + ''' + args_list = ["--model=HelloContainer", "--number_of_cross_validation_splits=5", "--azureml=True"] + with mock.patch("sys.argv", [""] + args_list): + runner = Runner(project_root=fixed_paths.repository_root_directory(), yaml_config_file=fixed_paths.SETTINGS_YAML_FILE) + with mock.patch("InnerEye.Azure.azure_runner.create_and_submit_experiment", return_value=None) as create_and_submit_experiment_patch: + runner.run() + assert runner.lightning_container.model_name == 'HelloContainer' + assert runner.lightning_container.number_of_cross_validation_splits == 5 + args, _ = create_and_submit_experiment_patch.call_args + script_run_config = args[1] + assert isinstance(script_run_config, HyperDriveConfig) diff --git a/Tests/ML/test_lightning_containers.py b/Tests/ML/test_lightning_containers.py index 51a852d74..c8c53a3b3 100644 --- a/Tests/ML/test_lightning_containers.py +++ b/Tests/ML/test_lightning_containers.py @@ -10,6 +10,8 @@ import pandas as pd import pytest from pytorch_lightning import LightningModule +from azureml.core import ScriptRunConfig +from azureml.train.hyperdrive.runconfig import HyperDriveConfig from InnerEye.Common.output_directories import OutputFolderForTests from InnerEye.ML.common import ModelExecutionMode @@ -19,7 +21,7 @@ from InnerEye.ML.model_config_base import ModelConfigBase from InnerEye.ML.run_ml import MLRunner from Tests.ML.configs.DummyModel import DummyModel -from Tests.ML.configs.lightning_test_containers import DummyContainerWithHooks, DummyContainerWithModel, \ +from Tests.ML.configs.lightning_test_containers import DummyContainerWithAzureDataset, DummyContainerWithHooks, DummyContainerWithModel, \ DummyContainerWithPlainLightning from Tests.ML.util import default_runner @@ -280,3 +282,33 @@ def test_container_hooks(test_output_dirs: OutputFolderForTests) -> None: # only check that they have all been called. for file in ["global_rank_zero.txt", "local_rank_zero.txt", "all_ranks.txt"]: assert (runner.container.outputs_folder / file).is_file(), f"Missing file: {file}" + +@pytest.mark.parametrize("number_of_cross_validation_splits", [0, 2]) +def test_get_hyperdrive_config(number_of_cross_validation_splits: int, + test_output_dirs: OutputFolderForTests) -> None: + """ + Testing that the hyperdrive config returned for the lightnig container is right for submitting + to AzureML. + + Note that because the function get_hyperdrive_config now lives in the super class WorkflowParams, + it is also tested for other aspects of functionality by a test of the same name in + Tests.ML.test_model_config_base. + """ + container = DummyContainerWithAzureDataset() + container.number_of_cross_validation_splits = number_of_cross_validation_splits + run_config = ScriptRunConfig( + source_directory=str(test_output_dirs.root_dir), + script=str(Path("something.py")), + arguments=["foo"], + compute_target="EnormousCluster") + if number_of_cross_validation_splits == 0: + with pytest.raises(NotImplementedError) as not_implemented_error: + container.get_hyperdrive_config(run_config=run_config) + assert 'Parameter search is not implemented' in str(not_implemented_error.value) + # The error should be thrown by + # InnerEye.ML.lightning_container.LightningContainer.get_parameter_search_hyperdrive_config + # since number_of_cross_validation_splits == 0 implies a parameter search hyperdrive config and + # not a cross validation one. + else: + hd_config = container.get_hyperdrive_config(run_config=run_config) + assert isinstance(hd_config, HyperDriveConfig) diff --git a/docs/bring_your_own_model.md b/docs/bring_your_own_model.md index b8b56b26c..8edd844d7 100644 --- a/docs/bring_your_own_model.md +++ b/docs/bring_your_own_model.md @@ -27,7 +27,9 @@ encapsulates everything that is needed for training with PyTorch Lightning: all the usual PyTorch Lightning methods required for training, like the `training_step` and `forward` methods. This object needs to adhere to additional constraints, see below. - The `get_data_module` method of the container needs to return a `LightningDataModule` that has the data loaders for -training and validation data. +training and validation data. If you are using cross validation this method needs to take into account the number of +cross validation splits, and the cross validation split index. You can find an example of handling cross validation in +the [HelloContainer](../InnerEye/ML/configs/other/HelloContainer.py) class. - The optional `get_inference_data_module` returns a `LightningDataModule` that is used to read the data for inference (that is, evaluating the trained model). By default, this returns the same data as `get_training_data_module`, but you can override this for special models like segmentation models that are trained on equal sized image patches, but diff --git a/docs/building_models.md b/docs/building_models.md index 0ec3e7bb6..92a6a5d8f 100755 --- a/docs/building_models.md +++ b/docs/building_models.md @@ -127,9 +127,9 @@ at the same time (provided that the cluster has capacity). This means that a com takes as long as a single training run. To start cross validation, you can either modify the `number_of_cross_validation_splits` property of your model, -or supply it on the command line: Provide all the usual switches, and add `--number_of_cross_validation_splits=N`, +or supply it on the command line: provide all the usual switches, and add `--number_of_cross_validation_splits=N`, for some `N` greater than 1; a value of 5 is typical. This will start a -[HyperDrive run](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters): A parent +[HyperDrive run](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters): a parent AzureML job, with `N` child runs that will execute in parallel. You can see the child runs in the AzureML UI in the "Child Runs" tab. From 4c078f267073fb26b7ca5e141b3fd72e0d48efb0 Mon Sep 17 00:00:00 2001 From: dumbledad Date: Tue, 22 Jun 2021 17:21:12 +0100 Subject: [PATCH 02/10] fixing new position of hellolightning CSV --- Tests/ML/configs/other/test_hello_container.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Tests/ML/configs/other/test_hello_container.py b/Tests/ML/configs/other/test_hello_container.py index bcf38a1fb..fad05c1df 100644 --- a/Tests/ML/configs/other/test_hello_container.py +++ b/Tests/ML/configs/other/test_hello_container.py @@ -6,7 +6,7 @@ import pytest from typing import Optional from InnerEye.Common import fixed_paths_for_tests -from InnerEye.ML.configs.other.HelloContainer import HelloDataModule +from InnerEye.ML.configs.other.HelloContainer import HelloContainer, HelloDataModule def test_cross_validation_splits() -> None: @@ -20,7 +20,7 @@ def test_cross_validation_splits() -> None: - We test that across all the splits the validation data use all of the non-test data. ''' # Get full data-set for comparison - root_folder = fixed_paths_for_tests.full_ml_test_data_path() + root_folder = HelloContainer().local_dataset data_module_no_xval = HelloDataModule(root_folder=root_folder) for number_of_cross_validation_splits in [0, 1, 5]: From 53cf2719bd29f130730e0c2f07c36bd3ff33390c Mon Sep 17 00:00:00 2001 From: Tim Regan Date: Wed, 23 Jun 2021 08:18:57 +0100 Subject: [PATCH 03/10] Removing test of deleted getter --- Tests/ML/test_model_config_base.py | 1 - 1 file changed, 1 deletion(-) diff --git a/Tests/ML/test_model_config_base.py b/Tests/ML/test_model_config_base.py index 8fdd19d64..172b3b30f 100644 --- a/Tests/ML/test_model_config_base.py +++ b/Tests/ML/test_model_config_base.py @@ -80,7 +80,6 @@ def test_get_total_number_of_cross_validation_runs() -> None: config = ModelConfigBase(should_validate=False) config.number_of_cross_validation_splits = 2 assert config.perform_cross_validation - assert config.get_total_number_of_cross_validation_runs() == config.number_of_cross_validation_splits def test_config_with_typo() -> None: From ac8a4133aee7081a457eb90b48554e9c96a936a6 Mon Sep 17 00:00:00 2001 From: Tim Regan Date: Wed, 23 Jun 2021 08:49:56 +0100 Subject: [PATCH 04/10] assertion to fix optional + flake fixes --- InnerEye/ML/configs/other/HelloContainer.py | 1 - Tests/ML/configs/other/test_hello_container.py | 2 +- 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/InnerEye/ML/configs/other/HelloContainer.py b/InnerEye/ML/configs/other/HelloContainer.py index 8f7f33006..42b1dba80 100644 --- a/InnerEye/ML/configs/other/HelloContainer.py +++ b/InnerEye/ML/configs/other/HelloContainer.py @@ -12,7 +12,6 @@ from torch.optim import Adam, Optimizer from torch.optim.lr_scheduler import StepLR, _LRScheduler from torch.utils.data import DataLoader, Dataset -from pytorch_lightning.metrics import MeanAbsoluteError from sklearn.model_selection import KFold from InnerEye.Common import fixed_paths diff --git a/Tests/ML/configs/other/test_hello_container.py b/Tests/ML/configs/other/test_hello_container.py index fad05c1df..c32b302ea 100644 --- a/Tests/ML/configs/other/test_hello_container.py +++ b/Tests/ML/configs/other/test_hello_container.py @@ -5,7 +5,6 @@ import torch import pytest from typing import Optional -from InnerEye.Common import fixed_paths_for_tests from InnerEye.ML.configs.other.HelloContainer import HelloContainer, HelloDataModule @@ -21,6 +20,7 @@ def test_cross_validation_splits() -> None: ''' # Get full data-set for comparison root_folder = HelloContainer().local_dataset + assert root_folder is not None data_module_no_xval = HelloDataModule(root_folder=root_folder) for number_of_cross_validation_splits in [0, 1, 5]: From 2a896ee4484cdc76cfee30edc4302c69dbe10778 Mon Sep 17 00:00:00 2001 From: Tim Regan Date: Wed, 23 Jun 2021 11:44:42 +0100 Subject: [PATCH 05/10] Moving method content to single use place --- InnerEye/ML/lightning_container.py | 13 ++++--------- 1 file changed, 4 insertions(+), 9 deletions(-) diff --git a/InnerEye/ML/lightning_container.py b/InnerEye/ML/lightning_container.py index e87386888..27c647292 100644 --- a/InnerEye/ML/lightning_container.py +++ b/InnerEye/ML/lightning_container.py @@ -309,7 +309,10 @@ def get_cross_validation_hyperdrive_config(self, run_config: ScriptRunConfig) -> """ return HyperDriveConfig( run_config=run_config, - hyperparameter_sampling=self.get_cross_validation_hyperdrive_sampler(), + hyperparameter_sampling=GridParameterSampling( + parameter_space={ + CROSS_VALIDATION_SPLIT_INDEX_TAG_KEY: choice(list(range(self.number_of_cross_validation_splits))) + }), primary_metric_name=TrackedMetrics.Val_Loss.value, primary_metric_goal=PrimaryMetricGoal.MINIMIZE, max_total_runs=self.number_of_cross_validation_splits @@ -327,14 +330,6 @@ def get_hyperdrive_config(self, run_config: ScriptRunConfig) -> HyperDriveConfig else: return self.get_parameter_search_hyperdrive_config(run_config) - def get_cross_validation_hyperdrive_sampler(self) -> GridParameterSampling: - """ - Returns the cross validation sampler, required to sample the entire parameter space for cross validation. - """ - return GridParameterSampling(parameter_space={ - CROSS_VALIDATION_SPLIT_INDEX_TAG_KEY: choice(list(range(self.number_of_cross_validation_splits))), - }) - def __str__(self) -> str: """Returns a string describing the present object, as a list of key: value strings.""" arguments_str = "\nContainer:\n" From 3fc1e7d91a58ccfea8664db1a937b23d2b47394d Mon Sep 17 00:00:00 2001 From: Tim Regan Date: Wed, 23 Jun 2021 11:45:24 +0100 Subject: [PATCH 06/10] Tightening type hint and doc on HelloContainer --- InnerEye/ML/configs/other/HelloContainer.py | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/InnerEye/ML/configs/other/HelloContainer.py b/InnerEye/ML/configs/other/HelloContainer.py index 42b1dba80..250e7f98b 100644 --- a/InnerEye/ML/configs/other/HelloContainer.py +++ b/InnerEye/ML/configs/other/HelloContainer.py @@ -32,10 +32,12 @@ class HelloDataset(Dataset): # y = 0.2 * x + 0.1 * torch.randn(x.size()) # xy = torch.cat((x, y), dim=1) # np.savetxt("Tests/ML/test_data/hellocontainer.csv", xy.numpy(), delimiter=",") - def __init__(self, raw_data: List) -> None: + def __init__(self, raw_data: List[List[float]]) -> None: """ Creates the 1-dim regression dataset. - :param raw_data: The raw data, e.g. from a cross validation split or loaded from file + :param raw_data: The raw data, e.g. from a cross validation split or loaded from file. This + must be numeric data which can be converted into a tensor. See the static method + from_path_and_indexes for an example call. """ super().__init__() self.data = torch.tensor(raw_data, dtype=torch.float) From 85c46d7e175748746a83d4b9919a7f4292b5608e Mon Sep 17 00:00:00 2001 From: Tim Regan Date: Thu, 24 Jun 2021 09:29:38 +0100 Subject: [PATCH 07/10] Fixing file path for saving hellocontainer.csv --- InnerEye/ML/configs/other/HelloContainer.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/InnerEye/ML/configs/other/HelloContainer.py b/InnerEye/ML/configs/other/HelloContainer.py index 250e7f98b..6b6d5e9cb 100644 --- a/InnerEye/ML/configs/other/HelloContainer.py +++ b/InnerEye/ML/configs/other/HelloContainer.py @@ -31,7 +31,7 @@ class HelloDataset(Dataset): # x = torch.rand((N, 1)) * 10 # y = 0.2 * x + 0.1 * torch.randn(x.size()) # xy = torch.cat((x, y), dim=1) - # np.savetxt("Tests/ML/test_data/hellocontainer.csv", xy.numpy(), delimiter=",") + # np.savetxt("InnerEye/ML/configs/other/hellocontainer.csv", xy.numpy(), delimiter=",") def __init__(self, raw_data: List[List[float]]) -> None: """ Creates the 1-dim regression dataset. From 54b75daf6dfc724b390f4904ad8fb229144788b7 Mon Sep 17 00:00:00 2001 From: Tim Regan Date: Thu, 24 Jun 2021 14:10:50 +0100 Subject: [PATCH 08/10] Adding comment and documantaion about handline val/Loss for xval in LightningContainer --- InnerEye/ML/lightning_container.py | 3 +++ docs/bring_your_own_model.md | 13 ++++++++++--- 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/InnerEye/ML/lightning_container.py b/InnerEye/ML/lightning_container.py index 27c647292..bfc1b4e09 100644 --- a/InnerEye/ML/lightning_container.py +++ b/InnerEye/ML/lightning_container.py @@ -304,6 +304,9 @@ def create_lightning_module_and_store(self) -> None: def get_cross_validation_hyperdrive_config(self, run_config: ScriptRunConfig) -> HyperDriveConfig: """ Returns a configuration for AzureML Hyperdrive that varies the cross validation split index. + Because this adds a val/Loss metric it is important that when subclassing LightningContainer + your implementeation of LightningModule logs val/Loss. There is an example of this in + HelloRegression's validation_step method. :param run_config: The AzureML run configuration object that training for an individual model. :return: A hyperdrive configuration object. """ diff --git a/docs/bring_your_own_model.md b/docs/bring_your_own_model.md index 8edd844d7..2a7c71c93 100644 --- a/docs/bring_your_own_model.md +++ b/docs/bring_your_own_model.md @@ -27,9 +27,7 @@ encapsulates everything that is needed for training with PyTorch Lightning: all the usual PyTorch Lightning methods required for training, like the `training_step` and `forward` methods. This object needs to adhere to additional constraints, see below. - The `get_data_module` method of the container needs to return a `LightningDataModule` that has the data loaders for -training and validation data. If you are using cross validation this method needs to take into account the number of -cross validation splits, and the cross validation split index. You can find an example of handling cross validation in -the [HelloContainer](../InnerEye/ML/configs/other/HelloContainer.py) class. +training and validation data. - The optional `get_inference_data_module` returns a `LightningDataModule` that is used to read the data for inference (that is, evaluating the trained model). By default, this returns the same data as `get_training_data_module`, but you can override this for special models like segmentation models that are trained on equal sized image patches, but @@ -40,6 +38,15 @@ correctly. If you'd like to have your model defined in a different folder, pleas the `--model_configs_namespace` argument. For example, use `--model_configs_namespace=My.Own.configs` if your model configuration classes reside in folder `My/Own/configs` from the repository root. +### Cross Validation + +If you are doing cross validation you need to ensure that the `LightningDataModule` returned by your container's +`get_data_module` method: +- Needs to take into account the number of cross validation splits, and the cross validation split index when +preparing the data. +- Needs to log val/Loss in its `validation_step` method. +You can find an example of handling cross validation in the [HelloContainer](../InnerEye/ML/configs/other/HelloContainer.py) class. + *Example*: ```python from pathlib import Path From d695d16afbd5eebf5c28078fa249f016750b2c14 Mon Sep 17 00:00:00 2001 From: Tim Regan Date: Thu, 24 Jun 2021 14:15:31 +0100 Subject: [PATCH 09/10] removing stray whitespace --- docs/bring_your_own_model.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/bring_your_own_model.md b/docs/bring_your_own_model.md index 2a7c71c93..7db5b2864 100644 --- a/docs/bring_your_own_model.md +++ b/docs/bring_your_own_model.md @@ -27,7 +27,7 @@ encapsulates everything that is needed for training with PyTorch Lightning: all the usual PyTorch Lightning methods required for training, like the `training_step` and `forward` methods. This object needs to adhere to additional constraints, see below. - The `get_data_module` method of the container needs to return a `LightningDataModule` that has the data loaders for -training and validation data. +training and validation data. - The optional `get_inference_data_module` returns a `LightningDataModule` that is used to read the data for inference (that is, evaluating the trained model). By default, this returns the same data as `get_training_data_module`, but you can override this for special models like segmentation models that are trained on equal sized image patches, but From f6f15325aafc3f768bba2b191207371035d3fe94 Mon Sep 17 00:00:00 2001 From: Tim Regan Date: Thu, 24 Jun 2021 14:16:44 +0100 Subject: [PATCH 10/10] rewording --- docs/bring_your_own_model.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/bring_your_own_model.md b/docs/bring_your_own_model.md index 7db5b2864..a2df9c50e 100644 --- a/docs/bring_your_own_model.md +++ b/docs/bring_your_own_model.md @@ -45,7 +45,8 @@ If you are doing cross validation you need to ensure that the `LightningDataModu - Needs to take into account the number of cross validation splits, and the cross validation split index when preparing the data. - Needs to log val/Loss in its `validation_step` method. -You can find an example of handling cross validation in the [HelloContainer](../InnerEye/ML/configs/other/HelloContainer.py) class. +You can find a working example of handling cross validation in the +[HelloContainer](../InnerEye/ML/configs/other/HelloContainer.py) class. *Example*: ```python