Skip to content
This repository was archived by the owner on Mar 21, 2024. It is now read-only.

Commit c42937c

Browse files
ant0nscjavier-alvarez
authored andcommitted
Cancel queued AzureML jobs when starting a PR build (#640)
AzureML jobs from failed previous PR builds do not get cancelled, consuming excessive resources. Now kill all queued and running jobs before starting new ones.
1 parent cad2e04 commit c42937c

File tree

9 files changed

+112
-18
lines changed

9 files changed

+112
-18
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ loss.
1717
### Added
1818
- ([#648](https://github.com/microsoft/InnerEye-DeepLearning/pull/648)) Add torch_ort to SSL SimCLR. This makes training faster.
1919
- ([#594](https://github.com/microsoft/InnerEye-DeepLearning/pull/594)) When supplying a "--tag" argument, the AzureML jobs use that value as the display name, to more easily distinguish run.
20+
- ([#640](https://github.com/microsoft/InnerEye-DeepLearning/pull/640)) Cancel AzureML jobs from previous runs of the PR build in the same branch to reduce AML load
2021
- ([#577](https://github.com/microsoft/InnerEye-DeepLearning/pull/577)) Commandline switch `monitor_gpu` to monitor
2122
GPU utilization via Lightning's `GpuStatsMonitor`, switch `monitor_loading` to check batch loading times via
2223
`BatchTimeCallback`, and `pl_profiler` to turn on the Lightning profiler (`simple`, `advanced`, or `pytorch`)
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
name: AzureML_SDK
2+
channels:
3+
- defaults
4+
dependencies:
5+
- pip=20.1.1
6+
- python=3.7.3
7+
- pip:
8+
- azureml-sdk==1.36.0

azure-pipelines/build-pr.yml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,12 @@ variables:
1717
disable.coverage.autogenerate: 'true'
1818

1919
jobs:
20+
- job: CancelPreviousJobs
21+
pool:
22+
vmImage: 'ubuntu-18.04'
23+
steps:
24+
- template: cancel_aml_jobs.yml
25+
2026
- job: Windows
2127
pool:
2228
vmImage: 'windows-2019'
@@ -30,6 +36,7 @@ jobs:
3036
- template: build.yaml
3137

3238
- job: TrainInAzureML
39+
dependsOn: CancelPreviousJobs
3340
variables:
3441
- name: tag
3542
value: 'TrainBasicModel'
@@ -48,6 +55,7 @@ jobs:
4855
test_run_title: tests_after_training_single_run
4956

5057
- job: RunGpuTestsInAzureML
58+
dependsOn: CancelPreviousJobs
5159
variables:
5260
- name: tag
5361
value: 'RunGpuTests'
@@ -70,6 +78,7 @@ jobs:
7078
# is trained, because we use this build to also check the "submit_for_inference" code, that
7179
# presently only handles single channel models.
7280
- job: TrainInAzureMLViaSubmodule
81+
dependsOn: CancelPreviousJobs
7382
variables:
7483
- name: model
7584
value: 'BasicModel2Epochs1Channel'
@@ -90,6 +99,7 @@ jobs:
9099

91100
# Train a 2-element ensemble model
92101
- job: TrainEnsemble
102+
dependsOn: CancelPreviousJobs
93103
variables:
94104
- name: model
95105
value: 'BasicModelForEnsembleTest'
@@ -114,6 +124,7 @@ jobs:
114124

115125
# Train a model on 2 nodes
116126
- job: Train2Nodes
127+
dependsOn: CancelPreviousJobs
117128
variables:
118129
- name: model
119130
value: 'BasicModel2EpochsMoreData'
@@ -135,6 +146,7 @@ jobs:
135146
test_run_title: tests_after_training_2node_run
136147

137148
- job: TrainHelloWorld
149+
dependsOn: CancelPreviousJobs
138150
variables:
139151
- name: model
140152
value: 'HelloWorld'
@@ -152,6 +164,7 @@ jobs:
152164
# Run HelloContainer on 2 nodes. HelloContainer uses native Lighting test set inference, which can get
153165
# confused after doing multi-node training in the same script.
154166
- job: TrainHelloContainer
167+
dependsOn: CancelPreviousJobs
155168
variables:
156169
- name: model
157170
value: 'HelloContainer'
@@ -176,6 +189,7 @@ jobs:
176189
# regressions in AML when requesting more than the default amount of memory. This needs to run with all subjects to
177190
# trigger the bug, total runtime 10min
178191
- job: TrainLung
192+
dependsOn: CancelPreviousJobs
179193
variables:
180194
- name: model
181195
value: 'Lung'

azure-pipelines/build_data_quality.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
steps:
22
- template: checkout.yml
33

4+
- template: prepare_conda.yml
5+
46
- bash: |
57
conda env create --file InnerEye-DataQuality/environment.yml --name InnerEyeDataQuality
68
source activate InnerEyeDataQuality

azure-pipelines/cancel_aml_jobs.py

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# ------------------------------------------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
4+
# ------------------------------------------------------------------------------------------
5+
import os
6+
7+
from azureml._restclient.constants import RunStatus
8+
from azureml.core import Experiment, Run, Workspace
9+
from azureml.core.authentication import ServicePrincipalAuthentication
10+
11+
12+
def cancel_running_and_queued_jobs() -> None:
13+
environ = os.environ
14+
print("Authenticating")
15+
auth = ServicePrincipalAuthentication(
16+
tenant_id='72f988bf-86f1-41af-91ab-2d7cd011db47',
17+
service_principal_id=environ["APPLICATION_ID"],
18+
service_principal_password=environ["APPLICATION_KEY"])
19+
print("Getting AML workspace")
20+
workspace = Workspace.get(
21+
name="InnerEye-DeepLearning",
22+
auth=auth,
23+
subscription_id=environ["SUBSCRIPTION_ID"],
24+
resource_group="InnerEye-DeepLearning")
25+
branch = environ["BRANCH"]
26+
print(f"Branch: {branch}")
27+
if not branch.startswith("refs/pull/"):
28+
print("This branch is not a PR branch, hence not cancelling anything.")
29+
exit(0)
30+
experiment_name = branch.replace("/", "_")
31+
print(f"Experiment: {experiment_name}")
32+
experiment = Experiment(workspace, name=experiment_name)
33+
print(f"Retrieved experiment {experiment.name}")
34+
for run in experiment.get_runs(include_children=True, properties={}):
35+
assert isinstance(run, Run)
36+
status_suffix = f"'{run.status}' run {run.id} ({run.display_name})"
37+
if run.status in (RunStatus.COMPLETED, RunStatus.FAILED, RunStatus.FINALIZING, RunStatus.CANCELED,
38+
RunStatus.CANCEL_REQUESTED):
39+
print(f"Skipping {status_suffix}")
40+
else:
41+
print(f"Cancelling {status_suffix}")
42+
run.cancel()
43+
44+
45+
if __name__ == "__main__":
46+
cancel_running_and_queued_jobs()
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
steps:
2+
- checkout: self
3+
4+
- template: prepare_conda.yml
5+
6+
# https://docs.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#pythonanaconda
7+
- task: Cache@2
8+
displayName: Use cached Conda environment AzureML_SDK
9+
inputs:
10+
# Beware of changing the cache key or path independently, safest to change in sync
11+
key: 'usr_share_miniconda_azureml_conda | "$(Agent.OS)" | azure-pipelines/azureml-conda-environment.yml'
12+
path: /usr/share/miniconda/envs
13+
cacheHitVar: CONDA_CACHE_RESTORED
14+
15+
- script: conda env create --file azure-pipelines/azureml-conda-environment.yml
16+
displayName: Create Conda environment AzureML_SDK
17+
condition: eq(variables.CONDA_CACHE_RESTORED, 'false')
18+
19+
- bash: |
20+
source activate AzureML_SDK
21+
python azure-pipelines/cancel_aml_jobs.py
22+
displayName: Cancel jobs from previous run
23+
env:
24+
SUBSCRIPTION_ID: $(InnerEyeDevSubscriptionID)
25+
APPLICATION_ID: $(InnerEyeDeepLearningServicePrincipalID)
26+
APPLICATION_KEY: $(InnerEyeDeepLearningServicePrincipalKey)
27+
BRANCH: $(Build.SourceBranch)

azure-pipelines/checkout.yml

Lines changed: 0 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -2,21 +2,3 @@ steps:
22
- checkout: self
33
lfs: true
44
submodules: true
5-
6-
- bash: |
7-
subdir=bin
8-
echo "Adding this directory to PATH: $CONDA/$subdir"
9-
echo "##vso[task.prependpath]$CONDA/$subdir"
10-
displayName: Add conda to PATH
11-
condition: succeeded()
12-
13-
- bash: |
14-
conda install conda=4.8.3 -y
15-
conda --version
16-
conda list
17-
displayName: Print conda version and initial package list
18-
19-
- bash: |
20-
sudo chown -R $USER /usr/share/miniconda
21-
condition: and(succeeded(), eq( variables['Agent.OS'], 'Linux' ))
22-
displayName: Take ownership of conda installation

azure-pipelines/inner_eye_env.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@ steps:
33

44
- template: store_settings.yml
55

6+
- template: prepare_conda.yml
7+
68
# https://docs.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#pythonanaconda
79
- task: Cache@2
810
displayName: Use cached Conda environment

azure-pipelines/prepare_conda.yml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
steps:
2+
- bash: |
3+
subdir=bin
4+
echo "Adding this directory to PATH: $CONDA/$subdir"
5+
echo "##vso[task.prependpath]$CONDA/$subdir"
6+
displayName: Add conda to PATH
7+
condition: succeeded()
8+
9+
- bash: |
10+
sudo chown -R $USER /usr/share/miniconda
11+
condition: and(succeeded(), eq( variables['Agent.OS'], 'Linux' ))
12+
displayName: Take ownership of conda installation

0 commit comments

Comments
 (0)