-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Describe the bug
We have multiple data science teams onboarded on sagemaker studio. We mandate all users to tag sagemaker resources in order to enable tag based access control. However, when using sagemaker experiments with a pytorch training job, the job fails because of 'Access Denied Error' at load_run() method as it seems that the required tags are not being passed on to the experiment object creation. Even if the experiment with the right tags exist it seems that the Experiment._load_or_create method in the SDK first tries to create the experiment object and then attempts to load if it only encounters a 'resource already exists' exception. But in my case it fails on 'Access Denied' error and hence is not caught. I confirm that this happening only at the training job level and I am able to create both experiment and trial using their specific create methods.
To reproduce
Sagemaker execution role should have the below policy in addition to sagemaker:AddTags permission on all resources. Tag the execution role itself with key as 'team' and a 'test' value to allow for principalTag comparison.
[
{
"Condition": {
"StringEquals": {
"aws:RequestTag/team": "${aws:PrincipalTag/team}"
}
},
"Action": "sagemaker:Create*",
"Resource": "*",
"Effect": "Allow"
},
{
"Condition": {
"StringEquals": {
"aws:ResourceTag/key": "${aws:PrincipalTag/team}"
}
},
"Effect": "Allow",
"Action": "sagemaker:CreateTrial",
"Resource": ["arn:aws:sagemaker:*:*:experiment/*"]
}
]
The following works
from smexperiments import experiment
from smexperiments.trial import Trial
default_tags = [{'Key': 'team', 'Value': 'test'}]
experiment = experiment.Experiment.create(experiment_name='MNIST',tags=default_tags)
trial = Trial.create(trial_name="linear-learner2",experiment_name="MNIST",tags=default_tags)The below throws an 'Access Denied' exception
with Run(experiment_name='MNIST', run_name=run_name, sagemaker_session=sess,tags=default_tags) as run:
pytorch_estimator = PyTorch(entry_point ='train_script_expt.py.py',
....
tags = default_tags)
pytorch_estimator.fit({ 'train': trainpath,
'test': testpath })Training job script which throws the error
with load_run(sagemaker_session=session) as run:
run.log_parameters(
{ "device":device, "epochs":args.epochs}
)Expected behavior
I was expecting that the tags to be passed on to the experiment and trial creation within the sagemaker job. Or if the guidance is to make sure that experiment and trial already exists - then the SDK should load the experiment regardless of the error thrown upon create in the try block.
Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.
error stacktrack
Traceback (most recent call last):
File "train_script_expt.py", line 190, in <module>
with load_run(sagemaker_session=session) as run:
File "/opt/conda/lib/python3.8/site-packages/sagemaker/experiments/run.py", line 847, in load_run
run_instance = Run(
File "/opt/conda/lib/python3.8/site-packages/sagemaker/experiments/run.py", line 177, in __init__
self._experiment = Experiment._load_or_create(
File "/opt/conda/lib/python3.8/site-packages/sagemaker/experiments/experiment.py", line 171, in _load_or_create
raise ce
File "/opt/conda/lib/python3.8/site-packages/sagemaker/experiments/experiment.py", line 160, in _load_or_create
experiment = Experiment.create(
File "/opt/conda/lib/python3.8/site-packages/sagemaker/experiments/experiment.py", line 120, in create
return cls._construct(
File "/opt/conda/lib/python3.8/site-packages/sagemaker/apiutils/_base_types.py", line 190, in _construct
return instance._invoke_api(boto_method_name, kwargs)
File "/opt/conda/lib/python3.8/site-packages/sagemaker/apiutils/_base_types.py", line 226, in _invoke_api
api_boto_response = api_method(**api_kwargs)
File "/opt/conda/lib/python3.8/site-packages/botocore/client.py", line 534, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/opt/conda/lib/python3.8/site-packages/botocore/client.py", line 976, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDeniedException) when calling the CreateExperiment operation: User: arn:aws:sts::XXXX:assumed-role/aimlSagemakerStudio-xyzteam-LTWZ4NF3WWVM/SageMaker is not authorized to perform: sagemaker:CreateExperiment on resource: arn:aws:sagemaker:us-east-2:XXXexperiment/xxxx because no identity-based policy allows the sagemaker:CreateExperiment action
System information
A description of your system. Please provide:
- SageMaker Python SDK version: 2.171.0
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): Pytorch
- Framework version: 1.12
- Python version: 3.8
- CPU or GPU: CPU
- Custom Docker image (Y/N): N