Skip to content

SageMaker experiments not being created within training job when mandated to have tags #3989

@inchara1990

Description

@inchara1990

Describe the bug
We have multiple data science teams onboarded on sagemaker studio. We mandate all users to tag sagemaker resources in order to enable tag based access control. However, when using sagemaker experiments with a pytorch training job, the job fails because of 'Access Denied Error' at load_run() method as it seems that the required tags are not being passed on to the experiment object creation. Even if the experiment with the right tags exist it seems that the Experiment._load_or_create method in the SDK first tries to create the experiment object and then attempts to load if it only encounters a 'resource already exists' exception. But in my case it fails on 'Access Denied' error and hence is not caught. I confirm that this happening only at the training job level and I am able to create both experiment and trial using their specific create methods.

To reproduce
Sagemaker execution role should have the below policy in addition to sagemaker:AddTags permission on all resources. Tag the execution role itself with key as 'team' and a 'test' value to allow for principalTag comparison.

[
   {
    "Condition": {
      "StringEquals": {
        "aws:RequestTag/team": "${aws:PrincipalTag/team}"
      }
    },
    "Action": "sagemaker:Create*",
    "Resource": "*",
    "Effect": "Allow"
  },
  {
    "Condition": {
      "StringEquals": {
        "aws:ResourceTag/key": "${aws:PrincipalTag/team}"
      }
    },
    "Effect": "Allow",
    "Action": "sagemaker:CreateTrial",
    "Resource": ["arn:aws:sagemaker:*:*:experiment/*"]
  }
]

The following works

 from smexperiments import experiment
 from smexperiments.trial import Trial
 default_tags = [{'Key': 'team', 'Value': 'test'}]
 experiment = experiment.Experiment.create(experiment_name='MNIST',tags=default_tags) 
 trial = Trial.create(trial_name="linear-learner2",experiment_name="MNIST",tags=default_tags)

The below throws an 'Access Denied' exception

   with Run(experiment_name='MNIST', run_name=run_name, sagemaker_session=sess,tags=default_tags) as run:
        pytorch_estimator = PyTorch(entry_point ='train_script_expt.py.py',
                                                         ....
                                                         tags = default_tags)
        pytorch_estimator.fit({ 'train': trainpath,
                                              'test': testpath })

Training job script which throws the error

     with load_run(sagemaker_session=session) as run:
        run.log_parameters(
            { "device":device, "epochs":args.epochs}
        )

Expected behavior
I was expecting that the tags to be passed on to the experiment and trial creation within the sagemaker job. Or if the guidance is to make sure that experiment and trial already exists - then the SDK should load the experiment regardless of the error thrown upon create in the try block.

Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.
error stacktrack

Traceback (most recent call last):
  File "train_script_expt.py", line 190, in <module>
with load_run(sagemaker_session=session) as run:
  File "/opt/conda/lib/python3.8/site-packages/sagemaker/experiments/run.py", line 847, in load_run
run_instance = Run(
  File "/opt/conda/lib/python3.8/site-packages/sagemaker/experiments/run.py", line 177, in __init__
self._experiment = Experiment._load_or_create(
  File "/opt/conda/lib/python3.8/site-packages/sagemaker/experiments/experiment.py", line 171, in _load_or_create
raise ce
  File "/opt/conda/lib/python3.8/site-packages/sagemaker/experiments/experiment.py", line 160, in _load_or_create
    experiment = Experiment.create(
File "/opt/conda/lib/python3.8/site-packages/sagemaker/experiments/experiment.py", line 120, in create
return cls._construct(
  File "/opt/conda/lib/python3.8/site-packages/sagemaker/apiutils/_base_types.py", line 190, in _construct
    return instance._invoke_api(boto_method_name, kwargs)
  File "/opt/conda/lib/python3.8/site-packages/sagemaker/apiutils/_base_types.py", line 226, in _invoke_api
api_boto_response = api_method(**api_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/botocore/client.py", line 534, in _api_call
return self._make_api_call(operation_name, kwargs)
  File "/opt/conda/lib/python3.8/site-packages/botocore/client.py", line 976, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDeniedException) when calling the CreateExperiment operation: User: arn:aws:sts::XXXX:assumed-role/aimlSagemakerStudio-xyzteam-LTWZ4NF3WWVM/SageMaker is not authorized to perform: sagemaker:CreateExperiment on resource: arn:aws:sagemaker:us-east-2:XXXexperiment/xxxx because no identity-based policy allows the sagemaker:CreateExperiment action

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.171.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Pytorch
  • Framework version: 1.12
  • Python version: 3.8
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): N

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions