Skip to content

[Bug Report] - SageMaker Pipelines to Run Jobs Locally #3635

@fjpa121197

Description

@fjpa121197

Link to the notebook

Following example from: https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-pipelines/tabular/local-mode/sagemaker-pipelines-local-mode.ipynb

My notebook with error (i made some modifications to it, minor ones): https://github.com/fjpa121197/aws-sagemaker-training/blob/main/sagemaker-pipelines-local-mode.ipynb

Describe the bug
I'm trying to follow this tutorial to run Sagemaker Pipelines locally and test them before using managed resources. I have created a pipeline definition that includes preprocessing, training and evaluation. I'm able to create the pipeline without any problem, but when executing the pipeline, I encountered error in the evaluation step. It is related to not being able to download the model.tar.gz file to the container and to the correct directory to use the model for evaluation.

Error:

Starting pipeline step: 'AbaloneEval'
Container jhais7c823-algo-1-7ko39  Creating
Container jhais7c823-algo-1-7ko39  Created
Attaching to jhais7c823-algo-1-7ko39
jhais7c823-algo-1-7ko39  | Traceback (most recent call last):
jhais7c823-algo-1-7ko39  |   File "/opt/ml/processing/input/code/evaluation.py", line 16, in <module>
jhais7c823-algo-1-7ko39  |     with tarfile.open(model_path) as tar:
jhais7c823-algo-1-7ko39  |   File "/miniconda3/lib/python3.8/tarfile.py", line 1603, in open
jhais7c823-algo-1-7ko39  |     return func(name, "r", fileobj, **kwargs)
jhais7c823-algo-1-7ko39  |   File "/miniconda3/lib/python3.8/tarfile.py", line 1667, in gzopen
jhais7c823-algo-1-7ko39  |     fileobj = GzipFile(name, mode + "b", compresslevel, fileobj)
jhais7c823-algo-1-7ko39  |   File "/miniconda3/lib/python3.8/gzip.py", line 173, in __init__
jhais7c823-algo-1-7ko39  |     fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
jhais7c823-algo-1-7ko39  | FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/processing/model/model.tar.gz'

jhais7c823-algo-1-7ko39 exited with code 1
Aborting on container exit...
Container jhais7c823-algo-1-7ko39  Stopping
Container jhais7c823-algo-1-7ko39  Stopped
Pipeline step 'AbaloneEval' FAILED. Failure message is: RuntimeError: Failed to run: ['docker-compose', '-f', 'C:\\Users\\FRANCI~1.PAR\\AppData\\Local\\Temp\\tmp188wz79r\\docker-compose.yaml', 'up', '--build', '--abort-on-container-exit']
Pipeline execution 1012b92d-36c6-4499-b898-d78d7a2bea8a FAILED because step 'AbaloneEval' failed.

I understand that the evaluation step definition is as follows:

Job Name:  script-abalone-eval-2022-10-25-09-04-44-205
Inputs:  [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': <sagemaker.workflow.properties.Properties object at 0x000002647A7F1DC0>, 'LocalPath': '/opt/ml/processing/model', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'input-2', 'AppManaged': False, 'S3Input': {'S3Uri': <sagemaker.workflow.properties.Properties object at 0x000002647A11EB80>, 'LocalPath': '/opt/ml/processing/test', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-local-pipeline-tutorials/script-abalone-eval-2022-10-25-09-04-44-205/input/code/evaluation.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'evaluation', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-local-pipeline-tutorials/script-abalone-eval-2022-10-25-09-04-44-205/output/evaluation', 'LocalPath': '/opt/ml/processing/evaluation', 'S3UploadMode': 'EndOfJob'}}]

And my eval_args definition is as follows:

eval_args = script_eval.run(
    inputs=[
        ProcessingInput(
            source=step_train.properties.ModelArtifacts.S3ModelArtifacts,
            destination="/opt/ml/processing/model",
        ),
        ProcessingInput(
            source=step_process.properties.ProcessingOutputConfig.Outputs["test"].S3Output.S3Uri,
            destination="/opt/ml/processing/test",
        ),
    ],
    outputs=[
        ProcessingOutput(output_name="evaluation", source="/opt/ml/processing/evaluation"),
    ],
    code="code/evaluation.py",
)

where source for the first input refers to the step_train defined before and it should download the model artifacts, but it is not doing it. For the other defined input, it does download the test data to use, but not the model artificats.

Not sure if there is a replacement for: source=step_train.properties.ModelArtifacts.S3ModelArtifacts argument.

Am I doing something wrong? I don't think it is permission/policies related since it doesn't give any AccessDenied errors.

Im using sagemaker 2.113.0

Thanks in advance

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions