Skip to content

bug in local mode docker-compose.yaml  #2623

@quetil

Description

@quetil

Describe the bug
When using the locale mode, under windows, the Estimator.fi() is not working.
To reproduce

import sagemaker
from sagemaker.local import LocalSession
sagemaker_session = LocalSession()
sagemaker_session.config = {'local': {'local_code': True}}
tf_estimator = TensorFlow(
        # env
        py_version='py37',
        framework_version='2.4',

        # instance
        instance_count=1,
        instance_type='local',

        # job
        base_job_name='test',
        source_dir=r"C:/Users/me/project/",
        entry_point=r"C:/Users/me/project/train.py",
        role=role,
        sagemaker_session = sagemaker_session,
        script_mode=True,
)
tf_estimator.fit(r"file://C:/Users/me/project/data", wait=False)

but the generated docker-compose.yaml file is not well written, in volumes entry, one can see:
- C:\Users\me\project\data:/opt/ml/input/data/training
- /Users/me/project/:/opt/ml/code

The last line is not well written. I tried several ways:

  • source_dir="C:/Users/me/project/"
  • source_dir="file://C:/Users/me/project/"
  • source_dir="file:///C:/Users/me/project/"
  • source_dir=r"file://C:/Users/me/project/"
  • source_dir=r"C:/Users/me/project/"

But it doesn't work better.

EDIT
It seems the origin of the error come from sagemaker.local.image._prepare_training_volumes() and the use of the urllib.parse.urlparse() function.

training_dir
Out[2]: 'file://C:/Users/me/project/'
urlparse(training_dir)
Out[3]: ParseResult(scheme='file', netloc='C:', path='/Users/me/project/', params='', query='', fragment='')

And then, only the "path" key is written in the docker-compose.yaml file, which lead to an error because docker can't find /Users/me/project/ the C: is missing.

Any workaround in mind? I fixed it for my use case with:
volumes.append(_Volume(parsed_uri.netloc+parsed_uri.path, "/opt/ml/code")) instead of volumes.append(_Volume(parsed_uri.path, "/opt/ml/code")) but without knowing the side effect.

Expected behavior
To write correctly the docker-compose.yaml and to launch the training.

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.59.1
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Tensorflow
  • Framework version: 2.4
  • Python version: 3.7
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): N

Additional context
Docker image: 763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-training (2.4-cpu-py37)

Thank you for your help !

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions