Skip to content

CalledProcessError pulling PyTorch image for local training job #1105

@elicutler

Description

@elicutler

Please fill out the form below.

System Information

  • Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): PyTorch
  • Framework Version: 1.1, 1.2
  • Python Version: 3.7.4
  • CPU or GPU: CPU (ml.t2.xlarge, ml.t2.medium)
  • Python SDK Version: 1.43.3
  • Are you using a custom image: No

Conda env:

channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _pytorch_select=0.2=gpu_0
  - asn1crypto=0.24.0=py37_0
  - astroid=2.3.1=py37_0
  - attrs=19.1.0=py37_1
  - backcall=0.1.0=py37_0
  - blas=1.0=mkl
  - bleach=3.1.0=py37_0
  - ca-certificates=2019.8.28=0
  - certifi=2019.9.11=py37_0
  - cffi=1.12.3=py37h2e261b9_0
  - chardet=3.0.4=py37_1003
  - cryptography=2.7=py37h1ba5d50_0
  - cudatoolkit=10.0.130=0
  - cudnn=7.6.0=cuda10.0_0
  - dbus=1.13.6=h746ee38_0
  - decorator=4.4.0=py37_1
  - defusedxml=0.6.0=py_0
  - entrypoints=0.3=py37_0
  - expat=2.2.6=he6710b0_0
  - fontconfig=2.13.0=h9420a91_0
  - freetype=2.9.1=h8a8886c_1
  - glib=2.56.2=hd408876_0
  - gmp=6.1.2=h6c8ec71_1
  - gst-plugins-base=1.14.0=hbbd80ab_1
  - gstreamer=1.14.0=hb453b48_1
  - icu=58.2=h9c2bf20_1
  - idna=2.8=py37_0
  - intel-openmp=2019.4=243
  - ipykernel=5.1.2=py37h39e3cac_0
  - ipython=7.8.0=py37h39e3cac_0
  - ipython_genutils=0.2.0=py37_0
  - ipywidgets=7.5.1=py_0
  - isort=4.3.21=py37_0
  - jedi=0.15.1=py37_0
  - jinja2=2.10.1=py37_0
  - jpeg=9b=h024ee3a_2
  - jsonschema=3.0.2=py37_0
  - jupyter=1.0.0=py37_7
  - jupyter_client=5.3.3=py37_1
  - jupyter_console=6.0.0=py37_0
  - jupyter_core=4.5.0=py_0
  - lazy-object-proxy=1.4.2=py37h7b6447c_0
  - libedit=3.1.20181209=hc058e9b_0
  - libffi=3.2.1=hd88cf55_4
  - libgcc-ng=9.1.0=hdf63c60_0
  - libgfortran-ng=7.3.0=hdf63c60_0
  - libpng=1.6.37=hbc83047_0
  - libsodium=1.0.16=h1bed415_0
  - libstdcxx-ng=9.1.0=hdf63c60_0
  - libtiff=4.0.10=h2733197_2
  - libuuid=1.0.3=h1bed415_2
  - libxcb=1.13=h1bed415_1
  - libxml2=2.9.9=hea5a465_1
  - markupsafe=1.1.1=py37h7b6447c_0
  - mccabe=0.6.1=py37_1
  - mistune=0.8.4=py37h7b6447c_0
  - mkl=2019.4=243
  - mkl-service=2.3.0=py37he904b0f_0
  - mkl_fft=1.0.14=py37ha843d7b_0
  - mkl_random=1.1.0=py37hd6b4f25_0
  - nb_conda_kernels=2.2.2=py37_0
  - nbconvert=5.6.0=py37_1
  - nbformat=4.4.0=py37_0
  - ncurses=6.1=he6710b0_1
  - ninja=1.9.0=py37hfd86e86_0
  - notebook=6.0.1=py37_0
  - numpy=1.17.2=py37haad9e8e_0
  - numpy-base=1.17.2=py37hde5b4d6_0
  - olefile=0.46=py37_0
  - openssl=1.1.1d=h7b6447c_2
  - pandas=0.25.1=py37he6710b0_0
  - pandoc=2.2.3.2=0
  - pandocfilters=1.4.2=py37_1
  - parso=0.5.1=py_0
  - pcre=8.43=he6710b0_0
  - pexpect=4.7.0=py37_0
  - pickleshare=0.7.5=py37_0
  - pillow=6.1.0=py37h34e0f95_0
  - pip=19.2.3=py37_0
  - prometheus_client=0.7.1=py_0
  - prompt_toolkit=2.0.9=py37_0
  - ptyprocess=0.6.0=py37_0
  - pycparser=2.19=py37_0
  - pygments=2.4.2=py_0
  - pylint=2.4.2=py37_0
  - pyopenssl=19.0.0=py37_0
  - pyqt=5.9.2=py37h05f1152_2
  - pyrsistent=0.15.4=py37h7b6447c_0
  - pysocks=1.7.1=py37_0
  - python=3.7.4=h265db76_1
  - python-dateutil=2.8.0=py37_0
  - pytorch=1.2.0=cuda100py37h938c94c_0
  - pytz=2019.2=py_0
  - pyzmq=18.1.0=py37he6710b0_0
  - qt=5.9.7=h5867ecd_1
  - qtconsole=4.5.5=py_0
  - readline=7.0=h7b6447c_5
  - send2trash=1.5.0=py37_0
  - setuptools=41.2.0=py37_0
  - sip=4.19.8=py37hf484d3e_0
  - six=1.12.0=py37_0
  - sqlite=3.30.0=h7b6447c_0
  - terminado=0.8.2=py37_0
  - testpath=0.4.2=py37_0
  - tk=8.6.8=hbc83047_0
  - torchvision=0.4.0=cuda100py37hecfc37a_0
  - tornado=6.0.3=py37h7b6447c_0
  - traitlets=4.3.2=py37_0
  - urllib3=1.24.2=py37_0
  - wcwidth=0.1.7=py37_0
  - webencodings=0.5.1=py37_1
  - wheel=0.33.6=py37_0
  - widgetsnbextension=3.5.1=py37_0
  - wrapt=1.11.2=py37h7b6447c_0
  - xz=5.2.4=h14c3975_4
  - zeromq=4.3.1=he6710b0_3
  - zlib=1.2.11=h7b6447c_3
  - zstd=1.3.7=h0b5b093_0
  - pip:
    - bcrypt==3.1.7
    - boto3==1.9.243
    - botocore==1.12.243
    - cached-property==1.5.1
    - docker==3.7.3
    - docker-compose==1.24.1
    - docker-pycreds==0.4.0
    - dockerpty==0.4.1
    - docopt==0.6.2
    - docutils==0.15.2
    - fabric==2.5.0
    - invoke==1.3.0
    - jmespath==0.9.4
    - paramiko==2.6.0
    - protobuf==3.10.0
    - protobuf3-to-dict==0.1.5
    - pynacl==1.3.0
    - pyyaml==3.13
    - requests==2.20.1
    - s3transfer==0.2.1
    - sagemaker==1.43.3
    - scipy==1.3.1
    - texttable==0.9.1
    - torch==1.2.0
    - websocket-client==0.56.0

Describe the problem

Attempting to fit an estimator locally on a SageMaker notebook instance yields an error. This error started a couple days ago, on code that used to run fine.

Minimal repro / logs

I run the following code in a Sagemaker notebook instance of type ml.t2.xlarge or ml.t2.medium, and get the same error in both cases. Also get a very similar error if I switch the framework_version to 1.2.

import os
import sagemaker

from sagemaker.pytorch import PyTorch

from constants import S3_PREFIX

session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = session.default_bucket()

estimator = PyTorch(
    entry_point='train.py', source_dir='.', role=role,
    train_instance_count=1, train_instance_type='local',
    framework_version='1.1'
)
try:
    estimator.fit({
        'train_dir': f's3://{bucket}/{S3_PREFIX}/train',
        'val_dir': f's3://{bucket}/{S3_PREFIX}/val'
    })
finally:
    # otherwise docker tmp garbage will fill up disk
    os.system('sudo rm -rf /tmp/tmp*') 

This yields the following error:

---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
<ipython-input-14-e10fe05fd9ee> in <module>
      7     estimator.fit({
      8         'train_dir': f's3://{bucket}/{S3_PREFIX}/train',
----> 9         'val_dir': f's3://{bucket}/{S3_PREFIX}/val'
     10     })
     11 finally:

~/anaconda3/envs/home-listings/lib/python3.7/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name)
    337         self._prepare_for_training(job_name=job_name)
    338 
--> 339         self.latest_training_job = _TrainingJob.start_new(self, inputs)
    340         if wait:
    341             self.latest_training_job.wait(logs=logs)

~/anaconda3/envs/home-listings/lib/python3.7/site-packages/sagemaker/estimator.py in start_new(cls, estimator, inputs)
    861         cls._add_spot_checkpoint_args(local_mode, estimator, train_args)
    862 
--> 863         estimator.sagemaker_session.train(**train_args)
    864 
    865         return cls(estimator.sagemaker_session, estimator._current_job_name)

~/anaconda3/envs/home-listings/lib/python3.7/site-packages/sagemaker/session.py in train(self, input_mode, input_config, role, job_name, output_config, resource_config, vpc_config, hyperparameters, stop_condition, tags, metric_definitions, enable_network_isolation, image, algorithm_arn, encrypt_inter_container_traffic, train_use_spot_instances, checkpoint_s3_uri, checkpoint_local_path)
    392         LOGGER.info("Creating training-job with name: %s", job_name)
    393         LOGGER.debug("train request: %s", json.dumps(train_request, indent=4))
--> 394         self.sagemaker_client.create_training_job(**train_request)
    395 
    396     def compile_model(

~/anaconda3/envs/home-listings/lib/python3.7/site-packages/sagemaker/local/local_session.py in create_training_job(self, TrainingJobName, AlgorithmSpecification, OutputDataConfig, ResourceConfig, InputDataConfig, **kwargs)
     99         training_job = _LocalTrainingJob(container)
    100         hyperparameters = kwargs["HyperParameters"] if "HyperParameters" in kwargs else {}
--> 101         training_job.start(InputDataConfig, OutputDataConfig, hyperparameters, TrainingJobName)
    102 
    103         LocalSagemakerClient._training_jobs[TrainingJobName] = training_job

~/anaconda3/envs/home-listings/lib/python3.7/site-packages/sagemaker/local/entities.py in start(self, input_data_config, output_data_config, hyperparameters, job_name)
     87 
     88         self.model_artifacts = self.container.train(
---> 89             input_data_config, output_data_config, hyperparameters, job_name
     90         )
     91         self.end_time = datetime.datetime.now()

~/anaconda3/envs/home-listings/lib/python3.7/site-packages/sagemaker/local/image.py in train(self, input_data_config, output_data_config, hyperparameters, job_name)
    139 
    140         if _ecr_login_if_needed(self.sagemaker_session.boto_session, self.image):
--> 141             _pull_image(self.image)
    142 
    143         process = subprocess.Popen(

~/anaconda3/envs/home-listings/lib/python3.7/site-packages/sagemaker/local/image.py in _pull_image(image)
    831     logger.info("docker command: %s", pull_image_command)
    832 
--> 833     subprocess.check_output(pull_image_command, shell=True)
    834     logger.info("image pulled: %s", image)

~/anaconda3/envs/home-listings/lib/python3.7/subprocess.py in check_output(timeout, *popenargs, **kwargs)
    393 
    394     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
--> 395                **kwargs).stdout
    396 
    397 

~/anaconda3/envs/home-listings/lib/python3.7/subprocess.py in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    485         if check and retcode:
    486             raise CalledProcessError(retcode, process.args,
--> 487                                      output=stdout, stderr=stderr)
    488     return CompletedProcess(process.args, retcode, stdout, stderr)
    489 

CalledProcessError: Command 'docker pull 520713654638.dkr.ecr.us-west-2.amazonaws.com/sagemaker-pytorch:1.1-cpu-py3' returned non-zero exit status 1.

No error occurs if I switch the train_instance_type to a GPU instance; this error only happens with train_instance_type='local'.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions