-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
Description
Please fill out the form below.
System Information
- Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans):
PyTorch - Framework Version:
1.1,1.2 - Python Version:
3.7.4 - CPU or GPU:
CPU(ml.t2.xlarge,ml.t2.medium) - Python SDK Version:
1.43.3 - Are you using a custom image: No
Conda env:
channels:
- defaults
dependencies:
- _libgcc_mutex=0.1=main
- _pytorch_select=0.2=gpu_0
- asn1crypto=0.24.0=py37_0
- astroid=2.3.1=py37_0
- attrs=19.1.0=py37_1
- backcall=0.1.0=py37_0
- blas=1.0=mkl
- bleach=3.1.0=py37_0
- ca-certificates=2019.8.28=0
- certifi=2019.9.11=py37_0
- cffi=1.12.3=py37h2e261b9_0
- chardet=3.0.4=py37_1003
- cryptography=2.7=py37h1ba5d50_0
- cudatoolkit=10.0.130=0
- cudnn=7.6.0=cuda10.0_0
- dbus=1.13.6=h746ee38_0
- decorator=4.4.0=py37_1
- defusedxml=0.6.0=py_0
- entrypoints=0.3=py37_0
- expat=2.2.6=he6710b0_0
- fontconfig=2.13.0=h9420a91_0
- freetype=2.9.1=h8a8886c_1
- glib=2.56.2=hd408876_0
- gmp=6.1.2=h6c8ec71_1
- gst-plugins-base=1.14.0=hbbd80ab_1
- gstreamer=1.14.0=hb453b48_1
- icu=58.2=h9c2bf20_1
- idna=2.8=py37_0
- intel-openmp=2019.4=243
- ipykernel=5.1.2=py37h39e3cac_0
- ipython=7.8.0=py37h39e3cac_0
- ipython_genutils=0.2.0=py37_0
- ipywidgets=7.5.1=py_0
- isort=4.3.21=py37_0
- jedi=0.15.1=py37_0
- jinja2=2.10.1=py37_0
- jpeg=9b=h024ee3a_2
- jsonschema=3.0.2=py37_0
- jupyter=1.0.0=py37_7
- jupyter_client=5.3.3=py37_1
- jupyter_console=6.0.0=py37_0
- jupyter_core=4.5.0=py_0
- lazy-object-proxy=1.4.2=py37h7b6447c_0
- libedit=3.1.20181209=hc058e9b_0
- libffi=3.2.1=hd88cf55_4
- libgcc-ng=9.1.0=hdf63c60_0
- libgfortran-ng=7.3.0=hdf63c60_0
- libpng=1.6.37=hbc83047_0
- libsodium=1.0.16=h1bed415_0
- libstdcxx-ng=9.1.0=hdf63c60_0
- libtiff=4.0.10=h2733197_2
- libuuid=1.0.3=h1bed415_2
- libxcb=1.13=h1bed415_1
- libxml2=2.9.9=hea5a465_1
- markupsafe=1.1.1=py37h7b6447c_0
- mccabe=0.6.1=py37_1
- mistune=0.8.4=py37h7b6447c_0
- mkl=2019.4=243
- mkl-service=2.3.0=py37he904b0f_0
- mkl_fft=1.0.14=py37ha843d7b_0
- mkl_random=1.1.0=py37hd6b4f25_0
- nb_conda_kernels=2.2.2=py37_0
- nbconvert=5.6.0=py37_1
- nbformat=4.4.0=py37_0
- ncurses=6.1=he6710b0_1
- ninja=1.9.0=py37hfd86e86_0
- notebook=6.0.1=py37_0
- numpy=1.17.2=py37haad9e8e_0
- numpy-base=1.17.2=py37hde5b4d6_0
- olefile=0.46=py37_0
- openssl=1.1.1d=h7b6447c_2
- pandas=0.25.1=py37he6710b0_0
- pandoc=2.2.3.2=0
- pandocfilters=1.4.2=py37_1
- parso=0.5.1=py_0
- pcre=8.43=he6710b0_0
- pexpect=4.7.0=py37_0
- pickleshare=0.7.5=py37_0
- pillow=6.1.0=py37h34e0f95_0
- pip=19.2.3=py37_0
- prometheus_client=0.7.1=py_0
- prompt_toolkit=2.0.9=py37_0
- ptyprocess=0.6.0=py37_0
- pycparser=2.19=py37_0
- pygments=2.4.2=py_0
- pylint=2.4.2=py37_0
- pyopenssl=19.0.0=py37_0
- pyqt=5.9.2=py37h05f1152_2
- pyrsistent=0.15.4=py37h7b6447c_0
- pysocks=1.7.1=py37_0
- python=3.7.4=h265db76_1
- python-dateutil=2.8.0=py37_0
- pytorch=1.2.0=cuda100py37h938c94c_0
- pytz=2019.2=py_0
- pyzmq=18.1.0=py37he6710b0_0
- qt=5.9.7=h5867ecd_1
- qtconsole=4.5.5=py_0
- readline=7.0=h7b6447c_5
- send2trash=1.5.0=py37_0
- setuptools=41.2.0=py37_0
- sip=4.19.8=py37hf484d3e_0
- six=1.12.0=py37_0
- sqlite=3.30.0=h7b6447c_0
- terminado=0.8.2=py37_0
- testpath=0.4.2=py37_0
- tk=8.6.8=hbc83047_0
- torchvision=0.4.0=cuda100py37hecfc37a_0
- tornado=6.0.3=py37h7b6447c_0
- traitlets=4.3.2=py37_0
- urllib3=1.24.2=py37_0
- wcwidth=0.1.7=py37_0
- webencodings=0.5.1=py37_1
- wheel=0.33.6=py37_0
- widgetsnbextension=3.5.1=py37_0
- wrapt=1.11.2=py37h7b6447c_0
- xz=5.2.4=h14c3975_4
- zeromq=4.3.1=he6710b0_3
- zlib=1.2.11=h7b6447c_3
- zstd=1.3.7=h0b5b093_0
- pip:
- bcrypt==3.1.7
- boto3==1.9.243
- botocore==1.12.243
- cached-property==1.5.1
- docker==3.7.3
- docker-compose==1.24.1
- docker-pycreds==0.4.0
- dockerpty==0.4.1
- docopt==0.6.2
- docutils==0.15.2
- fabric==2.5.0
- invoke==1.3.0
- jmespath==0.9.4
- paramiko==2.6.0
- protobuf==3.10.0
- protobuf3-to-dict==0.1.5
- pynacl==1.3.0
- pyyaml==3.13
- requests==2.20.1
- s3transfer==0.2.1
- sagemaker==1.43.3
- scipy==1.3.1
- texttable==0.9.1
- torch==1.2.0
- websocket-client==0.56.0
Describe the problem
Attempting to fit an estimator locally on a SageMaker notebook instance yields an error. This error started a couple days ago, on code that used to run fine.
Minimal repro / logs
I run the following code in a Sagemaker notebook instance of type ml.t2.xlarge or ml.t2.medium, and get the same error in both cases. Also get a very similar error if I switch the framework_version to 1.2.
import os
import sagemaker
from sagemaker.pytorch import PyTorch
from constants import S3_PREFIX
session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = session.default_bucket()
estimator = PyTorch(
entry_point='train.py', source_dir='.', role=role,
train_instance_count=1, train_instance_type='local',
framework_version='1.1'
)
try:
estimator.fit({
'train_dir': f's3://{bucket}/{S3_PREFIX}/train',
'val_dir': f's3://{bucket}/{S3_PREFIX}/val'
})
finally:
# otherwise docker tmp garbage will fill up disk
os.system('sudo rm -rf /tmp/tmp*')
This yields the following error:
---------------------------------------------------------------------------
CalledProcessError Traceback (most recent call last)
<ipython-input-14-e10fe05fd9ee> in <module>
7 estimator.fit({
8 'train_dir': f's3://{bucket}/{S3_PREFIX}/train',
----> 9 'val_dir': f's3://{bucket}/{S3_PREFIX}/val'
10 })
11 finally:
~/anaconda3/envs/home-listings/lib/python3.7/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name)
337 self._prepare_for_training(job_name=job_name)
338
--> 339 self.latest_training_job = _TrainingJob.start_new(self, inputs)
340 if wait:
341 self.latest_training_job.wait(logs=logs)
~/anaconda3/envs/home-listings/lib/python3.7/site-packages/sagemaker/estimator.py in start_new(cls, estimator, inputs)
861 cls._add_spot_checkpoint_args(local_mode, estimator, train_args)
862
--> 863 estimator.sagemaker_session.train(**train_args)
864
865 return cls(estimator.sagemaker_session, estimator._current_job_name)
~/anaconda3/envs/home-listings/lib/python3.7/site-packages/sagemaker/session.py in train(self, input_mode, input_config, role, job_name, output_config, resource_config, vpc_config, hyperparameters, stop_condition, tags, metric_definitions, enable_network_isolation, image, algorithm_arn, encrypt_inter_container_traffic, train_use_spot_instances, checkpoint_s3_uri, checkpoint_local_path)
392 LOGGER.info("Creating training-job with name: %s", job_name)
393 LOGGER.debug("train request: %s", json.dumps(train_request, indent=4))
--> 394 self.sagemaker_client.create_training_job(**train_request)
395
396 def compile_model(
~/anaconda3/envs/home-listings/lib/python3.7/site-packages/sagemaker/local/local_session.py in create_training_job(self, TrainingJobName, AlgorithmSpecification, OutputDataConfig, ResourceConfig, InputDataConfig, **kwargs)
99 training_job = _LocalTrainingJob(container)
100 hyperparameters = kwargs["HyperParameters"] if "HyperParameters" in kwargs else {}
--> 101 training_job.start(InputDataConfig, OutputDataConfig, hyperparameters, TrainingJobName)
102
103 LocalSagemakerClient._training_jobs[TrainingJobName] = training_job
~/anaconda3/envs/home-listings/lib/python3.7/site-packages/sagemaker/local/entities.py in start(self, input_data_config, output_data_config, hyperparameters, job_name)
87
88 self.model_artifacts = self.container.train(
---> 89 input_data_config, output_data_config, hyperparameters, job_name
90 )
91 self.end_time = datetime.datetime.now()
~/anaconda3/envs/home-listings/lib/python3.7/site-packages/sagemaker/local/image.py in train(self, input_data_config, output_data_config, hyperparameters, job_name)
139
140 if _ecr_login_if_needed(self.sagemaker_session.boto_session, self.image):
--> 141 _pull_image(self.image)
142
143 process = subprocess.Popen(
~/anaconda3/envs/home-listings/lib/python3.7/site-packages/sagemaker/local/image.py in _pull_image(image)
831 logger.info("docker command: %s", pull_image_command)
832
--> 833 subprocess.check_output(pull_image_command, shell=True)
834 logger.info("image pulled: %s", image)
~/anaconda3/envs/home-listings/lib/python3.7/subprocess.py in check_output(timeout, *popenargs, **kwargs)
393
394 return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
--> 395 **kwargs).stdout
396
397
~/anaconda3/envs/home-listings/lib/python3.7/subprocess.py in run(input, capture_output, timeout, check, *popenargs, **kwargs)
485 if check and retcode:
486 raise CalledProcessError(retcode, process.args,
--> 487 output=stdout, stderr=stderr)
488 return CompletedProcess(process.args, retcode, stdout, stderr)
489
CalledProcessError: Command 'docker pull 520713654638.dkr.ecr.us-west-2.amazonaws.com/sagemaker-pytorch:1.1-cpu-py3' returned non-zero exit status 1.
No error occurs if I switch the train_instance_type to a GPU instance; this error only happens with train_instance_type='local'.
BedirYilmaz