-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Description
Describe the bug
This is related to the already resolved issue #1349 . After using the local mode for training via LocalSession(), deploying the estimator locally repeatedly throws this error:
Click to see the error
algo-1-zu1b2_1 | 2020-03-13 17:36:32,149 [WARN ] W-9001-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - in DEFAULT_MODEL_FILENAMES.items()]))
algo-1-zu1b2_1 | 2020-03-13 17:36:32,149 [WARN ] W-9001-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ValueError: Failed to load model with default model_fn: missing file model-symbol.json.Expected files: ['model-symbol.json', 'model-0000.params', 'model-shapes.json']
algo-1-zu1b2_1 | 2020-03-13 17:36:32,149 [WARN ] W-9001-model com.amazonaws.ml.mms.wlm.BatchAggregator - Load model failed: model, error: Worker died.
I investigated where the issue might be and manually created the MXNetModel that worked with non-local mode but fails in local mode with
---------------------------------------------------------------------------
ClientError Traceback (most recent call last)
<timed exec> in <module>()
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, update_endpoint, tags, kms_key, wait, data_capture_config)
440 self.name = "{}{}".format(name_prefix, compiled_model_suffix)
441
--> 442 self._create_sagemaker_model(instance_type, accelerator_type, tags)
443 production_variant = sagemaker.production_variant(
444 self.name, instance_type, initial_instance_count, accelerator_type=accelerator_type
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/model.py in _create_sagemaker_model(self, instance_type, accelerator_type, tags)
175 /api/latest/reference/services/sagemaker.html#SageMaker.Client.add_tags
176 """
--> 177 container_def = self.prepare_container_def(instance_type, accelerator_type=accelerator_type)
178 self.name = self.name or utils.name_from_image(container_def["Image"])
179 enable_network_isolation = self.enable_network_isolation()
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/mxnet/model.py in prepare_container_def(self, instance_type, accelerator_type)
150
151 deploy_key_prefix = model_code_key_prefix(self.key_prefix, self.name, deploy_image)
--> 152 self._upload_code(deploy_key_prefix, self._is_mms_version())
153 deploy_env = dict(self.env)
154 deploy_env.update(self._framework_env_vars())
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/model.py in _upload_code(self, key_prefix, repack)
823 repacked_model_uri=repacked_model_data,
824 sagemaker_session=self.sagemaker_session,
--> 825 kms_key=self.model_kms_key,
826 )
827
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/utils.py in repack_model(inference_script, source_directory, dependencies, model_uri, repacked_model_uri, sagemaker_session, kms_key)
481
482 with _tmpdir() as tmp:
--> 483 model_dir = _extract_model(model_uri, sagemaker_session, tmp)
484
485 _create_or_update_code_dir(
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/utils.py in _extract_model(model_uri, sagemaker_session, tmp)
571 if model_uri.lower().startswith("s3://"):
572 local_model_path = os.path.join(tmp, "tar_file")
--> 573 download_file_from_url(model_uri, local_model_path, sagemaker_session)
574 else:
575 local_model_path = model_uri.replace("file://", "")
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/utils.py in download_file_from_url(url, dst, sagemaker_session)
589 bucket, key = url.netloc, url.path.lstrip("/")
590
--> 591 download_file(bucket, key, dst, sagemaker_session)
592
593
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/utils.py in download_file(bucket_name, path, target, sagemaker_session)
607 s3 = boto_session.resource("s3")
608 bucket = s3.Bucket(bucket_name)
--> 609 bucket.download_file(path, target)
610
611
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/boto3/s3/inject.py in bucket_download_file(self, Key, Filename, ExtraArgs, Callback, Config)
244 return self.meta.client.download_file(
245 Bucket=self.name, Key=Key, Filename=Filename,
--> 246 ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
247
248
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/boto3/s3/inject.py in download_file(self, Bucket, Key, Filename, ExtraArgs, Callback, Config)
170 return transfer.download_file(
171 bucket=Bucket, key=Key, filename=Filename,
--> 172 extra_args=ExtraArgs, callback=Callback)
173
174
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/boto3/s3/transfer.py in download_file(self, bucket, key, filename, extra_args, callback)
305 bucket, key, filename, extra_args, subscribers)
306 try:
--> 307 future.result()
308 # This is for backwards compatibility where when retries are
309 # exceeded we need to throw the same error from boto3 instead of
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/s3transfer/futures.py in result(self)
104 # however if a KeyboardInterrupt is raised we want want to exit
105 # out of this and propogate the exception.
--> 106 return self._coordinator.result()
107 except KeyboardInterrupt as e:
108 self.cancel()
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/s3transfer/futures.py in result(self)
263 # final result.
264 if self._exception:
--> 265 raise self._exception
266 return self._result
267
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/s3transfer/tasks.py in _main(self, transfer_future, **kwargs)
253 # Call the submit method to start submitting tasks to execute the
254 # transfer.
--> 255 self._submit(transfer_future=transfer_future, **kwargs)
256 except BaseException as e:
257 # If there was an exception raised during the submission of task
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/s3transfer/download.py in _submit(self, client, config, osutil, request_executor, io_executor, transfer_future, bandwidth_limiter)
341 Bucket=transfer_future.meta.call_args.bucket,
342 Key=transfer_future.meta.call_args.key,
--> 343 **transfer_future.meta.call_args.extra_args
344 )
345 transfer_future.meta.provide_transfer_size(
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
314 "%s() only accepts keyword arguments." % py_operation_name)
315 # The "self" in this scope is referring to the BaseClient.
--> 316 return self._make_api_call(operation_name, kwargs)
317
318 _api_call.__name__ = str(py_operation_name)
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
624 error_code = parsed_response.get("Error", {}).get("Code")
625 error_class = self.exceptions.from_code(error_code)
--> 626 raise error_class(parsed_response, operation_name)
627 else:
628 return parsed_response
ClientError: An error occurred (404) when calling the HeadObject operation: Not Foundand looking more into it I found out that in local mode model.tar.gz location uri is stored differently from non-local.
-
Local model uri:
s3://{bucket}/{prefix}/outputmxnet-training-2020-03-12-23-19-23-971/model.tar.gz -
Non-local uri:
s3://{bucket}/{prefix}/output/mxnet-training-2020-03-12-23-19-23-971/output/model.tar.gz.
System information
A description of your system. Please provide:
- SageMaker Python SDK version: 1.51.3
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): MXNet
- Framework version: 1.6.0
- Python version: py3.6
- CPU or GPU: Both
- Custom Docker image (Y/N): N