Skip to content

Local deploy failure for MXNet estimator due to non-uniform output uri #1354

@ehsanmok

Description

@ehsanmok

Describe the bug

This is related to the already resolved issue #1349 . After using the local mode for training via LocalSession(), deploying the estimator locally repeatedly throws this error:

Click to see the error
algo-1-zu1b2_1  | 2020-03-13 17:36:32,149 [WARN ] W-9001-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     in DEFAULT_MODEL_FILENAMES.items()]))
algo-1-zu1b2_1  | 2020-03-13 17:36:32,149 [WARN ] W-9001-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ValueError: Failed to load model with default model_fn: missing file model-symbol.json.Expected files: ['model-symbol.json', 'model-0000.params', 'model-shapes.json']
algo-1-zu1b2_1  | 2020-03-13 17:36:32,149 [WARN ] W-9001-model com.amazonaws.ml.mms.wlm.BatchAggregator - Load model failed: model, error: Worker died.

I investigated where the issue might be and manually created the MXNetModel that worked with non-local mode but fails in local mode with

---------------------------------------------------------------------------
ClientError                               Traceback (most recent call last)
<timed exec> in <module>()

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, update_endpoint, tags, kms_key, wait, data_capture_config)
    440             self.name = "{}{}".format(name_prefix, compiled_model_suffix)
    441 
--> 442         self._create_sagemaker_model(instance_type, accelerator_type, tags)
    443         production_variant = sagemaker.production_variant(
    444             self.name, instance_type, initial_instance_count, accelerator_type=accelerator_type

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/model.py in _create_sagemaker_model(self, instance_type, accelerator_type, tags)
    175                 /api/latest/reference/services/sagemaker.html#SageMaker.Client.add_tags
    176         """
--> 177         container_def = self.prepare_container_def(instance_type, accelerator_type=accelerator_type)
    178         self.name = self.name or utils.name_from_image(container_def["Image"])
    179         enable_network_isolation = self.enable_network_isolation()

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/mxnet/model.py in prepare_container_def(self, instance_type, accelerator_type)
    150 
    151         deploy_key_prefix = model_code_key_prefix(self.key_prefix, self.name, deploy_image)
--> 152         self._upload_code(deploy_key_prefix, self._is_mms_version())
    153         deploy_env = dict(self.env)
    154         deploy_env.update(self._framework_env_vars())

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/model.py in _upload_code(self, key_prefix, repack)
    823                 repacked_model_uri=repacked_model_data,
    824                 sagemaker_session=self.sagemaker_session,
--> 825                 kms_key=self.model_kms_key,
    826             )
    827 

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/utils.py in repack_model(inference_script, source_directory, dependencies, model_uri, repacked_model_uri, sagemaker_session, kms_key)
    481 
    482     with _tmpdir() as tmp:
--> 483         model_dir = _extract_model(model_uri, sagemaker_session, tmp)
    484 
    485         _create_or_update_code_dir(

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/utils.py in _extract_model(model_uri, sagemaker_session, tmp)
    571     if model_uri.lower().startswith("s3://"):
    572         local_model_path = os.path.join(tmp, "tar_file")
--> 573         download_file_from_url(model_uri, local_model_path, sagemaker_session)
    574     else:
    575         local_model_path = model_uri.replace("file://", "")

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/utils.py in download_file_from_url(url, dst, sagemaker_session)
    589     bucket, key = url.netloc, url.path.lstrip("/")
    590 
--> 591     download_file(bucket, key, dst, sagemaker_session)
    592 
    593 

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/utils.py in download_file(bucket_name, path, target, sagemaker_session)
    607     s3 = boto_session.resource("s3")
    608     bucket = s3.Bucket(bucket_name)
--> 609     bucket.download_file(path, target)
    610 
    611 

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/boto3/s3/inject.py in bucket_download_file(self, Key, Filename, ExtraArgs, Callback, Config)
    244     return self.meta.client.download_file(
    245         Bucket=self.name, Key=Key, Filename=Filename,
--> 246         ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
    247 
    248 

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/boto3/s3/inject.py in download_file(self, Bucket, Key, Filename, ExtraArgs, Callback, Config)
    170         return transfer.download_file(
    171             bucket=Bucket, key=Key, filename=Filename,
--> 172             extra_args=ExtraArgs, callback=Callback)
    173 
    174 

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/boto3/s3/transfer.py in download_file(self, bucket, key, filename, extra_args, callback)
    305             bucket, key, filename, extra_args, subscribers)
    306         try:
--> 307             future.result()
    308         # This is for backwards compatibility where when retries are
    309         # exceeded we need to throw the same error from boto3 instead of

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/s3transfer/futures.py in result(self)
    104             # however if a KeyboardInterrupt is raised we want want to exit
    105             # out of this and propogate the exception.
--> 106             return self._coordinator.result()
    107         except KeyboardInterrupt as e:
    108             self.cancel()

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/s3transfer/futures.py in result(self)
    263         # final result.
    264         if self._exception:
--> 265             raise self._exception
    266         return self._result
    267 

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/s3transfer/tasks.py in _main(self, transfer_future, **kwargs)
    253             # Call the submit method to start submitting tasks to execute the
    254             # transfer.
--> 255             self._submit(transfer_future=transfer_future, **kwargs)
    256         except BaseException as e:
    257             # If there was an exception raised during the submission of task

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/s3transfer/download.py in _submit(self, client, config, osutil, request_executor, io_executor, transfer_future, bandwidth_limiter)
    341                 Bucket=transfer_future.meta.call_args.bucket,
    342                 Key=transfer_future.meta.call_args.key,
--> 343                 **transfer_future.meta.call_args.extra_args
    344             )
    345             transfer_future.meta.provide_transfer_size(

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    314                     "%s() only accepts keyword arguments." % py_operation_name)
    315             # The "self" in this scope is referring to the BaseClient.
--> 316             return self._make_api_call(operation_name, kwargs)
    317 
    318         _api_call.__name__ = str(py_operation_name)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    624             error_code = parsed_response.get("Error", {}).get("Code")
    625             error_class = self.exceptions.from_code(error_code)
--> 626             raise error_class(parsed_response, operation_name)
    627         else:
    628             return parsed_response

ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

and looking more into it I found out that in local mode model.tar.gz location uri is stored differently from non-local.

  • Local model uri: s3://{bucket}/{prefix}/outputmxnet-training-2020-03-12-23-19-23-971/model.tar.gz

  • Non-local uri: s3://{bucket}/{prefix}/output/mxnet-training-2020-03-12-23-19-23-971/output/model.tar.gz.

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 1.51.3
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): MXNet
  • Framework version: 1.6.0
  • Python version: py3.6
  • CPU or GPU: Both
  • Custom Docker image (Y/N): N

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions