Skip to content

Local mode failure for MXNet estimator #1349

@ehsanmok

Description

@ehsanmok

Describe the bug
I'm on ml.p3.2xlarge and mxnet_p36 conda env and installed python -m pip install "sagemaker[local]". The following fails training in local mode train_instance_type='local' or 'local_gpu' but works on any non-local instance type

estimator = MXNet(entry_point='main.py',
                  source_dir='code',
                  role=role,
                  train_instance_count=1, 
                  train_instance_type='local',  # 'ml.c4.2xlarge'
                  framework_version="1.4.1",
                  py_version='py3',
                  hyperparameters=hyperparameters,
                  output_path=train_output,
                  code_location=code_location,
                  sagemaker_session=session,
                 )

Screenshots or logs

ClientError                               Traceback (most recent call last)
<ipython-input-11-2c228015dcf6> in <module>()
     15                  )
     16 
---> 17 estimator.fit(input_data)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
    460         self._prepare_for_training(job_name=job_name)
    461 
--> 462         self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
    463         self.jobs.append(self.latest_training_job)
    464         if wait:

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/estimator.py in start_new(cls, estimator, inputs, experiment_config)
   1008             train_args["enable_sagemaker_metrics"] = estimator.enable_sagemaker_metrics
   1009 
-> 1010         estimator.sagemaker_session.train(**train_args)
   1011 
   1012         return cls(estimator.sagemaker_session, estimator._current_job_name)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/session.py in train(self, input_mode, input_config, role, job_name, output_config, resource_config, vpc_config, hyperparameters, stop_condition, tags, metric_definitions, enable_network_isolation, image, algorithm_arn, encrypt_inter_container_traffic, train_use_spot_instances, checkpoint_s3_uri, checkpoint_local_path, experiment_config, debugger_rule_configs, debugger_hook_config, tensorboard_output_config, enable_sagemaker_metrics)
    567         LOGGER.info("Creating training-job with name: %s", job_name)
    568         LOGGER.debug("train request: %s", json.dumps(train_request, indent=4))
--> 569         self.sagemaker_client.create_training_job(**train_request)
    570 
    571     def process(

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    314                     "%s() only accepts keyword arguments." % py_operation_name)
    315             # The "self" in this scope is referring to the BaseClient.
--> 316             return self._make_api_call(operation_name, kwargs)
    317 
    318         _api_call.__name__ = str(py_operation_name)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    624             error_code = parsed_response.get("Error", {}).get("Code")
    625             error_class = self.exceptions.from_code(error_code)
--> 626             raise error_class(parsed_response, operation_name)
    627         else:
    628             return parsed_response

ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: 1 validation error detected: Value 'local' at 'resourceConfig.instanceType' failed to satisfy constraint: Member must satisfy enum value set: [ml.p2.xlarge, ml.m5.4xlarge, ml.m4.16xlarge, ml.p3.16xlarge, ml.m5.large, ml.p2.16xlarge, ml.c4.2xlarge, ml.c5.2xlarge, ml.c4.4xlarge, ml.c5.4xlarge, ml.g4dn.xlarge, ml.g4dn.12xlarge, ml.c4.8xlarge, ml.g4dn.2xlarge, ml.c5.9xlarge, ml.g4dn.4xlarge, ml.c5.xlarge, ml.g4dn.16xlarge, ml.c4.xlarge, ml.g4dn.8xlarge, ml.c5.18xlarge, ml.p3dn.24xlarge, ml.p3.2xlarge, ml.m5.xlarge, ml.m4.10xlarge, ml.m5.12xlarge, ml.m4.xlarge, ml.m5.24xlarge, ml.m4.2xlarge, ml.p2.8xlarge, ml.m5.2xlarge, ml.p3.8xlarge, ml.m4.4xlarge]

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 1.50.16
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): MXNet
  • Framework version: 1.4.1 and 1.6.0
  • Python version: py3
  • CPU or GPU: Both
  • Custom Docker image (Y/N):

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions