-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Description
Describe the bug
I'm on ml.p3.2xlarge and mxnet_p36 conda env and installed python -m pip install "sagemaker[local]". The following fails training in local mode train_instance_type='local' or 'local_gpu' but works on any non-local instance type
estimator = MXNet(entry_point='main.py',
source_dir='code',
role=role,
train_instance_count=1,
train_instance_type='local', # 'ml.c4.2xlarge'
framework_version="1.4.1",
py_version='py3',
hyperparameters=hyperparameters,
output_path=train_output,
code_location=code_location,
sagemaker_session=session,
)Screenshots or logs
ClientError Traceback (most recent call last)
<ipython-input-11-2c228015dcf6> in <module>()
15 )
16
---> 17 estimator.fit(input_data)
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
460 self._prepare_for_training(job_name=job_name)
461
--> 462 self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
463 self.jobs.append(self.latest_training_job)
464 if wait:
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/estimator.py in start_new(cls, estimator, inputs, experiment_config)
1008 train_args["enable_sagemaker_metrics"] = estimator.enable_sagemaker_metrics
1009
-> 1010 estimator.sagemaker_session.train(**train_args)
1011
1012 return cls(estimator.sagemaker_session, estimator._current_job_name)
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/session.py in train(self, input_mode, input_config, role, job_name, output_config, resource_config, vpc_config, hyperparameters, stop_condition, tags, metric_definitions, enable_network_isolation, image, algorithm_arn, encrypt_inter_container_traffic, train_use_spot_instances, checkpoint_s3_uri, checkpoint_local_path, experiment_config, debugger_rule_configs, debugger_hook_config, tensorboard_output_config, enable_sagemaker_metrics)
567 LOGGER.info("Creating training-job with name: %s", job_name)
568 LOGGER.debug("train request: %s", json.dumps(train_request, indent=4))
--> 569 self.sagemaker_client.create_training_job(**train_request)
570
571 def process(
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
314 "%s() only accepts keyword arguments." % py_operation_name)
315 # The "self" in this scope is referring to the BaseClient.
--> 316 return self._make_api_call(operation_name, kwargs)
317
318 _api_call.__name__ = str(py_operation_name)
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
624 error_code = parsed_response.get("Error", {}).get("Code")
625 error_class = self.exceptions.from_code(error_code)
--> 626 raise error_class(parsed_response, operation_name)
627 else:
628 return parsed_response
ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: 1 validation error detected: Value 'local' at 'resourceConfig.instanceType' failed to satisfy constraint: Member must satisfy enum value set: [ml.p2.xlarge, ml.m5.4xlarge, ml.m4.16xlarge, ml.p3.16xlarge, ml.m5.large, ml.p2.16xlarge, ml.c4.2xlarge, ml.c5.2xlarge, ml.c4.4xlarge, ml.c5.4xlarge, ml.g4dn.xlarge, ml.g4dn.12xlarge, ml.c4.8xlarge, ml.g4dn.2xlarge, ml.c5.9xlarge, ml.g4dn.4xlarge, ml.c5.xlarge, ml.g4dn.16xlarge, ml.c4.xlarge, ml.g4dn.8xlarge, ml.c5.18xlarge, ml.p3dn.24xlarge, ml.p3.2xlarge, ml.m5.xlarge, ml.m4.10xlarge, ml.m5.12xlarge, ml.m4.xlarge, ml.m5.24xlarge, ml.m4.2xlarge, ml.p2.8xlarge, ml.m5.2xlarge, ml.p3.8xlarge, ml.m4.4xlarge]
System information
A description of your system. Please provide:
- SageMaker Python SDK version: 1.50.16
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): MXNet
- Framework version: 1.4.1 and 1.6.0
- Python version: py3
- CPU or GPU: Both
- Custom Docker image (Y/N):
Metadata
Metadata
Assignees
Labels
No labels