-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Description
I'm not exactly sure what happened. All of the sudden all of my training tasks now fail with no code changes. There are definitely authentication issues with aws credentials even though I am training on the online Jupyter notebook and my session is active.
This is how I am constructing the classifier.
classifier = TensorFlow(entry_point='sm_transcript_classifier_ep.py',
role=role,
training_steps= 1e4,
evaluation_steps= 100,
train_instance_count=1,
train_instance_type=INSTANCE_TYPE,
hyperparameters={
"question": QUESTION,
"n_words": _get_n_words()
})model function:
def estimator_fn(run_config, params):
bow_column = tf.feature_column.categorical_column_with_identity(
WORDS_FEATURE, num_buckets=params["n_words"])
bow_embedding_column = tf.feature_column.embedding_column(
bow_column, dimension=EMBEDDING_SIZE, combiner="sqrtn")
return tf.estimator.LinearClassifier(
feature_columns=[bow_embedding_column],
config=run_config
#loss_reduction=tf.losses.Reduction.SUM_BY_NONZERO_WEIGHTS #this doesn't work even though SageMaker should support TF 1.6??
)Full error log:
...........................................................
2018-04-17 20:34:49,194 INFO - root - running container entrypoint
2018-04-17 20:34:49,194 INFO - root - starting train task
2018-04-17 20:34:49,199 INFO - container_support.training - Training starting
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
2018-04-17 20:34:51,095 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTP connection (1): 169.254.170.2
2018-04-17 20:34:51,305 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): sagemaker-us-east-1-245511257894.s3.amazonaws.com
2018-04-17 20:34:51,983 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): s3.amazonaws.com
2018-04-17 20:34:52,246 INFO - tf_container - ----------------------TF_CONFIG--------------------------
2018-04-17 20:34:52,246 INFO - tf_container - {"environment": "cloud", "cluster": {"master": ["algo-1:2222"]}, "task": {"index": 0, "type": "master"}}
2018-04-17 20:34:52,246 INFO - tf_container - ---------------------------------------------------------
2018-04-17 20:34:52,246 INFO - tf_container - creating RunConfig:
2018-04-17 20:34:52,246 INFO - tf_container - {'save_checkpoints_secs': 300}
2018-04-17 20:34:52,247 INFO - tensorflow - TF_CONFIG environment variable: {u'environment': u'cloud', u'cluster': {u'master': [u'algo-1:2222']}, u'task': {u'index': 0, u'type': u'master'}}
2018-04-17 20:34:52,247 INFO - tf_container - invoking estimator_fn
2018-04-17 20:34:52,247 INFO - tensorflow - Using config: {'_save_checkpoints_secs': 300, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': None, '_task_type': u'master', '_global_id_in_cluster': 0, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb3b40d4190>, '_model_dir': u's3://sagemaker-us-east-1-245511257894/sagemaker-tensorflow-2018-04-17-20-30-05-729/checkpoints', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_service': None, '_save_summary_steps': 100, '_num_ps_replicas': 0}
2018-04-17 20:34:52,248 INFO - tensorflow - Skip starting Tensorflow server as there is only one node in the cluster.
2018-04-17 20:34:52.265465: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/config and using profilePrefix = 1
2018-04-17 20:34:52.267103: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/credentials and using profilePrefix = 0
2018-04-17 20:34:52.267120: I tensorflow/core/platform/s3/aws_logging.cc:54] Setting provider to read credentials from /root//.aws/credentials for credentials file and /root//.aws/config for the config file , for use with profile default
2018-04-17 20:34:52.267133: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating HttpClient with max connections2 and scheme http
2018-04-17 20:34:52.267154: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 2
2018-04-17 20:34:52.267175: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating TaskRole with default ECSCredentialsClient and refresh rate 900000
2018-04-17 20:34:52.267213: I tensorflow/core/platform/s3/aws_logging.cc:54] Unable to open config file /root//.aws/credentials for reading.
2018-04-17 20:34:52.267228: I tensorflow/core/platform/s3/aws_logging.cc:54] Failed to reload configuration.
2018-04-17 20:34:52.267238: I tensorflow/core/platform/s3/aws_logging.cc:54] Unable to open config file /root//.aws/config for reading.
2018-04-17 20:34:52.267244: I tensorflow/core/platform/s3/aws_logging.cc:54] Failed to reload configuration.
2018-04-17 20:34:52.267255: I tensorflow/core/platform/s3/aws_logging.cc:54] Credentials have expired or will expire, attempting to repull from ECS IAM Service.
2018-04-17 20:34:52.267342: I tensorflow/core/platform/s3/aws_logging.cc:54] Pool grown by 2
2018-04-17 20:34:52.267357: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-17 20:34:52.271264: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 25
2018-04-17 20:34:52.275164: I tensorflow/core/platform/s3/aws_logging.cc:54] Pool grown by 2
2018-04-17 20:34:52.275184: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-17 20:34:52.337301: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-04-17 20:34:52.337347: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-04-17 20:34:52.338141: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-17 20:34:56,292 INFO - tensorflow - Calling model_fn.
2018-04-17 20:34:56,293 ERROR - container_support.training - uncaught exception during training: features should be a dictionary of `Tensor`s. Given type: <type 'function'>
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/container_support/training.py", line 38, in start
fw.train()
File "/usr/local/lib/python2.7/dist-packages/tf_container/train.py", line 139, in train
train_wrapper.train()
File "/usr/local/lib/python2.7/dist-packages/tf_container/trainer.py", line 73, in train
tf.estimator.train_and_evaluate(estimator=estimator, train_spec=train_spec, eval_spec=eval_spec)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 421, in train_and_evaluate
executor.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 522, in run
getattr(self, task_to_run)()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 577, in run_master
self._start_distributed_training(saving_listeners=saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 715, in _start_distributed_training
saving_listeners=saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 352, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 812, in _train_model
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 793, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/canned/linear.py", line 316, in _model_fn
config=config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/canned/linear.py", line 138, in _linear_model_fn
'Given type: {}'.format(type(features)))
ValueError: features should be a dictionary of `Tensor`s. Given type: <type 'function'>
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-2a854a24dd88> in <module>()
17 })
18
---> 19 classifier.fit(inputs)
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py in fit(self, inputs, wait, logs, job_name, run_tensorboard_locally)
234 tensorboard.event.set()
235 else:
--> 236 fit_super()
237
238 @classmethod
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py in fit_super()
219 """
220 def fit_super():
--> 221 super(TensorFlow, self).fit(inputs, wait, logs, job_name)
222
223 if run_tensorboard_locally and wait is False:
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name)
608 self._hyperparameters[JOB_NAME_PARAM_NAME] = self._current_job_name
609 self._hyperparameters[SAGEMAKER_REGION_PARAM_NAME] = self.sagemaker_session.boto_session.region_name
--> 610 super(Framework, self).fit(inputs, wait, logs, self._current_job_name)
611
612 def hyperparameters(self):
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name)
163 self.latest_training_job = _TrainingJob.start_new(self, inputs)
164 if wait:
--> 165 self.latest_training_job.wait(logs=logs)
166
167 @classmethod
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
396 def wait(self, logs=True):
397 if logs:
--> 398 self.sagemaker_session.logs_for_job(self.job_name, wait=True)
399 else:
400 self.sagemaker_session.wait_for_job(self.job_name)
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll)
649
650 if wait:
--> 651 self._check_job_status(job_name, description)
652 if dot:
653 print()
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc)
393 if status != 'Completed':
394 reason = desc.get('FailureReason', '(No reason provided)')
--> 395 raise ValueError('Error training {}: {} Reason: {}'.format(job, status, reason))
396
397 def wait_for_endpoint(self, endpoint, poll=5):
ValueError: Error training sagemaker-tensorflow-2018-04-17-20-30-05-729: Failed Reason: AlgorithmError: uncaught exception during training: features should be a dictionary of `Tensor`s. Given type: <type 'function'>
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/container_support/training.py", line 38, in start
fw.train()
File "/usr/local/lib/python2.7/dist-packages/tf_container/train.py", line 139, in train
train_wrapper.train()
File "/usr/local/lib/python2.7/dist-packages/tf_container/trainer.py", line 73, in train
tf.estimator.train_and_evaluate(estimator=estimator, train_spec=train_spec, eval_spec=eval_spec)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 421, in train_and_evaluate
executor.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 522, in run
getattr(self, task_to_run)()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 577, in run_master
self._start_distributed_training(saving_liste
Metadata
Metadata
Assignees
Labels
No labels