Skip to content

Failed Reason: AlgorithmError: uncaught exception during training: features should be a dictionary of Tensors. Given type: <type 'function'> #153

@ghost

Description

I'm not exactly sure what happened. All of the sudden all of my training tasks now fail with no code changes. There are definitely authentication issues with aws credentials even though I am training on the online Jupyter notebook and my session is active.

This is how I am constructing the classifier.

classifier = TensorFlow(entry_point='sm_transcript_classifier_ep.py',
                               role=role,
                               training_steps= 1e4,                                  
                               evaluation_steps= 100,
                               train_instance_count=1,
                               train_instance_type=INSTANCE_TYPE,
                               hyperparameters={
                                   "question": QUESTION,
                                   "n_words": _get_n_words()
                               })

model function:

def estimator_fn(run_config, params):
    bow_column = tf.feature_column.categorical_column_with_identity(
        WORDS_FEATURE, num_buckets=params["n_words"])
    bow_embedding_column = tf.feature_column.embedding_column(
        bow_column, dimension=EMBEDDING_SIZE, combiner="sqrtn")
    return tf.estimator.LinearClassifier(
        feature_columns=[bow_embedding_column],
        config=run_config
        #loss_reduction=tf.losses.Reduction.SUM_BY_NONZERO_WEIGHTS #this doesn't work even though SageMaker should support TF 1.6??
    )

Full error log:

...........................................................
2018-04-17 20:34:49,194 INFO - root - running container entrypoint
2018-04-17 20:34:49,194 INFO - root - starting train task
2018-04-17 20:34:49,199 INFO - container_support.training - Training starting
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
2018-04-17 20:34:51,095 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTP connection (1): 169.254.170.2
2018-04-17 20:34:51,305 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): sagemaker-us-east-1-245511257894.s3.amazonaws.com
2018-04-17 20:34:51,983 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): s3.amazonaws.com
2018-04-17 20:34:52,246 INFO - tf_container - ----------------------TF_CONFIG--------------------------
2018-04-17 20:34:52,246 INFO - tf_container - {"environment": "cloud", "cluster": {"master": ["algo-1:2222"]}, "task": {"index": 0, "type": "master"}}
2018-04-17 20:34:52,246 INFO - tf_container - ---------------------------------------------------------
2018-04-17 20:34:52,246 INFO - tf_container - creating RunConfig:
2018-04-17 20:34:52,246 INFO - tf_container - {'save_checkpoints_secs': 300}
2018-04-17 20:34:52,247 INFO - tensorflow - TF_CONFIG environment variable: {u'environment': u'cloud', u'cluster': {u'master': [u'algo-1:2222']}, u'task': {u'index': 0, u'type': u'master'}}
2018-04-17 20:34:52,247 INFO - tf_container - invoking estimator_fn
2018-04-17 20:34:52,247 INFO - tensorflow - Using config: {'_save_checkpoints_secs': 300, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': None, '_task_type': u'master', '_global_id_in_cluster': 0, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb3b40d4190>, '_model_dir': u's3://sagemaker-us-east-1-245511257894/sagemaker-tensorflow-2018-04-17-20-30-05-729/checkpoints', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_service': None, '_save_summary_steps': 100, '_num_ps_replicas': 0}
2018-04-17 20:34:52,248 INFO - tensorflow - Skip starting Tensorflow server as there is only one node in the cluster.
2018-04-17 20:34:52.265465: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/config and using profilePrefix = 1
2018-04-17 20:34:52.267103: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/credentials and using profilePrefix = 0
2018-04-17 20:34:52.267120: I tensorflow/core/platform/s3/aws_logging.cc:54] Setting provider to read credentials from /root//.aws/credentials for credentials file and /root//.aws/config for the config file , for use with profile default
2018-04-17 20:34:52.267133: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating HttpClient with max connections2 and scheme http
2018-04-17 20:34:52.267154: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 2
2018-04-17 20:34:52.267175: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating TaskRole with default ECSCredentialsClient and refresh rate 900000
2018-04-17 20:34:52.267213: I tensorflow/core/platform/s3/aws_logging.cc:54] Unable to open config file /root//.aws/credentials for reading.
2018-04-17 20:34:52.267228: I tensorflow/core/platform/s3/aws_logging.cc:54] Failed to reload configuration.
2018-04-17 20:34:52.267238: I tensorflow/core/platform/s3/aws_logging.cc:54] Unable to open config file /root//.aws/config for reading.
2018-04-17 20:34:52.267244: I tensorflow/core/platform/s3/aws_logging.cc:54] Failed to reload configuration.
2018-04-17 20:34:52.267255: I tensorflow/core/platform/s3/aws_logging.cc:54] Credentials have expired or will expire, attempting to repull from ECS IAM Service.
2018-04-17 20:34:52.267342: I tensorflow/core/platform/s3/aws_logging.cc:54] Pool grown by 2
2018-04-17 20:34:52.267357: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-17 20:34:52.271264: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 25
2018-04-17 20:34:52.275164: I tensorflow/core/platform/s3/aws_logging.cc:54] Pool grown by 2
2018-04-17 20:34:52.275184: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-17 20:34:52.337301: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-04-17 20:34:52.337347: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-04-17 20:34:52.338141: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-17 20:34:56,292 INFO - tensorflow - Calling model_fn.
2018-04-17 20:34:56,293 ERROR - container_support.training - uncaught exception during training: features should be a dictionary of `Tensor`s. Given type: <type 'function'>
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/container_support/training.py", line 38, in start
    fw.train()
  File "/usr/local/lib/python2.7/dist-packages/tf_container/train.py", line 139, in train
    train_wrapper.train()
  File "/usr/local/lib/python2.7/dist-packages/tf_container/trainer.py", line 73, in train
    tf.estimator.train_and_evaluate(estimator=estimator, train_spec=train_spec, eval_spec=eval_spec)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 421, in train_and_evaluate
    executor.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 522, in run
    getattr(self, task_to_run)()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 577, in run_master
    self._start_distributed_training(saving_listeners=saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 715, in _start_distributed_training
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 352, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 812, in _train_model
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 793, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/canned/linear.py", line 316, in _model_fn
    config=config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/canned/linear.py", line 138, in _linear_model_fn
    'Given type: {}'.format(type(features)))
ValueError: features should be a dictionary of `Tensor`s. Given type: <type 'function'>


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-2a854a24dd88> in <module>()
     17                                })
     18 
---> 19 classifier.fit(inputs)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py in fit(self, inputs, wait, logs, job_name, run_tensorboard_locally)
    234                 tensorboard.event.set()
    235         else:
--> 236             fit_super()
    237 
    238     @classmethod

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py in fit_super()
    219         """
    220         def fit_super():
--> 221             super(TensorFlow, self).fit(inputs, wait, logs, job_name)
    222 
    223         if run_tensorboard_locally and wait is False:

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name)
    608         self._hyperparameters[JOB_NAME_PARAM_NAME] = self._current_job_name
    609         self._hyperparameters[SAGEMAKER_REGION_PARAM_NAME] = self.sagemaker_session.boto_session.region_name
--> 610         super(Framework, self).fit(inputs, wait, logs, self._current_job_name)
    611 
    612     def hyperparameters(self):

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name)
    163         self.latest_training_job = _TrainingJob.start_new(self, inputs)
    164         if wait:
--> 165             self.latest_training_job.wait(logs=logs)
    166 
    167     @classmethod

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
    396     def wait(self, logs=True):
    397         if logs:
--> 398             self.sagemaker_session.logs_for_job(self.job_name, wait=True)
    399         else:
    400             self.sagemaker_session.wait_for_job(self.job_name)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll)
    649 
    650         if wait:
--> 651             self._check_job_status(job_name, description)
    652             if dot:
    653                 print()

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc)
    393         if status != 'Completed':
    394             reason = desc.get('FailureReason', '(No reason provided)')
--> 395             raise ValueError('Error training {}: {} Reason: {}'.format(job, status, reason))
    396 
    397     def wait_for_endpoint(self, endpoint, poll=5):

ValueError: Error training sagemaker-tensorflow-2018-04-17-20-30-05-729: Failed Reason: AlgorithmError: uncaught exception during training: features should be a dictionary of `Tensor`s. Given type: <type 'function'>
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/container_support/training.py", line 38, in start
    fw.train()
  File "/usr/local/lib/python2.7/dist-packages/tf_container/train.py", line 139, in train
    train_wrapper.train()
  File "/usr/local/lib/python2.7/dist-packages/tf_container/trainer.py", line 73, in train
    tf.estimator.train_and_evaluate(estimator=estimator, train_spec=train_spec, eval_spec=eval_spec)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 421, in train_and_evaluate
    executor.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 522, in run
    getattr(self, task_to_run)()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 577, in run_master
    self._start_distributed_training(saving_liste

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions