-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
Description
Reference: 0411850995
System Information
- Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): TensorFlow , OpenNMT-tf
- Framework Version: 1.14
- Python Version: py3
- CPU or GPU: GPU
- Python SDK Version: latest
- Are you using a custom image: script training
Describe the problem
I am using OpenNMT-tf for training. When the training reaches the export step, the exported model is not exported properly, and a strange error named "Internal server error" appears.
Minimal repro / logs
Unfortunately, there is nothing in the logs telling anything about the issue!
- Exact command to reproduce:
estimator = TensorFlow(entry_point=local_training_script_path,
dependencies=['model.py'],
train_instance_type='ml.p3.2xlarge',
train_instance_count=1,
checkpoint_s3_uri=checkpoint_path,
output_path=model_artifacts_location,
code_location=custom_code_upload_location,
role=sagemaker.get_execution_role(),
framework_version='1.14',
py_version='py3',
script_mode=True,
train_use_spot_instances=train_use_spot_instances,
train_max_run=train_max_run,
train_max_wait=train_max_wait,
train_volume_size=75)
estimator.fit(training_inputs_location, job_name=job_name_sagemaker, wait=True)
The shell script contains
pip install OpenNMT-tf
onmt-main ...
Please advise,