Skip to content

Pytorch deployment failing with unexpected errors #752

@carlomazzaferro

Description

@carlomazzaferro

Please fill out the form below.

System Information

  • Framework: Pytorch
  • Framework Version: 1.0.0
  • Python Version: 3
  • CPU or GPU: CPU
  • Python SDK Version: 1.18.4
  • Are you using a custom image: No

Describe the problem

Model deployment fails with cryptic errors. See the logs below. The command issued to deploy the model is the following:

MODEL_PATH = 's3:///sagemaker-us-east-2-971148336196/improved-ner-training-0-25-0/output/model.tar.gz'

MODEL_NAME = 'improved-ner-model-model-' + os.environ['ENVIRONMENT']
ENDPOINT_NAME = 'improved-ner-model-sagemaker-endpoint-' + os.environ['ENVIRONMENT']

DEPLOY_INSTANCE = 'ml.m5.large'

model = PyTorchModel(model_data=MODEL_PATH, role=ROLE, entry_point='train_model.py',
                             sagemaker_session=sm_session, py_version='py3', framework_version='1.0.0',
                             name=ENDPOINT_NAME)

model.deploy(initial_instance_count=1, instance_type=DEPLOY_INSTANCE, endpoint_name=ENDPOINT_NAME)

The model is publicly available here:
https://s3.us-east-2.amazonaws.com/sagemaker-us-east-2-971148336196/improved-ner-training-0-25-0/output/model.tar.gz

It contains a directory called flair which contains the final_model.pt

The (relevant) part of the train_model.py script is the following:

def model_fn(model_dir):
    f_out = os.path.join(model_dir, 'flair')
    m = SequenceTagger.load_from_file(os.path.join(f_out, 'final-model.pt'))
    return m

def input_fn(request_body, request_content_type):
    if request_content_type.lower() != 'application/json':
        raise ValueError('Content type must be application/json')

    if 'sentence' not in request_body:
        raise ValueError('Request must be JSON formatted with key: sentence')
    return request_body['sentence']


def predict_fn(input_data, model):
    return model.predict(input_data)


if __name__ == "__main__":
    args, _ = parse_args()

    flair_out = os.path.join(args.model_dir, 'flair')
    trainer(flair_out)  # This trains a model using flair.trainer.ModelTrainer

    model = SequenceTagger.load_from_file(os.path.join(flair_out, 'final-model.pt'))
    # create example sentence
    sentence = Sentence('I love Berlin')

    # predict tags and print
    model.predict(sentence)

Minimal repro / logs

The CloudWatch logs are very opaque. One of the errors is the following:

sagemaker_containers._errors.ClientError: [Errno 30] Read-only file system: '/opt/ml/model/flair/final-model.pt'

Then, much later, these errors pop up:

Processing /opt/ml/code
Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: '/tmp/pip-req-tracker-27gca9by/35241637574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3'
You are using pip version 18.1, however version 19.0.3 is available.
[2019-04-12 03:49:26 +0000] [25] [ERROR] Error handling request /ping

Any ideas of what is actually causing the error, or some other steps to take to make it easier to debug?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions