Pytorch deployment failing with unexpected errors

Please fill out the form below.

### System Information
- **Framework**: Pytorch
- **Framework Version**: 1.0.0
- **Python Version**: 3
- **CPU or GPU**: CPU
- **Python SDK Version**:  1.18.4
- **Are you using a custom image**: No

### Describe the problem
Model deployment fails with cryptic errors. See the logs below. The command issued to deploy the model is the following:

```
MODEL_PATH = 's3:///sagemaker-us-east-2-971148336196/improved-ner-training-0-25-0/output/model.tar.gz'

MODEL_NAME = 'improved-ner-model-model-' + os.environ['ENVIRONMENT']
ENDPOINT_NAME = 'improved-ner-model-sagemaker-endpoint-' + os.environ['ENVIRONMENT']

DEPLOY_INSTANCE = 'ml.m5.large'

model = PyTorchModel(model_data=MODEL_PATH, role=ROLE, entry_point='train_model.py',
                             sagemaker_session=sm_session, py_version='py3', framework_version='1.0.0',
                             name=ENDPOINT_NAME)

model.deploy(initial_instance_count=1, instance_type=DEPLOY_INSTANCE, endpoint_name=ENDPOINT_NAME)
```

The model is publicly available here: 
https://s3.us-east-2.amazonaws.com/sagemaker-us-east-2-971148336196/improved-ner-training-0-25-0/output/model.tar.gz

It contains a directory called `flair` which contains the `final_model.pt`

The (relevant) part of the `train_model.py` script is the following:


```
def model_fn(model_dir):
    f_out = os.path.join(model_dir, 'flair')
    m = SequenceTagger.load_from_file(os.path.join(f_out, 'final-model.pt'))
    return m

def input_fn(request_body, request_content_type):
    if request_content_type.lower() != 'application/json':
        raise ValueError('Content type must be application/json')

    if 'sentence' not in request_body:
        raise ValueError('Request must be JSON formatted with key: sentence')
    return request_body['sentence']


def predict_fn(input_data, model):
    return model.predict(input_data)


if __name__ == "__main__":
    args, _ = parse_args()

    flair_out = os.path.join(args.model_dir, 'flair')
    trainer(flair_out)  # This trains a model using flair.trainer.ModelTrainer

    model = SequenceTagger.load_from_file(os.path.join(flair_out, 'final-model.pt'))
    # create example sentence
    sentence = Sentence('I love Berlin')

    # predict tags and print
    model.predict(sentence)
```


### Minimal repro / logs

The CloudWatch logs are very opaque. One of the errors is the following:

```
sagemaker_containers._errors.ClientError: [Errno 30] Read-only file system: '/opt/ml/model/flair/final-model.pt'
```

Then, much later, these errors pop up:

```
Processing /opt/ml/code
Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: '/tmp/pip-req-tracker-27gca9by/35241637574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3'
You are using pip version 18.1, however version 19.0.3 is available.
[2019-04-12 03:49:26 +0000] [25] [ERROR] Error handling request /ping
```

Any ideas of what is actually causing the error, or some other steps to take to make it easier to debug?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pytorch deployment failing with unexpected errors #752

System Information

Describe the problem

Minimal repro / logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pytorch deployment failing with unexpected errors #752

Description

System Information

Describe the problem

Minimal repro / logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions