Skip to content

InternalServerError: We encountered an internal error - Platform Error: SageMaker pipe channel timed out #1316

@koles289

Description

@koles289

I've been working with Sagemaker ObjectDetection and I was modifying it to work with AugmentedManifest files on my own data and adding some tuning parameters.
The problem is, that when I launch training I'm getting this InternalServerError. When I looked at the logs files the training generated, I have seen platform error that I copied below:


timestamp message
1594116610922 Docker entrypoint called with argument(s): train
1594116615923 [07/07/2020 10:10:15 INFO 140635090528064] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/default-input.json: {u'label_width': u'350', u'early_stopping_min_epochs': u'10', u'epochs': u'30', u'overlap_threshold': u'0.5', u'lr_scheduler_factor': u'0.1', u'_num_kv_servers': u'auto', u'weight_decay': u'0.0005', u'mini_batch_size': u'32', u'use_pretrained_model': u'0', u'freeze_layer_pattern': u'', u'lr_scheduler_step': u'', u'early_stopping': u'False', u'early_stopping_patience': u'5', u'momentum': u'0.9', u'num_training_samples': u'', u'optimizer': u'sgd', u'_tuning_objective_metric': u'', u'early_stopping_tolerance': u'0.0', u'learning_rate': u'0.001', u'kv_store': u'device', u'nms_threshold': u'0.45', u'num_classes': u'', u'base_network': u'vgg-16', u'nms_topk': u'400', u'_kvstore': u'device', u'image_shape': u'300'}
1594116615923 [07/07/2020 10:10:15 INFO 140635090528064] Merging with provided configuration from /opt/ml/input/config/hyperparameters.json: {u'learning_rate': u'0.002181994762971979', u'epochs': u'30', u'nms_threshold': u'0.45', u'optimizer': u'adam', u'_tuning_objective_metric': u'validation:mAP', u'base_network': u'resnet-50', u'image_shape': u'512', u'label_width': u'150', u'lr_scheduler_step': u'10,20', u'momentum': u'0.9779938194841434', u'overlap_threshold': u'0.5', u'num_training_samples': u'682', u'mini_batch_size': u'35', u'weight_decay': u'0.980545413373939', u'use_pretrained_model': u'1', u'num_classes': u'2', u'lr_scheduler_factor': u'0.25'}
1594116615923 [07/07/2020 10:10:15 INFO 140635090528064] Final configuration: {u'label_width': u'150', u'early_stopping_min_epochs': u'10', u'epochs': u'30', u'overlap_threshold': u'0.5', u'lr_scheduler_factor': u'0.25', u'_num_kv_servers': u'auto', u'weight_decay': u'0.980545413373939', u'mini_batch_size': u'35', u'use_pretrained_model': u'1', u'freeze_layer_pattern': u'', u'lr_scheduler_step': u'10,20', u'early_stopping': u'False', u'early_stopping_patience': u'5', u'momentum': u'0.9779938194841434', u'num_training_samples': u'682', u'optimizer': u'adam', u'_tuning_objective_metric': u'validation:mAP', u'early_stopping_tolerance': u'0.0', u'learning_rate': u'0.002181994762971979', u'kv_store': u'device', u'nms_threshold': u'0.45', u'num_classes': u'2', u'base_network': u'resnet-50', u'nms_topk': u'400', u'_kvstore': u'device', u'image_shape': u'512'}
1594116615923 Process 1 is a worker.
1594116615923 [07/07/2020 10:10:15 INFO 140635090528064] Using default worker.
1594116615923 [07/07/2020 10:10:15 INFO 140635090528064] Loaded iterator creator application/x-image for content type ('application/x-image', '1.0')
1594116615923 [07/07/2020 10:10:15 INFO 140635090528064] Loaded iterator creator application/x-recordio for content type ('application/x-recordio', '1.0')
1594116615923 [07/07/2020 10:10:15 INFO 140635090528064] Loaded iterator creator image/png for content type ('image/png', '1.0')
1594116615923 [07/07/2020 10:10:15 INFO 140635090528064] Loaded iterator creator image/jpeg for content type ('image/jpeg', '1.0')
1594116615923 [07/07/2020 10:10:15 INFO 140635090528064] Checkpoint loading and saving are disabled.
1594116615923 [07/07/2020 10:10:15 INFO 140635090528064] The channel 'train' is in pipe input mode under /opt/ml/input/data/train.
1594116615923 [07/07/2020 10:10:15 INFO 140635090528064] The channel 'validation' is in pipe input mode under /opt/ml/input/data/validation.
1594116615923 [07/07/2020 10:10:15 INFO 140635090528064] nvidia-smi took: 0.1007168293 secs to identify 4 gpus
1594116615924 [07/07/2020 10:10:15 INFO 140635090528064] Number of GPUs being used: 4
1594116615924 [07/07/2020 10:10:15 INFO 140635090528064] The channel 'train' is in pipe input mode under /opt/ml/input/data/train.
1594116615924 [07/07/2020 10:10:15 INFO 140635090528064] The channel 'validation' is in pipe input mode under /opt/ml/input/data/validation.
1594116615924 [07/07/2020 10:10:15 WARNING 140635090528064] Training images are resized to image shape (3, 512, 512)
1594116615924 [10:10:15] /opt/brazil-pkg-cache/packages/AIApplicationsPipeIterators/AIApplicationsPipeIterators-1.0.1118.0/AL2012/generic-flavor/src/data_iter/src/ease_det_image_iter.cpp:41: ImageDetRecordIOParser: pipe:///opt/ml/input/data/train, use 31 threads for decoding..
1594116624926 [07/07/2020 10:10:24 WARNING 140635090528064] Validation images are resized to image shape (3, 512, 512)
1594116624927 [10:10:24] /opt/brazil-pkg-cache/packages/AIApplicationsPipeIterators/AIApplicationsPipeIterators-1.0.1118.0/AL2012/generic-flavor/src/data_iter/src/ease_det_image_iter.cpp:41: ImageDetRecordIOParser: pipe:///opt/ml/input/data/validation, use 31 threads for decoding..
1594116684945 /opt/amazon/lib/python2.7/site-packages/ai_algorithms_sdk/base/signals_lib.so(+0x2516) [0x7fe7dbed1516]
1594116684945 /opt/amazon/lib/python2.7/site-packages/ai_algorithms_sdk/base/signals_lib.so(+0x299a) [0x7fe7dbed199a]
1594116684945 /opt/amazon/lib/python2.7/site-packages/ai_algorithms_sdk/base/signals_lib.so(+0x2ba9) [0x7fe7dbed1ba9]
1594116684945 /opt/amazon/lib/libstdc++.so.6(+0x63156) [0x7fe80f984156]
1594116684945 /opt/amazon/lib/libstdc++.so.6(+0x61d19) [0x7fe80f982d19]
1594116684945 /opt/amazon/lib/libstdc++.so.6(__gxx_personality_v0+0xd2) [0x7fe80f9833e2]
1594116684945 /opt/amazon/lib/libgcc_s.so.1(+0x10560) [0x7fe80f500560]
1594116684945 /opt/amazon/lib/libgcc_s.so.1(_Unwind_Resume+0x59) [0x7fe80f500be9]
1594116684945 /opt/amazon/lib/libaialgsdataiter.so(_ZN6aialgs2io17PipeChannelReader14ReadNextStringEv+0x1e3) [0x7fe826cd3d33]
1594116684945 /opt/amazon/lib/libaialgsdataiter.so(_ZZN6aialgs2io16EASEDetImageIterIfE22ReadNextVecFeatureListERSt6vectorINS0_11FeatureListESaIS4_EEENKUlvE_clEv+0x1b6) [0x7fe826ceecb6]
1594116684945 /opt/amazon/lib/libaialgsdataiter.so(+0x8809c) [0x7fe826cec09c]
1594116684945 /opt/amazon/lib/libgomp.so.1(+0xf6e6) [0x7fe80f7176e6]
1594116684945 /lib64/libpthread.so.0(+0x7dc5) [0x7fe8285a7dc5]
1594116684945 /lib64/libc.so.6(clone+0x6d) [0x7fe8279a46ed]
1594116684945 Platform Error: SageMaker pipe channel timed out.
1594116684945 [10:11:24] /opt/brazil-pkg-cache/packages/AIApplicationsPipeIterators/AIApplicationsPipeIterators-1.0.1118.0/AL2012/generic-flavor/src/data_iter/src/channel.cpp:124: (Platform Error) FIFO 0 of the SageMaker pipe channel '/opt/ml/input/data/validation' timed out.
1594116684945 Stack trace returned 8 entries:
1594116684946 [bt] (0) /opt/amazon/lib/libaialgsdataiter.so(dmlc::StackTrace()+0x3d) [0x7fe826cbbd3d]
1594116684946 [bt] (1) /opt/amazon/lib/libaialgsdataiter.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x1a) [0x7fe826cbbfda]
1594116684946 [bt] (2) /opt/amazon/lib/libaialgsdataiter.so(aialgs::io::PipeChannelReader::ReadNextString()+0x26f) [0x7fe826cd3dbf]
1594116684946 [bt] (3) /opt/amazon/lib/libaialgsdataiter.so(aialgs::io::EASEDetImageIter::ReadNextVecFeatureList(std::vector<aialgs::io::FeatureList, std::allocatoraialgs::io::FeatureList >&)::{lambda()#1}::operator()() const+0x1b6) [0x7fe826ceecb6]
1594116684946 [bt] (4) /opt/amazon/lib/libaialgsdataiter.so(+0x8809c) [0x7fe826cec09c]
1594116684946 [bt] (5) /opt/amazon/lib/libgomp.so.1(+0xf6e6) [0x7fe80f7176e6]
1594116684946 [bt] (6) /lib64/libpthread.so.0(+0x7dc5) [0x7fe8285a7dc5]
1594116684946 [bt] (7) /lib64/libc.so.6(clone+0x6d) [0x7fe8279a46ed]
1594116684946 Platform Error: SageMaker pipe channel timed out.
1594116684946 [10:11:24] /opt/brazil-pkg-cache/packages/AIApplicationsPipeIterators/AIApplicationsPipeIterators-1.0.1118.0/AL2012/generic-flavor/src/data_iter/src/channel.cpp:124: (Platform Error) FIFO 0 of the SageMaker pipe channel '/opt/ml/input/data/validation' timed out.
1594116684946 Stack trace returned 8 entries:
1594116684946 [bt] (0) /opt/amazon/lib/libaialgsdataiter.so(dmlc::StackTrace()+0x3d) [0x7fe826cbbd3d]
1594116684946 [bt] (1) /opt/amazon/lib/libaialgsdataiter.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x1a) [0x7fe826cbbfda]
1594116684946 [bt] (2) /opt/amazon/lib/libaialgsdataiter.so(aialgs::io::PipeChannelReader::ReadNextString()+0x26f) [0x7fe826cd3dbf]
1594116684946 [bt] (3) /opt/amazon/lib/libaialgsdataiter.so(aialgs::io::EASEDetImageIter::ReadNextVecFeatureList(std::vector<aialgs::io::FeatureList, std::allocatoraialgs::io::FeatureList >&)::{lambda()#1}::operator()() const+0x1b6) [0x7fe826ceecb6]
1594116684946 [bt] (4) /opt/amazon/lib/libaialgsdataiter.so(+0x8809c) [0x7fe826cec09c]
1594116684946 [bt] (5) /opt/amazon/lib/libgomp.so.1(+0xf6e6) [0x7fe80f7176e6]
1594116684946 [bt] (6) /lib64/libpthread.so.0(+0x7dc5) [0x7fe8285a7dc5]
1594116684946 [bt] (7) /lib64/libc.so.6(clone+0x6d) [0x7fe8279a46ed]

I am quite new to AWS so any help would be appreciated.

Thank you

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions