-
Notifications
You must be signed in to change notification settings - Fork 7k
Closed
Description
I've been working with Sagemaker ObjectDetection and I was modifying it to work with AugmentedManifest files on my own data and adding some tuning parameters.
The problem is, that when I launch training I'm getting this InternalServerError. When I looked at the logs files the training generated, I have seen platform error that I copied below:
| timestamp | message |
|---|---|
| 1594116610922 | Docker entrypoint called with argument(s): train |
| 1594116615923 | [07/07/2020 10:10:15 INFO 140635090528064] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/default-input.json: {u'label_width': u'350', u'early_stopping_min_epochs': u'10', u'epochs': u'30', u'overlap_threshold': u'0.5', u'lr_scheduler_factor': u'0.1', u'_num_kv_servers': u'auto', u'weight_decay': u'0.0005', u'mini_batch_size': u'32', u'use_pretrained_model': u'0', u'freeze_layer_pattern': u'', u'lr_scheduler_step': u'', u'early_stopping': u'False', u'early_stopping_patience': u'5', u'momentum': u'0.9', u'num_training_samples': u'', u'optimizer': u'sgd', u'_tuning_objective_metric': u'', u'early_stopping_tolerance': u'0.0', u'learning_rate': u'0.001', u'kv_store': u'device', u'nms_threshold': u'0.45', u'num_classes': u'', u'base_network': u'vgg-16', u'nms_topk': u'400', u'_kvstore': u'device', u'image_shape': u'300'} |
| 1594116615923 | [07/07/2020 10:10:15 INFO 140635090528064] Merging with provided configuration from /opt/ml/input/config/hyperparameters.json: {u'learning_rate': u'0.002181994762971979', u'epochs': u'30', u'nms_threshold': u'0.45', u'optimizer': u'adam', u'_tuning_objective_metric': u'validation:mAP', u'base_network': u'resnet-50', u'image_shape': u'512', u'label_width': u'150', u'lr_scheduler_step': u'10,20', u'momentum': u'0.9779938194841434', u'overlap_threshold': u'0.5', u'num_training_samples': u'682', u'mini_batch_size': u'35', u'weight_decay': u'0.980545413373939', u'use_pretrained_model': u'1', u'num_classes': u'2', u'lr_scheduler_factor': u'0.25'} |
| 1594116615923 | [07/07/2020 10:10:15 INFO 140635090528064] Final configuration: {u'label_width': u'150', u'early_stopping_min_epochs': u'10', u'epochs': u'30', u'overlap_threshold': u'0.5', u'lr_scheduler_factor': u'0.25', u'_num_kv_servers': u'auto', u'weight_decay': u'0.980545413373939', u'mini_batch_size': u'35', u'use_pretrained_model': u'1', u'freeze_layer_pattern': u'', u'lr_scheduler_step': u'10,20', u'early_stopping': u'False', u'early_stopping_patience': u'5', u'momentum': u'0.9779938194841434', u'num_training_samples': u'682', u'optimizer': u'adam', u'_tuning_objective_metric': u'validation:mAP', u'early_stopping_tolerance': u'0.0', u'learning_rate': u'0.002181994762971979', u'kv_store': u'device', u'nms_threshold': u'0.45', u'num_classes': u'2', u'base_network': u'resnet-50', u'nms_topk': u'400', u'_kvstore': u'device', u'image_shape': u'512'} |
| 1594116615923 | Process 1 is a worker. |
| 1594116615923 | [07/07/2020 10:10:15 INFO 140635090528064] Using default worker. |
| 1594116615923 | [07/07/2020 10:10:15 INFO 140635090528064] Loaded iterator creator application/x-image for content type ('application/x-image', '1.0') |
| 1594116615923 | [07/07/2020 10:10:15 INFO 140635090528064] Loaded iterator creator application/x-recordio for content type ('application/x-recordio', '1.0') |
| 1594116615923 | [07/07/2020 10:10:15 INFO 140635090528064] Loaded iterator creator image/png for content type ('image/png', '1.0') |
| 1594116615923 | [07/07/2020 10:10:15 INFO 140635090528064] Loaded iterator creator image/jpeg for content type ('image/jpeg', '1.0') |
| 1594116615923 | [07/07/2020 10:10:15 INFO 140635090528064] Checkpoint loading and saving are disabled. |
| 1594116615923 | [07/07/2020 10:10:15 INFO 140635090528064] The channel 'train' is in pipe input mode under /opt/ml/input/data/train. |
| 1594116615923 | [07/07/2020 10:10:15 INFO 140635090528064] The channel 'validation' is in pipe input mode under /opt/ml/input/data/validation. |
| 1594116615923 | [07/07/2020 10:10:15 INFO 140635090528064] nvidia-smi took: 0.1007168293 secs to identify 4 gpus |
| 1594116615924 | [07/07/2020 10:10:15 INFO 140635090528064] Number of GPUs being used: 4 |
| 1594116615924 | [07/07/2020 10:10:15 INFO 140635090528064] The channel 'train' is in pipe input mode under /opt/ml/input/data/train. |
| 1594116615924 | [07/07/2020 10:10:15 INFO 140635090528064] The channel 'validation' is in pipe input mode under /opt/ml/input/data/validation. |
| 1594116615924 | [07/07/2020 10:10:15 WARNING 140635090528064] Training images are resized to image shape (3, 512, 512) |
| 1594116615924 | [10:10:15] /opt/brazil-pkg-cache/packages/AIApplicationsPipeIterators/AIApplicationsPipeIterators-1.0.1118.0/AL2012/generic-flavor/src/data_iter/src/ease_det_image_iter.cpp:41: ImageDetRecordIOParser: pipe:///opt/ml/input/data/train, use 31 threads for decoding.. |
| 1594116624926 | [07/07/2020 10:10:24 WARNING 140635090528064] Validation images are resized to image shape (3, 512, 512) |
| 1594116624927 | [10:10:24] /opt/brazil-pkg-cache/packages/AIApplicationsPipeIterators/AIApplicationsPipeIterators-1.0.1118.0/AL2012/generic-flavor/src/data_iter/src/ease_det_image_iter.cpp:41: ImageDetRecordIOParser: pipe:///opt/ml/input/data/validation, use 31 threads for decoding.. |
| 1594116684945 | /opt/amazon/lib/python2.7/site-packages/ai_algorithms_sdk/base/signals_lib.so(+0x2516) [0x7fe7dbed1516] |
| 1594116684945 | /opt/amazon/lib/python2.7/site-packages/ai_algorithms_sdk/base/signals_lib.so(+0x299a) [0x7fe7dbed199a] |
| 1594116684945 | /opt/amazon/lib/python2.7/site-packages/ai_algorithms_sdk/base/signals_lib.so(+0x2ba9) [0x7fe7dbed1ba9] |
| 1594116684945 | /opt/amazon/lib/libstdc++.so.6(+0x63156) [0x7fe80f984156] |
| 1594116684945 | /opt/amazon/lib/libstdc++.so.6(+0x61d19) [0x7fe80f982d19] |
| 1594116684945 | /opt/amazon/lib/libstdc++.so.6(__gxx_personality_v0+0xd2) [0x7fe80f9833e2] |
| 1594116684945 | /opt/amazon/lib/libgcc_s.so.1(+0x10560) [0x7fe80f500560] |
| 1594116684945 | /opt/amazon/lib/libgcc_s.so.1(_Unwind_Resume+0x59) [0x7fe80f500be9] |
| 1594116684945 | /opt/amazon/lib/libaialgsdataiter.so(_ZN6aialgs2io17PipeChannelReader14ReadNextStringEv+0x1e3) [0x7fe826cd3d33] |
| 1594116684945 | /opt/amazon/lib/libaialgsdataiter.so(_ZZN6aialgs2io16EASEDetImageIterIfE22ReadNextVecFeatureListERSt6vectorINS0_11FeatureListESaIS4_EEENKUlvE_clEv+0x1b6) [0x7fe826ceecb6] |
| 1594116684945 | /opt/amazon/lib/libaialgsdataiter.so(+0x8809c) [0x7fe826cec09c] |
| 1594116684945 | /opt/amazon/lib/libgomp.so.1(+0xf6e6) [0x7fe80f7176e6] |
| 1594116684945 | /lib64/libpthread.so.0(+0x7dc5) [0x7fe8285a7dc5] |
| 1594116684945 | /lib64/libc.so.6(clone+0x6d) [0x7fe8279a46ed] |
| 1594116684945 | Platform Error: SageMaker pipe channel timed out. |
| 1594116684945 | [10:11:24] /opt/brazil-pkg-cache/packages/AIApplicationsPipeIterators/AIApplicationsPipeIterators-1.0.1118.0/AL2012/generic-flavor/src/data_iter/src/channel.cpp:124: (Platform Error) FIFO 0 of the SageMaker pipe channel '/opt/ml/input/data/validation' timed out. |
| 1594116684945 | Stack trace returned 8 entries: |
| 1594116684946 | [bt] (0) /opt/amazon/lib/libaialgsdataiter.so(dmlc::StackTrace()+0x3d) [0x7fe826cbbd3d] |
| 1594116684946 | [bt] (1) /opt/amazon/lib/libaialgsdataiter.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x1a) [0x7fe826cbbfda] |
| 1594116684946 | [bt] (2) /opt/amazon/lib/libaialgsdataiter.so(aialgs::io::PipeChannelReader::ReadNextString()+0x26f) [0x7fe826cd3dbf] |
| 1594116684946 | [bt] (3) /opt/amazon/lib/libaialgsdataiter.so(aialgs::io::EASEDetImageIter::ReadNextVecFeatureList(std::vector<aialgs::io::FeatureList, std::allocatoraialgs::io::FeatureList >&)::{lambda()#1}::operator()() const+0x1b6) [0x7fe826ceecb6] |
| 1594116684946 | [bt] (4) /opt/amazon/lib/libaialgsdataiter.so(+0x8809c) [0x7fe826cec09c] |
| 1594116684946 | [bt] (5) /opt/amazon/lib/libgomp.so.1(+0xf6e6) [0x7fe80f7176e6] |
| 1594116684946 | [bt] (6) /lib64/libpthread.so.0(+0x7dc5) [0x7fe8285a7dc5] |
| 1594116684946 | [bt] (7) /lib64/libc.so.6(clone+0x6d) [0x7fe8279a46ed] |
| 1594116684946 | Platform Error: SageMaker pipe channel timed out. |
| 1594116684946 | [10:11:24] /opt/brazil-pkg-cache/packages/AIApplicationsPipeIterators/AIApplicationsPipeIterators-1.0.1118.0/AL2012/generic-flavor/src/data_iter/src/channel.cpp:124: (Platform Error) FIFO 0 of the SageMaker pipe channel '/opt/ml/input/data/validation' timed out. |
| 1594116684946 | Stack trace returned 8 entries: |
| 1594116684946 | [bt] (0) /opt/amazon/lib/libaialgsdataiter.so(dmlc::StackTrace()+0x3d) [0x7fe826cbbd3d] |
| 1594116684946 | [bt] (1) /opt/amazon/lib/libaialgsdataiter.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x1a) [0x7fe826cbbfda] |
| 1594116684946 | [bt] (2) /opt/amazon/lib/libaialgsdataiter.so(aialgs::io::PipeChannelReader::ReadNextString()+0x26f) [0x7fe826cd3dbf] |
| 1594116684946 | [bt] (3) /opt/amazon/lib/libaialgsdataiter.so(aialgs::io::EASEDetImageIter::ReadNextVecFeatureList(std::vector<aialgs::io::FeatureList, std::allocatoraialgs::io::FeatureList >&)::{lambda()#1}::operator()() const+0x1b6) [0x7fe826ceecb6] |
| 1594116684946 | [bt] (4) /opt/amazon/lib/libaialgsdataiter.so(+0x8809c) [0x7fe826cec09c] |
| 1594116684946 | [bt] (5) /opt/amazon/lib/libgomp.so.1(+0xf6e6) [0x7fe80f7176e6] |
| 1594116684946 | [bt] (6) /lib64/libpthread.so.0(+0x7dc5) [0x7fe8285a7dc5] |
| 1594116684946 | [bt] (7) /lib64/libc.so.6(clone+0x6d) [0x7fe8279a46ed] |
I am quite new to AWS so any help would be appreciated.
Thank you
Metadata
Metadata
Assignees
Labels
No labels