Skip to content

dataparallel need to work on apex #269

@seongwook-ham

Description

@seongwook-ham

similiar to #227
i already check that distributed data parallel works well.
but in my case where dataset is large(>200GB) and using multigpu(8gpu)
distributed dataparallel need at least 1.6TB. am i right?
i have only 512GB ram. so i need to use data parallel.
in same code dataparallel without apex works normally.
also distribued dataparallel with apex works normally.
but dataparallel with apex throws following error.
Traceback (most recent call last): | 0/333 [00:00<?, ?it/s]
File "run_pretrain_amp.py", line 1140, in
main()
File "run_pretrain_amp.py", line 1038, in main
loss = model(masked_input_ids, segment_ids, input_mask, masked_lm_labels, label_ids)
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
output = module(*input, **kwargs)
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/apex/amp/_initialize.py", line 193, in new_fwd
**applier(kwargs, input_caster))
File "/home/kizunasunhy/bert_seongwook_v1/modeling.py", line 713, in forward
output_all_encoded_layers=False)
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/kizunasunhy/bert_seongwook_v1/modeling.py", line 641, in forward
embedding_output = self.embeddings(input_ids, token_type_ids)
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/kizunasunhy/bert_seongwook_v1/modeling.py", line 208, in forward
words_embeddings = self.word_embeddings(input_ids)
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 118, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/home/kizunasunhy/.conda/envs/temp1/lib/python3.6/site-packages/torch/nn/functional.py", line 1454, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1549630534704/work/aten/src/THC/generic/THCTensorIndex.cu:519

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions