Fix device mismatch bug in T5 implementation #1944

joecummings · 2022-10-13T15:01:28Z

Error

[[cure_business]TorchScriptTrain](https://www.internalfb.com/fblearner/details/379462033/operator/4373394157?tab)Ran for 4 mins 9 s
[Hide logs](https://www.internalfb.com/intern/fblearner/details/379462033?tab=operator_details#)
Try #3

    [stderr](https://www.internalfb.com/intern/fblearner/details/379462033?tab=operator_details#)
    [stdout](https://www.internalfb.com/intern/fblearner/details/379462033?tab=operator_details#)

[Try #3](https://www.internalfb.com/intern/fblearner/details/379462033?tab=operator_details#)
    return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
  File "/tmp/jetter.zjtu55hv/pytorch_lightning/plugins/precision/precision_plugin.py", line 153, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/tmp/jetter.zjtu55hv/torch/optim/optimizer.py", line 140, in wrapper
    out = func(*args, **kwargs)
  File "/tmp/jetter.zjtu55hv/fblearner/flow/projects/fluent2/definition/transformers/ecg/huggingface_transformers_4_6/optimization.py", line 368, in
 step
    loss = closure()
  File "/tmp/jetter.zjtu55hv/pytorch_lightning/plugins/precision/precision_plugin.py", line 138, in _wrap_closure
    closure_result = closure()
  File "/tmp/jetter.zjtu55hv/pytorch_lightning/loops/optimization/optimizer_loop.py", line 148, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/tmp/jetter.zjtu55hv/pytorch_lightning/loops/optimization/optimizer_loop.py", line 134, in closure
    step_output = self._step_fn()
  File "/tmp/jetter.zjtu55hv/pytorch_lightning/loops/optimization/optimizer_loop.py", line 422, in _training_step
    training_step_output = self.trainer._call_strategy_hook("training_step", *step_kwargs.values())
  File "/tmp/jetter.zjtu55hv/pytorch_lightning/trainer/trainer.py", line 1752, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/tmp/jetter.zjtu55hv/pytorch_lightning/strategies/strategy.py", line 340, in training_step
    return self.model.training_step(*args, **kwargs)
  File "/tmp/jetter.zjtu55hv/fblearner/flow/projects/fluent2/definition/transformers/ecg/ecg_two_tower.py", line 340, in training_step
    loss, _ = self.train_eval_batch(batch)
  File "/tmp/jetter.zjtu55hv/fblearner/flow/projects/fluent2/definition/transformers/ecg/ecg_two_tower.py", line 312, in train_eval_batch
    embeddings_a = self.model(**model_inputs_a)
  File "/tmp/jetter.zjtu55hv/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/tmp/jetter.zjtu55hv/fblearner/flow/projects/fluent2/definition/transformers/ecg/t5_sentence_embeddings.py", line 135, in forward
    model_output = self.model(
  File "/tmp/jetter.zjtu55hv/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/tmp/jetter.zjtu55hv/torchtext/prototype/models/t5/model.py", line 173, in forward
    encoder_output, encoder_hidden_states, encoder_position_bias, encoder_sa = self.encoder(
  File "/tmp/jetter.zjtu55hv/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/tmp/jetter.zjtu55hv/torchtext/prototype/models/t5/modules.py", line 865, in forward
    output, position_bias, sa_score = mod(
  File "/tmp/jetter.zjtu55hv/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/tmp/jetter.zjtu55hv/torchtext/prototype/models/t5/modules.py", line 616, in forward
    sa_out, position_bias, sa_scores = self._sa_block(self.norm1(x), tgt_mask, tgt_key_padding_mask, position_bias)
  File "/tmp/jetter.zjtu55hv/torchtext/prototype/models/t5/modules.py", line 630, in _sa_block
    attn = self.self_attn(
  File "/tmp/jetter.zjtu55hv/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/tmp/jetter.zjtu55hv/torchtext/prototype/models/t5/modules.py", line 132, in forward
    attn_output, position_bias, attn_output_weights = self._t5_multi_head_attention_forward(
  File "/tmp/jetter.zjtu55hv/torchtext/prototype/models/t5/modules.py", line 257, in _t5_multi_head_attention_forward
    position_bias = self._compute_bias(
  File "/tmp/jetter.zjtu55hv/torchtext/prototype/models/t5/modules.py", line 420, in _compute_bias
    relative_position_bucket = self._relative_position_bucket(
  File "/tmp/jetter.zjtu55hv/torchtext/prototype/models/t5/modules.py", line 454, in _relative_position_bucket
    relative_buckets += (relative_position > 0).to(torch.long) * num_buckets
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Fix & Context

relative_buckets is a Tensor created without specifying device, meaning it automatically gets put on CPU; however, the rest of the input Tensors are created with CUDA; therefore, there is a mismatch when attempting to do an arithmetic operation. Discovered when working on AI for CS workflow.

Testing

Fluent2 and Bento notebook; passes existing tests here. Do we have any Integration tests w/ CUDA we could run in OSS to check this?

Nayef211

Thanks for the fix @joecummings

torchtext/prototype/models/t5/modules.py

facebook-github-bot added the cla signed label Oct 13, 2022

joecummings requested review from Nayef211 and abhinavarora October 13, 2022 15:01

joecummings added the bug label Oct 13, 2022

Nayef211 approved these changes Oct 14, 2022

View reviewed changes

torchtext/prototype/models/t5/modules.py Outdated Show resolved Hide resolved

joecummings added 5 commits October 14, 2022 17:25

Move relative_buckets Tensor to same device as relative_position

104c4f8

Update code pointer comments

eb0073f

Reference self.device from within MultiHeadedAttention private methods

bf9e1f1

Remove faulty call with device to t5 forward method

1ef8661

Add device to Attention obj

99b0872

joecummings force-pushed the device-mismatch-butg branch from 882e48b to 99b0872 Compare October 14, 2022 21:25

joecummings merged commit 4570a56 into pytorch:main Oct 17, 2022

joecummings deleted the device-mismatch-butg branch October 17, 2022 17:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix device mismatch bug in T5 implementation #1944

Fix device mismatch bug in T5 implementation #1944

Uh oh!

joecummings commented Oct 13, 2022

Uh oh!

Nayef211 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix device mismatch bug in T5 implementation #1944

Fix device mismatch bug in T5 implementation #1944

Uh oh!

Conversation

joecummings commented Oct 13, 2022

Error

Fix & Context

Testing

Uh oh!

Nayef211 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants