Refactor freqs_cis slice to be safer for PP #321

wconstab · 2024-05-10T23:03:32Z

Stack from ghstack (oldest at bottom):

Unchanged: we precompute freqs_cis for max_seqlen, >> seqlen for a given
batch.

Changed: instead of slicing self.freqs_cis down to seqlen at top level
transformer based on the input token shape, we slice it down to seqlen
inside a transformer layer after we have re-expanded to the full seqlen
in cases where TP has sharded across seqlen.

In the PP case, stage 1's input may be seqlen/TP instead of seqlen, but
we do not generally know this. That makes it hard for stage1 to slice
freqs_cis correctly. It's easy to do the slicing deeper inside, since
at that point we do know the full seqlen unambiguously.

Note: the full self.freqs_cis is stored in memory either way, and the
thing passed into every layer is just a view. This change should not be
material for memory usage or otherwise.

[ghstack-poisoned]

Unchanged: we precompute freqs_cis for max_seqlen, >> seqlen for a given batch. Changed: instead of slicing self.freqs_cis down to seqlen at top level transformer based on the input token shape, we slice it down to seqlen inside a transformer layer after we have re-expanded to the full seqlen in cases where TP has sharded across seqlen. In the PP case, stage 1's input may be seqlen/TP instead of seqlen, but we do not generally know this. That makes it hard for stage1 to slice freqs_cis correctly. It's easy to do the slicing deeper inside, since at that point we do know the full seqlen unambiguously. Note: the full self.freqs_cis is stored in memory either way, and the thing passed into every layer is just a view. This change should not be material for memory usage or otherwise. ghstack-source-id: 20ef05e Pull Request resolved: #321

lessw2020

makes sense - lgtm!

wanchaol

lgtm!

Unchanged: we precompute freqs_cis for max_seqlen, >> seqlen for a given batch. Changed: instead of slicing self.freqs_cis down to seqlen at top level transformer based on the input token shape, we slice it down to seqlen inside a transformer layer after we have re-expanded to the full seqlen in cases where TP has sharded across seqlen. In the PP case, stage 1's input may be seqlen/TP instead of seqlen, but we do not generally know this. That makes it hard for stage1 to slice freqs_cis correctly. It's easy to do the slicing deeper inside, since at that point we do know the full seqlen unambiguously. Note: the full self.freqs_cis is stored in memory either way, and the thing passed into every layer is just a view. This change should not be material for memory usage or otherwise. ghstack-source-id: 20ef05e Pull Request resolved: #321

awgu · 2024-05-14T12:42:14Z

torchtitan/models/llama/model.py

        torch.Tensor: Reshaped frequency tensor.
    """
    ndim = x.ndim
    assert 0 <= 1 < ndim


not from this PR: I wonder what the point of the 0 <= 1 part is 😃 .

lol. its always good to check your assumptions

Unchanged: we precompute freqs_cis for max_seqlen, >> seqlen for a given batch. Changed: instead of slicing self.freqs_cis down to seqlen at top level transformer based on the input token shape, we slice it down to seqlen inside a transformer layer after we have re-expanded to the full seqlen in cases where TP has sharded across seqlen. In the PP case, stage 1's input may be seqlen/TP instead of seqlen, but we do not generally know this. That makes it hard for stage1 to slice freqs_cis correctly. It's easy to do the slicing deeper inside, since at that point we do know the full seqlen unambiguously. Note: the full self.freqs_cis is stored in memory either way, and the thing passed into every layer is just a view. This change should not be material for memory usage or otherwise. ghstack-source-id: 20ef05e Pull Request resolved: pytorch#321

Update

231ebc1

[ghstack-poisoned]

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 10, 2024

wconstab mentioned this pull request May 10, 2024

Make Transformer tolerate missing layers for PP #322

Merged

wconstab requested review from lessw2020, tianyu-l and wanchaol May 10, 2024 23:48

wconstab mentioned this pull request May 10, 2024

Add Pipeline Parallel (and 2D PP+FSDP) support #318

Merged

lessw2020 approved these changes May 13, 2024

View reviewed changes

wanchaol approved these changes May 13, 2024

View reviewed changes

wconstab merged commit 231ebc1 into gh/wconstab/13/base May 13, 2024

wconstab deleted the gh/wconstab/13/head branch May 13, 2024 21:46

awgu reviewed May 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor freqs_cis slice to be safer for PP #321

Refactor freqs_cis slice to be safer for PP #321

Uh oh!

wconstab commented May 10, 2024 •

edited

Loading

Uh oh!

lessw2020 left a comment

Uh oh!

wanchaol left a comment

Uh oh!

awgu May 14, 2024

Uh oh!

wconstab May 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Refactor freqs_cis slice to be safer for PP #321

Refactor freqs_cis slice to be safer for PP #321

Uh oh!

Conversation

wconstab commented May 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lessw2020 left a comment

Choose a reason for hiding this comment

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

awgu May 14, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab May 14, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wconstab commented May 10, 2024 •

edited

Loading