Change MFU calculation #274

ad8e · 2024-04-26T00:02:01Z

Uses Flash's FLOP counter of 7*: https://github.com/Dao-AILab/flash-attention/blob/23e8fa5a263d1c7122bc46a86ef32030ee7130f9/benchmarks/benchmark_flash_attention.py#L27

Excludes vocab embedding from FLOPS.

Did not test.

Uses Flash's FLOP counter of 7*: https://github.com/Dao-AILab/flash-attention/blob/23e8fa5a263d1c7122bc46a86ef32030ee7130f9/benchmarks/benchmark_flash_attention.py#L27 Excludes vocab embedding from FLOPS. Did not test.

facebook-github-bot · 2024-04-26T00:02:05Z

Hi @ad8e!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

ad8e · 2024-04-26T00:04:23Z

Probably I won't end up signing the CLA, for laziness reasons; my PyTorch CLA unfortunately doesn't apply here. Someone else can pick up these changes and submit them.

wanchaol · 2024-04-26T00:12:55Z

Probably I won't end up signing the CLA, for laziness reasons; my PyTorch CLA unfortunately doesn't apply here. Someone else can pick up these changes and submit them.

@ad8e Thanks for the PR! sounds good, @tianyu-l could you take a look on this and help submit changes?

tianyu-l · 2024-04-26T00:27:15Z

Excludes vocab embedding from FLOPS.

Thank you @ad8e for helping improve torchtitan!

May I ask why we should exclude vocab embedding from FLOPS computation? It seems to me that the embedding layer is involved in both the forward and backward computations.

Thanks!

ad8e · 2024-04-26T00:34:52Z

~~Backward of embedding: you're right, the vocab layer must be counted. FLOPS = 2x vocab embedding params. It must calculate the gradient for the vocab weights, but not the gradient for the inputs.~~

Forward of embedding: it acts as a lookup table and so the flops are 0 (or = hidden dimension, if you want to count the memory bandwidth).

~~So I would instead suggest 6 * num_params - 4 * v * d + 7 * l * h * q * t as my corrected formula.~~

EDIT: Actually, perhaps the vocab backward can also use the lookup table method and skip the matmul. So then the vocab layer wouldn't need any matmul, in either forward or backward. I don't know the internal implementation of the embedding though. In that case, my original PR has the correct formula.

ad8e · 2024-04-26T00:53:06Z

https://discuss.pytorch.org/t/how-does-backward-work-for-embeddingbag/103342

Embedding backward uses the lookup table, as expected. So the vocab layer should be omitted entirely from FLOPS.

tianyu-l · 2024-04-26T22:57:47Z

https://discuss.pytorch.org/t/how-does-backward-work-for-embeddingbag/103342

Embedding backward uses the lookup table, as expected. So the vocab layer should be omitted entirely from FLOPS.

That makes sense! I've sent a PR #280 to address this. For the flash attention part, it's quite tricky (e.g. we shouldn't include the extra matmul recomputation in the backward pass into MFU computation; do we really want to consider the sparsity introduced in causal attention as the hardware treats sparsity differently; etc.), so I'm keeping the factor of 12.

ad8e · 2024-04-27T00:36:26Z

tianyu's PR is better and has a CLA attached, so closing this in favor of the referenced PR.

Change MFU calculation

1581154

Uses Flash's FLOP counter of 7*: https://github.com/Dao-AILab/flash-attention/blob/23e8fa5a263d1c7122bc46a86ef32030ee7130f9/benchmarks/benchmark_flash_attention.py#L27 Excludes vocab embedding from FLOPS. Did not test.

wanchaol requested a review from tianyu-l April 26, 2024 00:11

tianyu-l mentioned this pull request Apr 26, 2024

exclude embedding in MFU computation #280

Merged

ad8e closed this Apr 27, 2024

ad8e deleted the patch-1 branch April 27, 2024 01:04

awgu mentioned this pull request Jul 11, 2024

Incorrect MFU computation #452

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change MFU calculation #274

Change MFU calculation #274

Uh oh!

ad8e commented Apr 26, 2024

Uh oh!

facebook-github-bot commented Apr 26, 2024

Uh oh!

ad8e commented Apr 26, 2024 •

edited

Loading

Uh oh!

wanchaol commented Apr 26, 2024

Uh oh!

tianyu-l commented Apr 26, 2024

Uh oh!

ad8e commented Apr 26, 2024 •

edited

Loading

Uh oh!

ad8e commented Apr 26, 2024

Uh oh!

tianyu-l commented Apr 26, 2024

Uh oh!

ad8e commented Apr 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Change MFU calculation #274

Change MFU calculation #274

Uh oh!

Conversation

ad8e commented Apr 26, 2024

Uh oh!

facebook-github-bot commented Apr 26, 2024

Action Required

Process

Uh oh!

ad8e commented Apr 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wanchaol commented Apr 26, 2024

Uh oh!

tianyu-l commented Apr 26, 2024

Uh oh!

ad8e commented Apr 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ad8e commented Apr 26, 2024

Uh oh!

tianyu-l commented Apr 26, 2024

Uh oh!

ad8e commented Apr 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ad8e commented Apr 26, 2024 •

edited

Loading

ad8e commented Apr 26, 2024 •

edited

Loading