[DSV3] Adding 16B model training config, Enable FSDP and AC on DSV3-16B model #1330

wwwjn · 2025-06-23T21:12:02Z

Context

Introduced a basic DSV3-16B model training config
Enabled FSDP/HSDP on DSV3-16B model training

Performance

Current profiler looks like this: The to_copy takes to long and needs to be optimized. The copy comes from dtype conversion in class MoE():
routed_output = (routed_output.to(torch.float32) * top_scores.unsqueeze(-1)).to(x.dtype)

With FSDP only:

As titled, to save some H100 resource and avoid long waiting, only run integration test when PR's base branch is main. No need to run H100 tests on PRs like #1330

tianyu-l · 2025-06-24T18:33:17Z

torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml

need to use more realistic config, but can revisit later.

H-Huang

LGTM

H-Huang · 2025-06-24T19:16:42Z

torchtitan/models/deepseek_v3/infra/parallelize.py

 ):
-    # TODO: Add support for parallelizing the model, this is a placeholder function for now
+    if job_config.activation_checkpoint.mode != "none":
+        apply_ac(model, job_config.activation_checkpoint)


My understanding is that for SAC we are counting the number of matmuls occuring during forward, then selectively saving every say, N matmuls.

MoE might affect this in two ways:

matmul imbalances (gating/routing computation is lightweight, while expert MM is heavy)

Not sure how this interacts with expert parallel is across multiple ranks?

I'm not sure if we cover this in Llama4, any ideas @tianyu-l? Anyways, if SAC isn't covered i dont think its that high pri but maybe just add a comment.

That's a great point I've missed! Let me note this down and see how to resolve. If we can identify router/gating matmuls we can just ignore them in AC.

SAC per layer should still be more or less useful.

I only tested Full AC not SAC, if we agree we will not support SAC, I could add a comment.

H-Huang · 2025-06-24T19:20:28Z

torchtitan/models/deepseek_v3/model/moe.py

        # shape (bs*slen*top_k, dim)
        routed_output = self.experts(routed_input, num_local_tokens_per_expert)
-        routed_output = routed_output * top_scores.unsqueeze(-1)
+        routed_output = (routed_output.to(torch.float32) * top_scores.unsqueeze(-1)).to(


just curious how come this is needed?

Router computation is in fp32, so top_scores is in fp32.
This step is to make the score x activation computation in high precision, and then cast back.
Router precision in MoE seems critical for the training stability.

After applying FSDP, the routed_output at line 309 is bf16, and the top_scores is float32. If we don't explicitly convert dtype, the routed_output = routed_output * top_scores at line 310 will has dtype float32 (auto converted to high precision).

out = out.scatter_add(dim=0, index=token_indices, src=routed_output)

In this line, the out is bf16, as we applied FSDP. So I added this explicit dtype conversion following llama4

Thanks for the explanations!!

…#1331) As titled, to save some H100 resource and avoid long waiting, only run integration test when PR's base branch is main. No need to run H100 tests on PRs like pytorch#1330

…6B model (pytorch#1330) ## Context 1. Introduced a basic DSV3-16B model training config 2. Enabled FSDP/HSDP on DSV3-16B model training ## Performance Current profiler looks like this: The `to_copy` takes to long and needs to be optimized. The copy comes from dtype conversion in class MoE(): ```routed_output = (routed_output.to(torch.float32) * top_scores.unsqueeze(-1)).to(x.dtype)``` With FSDP only: <img width="1544" alt="Screenshot 2025-06-23 at 2 10 20 PM" src="https://github.com/user-attachments/assets/bcd698dc-3899-46e0-ae53-e7f8b0db13fc" />

As titled, to save some H100 resource and avoid long waiting, only run integration test when PR's base branch is main. No need to run H100 tests on PRs like #1330

…6B model (#1330) ## Context 1. Introduced a basic DSV3-16B model training config 2. Enabled FSDP/HSDP on DSV3-16B model training ## Performance Current profiler looks like this: The `to_copy` takes to long and needs to be optimized. The copy comes from dtype conversion in class MoE(): ```routed_output = (routed_output.to(torch.float32) * top_scores.unsqueeze(-1)).to(x.dtype)``` With FSDP only: <img width="1544" alt="Screenshot 2025-06-23 at 2 10 20 PM" src="https://github.com/user-attachments/assets/bcd698dc-3899-46e0-ae53-e7f8b0db13fc" />

…6B model (pytorch#1330) ## Context 1. Introduced a basic DSV3-16B model training config 2. Enabled FSDP/HSDP on DSV3-16B model training ## Performance Current profiler looks like this: The `to_copy` takes to long and needs to be optimized. The copy comes from dtype conversion in class MoE(): ```routed_output = (routed_output.to(torch.float32) * top_scores.unsqueeze(-1)).to(x.dtype)``` With FSDP only: <img width="1544" alt="Screenshot 2025-06-23 at 2 10 20 PM" src="https://github.com/user-attachments/assets/bcd698dc-3899-46e0-ae53-e7f8b0db13fc" />

…6B model (#1330) ## Context 1. Introduced a basic DSV3-16B model training config 2. Enabled FSDP/HSDP on DSV3-16B model training ## Performance Current profiler looks like this: The `to_copy` takes to long and needs to be optimized. The copy comes from dtype conversion in class MoE(): ```routed_output = (routed_output.to(torch.float32) * top_scores.unsqueeze(-1)).to(x.dtype)``` With FSDP only: <img width="1544" alt="Screenshot 2025-06-23 at 2 10 20 PM" src="https://github.com/user-attachments/assets/bcd698dc-3899-46e0-ae53-e7f8b0db13fc" />

…#1331) As titled, to save some H100 resource and avoid long waiting, only run integration test when PR's base branch is main. No need to run H100 tests on PRs like pytorch#1330

…6B model (#1330) ## Context 1. Introduced a basic DSV3-16B model training config 2. Enabled FSDP/HSDP on DSV3-16B model training ## Performance Current profiler looks like this: The `to_copy` takes to long and needs to be optimized. The copy comes from dtype conversion in class MoE(): ```routed_output = (routed_output.to(torch.float32) * top_scores.unsqueeze(-1)).to(x.dtype)``` With FSDP only: <img width="1544" alt="Screenshot 2025-06-23 at 2 10 20 PM" src="https://github.com/user-attachments/assets/bcd698dc-3899-46e0-ae53-e7f8b0db13fc" />

wwwjn added 7 commits June 23, 2025 13:05

rename to register model

f7e9ee9

forward and backward

f1bb6b8

lint

ea262c4

remove useless comments

0b56b96

16b config

5cbafad

fsdp

11d3b38

fix to conversion

b98683a

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 23, 2025

wwwjn requested review from H-Huang and tianyu-l June 23, 2025 21:12

wwwjn mentioned this pull request Jun 24, 2025

[CI] Only run integration test when PR's base branch is main #1331

Merged

fix the transpose

dfe2b61

tianyu-l approved these changes Jun 24, 2025

View reviewed changes

H-Huang approved these changes Jun 24, 2025

View reviewed changes

wwwjn merged commit b74918a into deepseek-v3 Jun 24, 2025
5 checks passed

tianyu-l deleted the dsv3-fsdp branch June 25, 2025 02:01

tianyu-l mentioned this pull request Jul 8, 2025

compile: turn off fullgraph=True to support llama4 #1182

Open

soulitzer mentioned this pull request Jul 8, 2025

Add option to exclude low flop mms from every-other-mm sac policy #1372

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DSV3] Adding 16B model training config, Enable FSDP and AC on DSV3-16B model #1330

[DSV3] Adding 16B model training config, Enable FSDP and AC on DSV3-16B model #1330

Uh oh!

wwwjn commented Jun 23, 2025 •

edited

Loading

Uh oh!

tianyu-l Jun 24, 2025

Uh oh!

H-Huang left a comment

Uh oh!

H-Huang Jun 24, 2025

Uh oh!

tianyu-l Jun 24, 2025

Uh oh!

wwwjn Jun 24, 2025

Uh oh!

H-Huang Jun 24, 2025

Uh oh!

tianyu-l Jun 24, 2025

Uh oh!

wwwjn Jun 24, 2025 •

edited

Loading

Uh oh!

H-Huang Jun 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[DSV3] Adding 16B model training config, Enable FSDP and AC on DSV3-16B model #1330

[DSV3] Adding 16B model training config, Enable FSDP and AC on DSV3-16B model #1330

Uh oh!

Conversation

wwwjn commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Performance

Uh oh!

tianyu-l Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

H-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

H-Huang Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

wwwjn Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

H-Huang Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

wwwjn Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

H-Huang Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wwwjn commented Jun 23, 2025 •

edited

Loading

wwwjn Jun 24, 2025 •

edited

Loading