SimpleFSDP Status Tracking

SimpleFSDP consists of two major components: (1) Frontend composability with different parallelisms & distributed training techniques; (2) Backend optimization in torch.compile to overlap communication.

###  Frontend Composability

#### Dense model (llama3)

- [Done] Parallelisms: TP/PP/CP

- [Done] other techniques: Distributed checkpointing / mixed-precision training / meta initialization / activation checkpointing

- [Done] Float 8 training: numeric difference (@pianpwk): We will see numeric difference in inductor mode because of triton kernel implementations. But we get bit-wise numeric equivalence in aot_eager.

#### MoE model (DSV3)

- [Done] Parallelisms: TP/EP/ETP

- [Need PoC] activation checkpointing composability: graph breaks when AC is applied (related issue: https://github.com/pytorch/torchtitan/issues/1935)

- [In progress] Parallelism: PP (Interleave1F1B+TP): dynamic shape errors (related issue: https://github.com/pytorch/torchtitan/pull/1529#discussion_r2389872587) @laithsakka @aorenste 

- [Done] zero2-style sharding + AC composability (related PR: https://github.com/pytorch/torchtitan/pull/1970) @ruisizhang123 


###  Backend Optimization

#### Manual bucketing & reordering

- [In progress] get results on DSV3 models & merge PR  (related PR: https://github.com/pytorch/torchtitan/pull/1881) @ruisizhang123 

- [In progress] numeric debugging recipe (@pianpwk @ezyang @yushangdi @ruisizhang123) 

- [Not start] allow users to specify module reordering positions via mode annotation. (@ruisizhang123 )

#### Auto bucketing & reordering

- [In progress] @eellison @IvanKobzarev (related PR: https://github.com/pytorch/torchtitan/pull/1813) 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SimpleFSDP Status Tracking #1980

Frontend Composability

Dense model (llama3)

MoE model (DSV3)

Backend Optimization

Manual bucketing & reordering

Auto bucketing & reordering

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SimpleFSDP Status Tracking #1980

Description

Frontend Composability

Dense model (llama3)

MoE model (DSV3)

Backend Optimization

Manual bucketing & reordering

Auto bucketing & reordering

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions