Add 4GPU unit test #82

wconstab · 2024-02-24T00:34:31Z

For now this literally just runs NGPU=4 ./run_llama_train.sh but I verified at least it catches problems.

As a follow up, we should integrate mgpu test infra from pytorch and set up actual unit tests to run in this job.

We should probably also keep testing the run_llama_train.sh script, and add other combinations of 2D parallelism to ensure they all keep working.

This reverts commit 38267d5.

gnadathur

Great!

For now this literally just runs `NGPU=4 ./run_llama_train.sh` but I verified at least it catches problems. As a follow up, we should integrate mgpu test infra from pytorch and set up actual unit tests to run in this job. We should probably also keep testing the run_llama_train.sh script, and add other combinations of 2D parallelism to ensure they all keep working. <img width="2120" alt="image" src="https://github.com/pytorch/torchtrain/assets/4984825/2c235e9a-04ed-4f2d-9915-67de39d78e1c">

…ort metrics. (pytorch#82) * Implement `metrics.distributed_mode`, controlling which processes report metrics. On a large cluster, it is most of the time sufficient to just have every node logging metrics instead of every GPU. * Improvements following Alex. review.

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 24, 2024

Add 4GPU unit test

c2f28ab

wconstab force-pushed the whc/4gpu branch from ce0bca8 to c2f28ab Compare February 24, 2024 00:40

wconstab added 4 commits February 23, 2024 16:42

try to run llama train script

25b2cb6

make names better

a3bb5dc

Break train.py to see if CI notices

38267d5

Revert "Break train.py to see if CI notices"

cd45ca5

This reverts commit 38267d5.

gnadathur approved these changes Feb 24, 2024

View reviewed changes

wconstab merged commit 6e17001 into main Feb 24, 2024

wconstab deleted the whc/4gpu branch February 24, 2024 01:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add 4GPU unit test #82

Add 4GPU unit test #82

Uh oh!

wconstab commented Feb 24, 2024 •

edited

Loading

Uh oh!

gnadathur left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add 4GPU unit test #82

Add 4GPU unit test #82

Uh oh!

Conversation

wconstab commented Feb 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gnadathur left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wconstab commented Feb 24, 2024 •

edited

Loading