Skip to content

Conversation

@wconstab
Copy link
Contributor

@wconstab wconstab commented Feb 24, 2024

For now this literally just runs NGPU=4 ./run_llama_train.sh but I verified at least it catches problems.

As a follow up, we should integrate mgpu test infra from pytorch and set up actual unit tests to run in this job.

We should probably also keep testing the run_llama_train.sh script, and add other combinations of 2D parallelism to ensure they all keep working.

image

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 24, 2024
Copy link
Contributor

@gnadathur gnadathur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@wconstab wconstab merged commit 6e17001 into main Feb 24, 2024
@wconstab wconstab deleted the whc/4gpu branch February 24, 2024 01:10
lessw2020 pushed a commit that referenced this pull request Apr 18, 2024
For now this literally just runs `NGPU=4 ./run_llama_train.sh` but I
verified at least it catches problems.

As a follow up, we should integrate mgpu test infra from pytorch and set
up actual unit tests to run in this job.

We should probably also keep testing the run_llama_train.sh script, and
add other combinations of 2D parallelism to ensure they all keep
working.

<img width="2120" alt="image"
src="https://github.com/pytorch/torchtrain/assets/4984825/2c235e9a-04ed-4f2d-9915-67de39d78e1c">
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
For now this literally just runs `NGPU=4 ./run_llama_train.sh` but I
verified at least it catches problems.

As a follow up, we should integrate mgpu test infra from pytorch and set
up actual unit tests to run in this job.

We should probably also keep testing the run_llama_train.sh script, and
add other combinations of 2D parallelism to ensure they all keep
working.

<img width="2120" alt="image"
src="https://github.com/pytorch/torchtrain/assets/4984825/2c235e9a-04ed-4f2d-9915-67de39d78e1c">
payoto pushed a commit to graphcore-research/torchtitan-fork that referenced this pull request Feb 7, 2025
…ort metrics. (pytorch#82)

* Implement `metrics.distributed_mode`, controlling which processes report metrics.

On a large cluster, it is most of the time sufficient to just have every node logging metrics instead of every GPU.

* Improvements following Alex. review.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants