Skip to content

Conversation

@lessw2020
Copy link
Contributor

This PR
1 - adds multi-node training support via a multinode_trainer.slurm file. Verified llama 7b on 20 nodes / 160 A100s.
2 - It also corrects a race condition that can occur on larger scale training in profiling, where the check for the trace dir existence fails for process 1, but in the interim another process 2 makes the directory, and then when process 1 tries to make the dir it errors out as the dir exists.
This is a simple fix of adding exist_ok=True to both of the makedir command (dump folder, trace folder).

Screenshot 2024-02-15 at 10 53 18 PM Screenshot 2024-02-15 at 10 55 02 PM

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 16, 2024
Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!!

@lessw2020 lessw2020 merged commit 70be86e into pytorch:main Feb 22, 2024
@lessw2020 lessw2020 deleted the expand_multi_node branch February 22, 2024 18:31
lessw2020 added a commit that referenced this pull request Apr 18, 2024
…ix (#63)

This PR 
1 - adds multi-node training support via a multinode_trainer.slurm file.
Verified llama 7b on 20 nodes / 160 A100s.
2 - It also corrects a race condition that can occur on larger scale
training in profiling, where the check for the trace dir existence fails
for process 1, but in the interim another process 2 makes the directory,
and then when process 1 tries to make the dir it errors out as the dir
exists.
This is a simple fix of adding exist_ok=True to both of the makedir
command (dump folder, trace folder).

<img width="1047" alt="Screenshot 2024-02-15 at 10 53 18 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/20378637-4adb-425b-91d8-7fd36289d3b5">
<img width="545" alt="Screenshot 2024-02-15 at 10 55 02 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/28658614-cff6-42b5-ab57-bac578393d5c">
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
…ix (pytorch#63)

This PR 
1 - adds multi-node training support via a multinode_trainer.slurm file.
Verified llama 7b on 20 nodes / 160 A100s.
2 - It also corrects a race condition that can occur on larger scale
training in profiling, where the check for the trace dir existence fails
for process 1, but in the interim another process 2 makes the directory,
and then when process 1 tries to make the dir it errors out as the dir
exists.
This is a simple fix of adding exist_ok=True to both of the makedir
command (dump folder, trace folder).

<img width="1047" alt="Screenshot 2024-02-15 at 10 53 18 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/20378637-4adb-425b-91d8-7fd36289d3b5">
<img width="545" alt="Screenshot 2024-02-15 at 10 55 02 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/28658614-cff6-42b5-ab57-bac578393d5c">
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants