Skip to content

Commit ab75dbd

Browse files
authored
Add 4GPU unit test (#82)
For now this literally just runs `NGPU=4 ./run_llama_train.sh` but I verified at least it catches problems. As a follow up, we should integrate mgpu test infra from pytorch and set up actual unit tests to run in this job. We should probably also keep testing the run_llama_train.sh script, and add other combinations of 2D parallelism to ensure they all keep working. <img width="2120" alt="image" src="https://github.com/pytorch/torchtrain/assets/4984825/2c235e9a-04ed-4f2d-9915-67de39d78e1c">
1 parent bccad90 commit ab75dbd

File tree

2 files changed

+45
-2
lines changed

2 files changed

+45
-2
lines changed
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
name: 4 GPU Unit Test
2+
3+
on:
4+
push:
5+
branches: [ main ]
6+
pull_request:
7+
8+
concurrency:
9+
group: unit-test${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_number || github.ref }}
10+
cancel-in-progress: true
11+
12+
defaults:
13+
run:
14+
shell: bash -l -eo pipefail {0}
15+
16+
jobs:
17+
unit_tests_4gpu:
18+
runs-on: linux.g5.12xlarge.nvidia.gpu
19+
strategy:
20+
matrix:
21+
python-version: ['3.10']
22+
steps:
23+
- name: Check out repo
24+
uses: actions/checkout@v3
25+
- name: Setup conda env
26+
uses: conda-incubator/setup-miniconda@v2
27+
with:
28+
auto-update-conda: true
29+
miniconda-version: "latest"
30+
activate-environment: test
31+
python-version: ${{ matrix.python-version }}
32+
- name: Update pip
33+
run: python -m pip install --upgrade pip
34+
- name: Install dependencies
35+
run: |
36+
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
37+
python -m pip install -r requirements.txt
38+
python -m pip install -r dev-requirements.txt
39+
python -m pip install -e .
40+
- name: Run NGPU=4 ./run_llama_train.sh
41+
run: NGPU=4 ./run_llama_train.sh
42+
- name: Upload Coverage to Codecov
43+
uses: codecov/codecov-action@v3

.github/workflows/unit_test.yaml renamed to .github/workflows/unit_test_cpu.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: Unit Test
1+
name: CPU Unit Test
22

33
on:
44
push:
@@ -14,7 +14,7 @@ defaults:
1414
shell: bash -l -eo pipefail {0}
1515

1616
jobs:
17-
unit_tests:
17+
cpu_unit_tests:
1818
runs-on: ubuntu-latest
1919
strategy:
2020
matrix:

0 commit comments

Comments
 (0)