Skip to content

Commit 90eb536

Browse files
authored
Update README to reflect 1.8 release. (#2812)
1 parent ea15839 commit 90eb536

File tree

1 file changed

+14
-14
lines changed

1 file changed

+14
-14
lines changed

README.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ Cloud TPU:
5050
* [Training FairSeq Transformer on Cloud TPUs](https://cloud.google.com/tpu/docs/tutorials/transformer-pytorch)
5151
* [Training Resnet50 on Cloud TPUs](https://cloud.google.com/tpu/docs/tutorials/resnet-pytorch)
5252

53-
To start, [you create a Cloud TPU node](https://cloud.google.com/tpu/docs/tutorials/resnet-alpha-py#create_tpu) with the corresponding release you wish to consume (TPU software version: ex. `pytorch-1.7`):
53+
To start, [you create a Cloud TPU node](https://cloud.google.com/tpu/docs/tutorials/resnet-alpha-py#create_tpu) with the corresponding release you wish to consume (TPU software version: ex. `pytorch-1.8`):
5454

5555
Once you've created a Cloud TPU node, you can train your PyTorch models by either:
5656

@@ -70,7 +70,7 @@ Follow these steps to train a PyTorch model with Docker on a Cloud TPU:
7070

7171
2. SSH into the VM and pull a version of the docker image into the VM. The currently available versions are:
7272

73-
* `gcr.io/tpu-pytorch/xla:r1.7`: The current stable version.
73+
* `gcr.io/tpu-pytorch/xla:r1.8`: The current stable version.
7474
* `gcr.io/tpu-pytorch/xla:nightly_3.6`: Nightly version using Python 3.6.
7575
* `gcr.io/tpu-pytorch/xla:nightly_3.7`: Nightly version using Python 3.7.
7676
* `gcr.io/tpu-pytorch/xla:nightly_3.6_YYYYMMDD (e.g.: gcr.io/tpu-pytorch/xla:nightly_3.6_20190531)`: The nightly version of the given day. You can replace `3.6` with `3.7` if desired.
@@ -89,19 +89,19 @@ Follow these steps to train a PyTorch model with Docker on a Cloud TPU:
8989
```
9090

9191
```Shell
92-
(vm)$ docker pull gcr.io/tpu-pytorch/xla:r1.7
92+
(vm)$ docker pull gcr.io/tpu-pytorch/xla:r1.8
9393
```
9494

9595
3. Where `$TPU_IP_ADDRESS` (e.g.: `10.1.1.2`) is your TPU Internal IP displayed in GCP UI, after pulling the docker image you can either:
9696

9797
* Run the container with a single command:
9898
```Shell
99-
(vm)$ docker run --shm-size 16G -e XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470" gcr.io/tpu-pytorch/xla:r1.7 python /pytorch/xla/test/test_train_mp_mnist.py
99+
(vm)$ docker run --shm-size 16G -e XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470" gcr.io/tpu-pytorch/xla:r1.8 python /pytorch/xla/test/test_train_mp_mnist.py
100100
```
101101

102102
* Run the script in an interactive shell:
103103
```Shell
104-
(vm)$ docker run -it --shm-size 16G gcr.io/tpu-pytorch/xla:r1.7
104+
(vm)$ docker run -it --shm-size 16G gcr.io/tpu-pytorch/xla:r1.8
105105
(pytorch) root@CONTAINERID:/$ export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
106106
(pytorch) root@CONTAINERID:/$ python pytorch/xla/test/test_train_mp_mnist.py
107107
```
@@ -121,21 +121,21 @@ Follow these steps to train a PyTorch model with a VM Image on a Cloud TPU:
121121
* Click **Create** to create the instance.
122122

123123

124-
2. SSH into VM and activate the conda environment you wish to use. Each release (e.g.: `1.6`, `1.7`, `nightly`) is a separate conda environment.
124+
2. SSH into VM and activate the conda environment you wish to use. Each release (e.g.: `1.7`, `1.8`, `nightly`) is a separate conda environment.
125125

126126
```Shell
127127
(vm)$ export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
128128
(vm)$ conda env list
129129
# conda environments:
130130
#
131131
base * /anaconda3
132-
torch-xla-1.6 /anaconda3/envs/torch-xla-1.6
133132
torch-xla-1.7 /anaconda3/envs/torch-xla-1.7
133+
torch-xla-1.8 /anaconda3/envs/torch-xla-1.8
134134
torch-xla-nightly /anaconda3/envs/torch-xla-nightly
135135
136-
(vm)$ conda activate torch-xla-1.7
137-
(torch-xla-1.7)$ cd /usr/share/torch-xla-1.7/pytorch/xla
138-
(torch-xla-1.7)$ python test/test_train_mp_mnist.py
136+
(vm)$ conda activate torch-xla-1.8
137+
(torch-xla-1.8)$ cd /usr/share/torch-xla-1.8/pytorch/xla
138+
(torch-xla-1.8)$ python test/test_train_mp_mnist.py
139139
```
140140

141141
To update the wheels `torch` and `torch_xla` to the latest nightly
@@ -188,19 +188,19 @@ Training on pods can be broken down to largely 3 different steps:
188188
2. Let's say the command you ran to run a v3-8 was: `XLA_USE_BF16=1 python test/test_train_mp_imagenet.py --fake_data`.
189189
* To distribute training as a conda environment process:
190190
```
191-
(torch-xla-1.7)$ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --conda-env=torch-xla-1.7 --env=XLA_USE_BF16=1 -- python /usr/share/torch-xla-1.7/pytorch/xla/test/test_train_mp_imagenet.py --fake_data
191+
(torch-xla-1.8)$ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --conda-env=torch-xla-1.8 --env=XLA_USE_BF16=1 -- python /usr/share/torch-xla-1.8/pytorch/xla/test/test_train_mp_imagenet.py --fake_data
192192
```
193193
194194
* Or, to distribute training as a docker container:
195195
```
196-
(torch-xla-1.7)$ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --docker-image=gcr.io/tpu-pytorch/xla:r1.7 --docker-run-flag=--rm=true --docker-run-flag=--shm-size=50GB --env=XLA_USE_BF16=1 -- python /pytorch/xla/test/test_train_mp_imagenet.py --fake_data
196+
(torch-xla-1.8)$ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --docker-image=gcr.io/tpu-pytorch/xla:r1.8 --docker-run-flag=--rm=true --docker-run-flag=--shm-size=50GB --env=XLA_USE_BF16=1 -- python /pytorch/xla/test/test_train_mp_imagenet.py --fake_data
197197
```
198198
199199
### List of VMs
200200
If you prefer to not use an [instance group](#create-your-instance-group), you can decide to use a list of VM instances that you may have already created (or can create individually). Make sure that you create all the VM instances in the same zone as the TPU node, and also make sure that the VMs have the same configuration (datasets, VM size, disk size, etc.). Then you can [start distributed training](#start-distributed-training) after creating your TPU pod. The difference is in the `python -m torch_xla.distributed.xla_dist` command. For example, to use a list of VMs run the following command (ex. conda with v3-32):
201201
```
202-
(torch-xla-1.7)$ cd /usr/share/torch-xla-1.7/pytorch/xla
203-
(torch-xla-1.7)$ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --vm $VM1 --vm $VM2 --vm $VM3 --vm $VM4 --conda-env=torch-xla-1.7 --env=XLA_USE_BF16=1 -- python test/test_train_mp_imagenet.py --fake_data
202+
(torch-xla-1.8)$ cd /usr/share/torch-xla-1.8/pytorch/xla
203+
(torch-xla-1.8)$ python -m torch_xla.distributed.xla_dist --tpu=$TPU_POD_NAME --vm $VM1 --vm $VM2 --vm $VM3 --vm $VM4 --conda-env=torch-xla-1.8 --env=XLA_USE_BF16=1 -- python test/test_train_mp_imagenet.py --fake_data
204204
```
205205
206206
### Datasets for distributed training

0 commit comments

Comments
 (0)