You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+14-14Lines changed: 14 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -50,7 +50,7 @@ Cloud TPU:
50
50
*[Training FairSeq Transformer on Cloud TPUs](https://cloud.google.com/tpu/docs/tutorials/transformer-pytorch)
51
51
*[Training Resnet50 on Cloud TPUs](https://cloud.google.com/tpu/docs/tutorials/resnet-pytorch)
52
52
53
-
To start, [you create a Cloud TPU node](https://cloud.google.com/tpu/docs/tutorials/resnet-alpha-py#create_tpu) with the corresponding release you wish to consume (TPU software version: ex. `pytorch-1.7`):
53
+
To start, [you create a Cloud TPU node](https://cloud.google.com/tpu/docs/tutorials/resnet-alpha-py#create_tpu) with the corresponding release you wish to consume (TPU software version: ex. `pytorch-1.8`):
54
54
55
55
Once you've created a Cloud TPU node, you can train your PyTorch models by either:
56
56
@@ -70,7 +70,7 @@ Follow these steps to train a PyTorch model with Docker on a Cloud TPU:
70
70
71
71
2. SSH into the VM and pull a version of the docker image into the VM. The currently available versions are:
72
72
73
-
*`gcr.io/tpu-pytorch/xla:r1.7`: The current stable version.
73
+
*`gcr.io/tpu-pytorch/xla:r1.8`: The current stable version.
74
74
*`gcr.io/tpu-pytorch/xla:nightly_3.6`: Nightly version using Python 3.6.
75
75
*`gcr.io/tpu-pytorch/xla:nightly_3.7`: Nightly version using Python 3.7.
76
76
*`gcr.io/tpu-pytorch/xla:nightly_3.6_YYYYMMDD (e.g.: gcr.io/tpu-pytorch/xla:nightly_3.6_20190531)`: The nightly version of the given day. You can replace `3.6` with `3.7` if desired.
@@ -89,19 +89,19 @@ Follow these steps to train a PyTorch model with Docker on a Cloud TPU:
89
89
```
90
90
91
91
```Shell
92
-
(vm)$ docker pull gcr.io/tpu-pytorch/xla:r1.7
92
+
(vm)$ docker pull gcr.io/tpu-pytorch/xla:r1.8
93
93
```
94
94
95
95
3. Where `$TPU_IP_ADDRESS` (e.g.: `10.1.1.2`) is your TPU Internal IP displayed in GCP UI, after pulling the docker image you can either:
96
96
97
97
* Run the container with a single command:
98
98
```Shell
99
-
(vm)$ docker run --shm-size 16G -e XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470" gcr.io/tpu-pytorch/xla:r1.7 python /pytorch/xla/test/test_train_mp_mnist.py
99
+
(vm)$ docker run --shm-size 16G -e XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470" gcr.io/tpu-pytorch/xla:r1.8 python /pytorch/xla/test/test_train_mp_mnist.py
100
100
```
101
101
102
102
* Run the script in an interactive shell:
103
103
```Shell
104
-
(vm)$ docker run -it --shm-size 16G gcr.io/tpu-pytorch/xla:r1.7
104
+
(vm)$ docker run -it --shm-size 16G gcr.io/tpu-pytorch/xla:r1.8
If you prefer to not use an [instance group](#create-your-instance-group), you can decide to use a list of VM instances that you may have already created (or can create individually). Make sure that you create all the VM instances in the same zone as the TPU node, and also make sure that the VMs have the same configuration (datasets, VM size, disk size, etc.). Then you can [start distributed training](#start-distributed-training) after creating your TPU pod. The difference is in the `python -m torch_xla.distributed.xla_dist` command. For example, to use a list of VMs run the following command (ex. conda with v3-32):
201
201
```
202
-
(torch-xla-1.7)$ cd /usr/share/torch-xla-1.7/pytorch/xla
0 commit comments