77
88 使用 ``python -m paddle.distributed.launch `` 方法启动分布式训练任务。
99
10+ Launch 模块是在每个节点运行,负责分布式协同和本地进程管理的模块。使用 launch 启动分布式训练可以简化参数配置,进行稳定可靠的分布式组网训练,同时使用优化的调试和日志收集功能。另外一些高级的分布式功能如容错和弹性都依赖 launch 启动。
11+
1012使用方法
1113:::::::::
1214.. code-block :: bash
2931
3032 - ``--rank ``: 节点序号, 可以通过主节点进行分配。默认值 ``--rank=-1 ``.
3133
32- - ``--log_level ``: 日志级别, 可选值为 CRITICAL/ERROR/WARNING/INFO/DEBUG/NOTSET, 不区分大小写. 0 号节点的日志默认不输出到标准输出,需要开启输出请使用 debug 模式 。默认值 ``--log_level=INFO ``.
34+ - ``--log_level ``: 日志级别, 可选值为 CRITICAL/ERROR/WARNING/INFO/DEBUG/NOTSET, 不区分大小写。默认值 ``--log_level=INFO ``.
3335
3436 - ``--nnodes ``: 节点数量,支持区间设定以开启弹性模式,比如 ``--nnodes=2:3 ``. 默认值 ``--nnodes=1 ``.
3537
@@ -91,84 +93,82 @@ Elastic 参数
9193.. code-block :: bash
9294 :name: code-block-example-bash0
9395
94- # For training on multi node, run the following command in one of the nodes
96+ # 在其中一个节点上运行如下命令以启动 2 机任务
9597
9698 python -m paddle.distributed.launch --nnodes 2 train.py
9799
98- # Then the following info will be print
100+ # 这时,日志会打印如下信息,
99101
100102 # Copy the following command to other nodes to run.
101103 # --------------------------------------------------------------------------------
102104 # python -m paddle.distributed.launch --master 10.0.0.1:38714 --nnodes 2 train.py
103105 # --------------------------------------------------------------------------------
104106
105- # Follow the instruction above and paste the command in other nodes can launch a multi nodes training job.
107+ # 按照提示,复制命令在另外的节点上运行命令即可启动分布式训练。
106108
107- # There are two ways to launch a job with the same command for multi nodes training
108- # 1) using the following command in every nodes, make sure the ip is one of the training node and the port is available on that node
109+ # 要想在每个节点上运行同样的命令启动分布式训练有如下两种方法:
110+ # 1) 使用预配置的 master 信息,其中 master 的 ip 为其中一个训练节点,端口为可用端口
109111 # python -m paddle.distributed.launch --master 10.0.0.1:38714 --nnodes 2 train.py
110- # 2) using the following command in every nodes with a independent etcd service
112+ # 2) 使用额外部署的 etcd 服务作为 master
111113 # python -m paddle.distributed.launch --master etcd://10.0.0.1:2379 --nnodes 2 train.py
112114
113- # This functionality works will for both collective and ps mode and even with other arguments.
115+ # 以上功能介绍可用配合别的参数使用。
114116
115117
116118 代码示例一 (collective, 单机)
117119:::::::::
118120.. code-block :: bash
119121 :name: code-block-example-bash1
120122
121- # For training on single node using 4 gpus.
123+ # 启动单机4卡任务
122124
123- python -m paddle.distributed.launch --gpus =0,1,2,3 train.py --lr=0.01
125+ python -m paddle.distributed.launch --devices =0,1,2,3 train.py --lr=0.01
124126
125127 代码示例二 (collective, 多机)
126128:::::::::
127129.. code-block :: bash
128130 :name: code-block-example-bash2
129131
130- # The parameters of --gpus and --ips must be consistent in each node.
131-
132- # For training on multiple nodes, e.g., 192.168.0.16, 192.168.0.17
132+ # 启动两机任务,其中机器 ip 为 192.168.0.16, 192.168.0.17
133133
134134 # On 192.168.0.16:
135135
136- python -m paddle.distributed.launch --gpus =0,1,2,3 --ips =192.168.0.16,192.168.0.17 train.py --lr=0.01
136+ python -m paddle.distributed.launch --devices =0,1,2,3 --master =192.168.0.16:8090 --nnodes=2 train.py --lr=0.01
137137
138138 # On 192.168.0.17:
139139
140- python -m paddle.distributed.launch --gpus =0,1,2,3 --ips =192.168.0.16,192.168.0.17 train.py --lr=0.01
140+ python -m paddle.distributed.launch --devices =0,1,2,3 --master =192.168.0.16:8090 --nnodes=2 train.py --lr=0.01
141141
142142 代码示例三 (ps, cpu, 单机)
143143:::::::::
144144.. code-block :: bash
145145 :name: code-block-example-bash3
146146
147- # To simulate distributed environment using single node, e.g., 2 servers and 4 workers.
147+ # 在单机上启动多个 server 和 trainer
148148
149- python -m paddle.distributed.launch --server_num=2 --worker_num =4 train.py --lr=0.01
149+ python -m paddle.distributed.launch --server_num=2 --trainer_num =4 train.py --lr=0.01
150150
151151 代码示例四 (ps, cpu, 多机)
152152:::::::::
153153.. code-block :: bash
154154 :name: code-block-example-bash4
155155
156- # For training on multiple nodes, e.g., 192.168.0.16, 192.168.0.17 where each node with 1 server and 2 workers.
156+ # 在多机上启动, 例如在 192.168.0.16, 192.168.0.17 分别启动1个 server 和2个 trainer
157157
158158 # On 192.168.0.16:
159159
160- python -m paddle.distributed.launch --servers= " 192.168.0.16:6170,192.168.0.17:6170 " --workers= " 192.168.0.16:6171,192.168.0.16:6172,192.168.0.17:6171,192.168.0.17:6172 " train.py --lr=0.01
160+ python -m paddle.distributed.launch --master= 192.168.0.16:8090 --nnodes=2 --server_num=1 --trainer_num=2 train.py --lr=0.01
161161
162162 # On 192.168.0.17:
163163
164- python -m paddle.distributed.launch --servers= " 192.168.0.16:6170,192.168.0.17:6170 " --workers= " 192.168.0.16:6171,192.168.0.16:6172,192.168.0.17:6171,192.168.0.17:6172 " train.py --lr=0.01
164+ python -m paddle.distributed.launch --master= 192.168.0.16:8090 --nnodes=2 --server_num=1 --trainer_num=2 train.py --lr=0.01
165165
166166 代码示例五 (ps, gpu, 单机)
167167:::::::::
168168.. code-block :: bash
169169 :name: code-block-example-bash5
170170
171- # To simulate distributed environment using single node, e.g., 2 servers and 4 workers, each worker use single gpu.
171+ # 当启动 gpu ps 时,需要指定使用的 gpu,
172172
173173 export CUDA_VISIBLE_DEVICES=0,1,2,3
174174 python -m paddle.distributed.launch --server_num=2 --worker_num=4 train.py --lr=0.01
@@ -178,7 +178,7 @@ Elastic 参数
178178.. code-block :: bash
179179 :name: code-block-example-bash6
180180
181- # For training on multiple nodes, e.g., 192.168.0.16, 192.168.0.17 where each node with 1 server and 2 workers.
181+ # 使用如下命令启动多机 gpu ps
182182
183183 # On 192.168.0.16:
184184
@@ -195,7 +195,7 @@ Elastic 参数
195195.. code-block :: bash
196196 :name: code-block-example-bash7
197197
198- # To simulate distributed environment using single node, e.g., 2 servers and 4 workers, two workers use gpu, two workers use cpu.
198+ # 使用如下命令启动单机 heter ps
199199
200200 export CUDA_VISIBLE_DEVICES=0,1
201201 python -m paddle.distributed.launch --server_num=2 --worker_num=2 --heter_worker_num=2 train.py --lr=0.01
@@ -205,7 +205,7 @@ Elastic 参数
205205.. code-block :: bash
206206 :name: code-block-example-bash8
207207
208- # For training on multiple nodes, e.g., 192.168.0.16, 192.168.0.17 where each node with 1 server, 1 gpu worker, 1 cpu worker.
208+ # 使用如下命令启动多机 heter ps
209209
210210 # On 192.168.0.16:
211211
@@ -222,8 +222,8 @@ Elastic 参数
222222.. code-block :: bash
223223 :name: code-block-example-bash9
224224
225- # With the following command, the job will begin to run immediately if 4 nodes are ready,
226- # or it will run after elastic_timeout if only 2 or 3 nodes ready
225+ # 使用如下命令启动弹性训练
226+ # 当 4 个节点 ready 时,训练立即开始,当只有 2 或 3 个节点 ready 时,将等待超时然后开始训练
227227 python -m paddle.distributed.launch --master etcd://10.0.0.1:2379 --nnodes 2:4 train.py
228228
229- # once the number of nodes changes between 2:4 during training, the strategy holds
229+ # 在训练过程中如果节点发生变化,上述逻辑不变。
0 commit comments