Skip to content

Commit 24c08f9

Browse files
authored
add distributed deployment (#4652)
* add distributed deployment * update index * review and update * add faq * translate launch api doc and fix doc by comments
1 parent ab7345d commit 24c08f9

File tree

3 files changed

+453
-27
lines changed

3 files changed

+453
-27
lines changed

docs/api/paddle/distributed/launch_cn.rst

Lines changed: 27 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ launch
77
88
使用 ``python -m paddle.distributed.launch`` 方法启动分布式训练任务。
99

10+
Launch 模块是在每个节点运行,负责分布式协同和本地进程管理的模块。使用 launch 启动分布式训练可以简化参数配置,进行稳定可靠的分布式组网训练,同时使用优化的调试和日志收集功能。另外一些高级的分布式功能如容错和弹性都依赖 launch 启动。
11+
1012
使用方法
1113
:::::::::
1214
.. code-block:: bash
@@ -29,7 +31,7 @@ launch
2931

3032
- ``--rank``: 节点序号, 可以通过主节点进行分配。默认值 ``--rank=-1``.
3133

32-
- ``--log_level``: 日志级别, 可选值为 CRITICAL/ERROR/WARNING/INFO/DEBUG/NOTSET, 不区分大小写. 0 号节点的日志默认不输出到标准输出,需要开启输出请使用 debug 模式。默认值 ``--log_level=INFO``.
34+
- ``--log_level``: 日志级别, 可选值为 CRITICAL/ERROR/WARNING/INFO/DEBUG/NOTSET, 不区分大小写。默认值 ``--log_level=INFO``.
3335

3436
- ``--nnodes``: 节点数量,支持区间设定以开启弹性模式,比如 ``--nnodes=2:3``. 默认值 ``--nnodes=1``.
3537

@@ -91,84 +93,82 @@ Elastic 参数
9193
.. code-block:: bash
9294
:name: code-block-example-bash0
9395
94-
# For training on multi node, run the following command in one of the nodes
96+
# 在其中一个节点上运行如下命令以启动 2 机任务
9597
9698
python -m paddle.distributed.launch --nnodes 2 train.py
9799
98-
# Then the following info will be print
100+
# 这时,日志会打印如下信息,
99101
100102
# Copy the following command to other nodes to run.
101103
# --------------------------------------------------------------------------------
102104
# python -m paddle.distributed.launch --master 10.0.0.1:38714 --nnodes 2 train.py
103105
# --------------------------------------------------------------------------------
104106
105-
# Follow the instruction above and paste the command in other nodes can launch a multi nodes training job.
107+
# 按照提示,复制命令在另外的节点上运行命令即可启动分布式训练。
106108
107-
# There are two ways to launch a job with the same command for multi nodes training
108-
# 1) using the following command in every nodes, make sure the ip is one of the training node and the port is available on that node
109+
# 要想在每个节点上运行同样的命令启动分布式训练有如下两种方法:
110+
# 1) 使用预配置的 master 信息,其中 master 的 ip 为其中一个训练节点,端口为可用端口
109111
# python -m paddle.distributed.launch --master 10.0.0.1:38714 --nnodes 2 train.py
110-
# 2) using the following command in every nodes with a independent etcd service
112+
# 2) 使用额外部署的 etcd 服务作为 master
111113
# python -m paddle.distributed.launch --master etcd://10.0.0.1:2379 --nnodes 2 train.py
112114
113-
# This functionality works will for both collective and ps mode and even with other arguments.
115+
# 以上功能介绍可用配合别的参数使用。
114116
115117
116118
代码示例一 (collective, 单机)
117119
:::::::::
118120
.. code-block:: bash
119121
:name: code-block-example-bash1
120122
121-
# For training on single node using 4 gpus.
123+
# 启动单机4卡任务
122124
123-
python -m paddle.distributed.launch --gpus=0,1,2,3 train.py --lr=0.01
125+
python -m paddle.distributed.launch --devices=0,1,2,3 train.py --lr=0.01
124126
125127
代码示例二 (collective, 多机)
126128
:::::::::
127129
.. code-block:: bash
128130
:name: code-block-example-bash2
129131
130-
# The parameters of --gpus and --ips must be consistent in each node.
131-
132-
# For training on multiple nodes, e.g., 192.168.0.16, 192.168.0.17
132+
# 启动两机任务,其中机器 ip 为 192.168.0.16, 192.168.0.17
133133
134134
# On 192.168.0.16:
135135
136-
python -m paddle.distributed.launch --gpus=0,1,2,3 --ips=192.168.0.16,192.168.0.17 train.py --lr=0.01
136+
python -m paddle.distributed.launch --devices=0,1,2,3 --master=192.168.0.16:8090 --nnodes=2 train.py --lr=0.01
137137
138138
# On 192.168.0.17:
139139
140-
python -m paddle.distributed.launch --gpus=0,1,2,3 --ips=192.168.0.16,192.168.0.17 train.py --lr=0.01
140+
python -m paddle.distributed.launch --devices=0,1,2,3 --master=192.168.0.16:8090 --nnodes=2 train.py --lr=0.01
141141
142142
代码示例三 (ps, cpu, 单机)
143143
:::::::::
144144
.. code-block:: bash
145145
:name: code-block-example-bash3
146146
147-
# To simulate distributed environment using single node, e.g., 2 servers and 4 workers.
147+
# 在单机上启动多个 server 和 trainer
148148
149-
python -m paddle.distributed.launch --server_num=2 --worker_num=4 train.py --lr=0.01
149+
python -m paddle.distributed.launch --server_num=2 --trainer_num=4 train.py --lr=0.01
150150
151151
代码示例四 (ps, cpu, 多机)
152152
:::::::::
153153
.. code-block:: bash
154154
:name: code-block-example-bash4
155155
156-
# For training on multiple nodes, e.g., 192.168.0.16, 192.168.0.17 where each node with 1 server and 2 workers.
156+
# 在多机上启动, 例如在 192.168.0.16, 192.168.0.17 分别启动1个 server 和2个 trainer
157157
158158
# On 192.168.0.16:
159159
160-
python -m paddle.distributed.launch --servers="192.168.0.16:6170,192.168.0.17:6170" --workers="192.168.0.16:6171,192.168.0.16:6172,192.168.0.17:6171,192.168.0.17:6172" train.py --lr=0.01
160+
python -m paddle.distributed.launch --master=192.168.0.16:8090 --nnodes=2 --server_num=1 --trainer_num=2 train.py --lr=0.01
161161
162162
# On 192.168.0.17:
163163
164-
python -m paddle.distributed.launch --servers="192.168.0.16:6170,192.168.0.17:6170" --workers="192.168.0.16:6171,192.168.0.16:6172,192.168.0.17:6171,192.168.0.17:6172" train.py --lr=0.01
164+
python -m paddle.distributed.launch --master=192.168.0.16:8090 --nnodes=2 --server_num=1 --trainer_num=2 train.py --lr=0.01
165165
166166
代码示例五 (ps, gpu, 单机)
167167
:::::::::
168168
.. code-block:: bash
169169
:name: code-block-example-bash5
170170
171-
# To simulate distributed environment using single node, e.g., 2 servers and 4 workers, each worker use single gpu.
171+
# 当启动 gpu ps 时,需要指定使用的 gpu
172172
173173
export CUDA_VISIBLE_DEVICES=0,1,2,3
174174
python -m paddle.distributed.launch --server_num=2 --worker_num=4 train.py --lr=0.01
@@ -178,7 +178,7 @@ Elastic 参数
178178
.. code-block:: bash
179179
:name: code-block-example-bash6
180180
181-
# For training on multiple nodes, e.g., 192.168.0.16, 192.168.0.17 where each node with 1 server and 2 workers.
181+
# 使用如下命令启动多机 gpu ps
182182
183183
# On 192.168.0.16:
184184
@@ -195,7 +195,7 @@ Elastic 参数
195195
.. code-block:: bash
196196
:name: code-block-example-bash7
197197
198-
# To simulate distributed environment using single node, e.g., 2 servers and 4 workers, two workers use gpu, two workers use cpu.
198+
# 使用如下命令启动单机 heter ps
199199
200200
export CUDA_VISIBLE_DEVICES=0,1
201201
python -m paddle.distributed.launch --server_num=2 --worker_num=2 --heter_worker_num=2 train.py --lr=0.01
@@ -205,7 +205,7 @@ Elastic 参数
205205
.. code-block:: bash
206206
:name: code-block-example-bash8
207207
208-
# For training on multiple nodes, e.g., 192.168.0.16, 192.168.0.17 where each node with 1 server, 1 gpu worker, 1 cpu worker.
208+
# 使用如下命令启动多机 heter ps
209209
210210
# On 192.168.0.16:
211211
@@ -222,8 +222,8 @@ Elastic 参数
222222
.. code-block:: bash
223223
:name: code-block-example-bash9
224224
225-
# With the following command, the job will begin to run immediately if 4 nodes are ready,
226-
# or it will run after elastic_timeout if only 2 or 3 nodes ready
225+
# 使用如下命令启动弹性训练
226+
# 当 4 个节点 ready 时,训练立即开始,当只有 2 或 3 个节点 ready 时,将等待超时然后开始训练
227227
python -m paddle.distributed.launch --master etcd://10.0.0.1:2379 --nnodes 2:4 train.py
228228
229-
# once the number of nodes changes between 2:4 during training, the strategy holds
229+
# 在训练过程中如果节点发生变化,上述逻辑不变。

0 commit comments

Comments
 (0)