Skip to content

Commit d793862

Browse files
committed
update
1 parent 84903cf commit d793862

File tree

2 files changed

+52
-62
lines changed

2 files changed

+52
-62
lines changed

docs/guides/06_distributed_training/cluster_quick_start_ps_cn.rst

Lines changed: 51 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -15,12 +15,11 @@
1515

1616
因此参数服务器模式对于存储超大规模模型参数的训练场景十分友好,常被用于训练拥有海量稀疏参数的搜索推荐领域模型。
1717

18-
本节将采用推荐领域非常经典的模型wide_and_deep为例,介绍如何使用飞桨分布式完成参数服务器训练任务,本次快速开始的完整示例代码位于 https://github.com/PaddlePaddle/FleetX/tree/develop/examples/wide_and_deep_dataset。
19-
20-
2.1 版本要求
18+
2.1 任务介绍
2119
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2220

23-
在编写分布式训练程序之前,用户需要确保已经安装paddlepaddle-2.0.0-rc-cpu或paddlepaddle-2.0.0-rc-gpu及以上版本的飞桨开源框架。
21+
本节将采用推荐领域非常经典的模型wide_and_deep为例,介绍如何使用飞桨分布式完成参数服务器训练任务,本次快速开始的完整示例代码位于 https://github.com/PaddlePaddle/FleetX/tree/develop/examples/wide_and_deep_dataset。
22+
在编写分布式训练程序之前,用户需要确保已经安装PaddlePaddle2.3及以上版本的飞桨开源框架。
2423

2524
2.2 操作方法
2625
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -146,65 +145,55 @@
146145

147146
.. code-block:: bash
148147
149-
fleetrun --server_num=1 --worker_num=2 train.py
148+
fleetrun --server_num=1 --trainer_num=2 train.py
150149
151-
您将看到显示如下日志信息
150+
您将在执行终端看到如下日志信息
152151

153152
.. code-block:: bash
154153
155-
----------- Configuration Arguments -----------
156-
gpus: 0,1
157-
heter_worker_num: None
158-
heter_workers:
159-
http_port: None
160-
ips: 127.0.0.1
161-
log_dir: log
162-
nproc_per_node: None
163-
server_num: 1
164-
servers:
165-
training_script: train.py
166-
training_script_args: []
167-
worker_num: 2
168-
workers:
169-
------------------------------------------------
170-
INFO 2021-05-06 12:14:26,890 launch.py:298] Run parameter-sever mode. pserver arguments:['--worker_num', '--server_num'], cuda count:8
171-
INFO 2021-05-06 12:14:26,892 launch_utils.py:973] Local server start 1 processes. First process distributed environment info (Only For Debug):
172-
+=======================================================================================+
173-
| Distributed Envs Value |
174-
+---------------------------------------------------------------------------------------+
175-
| PADDLE_TRAINERS_NUM 2 |
176-
| TRAINING_ROLE PSERVER |
177-
| POD_IP 127.0.0.1 |
178-
| PADDLE_GLOO_RENDEZVOUS 3 |
179-
| PADDLE_PSERVERS_IP_PORT_LIST 127.0.0.1:34008 |
180-
| PADDLE_PORT 34008 |
181-
| PADDLE_WITH_GLOO 0 |
182-
| PADDLE_HETER_TRAINER_IP_PORT_LIST |
183-
| PADDLE_TRAINER_ENDPOINTS 127.0.0.1:18913,127.0.0.1:10025 |
184-
| PADDLE_GLOO_HTTP_ENDPOINT 127.0.0.1:23053 |
185-
| PADDLE_GLOO_FS_PATH /tmp/tmp8vqb8arq |
186-
+=======================================================================================+
187-
188-
INFO 2021-05-06 12:14:26,902 launch_utils.py:1041] Local worker start 2 processes. First process distributed environment info (Only For Debug):
189-
+=======================================================================================+
190-
| Distributed Envs Value |
191-
+---------------------------------------------------------------------------------------+
192-
| PADDLE_GLOO_HTTP_ENDPOINT 127.0.0.1:23053 |
193-
| PADDLE_GLOO_RENDEZVOUS 3 |
194-
| PADDLE_PSERVERS_IP_PORT_LIST 127.0.0.1:34008 |
195-
| PADDLE_WITH_GLOO 0 |
196-
| PADDLE_TRAINER_ENDPOINTS 127.0.0.1:18913,127.0.0.1:10025 |
197-
| FLAGS_selected_gpus 0 |
198-
| PADDLE_GLOO_FS_PATH /tmp/tmp8vqb8arq |
199-
| PADDLE_TRAINERS_NUM 2 |
200-
| TRAINING_ROLE TRAINER |
201-
| XPU_VISIBLE_DEVICES 0 |
202-
| PADDLE_HETER_TRAINER_IP_PORT_LIST |
203-
| PADDLE_TRAINER_ID 0 |
204-
| CUDA_VISIBLE_DEVICES 0 |
205-
| FLAGS_selected_xpus 0 |
206-
+=======================================================================================+
207-
208-
INFO 2021-05-06 12:14:26,921 launch_utils.py:903] Please check servers, workers and heter_worker logs in log/workerlog.*, log/serverlog.* and log/heterlog.*
209-
INFO 2021-05-06 12:14:33,446 launch_utils.py:914] all workers exit, going to finish parameter server and heter_worker.
210-
INFO 2021-05-06 12:14:33,446 launch_utils.py:926] all parameter server are killed
154+
LAUNCH INFO 2022-05-18 11:27:17,761 ----------- Configuration ----------------------
155+
LAUNCH INFO 2022-05-18 11:27:17,761 devices: None
156+
LAUNCH INFO 2022-05-18 11:27:17,761 elastic_level: -1
157+
LAUNCH INFO 2022-05-18 11:27:17,761 elastic_timeout: 30
158+
LAUNCH INFO 2022-05-18 11:27:17,761 gloo_port: 6767
159+
LAUNCH INFO 2022-05-18 11:27:17,761 host: None
160+
LAUNCH INFO 2022-05-18 11:27:17,761 job_id: default
161+
LAUNCH INFO 2022-05-18 11:27:17,761 legacy: False
162+
LAUNCH INFO 2022-05-18 11:27:17,761 log_dir: log
163+
LAUNCH INFO 2022-05-18 11:27:17,761 log_level: INFO
164+
LAUNCH INFO 2022-05-18 11:27:17,762 master: None
165+
LAUNCH INFO 2022-05-18 11:27:17,762 max_restart: 3
166+
LAUNCH INFO 2022-05-18 11:27:17,762 nnodes: 1
167+
LAUNCH INFO 2022-05-18 11:27:17,762 nproc_per_node: None
168+
LAUNCH INFO 2022-05-18 11:27:17,762 rank: -1
169+
LAUNCH INFO 2022-05-18 11:27:17,762 run_mode: collective
170+
LAUNCH INFO 2022-05-18 11:27:17,762 server_num: 1
171+
LAUNCH INFO 2022-05-18 11:27:17,762 servers:
172+
LAUNCH INFO 2022-05-18 11:27:17,762 trainer_num: 2
173+
LAUNCH INFO 2022-05-18 11:27:17,762 trainers:
174+
LAUNCH INFO 2022-05-18 11:27:17,762 training_script: train.py
175+
LAUNCH INFO 2022-05-18 11:27:17,762 training_script_args: []
176+
LAUNCH INFO 2022-05-18 11:27:17,762 with_gloo: 0
177+
LAUNCH INFO 2022-05-18 11:27:17,762 --------------------------------------------------
178+
LAUNCH INFO 2022-05-18 11:27:17,772 Job: default, mode ps, replicas 1[1:1], elastic False
179+
LAUNCH INFO 2022-05-18 11:27:17,775 Run Pod: evjsyn, replicas 3, status ready
180+
LAUNCH INFO 2022-05-18 11:27:17,795 Watching Pod: evjsyn, replicas 3, status running
181+
182+
同时,在log目录下,会生成服务节点和训练节点的日志文件。
183+
服务节点日志:default.evjsyn.ps.0.log,日志中须包含以下内容,证明服务节点启动成功,可以提供服务。
184+
185+
.. code-block:: bash
186+
187+
I0518 11:27:20.730531 177420 brpc_ps_server.cc:73] running server with rank id: 0, endpoint: IP:PORT
188+
189+
训练节点日志:default.evjsyn.trainer.0.log,日志中打印了训练过程中的部分变量值。
190+
191+
.. code-block:: bash
192+
193+
time: [2022-05-18 11:27:27], batch: [1], loss[1]:[0.666739]
194+
time: [2022-05-18 11:27:27], batch: [2], loss[1]:[0.690405]
195+
time: [2022-05-18 11:27:27], batch: [3], loss[1]:[0.681693]
196+
time: [2022-05-18 11:27:27], batch: [4], loss[1]:[0.703863]
197+
time: [2022-05-18 11:27:27], batch: [5], loss[1]:[0.670717]
198+
199+
备注:启动相关问题,请参考\ `launch <https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/distributed/launch_cn.html>`_\。

docs/guides/06_distributed_training/index_cn.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
您可以通过以下内容,了解飞桨分布式训练的特性和使用指南:
66

77
- `分布式训练快速开始 <./cluster_quick_start_cn.html>`_ : 使用飞桨框架快速开始分布式训练。
8+
- `参数服务器快速开始 <./cluster_quick_start_ps_cn.html>`_ : 使用飞桨参数服务器快速开始分布式训练。
89
- `使用FleetAPI进行分布式训练 <./fleet_api_howto_cn.html>`_ : 使用飞桨框架FleetAPI完成分布式训练。
910

1011
.. toctree::

0 commit comments

Comments
 (0)