|
15 | 15 |
|
16 | 16 | 因此参数服务器模式对于存储超大规模模型参数的训练场景十分友好,常被用于训练拥有海量稀疏参数的搜索推荐领域模型。 |
17 | 17 |
|
18 | | -本节将采用推荐领域非常经典的模型wide_and_deep为例,介绍如何使用飞桨分布式完成参数服务器训练任务,本次快速开始的完整示例代码位于 https://github.com/PaddlePaddle/FleetX/tree/develop/examples/wide_and_deep_dataset。 |
19 | | - |
20 | | -2.1 版本要求 |
| 18 | +2.1 任务介绍 |
21 | 19 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
22 | 20 |
|
23 | | -在编写分布式训练程序之前,用户需要确保已经安装paddlepaddle-2.0.0-rc-cpu或paddlepaddle-2.0.0-rc-gpu及以上版本的飞桨开源框架。 |
| 21 | +本节将采用推荐领域非常经典的模型wide_and_deep为例,介绍如何使用飞桨分布式完成参数服务器训练任务,本次快速开始的完整示例代码位于 https://github.com/PaddlePaddle/FleetX/tree/develop/examples/wide_and_deep_dataset。 |
| 22 | +在编写分布式训练程序之前,用户需要确保已经安装PaddlePaddle2.3及以上版本的飞桨开源框架。 |
24 | 23 |
|
25 | 24 | 2.2 操作方法 |
26 | 25 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
146 | 145 |
|
147 | 146 | .. code-block:: bash |
148 | 147 |
|
149 | | - fleetrun --server_num=1 --worker_num=2 train.py |
| 148 | + fleetrun --server_num=1 --trainer_num=2 train.py |
150 | 149 |
|
151 | | -您将看到显示如下日志信息: |
| 150 | +您将在执行终端看到如下日志信息: |
152 | 151 |
|
153 | 152 | .. code-block:: bash |
154 | 153 | |
155 | | - ----------- Configuration Arguments ----------- |
156 | | - gpus: 0,1 |
157 | | - heter_worker_num: None |
158 | | - heter_workers: |
159 | | - http_port: None |
160 | | - ips: 127.0.0.1 |
161 | | - log_dir: log |
162 | | - nproc_per_node: None |
163 | | - server_num: 1 |
164 | | - servers: |
165 | | - training_script: train.py |
166 | | - training_script_args: [] |
167 | | - worker_num: 2 |
168 | | - workers: |
169 | | - ------------------------------------------------ |
170 | | - INFO 2021-05-06 12:14:26,890 launch.py:298] Run parameter-sever mode. pserver arguments:['--worker_num', '--server_num'], cuda count:8 |
171 | | - INFO 2021-05-06 12:14:26,892 launch_utils.py:973] Local server start 1 processes. First process distributed environment info (Only For Debug): |
172 | | - +=======================================================================================+ |
173 | | - | Distributed Envs Value | |
174 | | - +---------------------------------------------------------------------------------------+ |
175 | | - | PADDLE_TRAINERS_NUM 2 | |
176 | | - | TRAINING_ROLE PSERVER | |
177 | | - | POD_IP 127.0.0.1 | |
178 | | - | PADDLE_GLOO_RENDEZVOUS 3 | |
179 | | - | PADDLE_PSERVERS_IP_PORT_LIST 127.0.0.1:34008 | |
180 | | - | PADDLE_PORT 34008 | |
181 | | - | PADDLE_WITH_GLOO 0 | |
182 | | - | PADDLE_HETER_TRAINER_IP_PORT_LIST | |
183 | | - | PADDLE_TRAINER_ENDPOINTS 127.0.0.1:18913,127.0.0.1:10025 | |
184 | | - | PADDLE_GLOO_HTTP_ENDPOINT 127.0.0.1:23053 | |
185 | | - | PADDLE_GLOO_FS_PATH /tmp/tmp8vqb8arq | |
186 | | - +=======================================================================================+ |
187 | | - |
188 | | - INFO 2021-05-06 12:14:26,902 launch_utils.py:1041] Local worker start 2 processes. First process distributed environment info (Only For Debug): |
189 | | - +=======================================================================================+ |
190 | | - | Distributed Envs Value | |
191 | | - +---------------------------------------------------------------------------------------+ |
192 | | - | PADDLE_GLOO_HTTP_ENDPOINT 127.0.0.1:23053 | |
193 | | - | PADDLE_GLOO_RENDEZVOUS 3 | |
194 | | - | PADDLE_PSERVERS_IP_PORT_LIST 127.0.0.1:34008 | |
195 | | - | PADDLE_WITH_GLOO 0 | |
196 | | - | PADDLE_TRAINER_ENDPOINTS 127.0.0.1:18913,127.0.0.1:10025 | |
197 | | - | FLAGS_selected_gpus 0 | |
198 | | - | PADDLE_GLOO_FS_PATH /tmp/tmp8vqb8arq | |
199 | | - | PADDLE_TRAINERS_NUM 2 | |
200 | | - | TRAINING_ROLE TRAINER | |
201 | | - | XPU_VISIBLE_DEVICES 0 | |
202 | | - | PADDLE_HETER_TRAINER_IP_PORT_LIST | |
203 | | - | PADDLE_TRAINER_ID 0 | |
204 | | - | CUDA_VISIBLE_DEVICES 0 | |
205 | | - | FLAGS_selected_xpus 0 | |
206 | | - +=======================================================================================+ |
207 | | - |
208 | | - INFO 2021-05-06 12:14:26,921 launch_utils.py:903] Please check servers, workers and heter_worker logs in log/workerlog.*, log/serverlog.* and log/heterlog.* |
209 | | - INFO 2021-05-06 12:14:33,446 launch_utils.py:914] all workers exit, going to finish parameter server and heter_worker. |
210 | | - INFO 2021-05-06 12:14:33,446 launch_utils.py:926] all parameter server are killed |
| 154 | + LAUNCH INFO 2022-05-18 11:27:17,761 ----------- Configuration ---------------------- |
| 155 | + LAUNCH INFO 2022-05-18 11:27:17,761 devices: None |
| 156 | + LAUNCH INFO 2022-05-18 11:27:17,761 elastic_level: -1 |
| 157 | + LAUNCH INFO 2022-05-18 11:27:17,761 elastic_timeout: 30 |
| 158 | + LAUNCH INFO 2022-05-18 11:27:17,761 gloo_port: 6767 |
| 159 | + LAUNCH INFO 2022-05-18 11:27:17,761 host: None |
| 160 | + LAUNCH INFO 2022-05-18 11:27:17,761 job_id: default |
| 161 | + LAUNCH INFO 2022-05-18 11:27:17,761 legacy: False |
| 162 | + LAUNCH INFO 2022-05-18 11:27:17,761 log_dir: log |
| 163 | + LAUNCH INFO 2022-05-18 11:27:17,761 log_level: INFO |
| 164 | + LAUNCH INFO 2022-05-18 11:27:17,762 master: None |
| 165 | + LAUNCH INFO 2022-05-18 11:27:17,762 max_restart: 3 |
| 166 | + LAUNCH INFO 2022-05-18 11:27:17,762 nnodes: 1 |
| 167 | + LAUNCH INFO 2022-05-18 11:27:17,762 nproc_per_node: None |
| 168 | + LAUNCH INFO 2022-05-18 11:27:17,762 rank: -1 |
| 169 | + LAUNCH INFO 2022-05-18 11:27:17,762 run_mode: collective |
| 170 | + LAUNCH INFO 2022-05-18 11:27:17,762 server_num: 1 |
| 171 | + LAUNCH INFO 2022-05-18 11:27:17,762 servers: |
| 172 | + LAUNCH INFO 2022-05-18 11:27:17,762 trainer_num: 2 |
| 173 | + LAUNCH INFO 2022-05-18 11:27:17,762 trainers: |
| 174 | + LAUNCH INFO 2022-05-18 11:27:17,762 training_script: train.py |
| 175 | + LAUNCH INFO 2022-05-18 11:27:17,762 training_script_args: [] |
| 176 | + LAUNCH INFO 2022-05-18 11:27:17,762 with_gloo: 0 |
| 177 | + LAUNCH INFO 2022-05-18 11:27:17,762 -------------------------------------------------- |
| 178 | + LAUNCH INFO 2022-05-18 11:27:17,772 Job: default, mode ps, replicas 1[1:1], elastic False |
| 179 | + LAUNCH INFO 2022-05-18 11:27:17,775 Run Pod: evjsyn, replicas 3, status ready |
| 180 | + LAUNCH INFO 2022-05-18 11:27:17,795 Watching Pod: evjsyn, replicas 3, status running |
| 181 | +
|
| 182 | +同时,在log目录下,会生成服务节点和训练节点的日志文件。 |
| 183 | +服务节点日志:default.evjsyn.ps.0.log,日志中须包含以下内容,证明服务节点启动成功,可以提供服务。 |
| 184 | + |
| 185 | +.. code-block:: bash |
| 186 | +
|
| 187 | + I0518 11:27:20.730531 177420 brpc_ps_server.cc:73] running server with rank id: 0, endpoint: IP:PORT |
| 188 | +
|
| 189 | +训练节点日志:default.evjsyn.trainer.0.log,日志中打印了训练过程中的部分变量值。 |
| 190 | + |
| 191 | +.. code-block:: bash |
| 192 | +
|
| 193 | + time: [2022-05-18 11:27:27], batch: [1], loss[1]:[0.666739] |
| 194 | + time: [2022-05-18 11:27:27], batch: [2], loss[1]:[0.690405] |
| 195 | + time: [2022-05-18 11:27:27], batch: [3], loss[1]:[0.681693] |
| 196 | + time: [2022-05-18 11:27:27], batch: [4], loss[1]:[0.703863] |
| 197 | + time: [2022-05-18 11:27:27], batch: [5], loss[1]:[0.670717] |
| 198 | +
|
| 199 | +备注:启动相关问题,请参考\ `launch <https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/distributed/launch_cn.html>`_\。 |
0 commit comments