Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -90,8 +90,8 @@ CPU分布式训练速度进一步提高的核心在于选择合适的分布式
import paddle.fluid.incubate.fleet.base.role_maker as role_maker
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler.distributed_strategy_factory import DistributedStrategyFactory

然后指定CPU分布式运行的训练策略,目前可选配置有四种:同步训练(Sync)、异步训练(Async)、半异步训练(Half-Async)以及GEO训练。不同策略的细节,可以查看设计文档:
https://github.com/PaddlePaddle/Fleet/blob/develop/markdown_doc/transpiler/transpiler_cpu.md
然后指定CPU分布式运行的训练策略,目前可选配置有四种:同步训练(Sync)、异步训练(Async)、半异步训练(Half-Async)以及GEO训练。


通过如下代码引入上述策略的默认配置,并进行CPU分布式训练:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -94,8 +94,8 @@ First, we need to introduce relevant libraries into the code:
import paddle.fluid.incubate.fleet.base.role_maker as role_maker
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler.distributed_strategy_factory import DistributedStrategyFactory

At present, there are four kinds of training strategies: synchronous training, asynchronous, half asynchronous training and GEO training. For details of different strategies, you can view the design documents:
https://github.com/PaddlePaddle/Fleet/blob/develop/markdown_doc/transpiler/transpiler_cpu.md
At present, there are four kinds of training strategies: synchronous training, asynchronous, half asynchronous training and GEO training.


The default configuration of the above policy is introduced by the following code:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ PaddlePaddle Fluid支持在现代GPU [#]_ 服务器集群上完成高性能分
data_loader.reset()


另外,可以使用DALI库提升数据处理性能。DALI是NVIDIA开发的数据加载库,更多内容请参考 `官网文档 <https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html>`_ 。飞桨中如何结合使用DALI库请参考 `使用示例 <https://github.com/PaddlePaddle/Fleet/tree/develop/benchmark/collective/resnet>`_ 。
另外,可以使用DALI库提升数据处理性能。DALI是NVIDIA开发的数据加载库,更多内容请参考 `官网文档 <https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html>`_ 。飞桨中如何结合使用DALI库请参考 `使用示例 <https://github.com/PaddlePaddle/FleetX/tree/old_develop/deprecated/benchmark/collective/resnet>`_ 。

2、训练策略设置
===========
Expand Down Expand Up @@ -115,7 +115,7 @@ GPU多机多卡同步训练过程中存在慢trainer现象,即每步中训练
- Local SGD的warmup步长 :code:`local_sgd_is_warm_steps` 影响最终模型的泛化能力,一般需要等到模型参数稳定之后在进行Local SGD训练,经验值可以将学习率第一次下降时的epoch作为warmup步长,之后再进行Local SGD训练。
- Local SGD步长 :code:`local_sgd_steps` ,一般该值越大,通信次数越少,训练速度越快,但随之而来的时模型精度下降。经验值设置为2或者4。

具体的Local SGD的训练代码可以参考:https://github.com/PaddlePaddle/Fleet/tree/develop/examples/local_sgd/resnet
具体的Local SGD的训练代码可以参考:https://github.com/PaddlePaddle/FleetX/tree/old_develop/deprecated/examples/local_sgd/resnet


2、使用混合精度训练
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ optimizer = fluid.optimizer.DGCMomentumOptimizer(
learning_rate=0.001, momentum=0.9, rampup_begin_step=0)
optimizer.minimize(cost)
```
在fleet中我们提供了[DGC的示例](https://github.com/PaddlePaddle/Fleet/tree/develop/examples/dgc_example)。示例中以数字手写体识别为例,将程序移植为分布式版本(注:DGC亦支持单机多卡),再加上DGC优化器。可参照此示例将单机单卡程序迁移到DGC。在单机单卡迁移到DGC过程中,一般需要先对齐多机Momentum的精度,再对齐DGC的精度。
在fleet中我们提供了[DGC的示例](https://github.com/PaddlePaddle/FleetX/tree/old_develop/deprecated/examples/dgc_example)。示例中以数字手写体识别为例,将程序移植为分布式版本(注:DGC亦支持单机多卡),再加上DGC优化器。可参照此示例将单机单卡程序迁移到DGC。在单机单卡迁移到DGC过程中,一般需要先对齐多机Momentum的精度,再对齐DGC的精度。

## 3. 调参&适用场景
### 3.1 预热调参
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ Recompute原则上适用于所有Optimizer。

**2. 在Fleet API中使用Recompute**

`Fleet API <https://github.com/PaddlePaddle/Fleet>`_
`Fleet API <https://github.com/PaddlePaddle/FleetX>`_
是基于Fluid的分布式计算高层API。在Fleet API中添加RecomputeOptimizer
仅需要2步:

Expand All @@ -121,7 +121,7 @@ Recompute原则上适用于所有Optimizer。
为了帮助您快速地用Fleet API使用Recompute任务,我们提供了一些例子,
并且给出了这些例子的计算速度、效果和显存节省情况:

- 用Recompute做Bert Fine-tuning: `source <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/recompute/bert>`_
- 用Recompute做Bert Fine-tuning: `source <https://github.com/PaddlePaddle/FleetX/tree/old_develop/deprecated/examples/recompute/bert>`_

- 用Recompute做目标检测:开发中.

Expand All @@ -136,7 +136,7 @@ Q&A
- **有没有更多Recompute的官方例子?**

更多Recompute的例子将更新在 `examples <https://github.com/PaddlePaddle/examples/tree/master/community_examples/recompute>`_
和 `Fleet <https://github.com/PaddlePaddle/Fleet>`_ 库下,欢迎关注。
和 `Fleet <https://github.com/PaddlePaddle/FleetX>`_ 库下,欢迎关注。

- **有没有添加checkpoints的建议?**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ In principle, recompute is for all kinds of optimizers in Paddle.

**2. Using Recompute in Fleet API**

`Fleet API <https://github.com/PaddlePaddle/Fleet>`_
`Fleet API <https://github.com/PaddlePaddle/FleetX>`_
is a high-level API for distributed training in Fluid. Adding
RecomputeOptimizer to Fluid takes two steps:

Expand All @@ -154,7 +154,7 @@ We also post corresponding training speed,
test results and memory usages of these examples for reference.


- Fine-tuning Bert Large model with recomputing: `source <https://github.com/PaddlePaddle/Fleet/tree/develop/examples/recompute/bert>`_
- Fine-tuning Bert Large model with recomputing: `source <https://github.com/PaddlePaddle/FleetX/tree/old_develop/deprecated/examples/recompute/bert>`_

- Training object detection models with recomputing:developing.

Expand All @@ -171,7 +171,7 @@ first-computation and recomputation consistent.
- **Are there more official examples of Recompute?**

More examples will be updated at `examples <https://github.com/PaddlePaddle/examples/tree/master/community_examples/recompute>`_
and `Fleet <https://github.com/PaddlePaddle/Fleet>`_ . Feel free to
and `Fleet <https://github.com/PaddlePaddle/FleetX>`_ . Feel free to
raise issues if you get any problem with these examples.

- **How should I set checkpoints?**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@

本节将采用推荐领域非常经典的模型wide_and_deep为例,介绍如何使用飞桨分布式完成参数服务器训练任务。

参数服务器训练基于飞桨静态图,为方便用户理解,我们准备了一个wide_and_deep模型的单机静态图示例:\ `单机静态图示例 <https://github.com/PaddlePaddle/FleetX/tree/develop/eval/rec/wide_and_deep_single_static>`_\。
参数服务器训练基于飞桨静态图,为方便用户理解,我们准备了一个wide_and_deep模型的单机静态图示例:\ `单机静态图示例 <https://github.com/PaddlePaddle/FleetX/tree/old_develop/eval/rec/wide_and_deep_single_static>`_\。

在单机静态图示例基础上,通过1.2章节的操作方法,可以将其修改为参数服务器训练示例,本次快速开始的完整示例代码参考:\ `参数服务器完整示例 <https://github.com/PaddlePaddle/FleetX/tree/develop/examples/wide_and_deep_dataset>`_\。
在单机静态图示例基础上,通过1.2章节的操作方法,可以将其修改为参数服务器训练示例,本次快速开始的完整示例代码参考:\ `参数服务器完整示例 <https://github.com/PaddlePaddle/FleetX/tree/old_develop/examples/wide_and_deep_dataset>`_\。

同时,我们在AIStudio上建立了一个参数服务器快速开始的项目:\ `参数服务器快速开始 <https://aistudio.baidu.com/aistudio/projectdetail/4189047?channelType=0&channel=0>`_\,用户可以跳转到AIStudio上直接运行参数服务器的训练代码。

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -206,4 +206,4 @@
使用AMP模式耗时:
共计耗时 = 1.222 sec

上述例子存放在:\ `example/amp/amp_dygraph.py <https://github.com/PaddlePaddle/FleetX/blob/develop/examples/amp/amp_dygraph.py>`_\ 。
上述例子存放在:\ `example/amp/amp_dygraph.py <https://github.com/PaddlePaddle/FleetX/blob/old_develop/examples/amp/amp_dygraph.py>`_\ 。
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@ batch size = seq * seq_max_len

python recompute_dygraph.py

recompute动态图代码:`代码示例 <https://github.com/PaddlePaddle/FleetX/tree/develop/examples/recompute>`__。
recompute动态图代码:`代码示例 <https://github.com/PaddlePaddle/FleetX/tree/old_develop/examples/recompute>`__。

输出:

Expand Down
2 changes: 1 addition & 1 deletion docs/guides/06_distributed_training/model_parallel_cn.rst
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,7 @@
optimizer.clear_grad()
print("loss", loss.numpy())

模型并行的动态图代码:`example/model_parallelism/mp_dygraph.py <https://github.com/PaddlePaddle/FleetX/tree/develop/examples/model_parallelism>`_。
模型并行的动态图代码:`example/model_parallelism/mp_dygraph.py <https://github.com/PaddlePaddle/FleetX/tree/old_develop/examples/model_parallelism>`_。


运行方式(需要保证当前机器有两张gpu):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -261,7 +261,7 @@ model.train_batch(...):这一步主要就是执行1F1B的流水线并行方式
export CUDA_VISIBLE_DEVICES=0,1
python -m paddle.distributed.launch alexnet_dygraph_pipeline.py # alexnet_dygraph_pipeline.py是用户运行动态图流水线的python文件

基于AlexNet的完整的流水线并行动态图代码:`alex <https://github.com/PaddlePaddle/FleetX/tree/develop/examples/pipeline>`_。
基于AlexNet的完整的流水线并行动态图代码:`alex <https://github.com/PaddlePaddle/FleetX/tree/old_develop/examples/pipeline>`_。

控制台输出信息如下:

Expand Down