Skip to content

Conversation

@typhoonzero
Copy link
Contributor

@typhoonzero typhoonzero commented Nov 23, 2018

Fix #292

1

2

3

4

@shanyi15 shanyi15 added the API Guide docs related to API Guide label Nov 23, 2018
@typhoonzero typhoonzero requested a review from shanyi15 November 26, 2018 01:56
@typhoonzero typhoonzero changed the title [WIP]add distributed sync training api guide add distributed sync training api guide Nov 26, 2018
@shanyi15 shanyi15 requested a review from luotao1 November 26, 2018 04:12
############

Fluid支持数据并行的分布式同步训练,API使用 :code:`DistributedTranspiler` 将单机网络配置转换成可以多机执行的
:code:`pserver` 端程序和 :code:`trainer` 端程序,用户在不同的节点执行相同的一段代码,根据环境变量或启动参数,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

。(逗号变句号)用户在不同

- :code:`pservers` : 当前训练任务中pserver节点的IP端口列表
- :code:`trainers` : 当前训练任务中trainer节点的个数(注意NCCL2模式中,此项参数是字符串,指定trainer节点的IP端口列表),
注意,在pserver模式下,trainer节点个数可以和pserver节点个数不一致,比如使用20个pserver和50个trainer。在实际训练任务
中,您可以通过调整pserver节点和trainer节点个数找到最佳性能。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里trainers中,讲述nccl2模式的,是否应该放到85行中介绍。

low_level/metrics.rst
low_level/model_save_reader.rst
low_level/inference.rst
low_level/distributed/index.rst
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为避免多人同时修改这个文件造成conflict,请先取消对index文件的修改

.. toctree::
:maxdepth: 1

sync_training.rst No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为避免多人同时修改这个文件造成conflict,请先取消对index文件的修改

- :code:`pservers` : 当前训练任务中pserver节点的IP端口列表
- :code:`trainers` : 当前训练任务中trainer节点的个数(注意NCCL2模式中,此项参数是字符串,指定trainer节点的IP端口列表),
注意,在pserver模式下,trainer节点个数可以和pserver节点个数不一致,比如使用20个pserver和50个trainer。在实际训练任务
中,您可以通过调整pserver节点和trainer节点个数找到最佳性能。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“任务”和“中”之间有一个多余的空格

但需要使用大量节点的场景,有利于提升pserver端计算并行度
- :code:`split_method` : 配置transpiler分配参数(或参数的切片)到多个pserver的方式,
默认为"RoundRobin",也可以使用"HashName"
- :code:`min_block_size` : 如果配置了参数切分,指定最小Tensor的切分大小,防止RPC请求包过小,默认为8192,一般情况
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“况”和“不”之间有多余空格

@shanyi15
Copy link
Contributor

Review has been finished, please have a look, thanks

Copy link
Collaborator

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

- :code:`trainer_id` : trainer节点的id,从0到n-1,n为当前训练任务中trainer节点的个数
- :code:`program` : 被转换的 :code:`program` 默认使用 :code:`fluid.default_main_program()`
- :code:`pservers` : 当前训练任务中pserver节点的IP端口列表
- :code:`trainers` : int类型,当前训练任务中trainer节点的个数(NCCL2模式中,此项参数是字符串,指定trainer节点的IP端口列表),注意,在pserver模式下,trainer节点个数可以和pserver节点个数不一致,比如使用20个pserver和50个trainer。在实际训练任务中,您可以通过调整pserver节点和trainer节点个数找到最佳性能。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • :code:trainers : int类型,当前训练任务中trainer节点的个数。注意
    • 在pserver模式下,trainer节点个数可以和pserver节点个数不一致,比如使用20个pserver和50个trainer。在实际训练任务中,您可以通过调整pserver节点和trainer节点个数找到最佳性能。
    • NCCL2模式中,此项参数是字符串,指定trainer节点的IP端口列表。

@shanyi15 @tink2123 此处再分两点,感觉会更清晰,可以merge后,帮忙修改下么?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@shanyi15 shanyi15 requested a review from tink2123 November 30, 2018 05:48
Copy link
Collaborator

@tink2123 tink2123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shanyi15 shanyi15 merged commit ac3edb4 into PaddlePaddle:develop Nov 30, 2018
@luotao1 luotao1 mentioned this pull request Nov 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API Guide docs related to API Guide

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants