add distributed sync training api guide #372

typhoonzero · 2018-11-23T04:47:33Z

luotao1 · 2018-11-27T07:02:32Z

doc/fluid/api/api_guides/low_level/distributed/sync_training.rst

+############
+
+Fluid支持数据并行的分布式同步训练，API使用 :code:`DistributedTranspiler` 将单机网络配置转换成可以多机执行的
+:code:`pserver` 端程序和 :code:`trainer` 端程序，用户在不同的节点执行相同的一段代码，根据环境变量或启动参数，


。（逗号变句号）用户在不同

luotao1 · 2018-11-27T07:04:45Z

doc/fluid/api/api_guides/low_level/distributed/sync_training.rst

+- :code:`pservers` ： 当前训练任务中pserver节点的IP端口列表
+- :code:`trainers` ： 当前训练任务中trainer节点的个数（注意NCCL2模式中，此项参数是字符串，指定trainer节点的IP端口列表），
+  注意，在pserver模式下，trainer节点个数可以和pserver节点个数不一致，比如使用20个pserver和50个trainer。在实际训练任务
+  中，您可以通过调整pserver节点和trainer节点个数找到最佳性能。


这里trainers中，讲述nccl2模式的，是否应该放到85行中介绍。

shanyi15 · 2018-11-27T07:15:48Z

doc/fluid/api/api_guides/index.rst

    low_level/metrics.rst
    low_level/model_save_reader.rst
    low_level/inference.rst
+    low_level/distributed/index.rst


为避免多人同时修改这个文件造成conflict，请先取消对index文件的修改

shanyi15 · 2018-11-27T07:16:11Z

doc/fluid/api/api_guides/low_level/distributed/index.rst

+..  toctree::
+    :maxdepth: 1
+
+    sync_training.rst


为避免多人同时修改这个文件造成conflict，请先取消对index文件的修改

shanyi15 · 2018-11-27T07:22:30Z

doc/fluid/api/api_guides/low_level/distributed/sync_training.rst

+- :code:`pservers` ： 当前训练任务中pserver节点的IP端口列表
+- :code:`trainers` ： 当前训练任务中trainer节点的个数（注意NCCL2模式中，此项参数是字符串，指定trainer节点的IP端口列表），
+  注意，在pserver模式下，trainer节点个数可以和pserver节点个数不一致，比如使用20个pserver和50个trainer。在实际训练任务
+  中，您可以通过调整pserver节点和trainer节点个数找到最佳性能。


“任务”和“中”之间有一个多余的空格

shanyi15 · 2018-11-27T07:22:53Z

doc/fluid/api/api_guides/low_level/distributed/sync_training.rst

+  但需要使用大量节点的场景，有利于提升pserver端计算并行度
+- :code:`split_method` ： 配置transpiler分配参数（或参数的切片）到多个pserver的方式，
+  默认为"RoundRobin"，也可以使用"HashName"
+- :code:`min_block_size` ： 如果配置了参数切分，指定最小Tensor的切分大小，防止RPC请求包过小，默认为8192，一般情况


“况”和“不”之间有多余空格

shanyi15 · 2018-11-27T07:24:56Z

Review has been finished, please have a look, thanks

luotao1

LGTM

luotao1 · 2018-11-30T05:29:57Z

doc/fluid/api/api_guides/low_level/distributed/sync_training.rst

+- :code:`trainer_id` ： trainer节点的id，从0到n-1，n为当前训练任务中trainer节点的个数
+- :code:`program` ： 被转换的 :code:`program` 默认使用 :code:`fluid.default_main_program()`
+- :code:`pservers` ： 当前训练任务中pserver节点的IP端口列表
+- :code:`trainers` ： int类型，当前训练任务中trainer节点的个数（NCCL2模式中，此项参数是字符串，指定trainer节点的IP端口列表），注意，在pserver模式下，trainer节点个数可以和pserver节点个数不一致，比如使用20个pserver和50个trainer。在实际训练任务中，您可以通过调整pserver节点和trainer节点个数找到最佳性能。


:code:trainers ： int类型，当前训练任务中trainer节点的个数。注意

在pserver模式下，trainer节点个数可以和pserver节点个数不一致，比如使用20个pserver和50个trainer。在实际训练任务中，您可以通过调整pserver节点和trainer节点个数找到最佳性能。

NCCL2模式中，此项参数是字符串，指定trainer节点的IP端口列表。

@shanyi15 @tink2123 此处再分两点，感觉会更清晰，可以merge后，帮忙修改下么？

tink2123

LGTM

wip

fac03f9

shanyi15 added the API Guide docs related to API Guide label Nov 23, 2018

update

cf9d746

typhoonzero requested a review from shanyi15 November 26, 2018 01:56

typhoonzero changed the title ~~[WIP]add distributed sync training api guide~~ add distributed sync training api guide Nov 26, 2018

shanyi15 requested a review from luotao1 November 26, 2018 04:12

luotao1 reviewed Nov 27, 2018

View reviewed changes

shanyi15 reviewed Nov 27, 2018

View reviewed changes

follow comments

7259f65

typhoonzero mentioned this pull request Nov 28, 2018

Add sparse update API guide. #399

Merged

typhoonzero added 2 commits November 28, 2018 10:48

clean up

e6c2e60

fix format

e5bec04

luotao1 reviewed Nov 30, 2018

View reviewed changes

follow_comments

b962f27

shanyi15 requested a review from tink2123 November 30, 2018 05:48

tink2123 approved these changes Nov 30, 2018

View reviewed changes

shanyi15 merged commit ac3edb4 into PaddlePaddle:develop Nov 30, 2018

luotao1 mentioned this pull request Nov 30, 2018

Add async train doc #400

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add distributed sync training api guide #372

add distributed sync training api guide #372

Uh oh!

typhoonzero commented Nov 23, 2018 •

edited

Loading

Uh oh!

luotao1 Nov 27, 2018

Uh oh!

luotao1 Nov 27, 2018

Uh oh!

shanyi15 Nov 27, 2018

Uh oh!

shanyi15 Nov 27, 2018

Uh oh!

shanyi15 Nov 27, 2018

Uh oh!

shanyi15 Nov 27, 2018

Uh oh!

shanyi15 commented Nov 27, 2018

Uh oh!

luotao1 left a comment

Uh oh!

luotao1 Nov 30, 2018

Uh oh!

shanyi15 Nov 30, 2018

Uh oh!

tink2123 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

add distributed sync training api guide #372

add distributed sync training api guide #372

Uh oh!

Conversation

typhoonzero commented Nov 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shanyi15 commented Nov 27, 2018

Uh oh!

luotao1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tink2123 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

typhoonzero commented Nov 23, 2018 •

edited

Loading