-
Notifications
You must be signed in to change notification settings - Fork 876
add distributed sync training api guide #372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add distributed sync training api guide #372
Conversation
| ############ | ||
|
|
||
| Fluid支持数据并行的分布式同步训练,API使用 :code:`DistributedTranspiler` 将单机网络配置转换成可以多机执行的 | ||
| :code:`pserver` 端程序和 :code:`trainer` 端程序,用户在不同的节点执行相同的一段代码,根据环境变量或启动参数, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
。(逗号变句号)用户在不同
| - :code:`pservers` : 当前训练任务中pserver节点的IP端口列表 | ||
| - :code:`trainers` : 当前训练任务中trainer节点的个数(注意NCCL2模式中,此项参数是字符串,指定trainer节点的IP端口列表), | ||
| 注意,在pserver模式下,trainer节点个数可以和pserver节点个数不一致,比如使用20个pserver和50个trainer。在实际训练任务 | ||
| 中,您可以通过调整pserver节点和trainer节点个数找到最佳性能。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里trainers中,讲述nccl2模式的,是否应该放到85行中介绍。
doc/fluid/api/api_guides/index.rst
Outdated
| low_level/metrics.rst | ||
| low_level/model_save_reader.rst | ||
| low_level/inference.rst | ||
| low_level/distributed/index.rst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为避免多人同时修改这个文件造成conflict,请先取消对index文件的修改
| .. toctree:: | ||
| :maxdepth: 1 | ||
|
|
||
| sync_training.rst No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为避免多人同时修改这个文件造成conflict,请先取消对index文件的修改
| - :code:`pservers` : 当前训练任务中pserver节点的IP端口列表 | ||
| - :code:`trainers` : 当前训练任务中trainer节点的个数(注意NCCL2模式中,此项参数是字符串,指定trainer节点的IP端口列表), | ||
| 注意,在pserver模式下,trainer节点个数可以和pserver节点个数不一致,比如使用20个pserver和50个trainer。在实际训练任务 | ||
| 中,您可以通过调整pserver节点和trainer节点个数找到最佳性能。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“任务”和“中”之间有一个多余的空格
| 但需要使用大量节点的场景,有利于提升pserver端计算并行度 | ||
| - :code:`split_method` : 配置transpiler分配参数(或参数的切片)到多个pserver的方式, | ||
| 默认为"RoundRobin",也可以使用"HashName" | ||
| - :code:`min_block_size` : 如果配置了参数切分,指定最小Tensor的切分大小,防止RPC请求包过小,默认为8192,一般情况 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“况”和“不”之间有多余空格
|
Review has been finished, please have a look, thanks |
luotao1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| - :code:`trainer_id` : trainer节点的id,从0到n-1,n为当前训练任务中trainer节点的个数 | ||
| - :code:`program` : 被转换的 :code:`program` 默认使用 :code:`fluid.default_main_program()` | ||
| - :code:`pservers` : 当前训练任务中pserver节点的IP端口列表 | ||
| - :code:`trainers` : int类型,当前训练任务中trainer节点的个数(NCCL2模式中,此项参数是字符串,指定trainer节点的IP端口列表),注意,在pserver模式下,trainer节点个数可以和pserver节点个数不一致,比如使用20个pserver和50个trainer。在实际训练任务中,您可以通过调整pserver节点和trainer节点个数找到最佳性能。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
tink2123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fix #292