-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Part 4 of cross-device sends/recvs support: Added initial support for SIL accelerator function partitioning based on TF devices {CPU, GPU}, including control flow. #17165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… SIL accelerator function partitioning based on TF devices {CPU, GPU}, including control flow. Summary of changes: 1. Extended the DeviceType with an ALL enum value, indicating that an associated instruction runs on all devices with which the TF computation is involved. For example, promoted scalars run on ALL devices. Also, for ease of control flow handling, BB args are present on ALL devices. The exception is the function input arguments, which are only present in the primary device function (recall the primary function is the partitioned function that runs on a target device given by TensorFlow.enableGPU(), TensorFlow.enableTPU() or a default policy), while the helper functions do not take input or output tensors. 2. Added a new pass DevicePartitioner that sits between the PartitionerCloner pass in TFPartition and the TFGraphLowering pass in TFLowerGraph. It has two phases described as follows. In the analysis/mark phase, it inserts instructions for cross-device tensor sends/recvs, represented by "__tfop_tfc.TensorTransfer" builtin's. For example, when tensor x is produced on device D1, and is then consumed by tensor op foo() on device D2, it inserts right before foo() a "__tfop_tfc.TensorTransfer" builtin to send that tensor from D1 to D2. This builtin helps maintain the invariant that for any instruction I running on some device D, for any operand OP of I, OP must be present on D (either because OP is produced on D, or it is transfered via this builtin). When tf-dump-graph flag is on, the output SIL of this phase is dumped under a header like: --- TFDevicePartition Cross Device Tensor Transfer Annotation Result: $S3tmp10testScalar1fySf_tF.tf In the partitioning phase (DevicePartitionCloner), it extracts all instructions related to a given target device D into a new SIL function, to be lowered by TFGraphLowering. For a "__tfop_tfc.TensorTransfer" builtin: - If D is its source/send device, it gets lowered to a TF _Send op in the CPU/GPU device context, via a "__tfop_tfc.D2DTensorSend" builtin. - If D is its dest/recv device, it gets lowered to a TF _Recv op in the CPU/GPU device context, via a "__tfop_tfc.D2DTensorRecv" builtin. For control flow support, in each partitioned, device-spcific SIL function produced by DevicePartitionCloner, it retains all basic blocks from the input accelerator SIL function, along with the BB args. When tf-dump-graph flag is on, the output of this phase is dumped under a header like: --- TFDevicePartition Per-Device Function Extraction Result: $S3tmp10testScalar1fySf_tF.tf_CPU.device_partition 3. Extended the TFGraphLowering pass to turn D2DTensorSend/D2DTensorRecv into TF _Send and _Recv nodes. These nodes work on CPU and GPU. In the TPU device context, the above can be lowered to infeed/outfeed or HostCompute. This is to be explored later. 4. Also upgraded "tensorflowSend" and "tensorflowReceive" built-ins with "tfc.SendToHost" and "tfc.RecvFromHost" builtins, with proper tfop attributes to represent the tensor transfer id and send/recv devices.
@swift-ci please test tensorflow |
rxwei
approved these changes
Jun 13, 2018
@swift-ci please clean test tensorflow |
Confirmed that CI is working, and also passed local tests, so I'm merging. |
Great! |
marcrasi
pushed a commit
that referenced
this pull request
Jun 14, 2018
… SIL (#17165) Part 4 of cross-device sends/recvs support: Added initial support for SIL accelerator function partitioning based on TF devices {CPU, GPU}, including control flow. Summary of changes: 1. Extended the DeviceType with an ALL enum value, indicating that an associated instruction runs on all devices with which the TF computation is involved. For example, promoted scalars run on ALL devices. Also, for ease of control flow handling, BB args are present on ALL devices. The exception is the function input arguments, which are only present in the primary device function (recall the primary function is the partitioned function that runs on a target device given by TensorFlow.enableGPU(), TensorFlow.enableTPU() or a default policy), while the helper functions do not take input or output tensors. 2. Added a new pass DevicePartitioner that sits between the PartitionerCloner pass in TFPartition and the TFGraphLowering pass in TFLowerGraph. It has two phases described as follows. In the analysis/mark phase, it inserts instructions for cross-device tensor sends/recvs, represented by "__tfop_tfc.TensorTransfer" builtin's. For example, when tensor x is produced on device D1, and is then consumed by tensor op foo() on device D2, it inserts right before foo() a "__tfop_tfc.TensorTransfer" builtin to send that tensor from D1 to D2. This builtin helps maintain the invariant that for any instruction I running on some device D, for any operand OP of I, OP must be present on D (either because OP is produced on D, or it is transfered via this builtin). When tf-dump-graph flag is on, the output SIL of this phase is dumped under a header like: --- TFDevicePartition Cross Device Tensor Transfer Annotation Result: $S3tmp10testScalar1fySf_tF.tf In the partitioning phase (DevicePartitionCloner), it extracts all instructions related to a given target device D into a new SIL function, to be lowered by TFGraphLowering. For a "__tfop_tfc.TensorTransfer" builtin: - If D is its source/send device, it gets lowered to a TF _Send op in the CPU/GPU device context, via a "__tfop_tfc.D2DTensorSend" builtin. - If D is its dest/recv device, it gets lowered to a TF _Recv op in the CPU/GPU device context, via a "__tfop_tfc.D2DTensorRecv" builtin. For control flow support, in each partitioned, device-spcific SIL function produced by DevicePartitionCloner, it retains all basic blocks from the input accelerator SIL function, along with the BB args. When tf-dump-graph flag is on, the output of this phase is dumped under a header like: --- TFDevicePartition Per-Device Function Extraction Result: $S3tmp10testScalar1fySf_tF.tf_CPU.device_partition 3. Extended the TFGraphLowering pass to turn D2DTensorSend/D2DTensorRecv into TF _Send and _Recv nodes. These nodes work on CPU and GPU. In the TPU device context, the above can be lowered to infeed/outfeed or HostCompute. This is to be explored later. 4. Also upgraded "tensorflowSend" and "tensorflowReceive" built-ins with "tfc.SendToHost" and "tfc.RecvFromHost" builtins, with proper tfop attributes to represent the tensor transfer id and send/recv devices.
marcrasi
pushed a commit
to google/swift
that referenced
this pull request
Jun 15, 2018
… SIL (swiftlang#17165) Part 4 of cross-device sends/recvs support: Added initial support for SIL accelerator function partitioning based on TF devices {CPU, GPU}, including control flow. Summary of changes: 1. Extended the DeviceType with an ALL enum value, indicating that an associated instruction runs on all devices with which the TF computation is involved. For example, promoted scalars run on ALL devices. Also, for ease of control flow handling, BB args are present on ALL devices. The exception is the function input arguments, which are only present in the primary device function (recall the primary function is the partitioned function that runs on a target device given by TensorFlow.enableGPU(), TensorFlow.enableTPU() or a default policy), while the helper functions do not take input or output tensors. 2. Added a new pass DevicePartitioner that sits between the PartitionerCloner pass in TFPartition and the TFGraphLowering pass in TFLowerGraph. It has two phases described as follows. In the analysis/mark phase, it inserts instructions for cross-device tensor sends/recvs, represented by "__tfop_tfc.TensorTransfer" builtin's. For example, when tensor x is produced on device D1, and is then consumed by tensor op foo() on device D2, it inserts right before foo() a "__tfop_tfc.TensorTransfer" builtin to send that tensor from D1 to D2. This builtin helps maintain the invariant that for any instruction I running on some device D, for any operand OP of I, OP must be present on D (either because OP is produced on D, or it is transfered via this builtin). When tf-dump-graph flag is on, the output SIL of this phase is dumped under a header like: --- TFDevicePartition Cross Device Tensor Transfer Annotation Result: $S3tmp10testScalar1fySf_tF.tf In the partitioning phase (DevicePartitionCloner), it extracts all instructions related to a given target device D into a new SIL function, to be lowered by TFGraphLowering. For a "__tfop_tfc.TensorTransfer" builtin: - If D is its source/send device, it gets lowered to a TF _Send op in the CPU/GPU device context, via a "__tfop_tfc.D2DTensorSend" builtin. - If D is its dest/recv device, it gets lowered to a TF _Recv op in the CPU/GPU device context, via a "__tfop_tfc.D2DTensorRecv" builtin. For control flow support, in each partitioned, device-spcific SIL function produced by DevicePartitionCloner, it retains all basic blocks from the input accelerator SIL function, along with the BB args. When tf-dump-graph flag is on, the output of this phase is dumped under a header like: --- TFDevicePartition Per-Device Function Extraction Result: $S3tmp10testScalar1fySf_tF.tf_CPU.device_partition 3. Extended the TFGraphLowering pass to turn D2DTensorSend/D2DTensorRecv into TF _Send and _Recv nodes. These nodes work on CPU and GPU. In the TPU device context, the above can be lowered to infeed/outfeed or HostCompute. This is to be explored later. 4. Also upgraded "tensorflowSend" and "tensorflowReceive" built-ins with "tfc.SendToHost" and "tfc.RecvFromHost" builtins, with proper tfop attributes to represent the tensor transfer id and send/recv devices.
marcrasi
pushed a commit
that referenced
this pull request
Jun 22, 2018
… SIL (#17165) Part 4 of cross-device sends/recvs support: Added initial support for SIL accelerator function partitioning based on TF devices {CPU, GPU}, including control flow. Summary of changes: 1. Extended the DeviceType with an ALL enum value, indicating that an associated instruction runs on all devices with which the TF computation is involved. For example, promoted scalars run on ALL devices. Also, for ease of control flow handling, BB args are present on ALL devices. The exception is the function input arguments, which are only present in the primary device function (recall the primary function is the partitioned function that runs on a target device given by TensorFlow.enableGPU(), TensorFlow.enableTPU() or a default policy), while the helper functions do not take input or output tensors. 2. Added a new pass DevicePartitioner that sits between the PartitionerCloner pass in TFPartition and the TFGraphLowering pass in TFLowerGraph. It has two phases described as follows. In the analysis/mark phase, it inserts instructions for cross-device tensor sends/recvs, represented by "__tfop_tfc.TensorTransfer" builtin's. For example, when tensor x is produced on device D1, and is then consumed by tensor op foo() on device D2, it inserts right before foo() a "__tfop_tfc.TensorTransfer" builtin to send that tensor from D1 to D2. This builtin helps maintain the invariant that for any instruction I running on some device D, for any operand OP of I, OP must be present on D (either because OP is produced on D, or it is transfered via this builtin). When tf-dump-graph flag is on, the output SIL of this phase is dumped under a header like: --- TFDevicePartition Cross Device Tensor Transfer Annotation Result: $S3tmp10testScalar1fySf_tF.tf In the partitioning phase (DevicePartitionCloner), it extracts all instructions related to a given target device D into a new SIL function, to be lowered by TFGraphLowering. For a "__tfop_tfc.TensorTransfer" builtin: - If D is its source/send device, it gets lowered to a TF _Send op in the CPU/GPU device context, via a "__tfop_tfc.D2DTensorSend" builtin. - If D is its dest/recv device, it gets lowered to a TF _Recv op in the CPU/GPU device context, via a "__tfop_tfc.D2DTensorRecv" builtin. For control flow support, in each partitioned, device-spcific SIL function produced by DevicePartitionCloner, it retains all basic blocks from the input accelerator SIL function, along with the BB args. When tf-dump-graph flag is on, the output of this phase is dumped under a header like: --- TFDevicePartition Per-Device Function Extraction Result: $S3tmp10testScalar1fySf_tF.tf_CPU.device_partition 3. Extended the TFGraphLowering pass to turn D2DTensorSend/D2DTensorRecv into TF _Send and _Recv nodes. These nodes work on CPU and GPU. In the TPU device context, the above can be lowered to infeed/outfeed or HostCompute. This is to be explored later. 4. Also upgraded "tensorflowSend" and "tensorflowReceive" built-ins with "tfc.SendToHost" and "tfc.RecvFromHost" builtins, with proper tfop attributes to represent the tensor transfer id and send/recv devices.
marcrasi
pushed a commit
that referenced
this pull request
Jun 28, 2018
… SIL (#17165) Part 4 of cross-device sends/recvs support: Added initial support for SIL accelerator function partitioning based on TF devices {CPU, GPU}, including control flow. Summary of changes: 1. Extended the DeviceType with an ALL enum value, indicating that an associated instruction runs on all devices with which the TF computation is involved. For example, promoted scalars run on ALL devices. Also, for ease of control flow handling, BB args are present on ALL devices. The exception is the function input arguments, which are only present in the primary device function (recall the primary function is the partitioned function that runs on a target device given by TensorFlow.enableGPU(), TensorFlow.enableTPU() or a default policy), while the helper functions do not take input or output tensors. 2. Added a new pass DevicePartitioner that sits between the PartitionerCloner pass in TFPartition and the TFGraphLowering pass in TFLowerGraph. It has two phases described as follows. In the analysis/mark phase, it inserts instructions for cross-device tensor sends/recvs, represented by "__tfop_tfc.TensorTransfer" builtin's. For example, when tensor x is produced on device D1, and is then consumed by tensor op foo() on device D2, it inserts right before foo() a "__tfop_tfc.TensorTransfer" builtin to send that tensor from D1 to D2. This builtin helps maintain the invariant that for any instruction I running on some device D, for any operand OP of I, OP must be present on D (either because OP is produced on D, or it is transfered via this builtin). When tf-dump-graph flag is on, the output SIL of this phase is dumped under a header like: --- TFDevicePartition Cross Device Tensor Transfer Annotation Result: $S3tmp10testScalar1fySf_tF.tf In the partitioning phase (DevicePartitionCloner), it extracts all instructions related to a given target device D into a new SIL function, to be lowered by TFGraphLowering. For a "__tfop_tfc.TensorTransfer" builtin: - If D is its source/send device, it gets lowered to a TF _Send op in the CPU/GPU device context, via a "__tfop_tfc.D2DTensorSend" builtin. - If D is its dest/recv device, it gets lowered to a TF _Recv op in the CPU/GPU device context, via a "__tfop_tfc.D2DTensorRecv" builtin. For control flow support, in each partitioned, device-spcific SIL function produced by DevicePartitionCloner, it retains all basic blocks from the input accelerator SIL function, along with the BB args. When tf-dump-graph flag is on, the output of this phase is dumped under a header like: --- TFDevicePartition Per-Device Function Extraction Result: $S3tmp10testScalar1fySf_tF.tf_CPU.device_partition 3. Extended the TFGraphLowering pass to turn D2DTensorSend/D2DTensorRecv into TF _Send and _Recv nodes. These nodes work on CPU and GPU. In the TPU device context, the above can be lowered to infeed/outfeed or HostCompute. This is to be explored later. 4. Also upgraded "tensorflowSend" and "tensorflowReceive" built-ins with "tfc.SendToHost" and "tfc.RecvFromHost" builtins, with proper tfop attributes to represent the tensor transfer id and send/recv devices.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary of changes:
instruction runs on all devices with which the TF computation is involved. For
example, promoted scalars run on ALL devices.
Also, for ease of control flow handling, BB args are present on ALL devices. The
exception is the function input arguments, which are only present in the primary
device function (recall the primary function is the partitioned function that
runs on a target device given by TensorFlow.enableGPU(), TensorFlow.enableTPU()
or a default policy), while the helper functions do not take input or output
tensors.
pass in TFPartition and the TFGraphLowering pass in TFLowerGraph. It has two
phases described as follows.
In the analysis/mark phase, it inserts instructions for cross-device tensor
sends/recvs, represented by "__tfop_tfc.TensorTransfer" builtin's. For example,
when tensor x is produced on device D1, and is then consumed by tensor op foo()
on device D2, it inserts right before foo() a "__tfop_tfc.TensorTransfer"
builtin to send that tensor from D1 to D2.
This builtin helps maintain the invariant that for any instruction I running on
some device D, for any operand OP of I, OP must be present on D
(either because OP is produced on D, or it is transfered via this
builtin).
When tf-dump-graph flag is on, the output SIL of this phase is dumped under a header like:
In the partitioning phase (DevicePartitionCloner), it extracts all instructions
related to a given target device D into a new SIL function, to be lowered by
TFGraphLowering. For a "__tfop_tfc.TensorTransfer" builtin:
CPU/GPU device context, via a "__tfop_tfc.D2DTensorSend" builtin.
device context, via a "__tfop_tfc.D2DTensorRecv" builtin.
For control flow support, in each partitioned, device-spcific SIL function
produced by DevicePartitionCloner, it retains all basic blocks from the input
accelerator SIL function, along with the BB args.
When tf-dump-graph flag is on, the output of this phase is dumped under a header like:
_Send and _Recv nodes. These nodes work on CPU and GPU.
In the TPU device context, the above can be lowered to infeed/outfeed or
HostCompute. This is to be explored later.
"tfc.SendToHost" and "tfc.RecvFromHost" builtins, with proper tfop attributes to
represent the tensor transfer id and send/recv devices.