Part 4 of cross-device sends/recvs support: Added initial support for SIL accelerator function partitioning based on TF devices {CPU, GPU}, including control flow. #17165

mhong · 2018-06-13T02:07:12Z

Summary of changes:

Extended the DeviceType with an ALL enum value, indicating that an associated
instruction runs on all devices with which the TF computation is involved. For
example, promoted scalars run on ALL devices.

Also, for ease of control flow handling, BB args are present on ALL devices. The
exception is the function input arguments, which are only present in the primary
device function (recall the primary function is the partitioned function that
runs on a target device given by TensorFlow.enableGPU(), TensorFlow.enableTPU()
or a default policy), while the helper functions do not take input or output
tensors.

Added a new pass DevicePartitioner that sits between the PartitionerCloner
pass in TFPartition and the TFGraphLowering pass in TFLowerGraph. It has two
phases described as follows.

In the analysis/mark phase, it inserts instructions for cross-device tensor
sends/recvs, represented by "__tfop_tfc.TensorTransfer" builtin's. For example,
when tensor x is produced on device D1, and is then consumed by tensor op foo()
on device D2, it inserts right before foo() a "__tfop_tfc.TensorTransfer"
builtin to send that tensor from D1 to D2.

This builtin helps maintain the invariant that for any instruction I running on
some device D, for any operand OP of I, OP must be present on D
(either because OP is produced on D, or it is transfered via this
builtin).

When tf-dump-graph flag is on, the output SIL of this phase is dumped under a header like:

--- TFDevicePartition Cross Device Tensor Transfer Annotation Result: $S3tmp10testScalar1fySf_tF.tf

In the partitioning phase (DevicePartitionCloner), it extracts all instructions
related to a given target device D into a new SIL function, to be lowered by
TFGraphLowering. For a "__tfop_tfc.TensorTransfer" builtin:

If D is its source/send device, it gets lowered to a TF _Send op in the
CPU/GPU device context, via a "__tfop_tfc.D2DTensorSend" builtin.
If D is its dest/recv device, it gets lowered to a TF _Recv op in the CPU/GPU
device context, via a "__tfop_tfc.D2DTensorRecv" builtin.

For control flow support, in each partitioned, device-spcific SIL function
produced by DevicePartitionCloner, it retains all basic blocks from the input
accelerator SIL function, along with the BB args.

When tf-dump-graph flag is on, the output of this phase is dumped under a header like:

--- TFDevicePartition Per-Device Function Extraction Result: $S3tmp10testScalar1fySf_tF.tf_CPU.device_partition

Extended the TFGraphLowering pass to turn D2DTensorSend/D2DTensorRecv into TF
_Send and _Recv nodes. These nodes work on CPU and GPU.

In the TPU device context, the above can be lowered to infeed/outfeed or
HostCompute. This is to be explored later.

Also upgraded "tensorflowSend" and "tensorflowReceive" built-ins with
"tfc.SendToHost" and "tfc.RecvFromHost" builtins, with proper tfop attributes to
represent the tensor transfer id and send/recv devices.

… SIL accelerator function partitioning based on TF devices {CPU, GPU}, including control flow. Summary of changes: 1. Extended the DeviceType with an ALL enum value, indicating that an associated instruction runs on all devices with which the TF computation is involved. For example, promoted scalars run on ALL devices. Also, for ease of control flow handling, BB args are present on ALL devices. The exception is the function input arguments, which are only present in the primary device function (recall the primary function is the partitioned function that runs on a target device given by TensorFlow.enableGPU(), TensorFlow.enableTPU() or a default policy), while the helper functions do not take input or output tensors. 2. Added a new pass DevicePartitioner that sits between the PartitionerCloner pass in TFPartition and the TFGraphLowering pass in TFLowerGraph. It has two phases described as follows. In the analysis/mark phase, it inserts instructions for cross-device tensor sends/recvs, represented by "__tfop_tfc.TensorTransfer" builtin's. For example, when tensor x is produced on device D1, and is then consumed by tensor op foo() on device D2, it inserts right before foo() a "__tfop_tfc.TensorTransfer" builtin to send that tensor from D1 to D2. This builtin helps maintain the invariant that for any instruction I running on some device D, for any operand OP of I, OP must be present on D (either because OP is produced on D, or it is transfered via this builtin). When tf-dump-graph flag is on, the output SIL of this phase is dumped under a header like: --- TFDevicePartition Cross Device Tensor Transfer Annotation Result: $S3tmp10testScalar1fySf_tF.tf In the partitioning phase (DevicePartitionCloner), it extracts all instructions related to a given target device D into a new SIL function, to be lowered by TFGraphLowering. For a "__tfop_tfc.TensorTransfer" builtin: - If D is its source/send device, it gets lowered to a TF _Send op in the CPU/GPU device context, via a "__tfop_tfc.D2DTensorSend" builtin. - If D is its dest/recv device, it gets lowered to a TF _Recv op in the CPU/GPU device context, via a "__tfop_tfc.D2DTensorRecv" builtin. For control flow support, in each partitioned, device-spcific SIL function produced by DevicePartitionCloner, it retains all basic blocks from the input accelerator SIL function, along with the BB args. When tf-dump-graph flag is on, the output of this phase is dumped under a header like: --- TFDevicePartition Per-Device Function Extraction Result: $S3tmp10testScalar1fySf_tF.tf_CPU.device_partition 3. Extended the TFGraphLowering pass to turn D2DTensorSend/D2DTensorRecv into TF _Send and _Recv nodes. These nodes work on CPU and GPU. In the TPU device context, the above can be lowered to infeed/outfeed or HostCompute. This is to be explored later. 4. Also upgraded "tensorflowSend" and "tensorflowReceive" built-ins with "tfc.SendToHost" and "tfc.RecvFromHost" builtins, with proper tfop attributes to represent the tensor transfer id and send/recv devices.

rxwei · 2018-06-13T02:12:02Z

@swift-ci please test tensorflow

rxwei · 2018-06-13T17:02:39Z

@swift-ci please clean test tensorflow

mhong · 2018-06-13T23:14:09Z

Confirmed that CI is working, and also passed local tests, so I'm merging.

rxwei · 2018-06-13T23:21:30Z

Great!

… SIL (#17165) Part 4 of cross-device sends/recvs support: Added initial support for SIL accelerator function partitioning based on TF devices {CPU, GPU}, including control flow. Summary of changes: 1. Extended the DeviceType with an ALL enum value, indicating that an associated instruction runs on all devices with which the TF computation is involved. For example, promoted scalars run on ALL devices. Also, for ease of control flow handling, BB args are present on ALL devices. The exception is the function input arguments, which are only present in the primary device function (recall the primary function is the partitioned function that runs on a target device given by TensorFlow.enableGPU(), TensorFlow.enableTPU() or a default policy), while the helper functions do not take input or output tensors. 2. Added a new pass DevicePartitioner that sits between the PartitionerCloner pass in TFPartition and the TFGraphLowering pass in TFLowerGraph. It has two phases described as follows. In the analysis/mark phase, it inserts instructions for cross-device tensor sends/recvs, represented by "__tfop_tfc.TensorTransfer" builtin's. For example, when tensor x is produced on device D1, and is then consumed by tensor op foo() on device D2, it inserts right before foo() a "__tfop_tfc.TensorTransfer" builtin to send that tensor from D1 to D2. This builtin helps maintain the invariant that for any instruction I running on some device D, for any operand OP of I, OP must be present on D (either because OP is produced on D, or it is transfered via this builtin). When tf-dump-graph flag is on, the output SIL of this phase is dumped under a header like: --- TFDevicePartition Cross Device Tensor Transfer Annotation Result: $S3tmp10testScalar1fySf_tF.tf In the partitioning phase (DevicePartitionCloner), it extracts all instructions related to a given target device D into a new SIL function, to be lowered by TFGraphLowering. For a "__tfop_tfc.TensorTransfer" builtin: - If D is its source/send device, it gets lowered to a TF _Send op in the CPU/GPU device context, via a "__tfop_tfc.D2DTensorSend" builtin. - If D is its dest/recv device, it gets lowered to a TF _Recv op in the CPU/GPU device context, via a "__tfop_tfc.D2DTensorRecv" builtin. For control flow support, in each partitioned, device-spcific SIL function produced by DevicePartitionCloner, it retains all basic blocks from the input accelerator SIL function, along with the BB args. When tf-dump-graph flag is on, the output of this phase is dumped under a header like: --- TFDevicePartition Per-Device Function Extraction Result: $S3tmp10testScalar1fySf_tF.tf_CPU.device_partition 3. Extended the TFGraphLowering pass to turn D2DTensorSend/D2DTensorRecv into TF _Send and _Recv nodes. These nodes work on CPU and GPU. In the TPU device context, the above can be lowered to infeed/outfeed or HostCompute. This is to be explored later. 4. Also upgraded "tensorflowSend" and "tensorflowReceive" built-ins with "tfc.SendToHost" and "tfc.RecvFromHost" builtins, with proper tfop attributes to represent the tensor transfer id and send/recv devices.

… SIL (swiftlang#17165) Part 4 of cross-device sends/recvs support: Added initial support for SIL accelerator function partitioning based on TF devices {CPU, GPU}, including control flow. Summary of changes: 1. Extended the DeviceType with an ALL enum value, indicating that an associated instruction runs on all devices with which the TF computation is involved. For example, promoted scalars run on ALL devices. Also, for ease of control flow handling, BB args are present on ALL devices. The exception is the function input arguments, which are only present in the primary device function (recall the primary function is the partitioned function that runs on a target device given by TensorFlow.enableGPU(), TensorFlow.enableTPU() or a default policy), while the helper functions do not take input or output tensors. 2. Added a new pass DevicePartitioner that sits between the PartitionerCloner pass in TFPartition and the TFGraphLowering pass in TFLowerGraph. It has two phases described as follows. In the analysis/mark phase, it inserts instructions for cross-device tensor sends/recvs, represented by "__tfop_tfc.TensorTransfer" builtin's. For example, when tensor x is produced on device D1, and is then consumed by tensor op foo() on device D2, it inserts right before foo() a "__tfop_tfc.TensorTransfer" builtin to send that tensor from D1 to D2. This builtin helps maintain the invariant that for any instruction I running on some device D, for any operand OP of I, OP must be present on D (either because OP is produced on D, or it is transfered via this builtin). When tf-dump-graph flag is on, the output SIL of this phase is dumped under a header like: --- TFDevicePartition Cross Device Tensor Transfer Annotation Result: $S3tmp10testScalar1fySf_tF.tf In the partitioning phase (DevicePartitionCloner), it extracts all instructions related to a given target device D into a new SIL function, to be lowered by TFGraphLowering. For a "__tfop_tfc.TensorTransfer" builtin: - If D is its source/send device, it gets lowered to a TF _Send op in the CPU/GPU device context, via a "__tfop_tfc.D2DTensorSend" builtin. - If D is its dest/recv device, it gets lowered to a TF _Recv op in the CPU/GPU device context, via a "__tfop_tfc.D2DTensorRecv" builtin. For control flow support, in each partitioned, device-spcific SIL function produced by DevicePartitionCloner, it retains all basic blocks from the input accelerator SIL function, along with the BB args. When tf-dump-graph flag is on, the output of this phase is dumped under a header like: --- TFDevicePartition Per-Device Function Extraction Result: $S3tmp10testScalar1fySf_tF.tf_CPU.device_partition 3. Extended the TFGraphLowering pass to turn D2DTensorSend/D2DTensorRecv into TF _Send and _Recv nodes. These nodes work on CPU and GPU. In the TPU device context, the above can be lowered to infeed/outfeed or HostCompute. This is to be explored later. 4. Also upgraded "tensorflowSend" and "tensorflowReceive" built-ins with "tfc.SendToHost" and "tfc.RecvFromHost" builtins, with proper tfop attributes to represent the tensor transfer id and send/recv devices.

… SIL (#17165) Part 4 of cross-device sends/recvs support: Added initial support for SIL accelerator function partitioning based on TF devices {CPU, GPU}, including control flow. Summary of changes: 1. Extended the DeviceType with an ALL enum value, indicating that an associated instruction runs on all devices with which the TF computation is involved. For example, promoted scalars run on ALL devices. Also, for ease of control flow handling, BB args are present on ALL devices. The exception is the function input arguments, which are only present in the primary device function (recall the primary function is the partitioned function that runs on a target device given by TensorFlow.enableGPU(), TensorFlow.enableTPU() or a default policy), while the helper functions do not take input or output tensors. 2. Added a new pass DevicePartitioner that sits between the PartitionerCloner pass in TFPartition and the TFGraphLowering pass in TFLowerGraph. It has two phases described as follows. In the analysis/mark phase, it inserts instructions for cross-device tensor sends/recvs, represented by "__tfop_tfc.TensorTransfer" builtin's. For example, when tensor x is produced on device D1, and is then consumed by tensor op foo() on device D2, it inserts right before foo() a "__tfop_tfc.TensorTransfer" builtin to send that tensor from D1 to D2. This builtin helps maintain the invariant that for any instruction I running on some device D, for any operand OP of I, OP must be present on D (either because OP is produced on D, or it is transfered via this builtin). When tf-dump-graph flag is on, the output SIL of this phase is dumped under a header like: --- TFDevicePartition Cross Device Tensor Transfer Annotation Result: $S3tmp10testScalar1fySf_tF.tf In the partitioning phase (DevicePartitionCloner), it extracts all instructions related to a given target device D into a new SIL function, to be lowered by TFGraphLowering. For a "__tfop_tfc.TensorTransfer" builtin: - If D is its source/send device, it gets lowered to a TF _Send op in the CPU/GPU device context, via a "__tfop_tfc.D2DTensorSend" builtin. - If D is its dest/recv device, it gets lowered to a TF _Recv op in the CPU/GPU device context, via a "__tfop_tfc.D2DTensorRecv" builtin. For control flow support, in each partitioned, device-spcific SIL function produced by DevicePartitionCloner, it retains all basic blocks from the input accelerator SIL function, along with the BB args. When tf-dump-graph flag is on, the output of this phase is dumped under a header like: --- TFDevicePartition Per-Device Function Extraction Result: $S3tmp10testScalar1fySf_tF.tf_CPU.device_partition 3. Extended the TFGraphLowering pass to turn D2DTensorSend/D2DTensorRecv into TF _Send and _Recv nodes. These nodes work on CPU and GPU. In the TPU device context, the above can be lowered to infeed/outfeed or HostCompute. This is to be explored later. 4. Also upgraded "tensorflowSend" and "tensorflowReceive" built-ins with "tfc.SendToHost" and "tfc.RecvFromHost" builtins, with proper tfop attributes to represent the tensor transfer id and send/recv devices.

rxwei approved these changes Jun 13, 2018

View reviewed changes

rxwei added the tensorflow This is for "tensorflow" branch PRs. label Jun 13, 2018

mhong merged commit e6a4921 into swiftlang:tensorflow Jun 14, 2018

mhong deleted the mhong_gpu_sends_recvs branch June 14, 2018 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Part 4 of cross-device sends/recvs support: Added initial support for SIL accelerator function partitioning based on TF devices {CPU, GPU}, including control flow. #17165

Part 4 of cross-device sends/recvs support: Added initial support for SIL accelerator function partitioning based on TF devices {CPU, GPU}, including control flow. #17165

Uh oh!

mhong commented Jun 13, 2018

Uh oh!

rxwei commented Jun 13, 2018

Uh oh!

rxwei commented Jun 13, 2018

Uh oh!

mhong commented Jun 13, 2018

Uh oh!

rxwei commented Jun 13, 2018

Uh oh!

Uh oh!

Part 4 of cross-device sends/recvs support: Added initial support for SIL accelerator function partitioning based on TF devices {CPU, GPU}, including control flow. #17165

Part 4 of cross-device sends/recvs support: Added initial support for SIL accelerator function partitioning based on TF devices {CPU, GPU}, including control flow. #17165

Uh oh!

Conversation

mhong commented Jun 13, 2018

Uh oh!

rxwei commented Jun 13, 2018

Uh oh!

rxwei commented Jun 13, 2018

Uh oh!

mhong commented Jun 13, 2018

Uh oh!

rxwei commented Jun 13, 2018

Uh oh!

Uh oh!