Skip to content

Part 4 of cross-device sends/recvs support: Added initial support for SIL accelerator function partitioning based on TF devices {CPU, GPU}, including control flow. #17165

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 14, 2018

Conversation

mhong
Copy link

@mhong mhong commented Jun 13, 2018

Summary of changes:

  1. Extended the DeviceType with an ALL enum value, indicating that an associated
    instruction runs on all devices with which the TF computation is involved. For
    example, promoted scalars run on ALL devices.

Also, for ease of control flow handling, BB args are present on ALL devices. The
exception is the function input arguments, which are only present in the primary
device function (recall the primary function is the partitioned function that
runs on a target device given by TensorFlow.enableGPU(), TensorFlow.enableTPU()
or a default policy), while the helper functions do not take input or output
tensors.

  1. Added a new pass DevicePartitioner that sits between the PartitionerCloner
    pass in TFPartition and the TFGraphLowering pass in TFLowerGraph. It has two
    phases described as follows.

In the analysis/mark phase, it inserts instructions for cross-device tensor
sends/recvs, represented by "__tfop_tfc.TensorTransfer" builtin's. For example,
when tensor x is produced on device D1, and is then consumed by tensor op foo()
on device D2, it inserts right before foo() a "__tfop_tfc.TensorTransfer"
builtin to send that tensor from D1 to D2.

This builtin helps maintain the invariant that for any instruction I running on
some device D, for any operand OP of I, OP must be present on D
(either because OP is produced on D, or it is transfered via this
builtin).

When tf-dump-graph flag is on, the output SIL of this phase is dumped under a header like:

--- TFDevicePartition Cross Device Tensor Transfer Annotation Result: $S3tmp10testScalar1fySf_tF.tf

In the partitioning phase (DevicePartitionCloner), it extracts all instructions
related to a given target device D into a new SIL function, to be lowered by
TFGraphLowering. For a "__tfop_tfc.TensorTransfer" builtin:

  • If D is its source/send device, it gets lowered to a TF _Send op in the
    CPU/GPU device context, via a "__tfop_tfc.D2DTensorSend" builtin.
  • If D is its dest/recv device, it gets lowered to a TF _Recv op in the CPU/GPU
    device context, via a "__tfop_tfc.D2DTensorRecv" builtin.

For control flow support, in each partitioned, device-spcific SIL function
produced by DevicePartitionCloner, it retains all basic blocks from the input
accelerator SIL function, along with the BB args.

When tf-dump-graph flag is on, the output of this phase is dumped under a header like:

--- TFDevicePartition Per-Device Function Extraction Result: $S3tmp10testScalar1fySf_tF.tf_CPU.device_partition
  1. Extended the TFGraphLowering pass to turn D2DTensorSend/D2DTensorRecv into TF
    _Send and _Recv nodes. These nodes work on CPU and GPU.

In the TPU device context, the above can be lowered to infeed/outfeed or
HostCompute. This is to be explored later.

  1. Also upgraded "tensorflowSend" and "tensorflowReceive" built-ins with
    "tfc.SendToHost" and "tfc.RecvFromHost" builtins, with proper tfop attributes to
    represent the tensor transfer id and send/recv devices.

… SIL

accelerator function partitioning based on TF devices {CPU, GPU}, including
control flow.

Summary of changes:

1. Extended the DeviceType with an ALL enum value, indicating that an associated
instruction runs on all devices with which the TF computation is involved. For
example, promoted scalars run on ALL devices.

Also, for ease of control flow handling, BB args are present on ALL devices. The
exception is the function input arguments, which are only present in the primary
device function (recall the primary function is the partitioned function that
runs on a target device given by TensorFlow.enableGPU(), TensorFlow.enableTPU()
or a default policy), while the helper functions do not take input or output
tensors.

2. Added a new pass DevicePartitioner that sits between the PartitionerCloner
pass in TFPartition and the TFGraphLowering pass in TFLowerGraph. It has two
phases described as follows.

In the analysis/mark phase, it inserts instructions for cross-device tensor
sends/recvs, represented by "__tfop_tfc.TensorTransfer" builtin's. For example,
when tensor x is produced on device D1, and is then consumed by tensor op foo()
on device D2, it inserts right before foo() a "__tfop_tfc.TensorTransfer"
builtin to send that tensor from D1 to D2.

This builtin helps maintain the invariant that for any instruction I running on
some device D, for any operand OP of I, OP must be present on D
(either because OP is produced on D, or it is transfered via this
builtin).

When tf-dump-graph flag is on, the output SIL of this phase is dumped under a header like:
--- TFDevicePartition Cross Device Tensor Transfer Annotation Result: $S3tmp10testScalar1fySf_tF.tf

In the partitioning phase (DevicePartitionCloner), it extracts all instructions
related to a given target device D into a new SIL function, to be lowered by
TFGraphLowering. For a "__tfop_tfc.TensorTransfer" builtin:
- If D is its source/send device, it gets lowered to a TF _Send op in the
  CPU/GPU device context, via a "__tfop_tfc.D2DTensorSend" builtin.
- If D is its dest/recv device, it gets lowered to a TF _Recv op in the CPU/GPU
  device context, via a "__tfop_tfc.D2DTensorRecv" builtin.

For control flow support, in each partitioned, device-spcific SIL function
produced by DevicePartitionCloner, it retains all basic blocks from the input
accelerator SIL function, along with the BB args.

When tf-dump-graph flag is on, the output of this phase is dumped under a header like:
--- TFDevicePartition Per-Device Function Extraction Result: $S3tmp10testScalar1fySf_tF.tf_CPU.device_partition

3. Extended the TFGraphLowering pass to turn D2DTensorSend/D2DTensorRecv into TF
_Send and _Recv nodes. These nodes work on CPU and GPU.

In the TPU device context, the above can be lowered to infeed/outfeed or
HostCompute. This is to be explored later.

4. Also upgraded "tensorflowSend" and "tensorflowReceive" built-ins with
"tfc.SendToHost" and "tfc.RecvFromHost" builtins, with proper tfop attributes to
represent the tensor transfer id and send/recv devices.
@rxwei
Copy link
Contributor

rxwei commented Jun 13, 2018

@swift-ci please test tensorflow

@rxwei rxwei added the tensorflow This is for "tensorflow" branch PRs. label Jun 13, 2018
@rxwei
Copy link
Contributor

rxwei commented Jun 13, 2018

@swift-ci please clean test tensorflow

@mhong
Copy link
Author

mhong commented Jun 13, 2018

Confirmed that CI is working, and also passed local tests, so I'm merging.

@rxwei
Copy link
Contributor

rxwei commented Jun 13, 2018

Great!

@mhong mhong merged commit e6a4921 into swiftlang:tensorflow Jun 14, 2018
@mhong mhong deleted the mhong_gpu_sends_recvs branch June 14, 2018 12:39
marcrasi pushed a commit that referenced this pull request Jun 14, 2018
… SIL (#17165)

Part 4 of cross-device sends/recvs support: Added initial support for SIL accelerator function partitioning based on TF devices {CPU, GPU}, including
control flow.

Summary of changes:

1. Extended the DeviceType with an ALL enum value, indicating that an associated
instruction runs on all devices with which the TF computation is involved. For
example, promoted scalars run on ALL devices.

Also, for ease of control flow handling, BB args are present on ALL devices. The
exception is the function input arguments, which are only present in the primary
device function (recall the primary function is the partitioned function that
runs on a target device given by TensorFlow.enableGPU(), TensorFlow.enableTPU()
or a default policy), while the helper functions do not take input or output
tensors.

2. Added a new pass DevicePartitioner that sits between the PartitionerCloner
pass in TFPartition and the TFGraphLowering pass in TFLowerGraph. It has two
phases described as follows.

In the analysis/mark phase, it inserts instructions for cross-device tensor
sends/recvs, represented by "__tfop_tfc.TensorTransfer" builtin's. For example,
when tensor x is produced on device D1, and is then consumed by tensor op foo()
on device D2, it inserts right before foo() a "__tfop_tfc.TensorTransfer"
builtin to send that tensor from D1 to D2.

This builtin helps maintain the invariant that for any instruction I running on
some device D, for any operand OP of I, OP must be present on D
(either because OP is produced on D, or it is transfered via this
builtin).

When tf-dump-graph flag is on, the output SIL of this phase is dumped under a header like:
--- TFDevicePartition Cross Device Tensor Transfer Annotation Result: $S3tmp10testScalar1fySf_tF.tf

In the partitioning phase (DevicePartitionCloner), it extracts all instructions
related to a given target device D into a new SIL function, to be lowered by
TFGraphLowering. For a "__tfop_tfc.TensorTransfer" builtin:
- If D is its source/send device, it gets lowered to a TF _Send op in the
  CPU/GPU device context, via a "__tfop_tfc.D2DTensorSend" builtin.
- If D is its dest/recv device, it gets lowered to a TF _Recv op in the CPU/GPU
  device context, via a "__tfop_tfc.D2DTensorRecv" builtin.

For control flow support, in each partitioned, device-spcific SIL function
produced by DevicePartitionCloner, it retains all basic blocks from the input
accelerator SIL function, along with the BB args.

When tf-dump-graph flag is on, the output of this phase is dumped under a header like:
--- TFDevicePartition Per-Device Function Extraction Result: $S3tmp10testScalar1fySf_tF.tf_CPU.device_partition

3. Extended the TFGraphLowering pass to turn D2DTensorSend/D2DTensorRecv into TF
_Send and _Recv nodes. These nodes work on CPU and GPU.

In the TPU device context, the above can be lowered to infeed/outfeed or
HostCompute. This is to be explored later.

4. Also upgraded "tensorflowSend" and "tensorflowReceive" built-ins with
"tfc.SendToHost" and "tfc.RecvFromHost" builtins, with proper tfop attributes to
represent the tensor transfer id and send/recv devices.
marcrasi pushed a commit to google/swift that referenced this pull request Jun 15, 2018
… SIL (swiftlang#17165)

Part 4 of cross-device sends/recvs support: Added initial support for SIL accelerator function partitioning based on TF devices {CPU, GPU}, including
control flow.

Summary of changes:

1. Extended the DeviceType with an ALL enum value, indicating that an associated
instruction runs on all devices with which the TF computation is involved. For
example, promoted scalars run on ALL devices.

Also, for ease of control flow handling, BB args are present on ALL devices. The
exception is the function input arguments, which are only present in the primary
device function (recall the primary function is the partitioned function that
runs on a target device given by TensorFlow.enableGPU(), TensorFlow.enableTPU()
or a default policy), while the helper functions do not take input or output
tensors.

2. Added a new pass DevicePartitioner that sits between the PartitionerCloner
pass in TFPartition and the TFGraphLowering pass in TFLowerGraph. It has two
phases described as follows.

In the analysis/mark phase, it inserts instructions for cross-device tensor
sends/recvs, represented by "__tfop_tfc.TensorTransfer" builtin's. For example,
when tensor x is produced on device D1, and is then consumed by tensor op foo()
on device D2, it inserts right before foo() a "__tfop_tfc.TensorTransfer"
builtin to send that tensor from D1 to D2.

This builtin helps maintain the invariant that for any instruction I running on
some device D, for any operand OP of I, OP must be present on D
(either because OP is produced on D, or it is transfered via this
builtin).

When tf-dump-graph flag is on, the output SIL of this phase is dumped under a header like:
--- TFDevicePartition Cross Device Tensor Transfer Annotation Result: $S3tmp10testScalar1fySf_tF.tf

In the partitioning phase (DevicePartitionCloner), it extracts all instructions
related to a given target device D into a new SIL function, to be lowered by
TFGraphLowering. For a "__tfop_tfc.TensorTransfer" builtin:
- If D is its source/send device, it gets lowered to a TF _Send op in the
  CPU/GPU device context, via a "__tfop_tfc.D2DTensorSend" builtin.
- If D is its dest/recv device, it gets lowered to a TF _Recv op in the CPU/GPU
  device context, via a "__tfop_tfc.D2DTensorRecv" builtin.

For control flow support, in each partitioned, device-spcific SIL function
produced by DevicePartitionCloner, it retains all basic blocks from the input
accelerator SIL function, along with the BB args.

When tf-dump-graph flag is on, the output of this phase is dumped under a header like:
--- TFDevicePartition Per-Device Function Extraction Result: $S3tmp10testScalar1fySf_tF.tf_CPU.device_partition

3. Extended the TFGraphLowering pass to turn D2DTensorSend/D2DTensorRecv into TF
_Send and _Recv nodes. These nodes work on CPU and GPU.

In the TPU device context, the above can be lowered to infeed/outfeed or
HostCompute. This is to be explored later.

4. Also upgraded "tensorflowSend" and "tensorflowReceive" built-ins with
"tfc.SendToHost" and "tfc.RecvFromHost" builtins, with proper tfop attributes to
represent the tensor transfer id and send/recv devices.
marcrasi pushed a commit that referenced this pull request Jun 22, 2018
… SIL (#17165)

Part 4 of cross-device sends/recvs support: Added initial support for SIL accelerator function partitioning based on TF devices {CPU, GPU}, including
control flow.

Summary of changes:

1. Extended the DeviceType with an ALL enum value, indicating that an associated
instruction runs on all devices with which the TF computation is involved. For
example, promoted scalars run on ALL devices.

Also, for ease of control flow handling, BB args are present on ALL devices. The
exception is the function input arguments, which are only present in the primary
device function (recall the primary function is the partitioned function that
runs on a target device given by TensorFlow.enableGPU(), TensorFlow.enableTPU()
or a default policy), while the helper functions do not take input or output
tensors.

2. Added a new pass DevicePartitioner that sits between the PartitionerCloner
pass in TFPartition and the TFGraphLowering pass in TFLowerGraph. It has two
phases described as follows.

In the analysis/mark phase, it inserts instructions for cross-device tensor
sends/recvs, represented by "__tfop_tfc.TensorTransfer" builtin's. For example,
when tensor x is produced on device D1, and is then consumed by tensor op foo()
on device D2, it inserts right before foo() a "__tfop_tfc.TensorTransfer"
builtin to send that tensor from D1 to D2.

This builtin helps maintain the invariant that for any instruction I running on
some device D, for any operand OP of I, OP must be present on D
(either because OP is produced on D, or it is transfered via this
builtin).

When tf-dump-graph flag is on, the output SIL of this phase is dumped under a header like:
--- TFDevicePartition Cross Device Tensor Transfer Annotation Result: $S3tmp10testScalar1fySf_tF.tf

In the partitioning phase (DevicePartitionCloner), it extracts all instructions
related to a given target device D into a new SIL function, to be lowered by
TFGraphLowering. For a "__tfop_tfc.TensorTransfer" builtin:
- If D is its source/send device, it gets lowered to a TF _Send op in the
  CPU/GPU device context, via a "__tfop_tfc.D2DTensorSend" builtin.
- If D is its dest/recv device, it gets lowered to a TF _Recv op in the CPU/GPU
  device context, via a "__tfop_tfc.D2DTensorRecv" builtin.

For control flow support, in each partitioned, device-spcific SIL function
produced by DevicePartitionCloner, it retains all basic blocks from the input
accelerator SIL function, along with the BB args.

When tf-dump-graph flag is on, the output of this phase is dumped under a header like:
--- TFDevicePartition Per-Device Function Extraction Result: $S3tmp10testScalar1fySf_tF.tf_CPU.device_partition

3. Extended the TFGraphLowering pass to turn D2DTensorSend/D2DTensorRecv into TF
_Send and _Recv nodes. These nodes work on CPU and GPU.

In the TPU device context, the above can be lowered to infeed/outfeed or
HostCompute. This is to be explored later.

4. Also upgraded "tensorflowSend" and "tensorflowReceive" built-ins with
"tfc.SendToHost" and "tfc.RecvFromHost" builtins, with proper tfop attributes to
represent the tensor transfer id and send/recv devices.
marcrasi pushed a commit that referenced this pull request Jun 28, 2018
… SIL (#17165)

Part 4 of cross-device sends/recvs support: Added initial support for SIL accelerator function partitioning based on TF devices {CPU, GPU}, including
control flow.

Summary of changes:

1. Extended the DeviceType with an ALL enum value, indicating that an associated
instruction runs on all devices with which the TF computation is involved. For
example, promoted scalars run on ALL devices.

Also, for ease of control flow handling, BB args are present on ALL devices. The
exception is the function input arguments, which are only present in the primary
device function (recall the primary function is the partitioned function that
runs on a target device given by TensorFlow.enableGPU(), TensorFlow.enableTPU()
or a default policy), while the helper functions do not take input or output
tensors.

2. Added a new pass DevicePartitioner that sits between the PartitionerCloner
pass in TFPartition and the TFGraphLowering pass in TFLowerGraph. It has two
phases described as follows.

In the analysis/mark phase, it inserts instructions for cross-device tensor
sends/recvs, represented by "__tfop_tfc.TensorTransfer" builtin's. For example,
when tensor x is produced on device D1, and is then consumed by tensor op foo()
on device D2, it inserts right before foo() a "__tfop_tfc.TensorTransfer"
builtin to send that tensor from D1 to D2.

This builtin helps maintain the invariant that for any instruction I running on
some device D, for any operand OP of I, OP must be present on D
(either because OP is produced on D, or it is transfered via this
builtin).

When tf-dump-graph flag is on, the output SIL of this phase is dumped under a header like:
--- TFDevicePartition Cross Device Tensor Transfer Annotation Result: $S3tmp10testScalar1fySf_tF.tf

In the partitioning phase (DevicePartitionCloner), it extracts all instructions
related to a given target device D into a new SIL function, to be lowered by
TFGraphLowering. For a "__tfop_tfc.TensorTransfer" builtin:
- If D is its source/send device, it gets lowered to a TF _Send op in the
  CPU/GPU device context, via a "__tfop_tfc.D2DTensorSend" builtin.
- If D is its dest/recv device, it gets lowered to a TF _Recv op in the CPU/GPU
  device context, via a "__tfop_tfc.D2DTensorRecv" builtin.

For control flow support, in each partitioned, device-spcific SIL function
produced by DevicePartitionCloner, it retains all basic blocks from the input
accelerator SIL function, along with the BB args.

When tf-dump-graph flag is on, the output of this phase is dumped under a header like:
--- TFDevicePartition Per-Device Function Extraction Result: $S3tmp10testScalar1fySf_tF.tf_CPU.device_partition

3. Extended the TFGraphLowering pass to turn D2DTensorSend/D2DTensorRecv into TF
_Send and _Recv nodes. These nodes work on CPU and GPU.

In the TPU device context, the above can be lowered to infeed/outfeed or
HostCompute. This is to be explored later.

4. Also upgraded "tensorflowSend" and "tensorflowReceive" built-ins with
"tfc.SendToHost" and "tfc.RecvFromHost" builtins, with proper tfop attributes to
represent the tensor transfer id and send/recv devices.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tensorflow This is for "tensorflow" branch PRs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants