|
| 1 | +# GRPC Fail Fast By Default |
| 2 | + |
| 3 | +| Status | Accepted | |
| 4 | +| :------------ | :------------------------------------------------------ | |
| 5 | +| **RFC #** | [355](https://github.com/tensorflow/community/pull/355) | |
| 6 | +| **Author(s) ** | Haoyu Zhang ( [email protected]) | |
| 7 | +| **Sponsor ** | Bramandia Ramadhana ( [email protected]) | |
| 8 | +| **Updated** | 2021-03-04 | |
| 9 | + |
| 10 | +## Objective |
| 11 | + |
| 12 | +We propose to set the default value of the `GRPC_FAIL_FAST` environment variable |
| 13 | +to `use_caller`. This change prevents TensorFlow distributed jobs from hanging |
| 14 | +indefinitely due to task failures, and allows users and TF libraries (e.g., |
| 15 | +distribution strategies) to handle the connection errors for better failure and |
| 16 | +preemption recovery. |
| 17 | + |
| 18 | +## Background |
| 19 | + |
| 20 | +`GRPC_FAIL_FAST` is a TensorFlow distributed runtime environment variable that |
| 21 | +controls the behavior of RPC requests when observing a network disconnection |
| 22 | +with remote servers. It can be configured to the following values: |
| 23 | + |
| 24 | +* `true`, which immediately reports an `UnavailableError` when there is a |
| 25 | + connection issue for all RPCs, regardless of the per-RPC configurations; |
| 26 | +* `false`, which blocks and waits until successfully connected to the remote |
| 27 | + server (see |
| 28 | + [gRPC `wait_for_ready`](https://github.com/grpc/grpc/blob/master/doc/wait-for-ready.md)), |
| 29 | + regardless of the per-RPC configurations; |
| 30 | +* `use_caller`, which allows customization per RPC basis; in the current |
| 31 | + implementation, `true` is used for RPCs used in distributed execution (such |
| 32 | + as `RecvTensor`, `RunComponentFunction`), and `false` is used for RPCs in |
| 33 | + initializing remote execution environments (e.g., `GetStatus`). |
| 34 | + |
| 35 | +The default value of `GRPC_FAIL_FAST` is currently set to `false`. One of the |
| 36 | +consequences is that users and/or high-level distribute libraries (such as |
| 37 | +`ParameterServerStrategy`) need to |
| 38 | +[manually configure this environment variable](https://github.com/tensorflow/tensorflow/blob/1178262a2a55fa634a2390291fc633c515e28884/tensorflow/python/distribute/parameter_server_strategy_v2.py#L106) |
| 39 | +to receive reasonable exceptions when workers fail / get preempted; otherwise |
| 40 | +the cluster will hang and cannot recover from failures. |
| 41 | + |
| 42 | +## Proposed Change |
| 43 | + |
| 44 | +We propose to set the default value of `GRPC_FAIL_FAST` to `use_caller`. By |
| 45 | +doing so, the runtime reports errors quickly to detect remote server failures |
| 46 | +during execution, while still allowing the client to start early and wait for |
| 47 | +remote servers to establish initial connections. This should be the desired |
| 48 | +behavior for most use cases. |
| 49 | + |
| 50 | +In the context of TensorFlow 2, the default behavior of the following RPCs used |
| 51 | +for distributed execution will be changed from hanging on failures (current |
| 52 | +behavior) to immediately reporting failures (after the change): |
| 53 | + |
| 54 | +* `EagerService.CreateContext` |
| 55 | +* `EagerService.UpdateContext` |
| 56 | +* `EagerService.WaitQueueDone` |
| 57 | +* `EagerService.KeepAlive` |
| 58 | +* `EagerService.Enqueue` |
| 59 | +* `EagerService.RunComponentFunction` |
| 60 | +* `WorkerService.RecvTensor` |
| 61 | +* `WorkerService.RecvBuf` |
| 62 | + |
| 63 | +The default behavior of the following RPC will not change: it will still hang if |
| 64 | +the remote task cannot be reached. |
| 65 | + |
| 66 | +* `WorkerService.GetStatus` |
| 67 | + |
| 68 | +The `GetStatus` RPC is typically the first RPC sent from the client to |
| 69 | +initialize a distributed execution environment, in both the single- and the |
| 70 | +multi-client mode. The underlying implementation uses GRPC's |
| 71 | +[`wait_for_ready`](https://github.com/grpc/grpc/blob/master/doc/wait-for-ready.md) |
| 72 | +flag, which allows the client to start before the remote server in the |
| 73 | +deployment. |
| 74 | + |
| 75 | +## User Impact |
| 76 | + |
| 77 | +When this change is made to the codebase, subsequent TensorFlow 2 releases will |
| 78 | +have this new default behavior. TensorFlow 1.x users who use the stable releases |
| 79 | +(e.g., TensorFlow 1.15 or earlier) should not be affected by this change. Users |
| 80 | +who build TensorFlow directly from source at the head will also be affected. |
| 81 | + |
| 82 | +Most users should see the new default as expected behaviors in distributed |
| 83 | +execution. Users can take advantage of the built-in fault tolerance support in |
| 84 | +`ParameterServerStrategy` without having to make changes to the environment |
| 85 | +variable configurations. In other setups, exceptions will be raised to the model |
| 86 | +training loop code, where users can catch and handle these errors with custom |
| 87 | +logic instead of hanging indefinitely. |
| 88 | + |
| 89 | +Certain users might receive "false alarms" if there are transient connection |
| 90 | +errors to the remote servers. We expect this to happen very rarely since GRPC |
| 91 | +(built on top of HTTP and TCP protocols) should already handle packet drops and |
| 92 | +network flakinesses in most cases, and only report errors when there are real |
| 93 | +network or server failures. However, if this does happen to some users, please |
| 94 | +set `GRPC_FAIL_FAST=false` to override the default value and revert to the |
| 95 | +previous behavior. Please also file an issue to inform the TensorFlow Runtime |
| 96 | +team. |
| 97 | + |
0 commit comments