Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

Commit 3725c74

Browse files
author
ematejska
authored
Merge pull request #355 from haoyuz/grpc-fail-fast
RFC: Setting GRPC_FAIL_FAST to use_caller by Default
2 parents 41a8fbf + 4ffddb4 commit 3725c74

File tree

1 file changed

+97
-0
lines changed

1 file changed

+97
-0
lines changed
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# GRPC Fail Fast By Default
2+
3+
| Status | Accepted |
4+
| :------------ | :------------------------------------------------------ |
5+
| **RFC #** | [355](https://github.com/tensorflow/community/pull/355) |
6+
| **Author(s)** | Haoyu Zhang ([email protected]) |
7+
| **Sponsor** | Bramandia Ramadhana ([email protected]) |
8+
| **Updated** | 2021-03-04 |
9+
10+
## Objective
11+
12+
We propose to set the default value of the `GRPC_FAIL_FAST` environment variable
13+
to `use_caller`. This change prevents TensorFlow distributed jobs from hanging
14+
indefinitely due to task failures, and allows users and TF libraries (e.g.,
15+
distribution strategies) to handle the connection errors for better failure and
16+
preemption recovery.
17+
18+
## Background
19+
20+
`GRPC_FAIL_FAST` is a TensorFlow distributed runtime environment variable that
21+
controls the behavior of RPC requests when observing a network disconnection
22+
with remote servers. It can be configured to the following values:
23+
24+
* `true`, which immediately reports an `UnavailableError` when there is a
25+
connection issue for all RPCs, regardless of the per-RPC configurations;
26+
* `false`, which blocks and waits until successfully connected to the remote
27+
server (see
28+
[gRPC `wait_for_ready`](https://github.com/grpc/grpc/blob/master/doc/wait-for-ready.md)),
29+
regardless of the per-RPC configurations;
30+
* `use_caller`, which allows customization per RPC basis; in the current
31+
implementation, `true` is used for RPCs used in distributed execution (such
32+
as `RecvTensor`, `RunComponentFunction`), and `false` is used for RPCs in
33+
initializing remote execution environments (e.g., `GetStatus`).
34+
35+
The default value of `GRPC_FAIL_FAST` is currently set to `false`. One of the
36+
consequences is that users and/or high-level distribute libraries (such as
37+
`ParameterServerStrategy`) need to
38+
[manually configure this environment variable](https://github.com/tensorflow/tensorflow/blob/1178262a2a55fa634a2390291fc633c515e28884/tensorflow/python/distribute/parameter_server_strategy_v2.py#L106)
39+
to receive reasonable exceptions when workers fail / get preempted; otherwise
40+
the cluster will hang and cannot recover from failures.
41+
42+
## Proposed Change
43+
44+
We propose to set the default value of `GRPC_FAIL_FAST` to `use_caller`. By
45+
doing so, the runtime reports errors quickly to detect remote server failures
46+
during execution, while still allowing the client to start early and wait for
47+
remote servers to establish initial connections. This should be the desired
48+
behavior for most use cases.
49+
50+
In the context of TensorFlow 2, the default behavior of the following RPCs used
51+
for distributed execution will be changed from hanging on failures (current
52+
behavior) to immediately reporting failures (after the change):
53+
54+
* `EagerService.CreateContext`
55+
* `EagerService.UpdateContext`
56+
* `EagerService.WaitQueueDone`
57+
* `EagerService.KeepAlive`
58+
* `EagerService.Enqueue`
59+
* `EagerService.RunComponentFunction`
60+
* `WorkerService.RecvTensor`
61+
* `WorkerService.RecvBuf`
62+
63+
The default behavior of the following RPC will not change: it will still hang if
64+
the remote task cannot be reached.
65+
66+
* `WorkerService.GetStatus`
67+
68+
The `GetStatus` RPC is typically the first RPC sent from the client to
69+
initialize a distributed execution environment, in both the single- and the
70+
multi-client mode. The underlying implementation uses GRPC's
71+
[`wait_for_ready`](https://github.com/grpc/grpc/blob/master/doc/wait-for-ready.md)
72+
flag, which allows the client to start before the remote server in the
73+
deployment.
74+
75+
## User Impact
76+
77+
When this change is made to the codebase, subsequent TensorFlow 2 releases will
78+
have this new default behavior. TensorFlow 1.x users who use the stable releases
79+
(e.g., TensorFlow 1.15 or earlier) should not be affected by this change. Users
80+
who build TensorFlow directly from source at the head will also be affected.
81+
82+
Most users should see the new default as expected behaviors in distributed
83+
execution. Users can take advantage of the built-in fault tolerance support in
84+
`ParameterServerStrategy` without having to make changes to the environment
85+
variable configurations. In other setups, exceptions will be raised to the model
86+
training loop code, where users can catch and handle these errors with custom
87+
logic instead of hanging indefinitely.
88+
89+
Certain users might receive "false alarms" if there are transient connection
90+
errors to the remote servers. We expect this to happen very rarely since GRPC
91+
(built on top of HTTP and TCP protocols) should already handle packet drops and
92+
network flakinesses in most cases, and only report errors when there are real
93+
network or server failures. However, if this does happen to some users, please
94+
set `GRPC_FAIL_FAST=false` to override the default value and revert to the
95+
previous behavior. Please also file an issue to inform the TensorFlow Runtime
96+
team.
97+

0 commit comments

Comments
 (0)