Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

Conversation

@haoyuz
Copy link
Contributor

@haoyuz haoyuz commented Feb 19, 2021

The feedback phase will be open for two weeks until 2021-03-18

Status Approved
RFC # 355
Author(s) Haoyu Zhang ([email protected])
Sponsor Bramandia Ramadhana ([email protected])
Updated 2021-03-04

Objective

We propose to set the default value of the GRPC_FAIL_FAST environment variable to use_caller. This change prevents TensorFlow distributed jobs from hanging indefinitely due to task failures, and allows users and TF libraries (e.g., distribution strategies) to handle the connection errors for better failure and preemption recovery.

@ematejska ematejska added the RFC: Proposed RFC Design Document label Mar 4, 2021
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Mar 19, 2021
For details: tensorflow/community#355

PiperOrigin-RevId: 363809453
Change-Id: I336219c0ce36bb4e45ded8836cf6c15d306c4db2
@ematejska
Copy link

This has been approved.

@ematejska ematejska added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Mar 22, 2021
@ematejska ematejska merged commit 3725c74 into tensorflow:master Mar 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

cla: yes RFC: Accepted RFC Design Document: Accepted by Review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants