[SPARK-32003][CORE][2.4] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost #29182

wypoon · 2020-07-22T05:12:40Z

What changes were proposed in this pull request?

If an executor is lost, the DAGScheduler handles the executor loss by removing the executor but does not unregister its outputs if the external shuffle service is used. However, if the node on which the executor runs is lost, the shuffle service may not be able to serve the shuffle files.
In such a case, when fetches from the executor's outputs fail in the same stage, the DAGScheduler again removes the executor and by right, should unregister its outputs. It doesn't because the epoch used to track the executor failure has not increased.

We track the epoch for failed executors that result in lost file output separately, so we can unregister the outputs in this scenario. The idea to track a second epoch is due to Attila Zsolt Piros.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit test. This test fails without the change and passes with it.

…ister outputs for executor on fetch failure after executor is lost

wypoon · 2020-07-22T05:15:22Z

This is a backport of #28848 to branch-2.4.
There were two files that were changed. The backport of DAGScheduler.scala was straightforward, and the diffs are essentially the same. The backport of DAGSchedulerSuite.scala needed some adjustments, as some method signatures had changed between 2.4 and 3.0.

SparkQA · 2020-07-22T07:05:01Z

Test build #126299 has finished for PR 29182 at commit 580beda.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

squito · 2020-07-22T14:54:45Z

Jenkins, retest this please

SparkQA · 2020-07-22T15:04:17Z

Test build #126341 has finished for PR 29182 at commit 580beda.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

wypoon · 2020-07-22T16:39:41Z

retest this please

SparkQA · 2020-07-22T19:30:35Z

Test build #126349 has finished for PR 29182 at commit 580beda.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-07-24T18:50:33Z

This is blocked by #29193 .

wypoon · 2020-07-24T22:30:14Z

@dongjoon-hyun this backport has a clean build in the most recent retry. This can be merged independently of the branch-3.0 backport.

dongjoon-hyun · 2020-07-25T00:31:12Z

@wypoon . Please see my comment. I didn't say this is blocked by Jenkins. This is blocked by the Apache Spark backporting policy. To prevent a regression at higher versions, we always make sure that backporting occurs in the order master -> branch-3.0 -> branch-2.4 -> ... .

dongjoon-hyun · 2020-07-25T00:32:40Z

There is no independent backporting. And, we do sometimes forward-porting, but that is not a recommend way in Apache Spark. It's allowed exceptionally.

This can be merged independently of the branch-3.0 backport.

wypoon · 2020-07-25T21:54:43Z

@wypoon . Please see my comment. I didn't say this is blocked by Jenkins. This is blocked by the Apache Spark backporting policy. To prevent a regression at higher versions, we always make sure that backporting occurs in the order master -> branch-3.0 -> branch-2.4 -> ... .

@dongjoon-hyun I wasn't aware of the policy. It makes sense. Thank you for explaining it to me.

dongjoon-hyun · 2020-07-26T15:44:01Z

Thanks, @wypoon . Since all test passed both branch-3.0 and branch-2.4, please ping the original merger once more on both your PRs to be safe. I believe he is the best person for this backporting.

SparkQA · 2020-08-03T18:14:59Z

Test build #5055 has finished for PR 29182 at commit 580beda.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wypoon · 2020-08-03T20:54:35Z

retest this please

SparkQA · 2020-08-03T23:42:40Z

Test build #127002 has finished for PR 29182 at commit 580beda.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ister outputs for executor on fetch failure after executor is lost ### What changes were proposed in this pull request? If an executor is lost, the `DAGScheduler` handles the executor loss by removing the executor but does not unregister its outputs if the external shuffle service is used. However, if the node on which the executor runs is lost, the shuffle service may not be able to serve the shuffle files. In such a case, when fetches from the executor's outputs fail in the same stage, the `DAGScheduler` again removes the executor and by right, should unregister its outputs. It doesn't because the epoch used to track the executor failure has not increased. We track the epoch for failed executors that result in lost file output separately, so we can unregister the outputs in this scenario. The idea to track a second epoch is due to Attila Zsolt Piros. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. This test fails without the change and passes with it. Closes #29182 from wypoon/SPARK-32003-2.4. Authored-by: Wing Yew Poon <[email protected]> Signed-off-by: Imran Rashid <[email protected]>

squito · 2020-08-04T19:39:31Z

merged to 2.4, thanks @wypoon

[SPARK-32003][CORE][2.4] When external shuffle service is used, unreg…

580beda

…ister outputs for executor on fetch failure after executor is lost

probot-autolabeler bot added the CORE label Jul 22, 2020

wypoon mentioned this pull request Jul 22, 2020

[SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost #28848

Closed

wypoon closed this Aug 4, 2020

[SPARK-32003][CORE][2.4] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost #29182

[SPARK-32003][CORE][2.4] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost #29182

Uh oh!

Conversation

wypoon commented Jul 22, 2020

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

wypoon commented Jul 22, 2020

Uh oh!

SparkQA commented Jul 22, 2020

Uh oh!

squito commented Jul 22, 2020

Uh oh!

SparkQA commented Jul 22, 2020

Uh oh!

wypoon commented Jul 22, 2020

Uh oh!

SparkQA commented Jul 22, 2020

Uh oh!

dongjoon-hyun commented Jul 24, 2020

Uh oh!

wypoon commented Jul 24, 2020

Uh oh!

dongjoon-hyun commented Jul 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Jul 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wypoon commented Jul 25, 2020

Uh oh!

dongjoon-hyun commented Jul 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Aug 3, 2020

Uh oh!

wypoon commented Aug 3, 2020

Uh oh!

SparkQA commented Aug 3, 2020

Uh oh!

squito commented Aug 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dongjoon-hyun commented Jul 25, 2020 •

edited

Loading

dongjoon-hyun commented Jul 25, 2020 •

edited

Loading

dongjoon-hyun commented Jul 26, 2020 •

edited

Loading