Skip to content

Conversation

@wypoon
Copy link
Contributor

@wypoon wypoon commented Jul 22, 2020

What changes were proposed in this pull request?

If an executor is lost, the DAGScheduler handles the executor loss by removing the executor but does not unregister its outputs if the external shuffle service is used. However, if the node on which the executor runs is lost, the shuffle service may not be able to serve the shuffle files.
In such a case, when fetches from the executor's outputs fail in the same stage, the DAGScheduler again removes the executor and by right, should unregister its outputs. It doesn't because the epoch used to track the executor failure has not increased.

We track the epoch for failed executors that result in lost file output separately, so we can unregister the outputs in this scenario. The idea to track a second epoch is due to Attila Zsolt Piros.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit test. This test fails without the change and passes with it.

…ister outputs for executor on fetch failure after executor is lost
@wypoon
Copy link
Contributor Author

wypoon commented Jul 22, 2020

This is a backport of #28848 to branch-2.4.
There were two files that were changed. The backport of DAGScheduler.scala was straightforward, and the diffs are essentially the same. The backport of DAGSchedulerSuite.scala needed some adjustments, as some method signatures had changed between 2.4 and 3.0.

@SparkQA
Copy link

SparkQA commented Jul 22, 2020

Test build #126299 has finished for PR 29182 at commit 580beda.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor

squito commented Jul 22, 2020

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Jul 22, 2020

Test build #126341 has finished for PR 29182 at commit 580beda.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wypoon
Copy link
Contributor Author

wypoon commented Jul 22, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jul 22, 2020

Test build #126349 has finished for PR 29182 at commit 580beda.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

This is blocked by #29193 .

@wypoon
Copy link
Contributor Author

wypoon commented Jul 24, 2020

@dongjoon-hyun this backport has a clean build in the most recent retry. This can be merged independently of the branch-3.0 backport.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 25, 2020

@wypoon . Please see my comment. I didn't say this is blocked by Jenkins. This is blocked by the Apache Spark backporting policy. To prevent a regression at higher versions, we always make sure that backporting occurs in the order master -> branch-3.0 -> branch-2.4 -> ... .

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 25, 2020

There is no independent backporting. And, we do sometimes forward-porting, but that is not a recommend way in Apache Spark. It's allowed exceptionally.

This can be merged independently of the branch-3.0 backport.

@wypoon
Copy link
Contributor Author

wypoon commented Jul 25, 2020

@wypoon . Please see my comment. I didn't say this is blocked by Jenkins. This is blocked by the Apache Spark backporting policy. To prevent a regression at higher versions, we always make sure that backporting occurs in the order master -> branch-3.0 -> branch-2.4 -> ... .

@dongjoon-hyun I wasn't aware of the policy. It makes sense. Thank you for explaining it to me.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 26, 2020

Thanks, @wypoon . Since all test passed both branch-3.0 and branch-2.4, please ping the original merger once more on both your PRs to be safe. I believe he is the best person for this backporting.

@SparkQA
Copy link

SparkQA commented Aug 3, 2020

Test build #5055 has finished for PR 29182 at commit 580beda.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wypoon
Copy link
Contributor Author

wypoon commented Aug 3, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Aug 3, 2020

Test build #127002 has finished for PR 29182 at commit 580beda.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Aug 4, 2020
…ister outputs for executor on fetch failure after executor is lost

### What changes were proposed in this pull request?

If an executor is lost, the `DAGScheduler` handles the executor loss by removing the executor but does not unregister its outputs if the external shuffle service is used. However, if the node on which the executor runs is lost, the shuffle service may not be able to serve the shuffle files.
In such a case, when fetches from the executor's outputs fail in the same stage, the `DAGScheduler` again removes the executor and by right, should unregister its outputs. It doesn't because the epoch used to track the executor failure has not increased.

We track the epoch for failed executors that result in lost file output separately, so we can unregister the outputs in this scenario. The idea to track a second epoch is due to Attila Zsolt Piros.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit test. This test fails without the change and passes with it.

Closes #29182 from wypoon/SPARK-32003-2.4.

Authored-by: Wing Yew Poon <[email protected]>
Signed-off-by: Imran Rashid <[email protected]>
@squito
Copy link
Contributor

squito commented Aug 4, 2020

merged to 2.4, thanks @wypoon

@wypoon wypoon closed this Aug 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants