-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-32003][CORE][2.4] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost #29182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ister outputs for executor on fetch failure after executor is lost
|
This is a backport of #28848 to branch-2.4. |
|
Test build #126299 has finished for PR 29182 at commit
|
|
Jenkins, retest this please |
|
Test build #126341 has finished for PR 29182 at commit
|
|
retest this please |
|
Test build #126349 has finished for PR 29182 at commit
|
|
This is blocked by #29193 . |
|
@dongjoon-hyun this backport has a clean build in the most recent retry. This can be merged independently of the branch-3.0 backport. |
|
@wypoon . Please see my comment. I didn't say this is blocked by Jenkins. This is blocked by the Apache Spark backporting policy. To prevent a regression at higher versions, we always make sure that backporting occurs in the order |
|
There is no
|
@dongjoon-hyun I wasn't aware of the policy. It makes sense. Thank you for explaining it to me. |
|
Thanks, @wypoon . Since all test passed both |
|
Test build #5055 has finished for PR 29182 at commit
|
|
retest this please |
|
Test build #127002 has finished for PR 29182 at commit
|
…ister outputs for executor on fetch failure after executor is lost ### What changes were proposed in this pull request? If an executor is lost, the `DAGScheduler` handles the executor loss by removing the executor but does not unregister its outputs if the external shuffle service is used. However, if the node on which the executor runs is lost, the shuffle service may not be able to serve the shuffle files. In such a case, when fetches from the executor's outputs fail in the same stage, the `DAGScheduler` again removes the executor and by right, should unregister its outputs. It doesn't because the epoch used to track the executor failure has not increased. We track the epoch for failed executors that result in lost file output separately, so we can unregister the outputs in this scenario. The idea to track a second epoch is due to Attila Zsolt Piros. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. This test fails without the change and passes with it. Closes #29182 from wypoon/SPARK-32003-2.4. Authored-by: Wing Yew Poon <[email protected]> Signed-off-by: Imran Rashid <[email protected]>
|
merged to 2.4, thanks @wypoon |
What changes were proposed in this pull request?
If an executor is lost, the
DAGSchedulerhandles the executor loss by removing the executor but does not unregister its outputs if the external shuffle service is used. However, if the node on which the executor runs is lost, the shuffle service may not be able to serve the shuffle files.In such a case, when fetches from the executor's outputs fail in the same stage, the
DAGScheduleragain removes the executor and by right, should unregister its outputs. It doesn't because the epoch used to track the executor failure has not increased.We track the epoch for failed executors that result in lost file output separately, so we can unregister the outputs in this scenario. The idea to track a second epoch is due to Attila Zsolt Piros.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New unit test. This test fails without the change and passes with it.