-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-29152][CORE][2.4] Executor Plugin shutdown when dynamic allocation is enabled #26901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
|
Thank you, @iRakson ! |
|
Test build #115373 has finished for PR 26901 at commit
|
|
Retest this please |
|
Test build #115383 has finished for PR 26901 at commit
|
|
I will check the issue. |
|
Thank you, @iRakson ! |
|
There are a bunch of OOM errors in the test output. They all seem to be on the driver side, which is not touched by this PR, so it's unclear to me how the change here could cause those failures. I can't reproduce the OOM locally, but if someone can and is able to attach a debugger or get a heap dump, it could help. |
|
Retest this please. |
|
Test build #115521 has finished for PR 26901 at commit
|
|
According to the failure history, this PR always fails with Maven at the following. And, mostly, it causes non-OOM failures. |
|
That's what is reported on the UI. But if you look at the actual unit-tests.log file, there are a bunch of OOM errors. From the latest timed out run: |
|
Oh, got it, @vanzin . |
|
Ping, @iRakson . |
|
Retest this please. |
|
Retest this please |
|
Test build #117970 has finished for PR 26901 at commit
|
|
Retest this please. |
|
Test build #118006 has finished for PR 26901 at commit
|
|
@dongjoon-hyun @vanzin Instead of adding a shutdown hook, now i am passing I think this approach is better than forcing a shutdown hook. |
|
Test build #121006 has finished for PR 26901 at commit
|
|
retest this please. |
|
Retest this please |
|
Test build #121038 has finished for PR 26901 at commit
|
|
gentle ping @dongjoon-hyun @vanzin |
|
If this one is better, shall we update |
Yes. Actually that is what I was thinking as well. We should update it in master and 3.0 as well. I have one doubt though. |
|
Since 3.0 is not released yet, you can create a follow-up PR like |
@dongjoon-hyun |
|
Test build #123827 has finished for PR 26901 at commit
|
|
retest this please |
|
Test build #123832 has finished for PR 26901 at commit
|
|
retest this please |
|
Test build #123844 has finished for PR 26901 at commit
|
|
retest this please |
|
@dongjoon-hyun @vanzin I have tested this a couple of times, older approach of adding shutdownhook is working fine. Last failure is unrelated to this. Also about the other approach, I think Vanzin's comment is valid one. Covering all the executor exit scenarios can be difficult. This seems a better fix to our problem. |
|
Test build #123847 has finished for PR 26901 at commit
|
|
cc @srowen |
|
@dongjoon-hyun @vanzin Can we get this in 2.4 ? |
|
#28254 has been closed. |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…tion is enabled ### What changes were proposed in this pull request? Added a Shutdown Hook in `executor.scala` which will ensure that executor's `stop()` method is always called. ### Why are the changes needed? In case executors are not going down gracefully, their `stop()` is not called. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually Closes #26901 from iRakson/SPARK-29152_2.4. Authored-by: iRakson <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
|
Unfortunately, this seems to break all
I'm still monitoring the Apache Spark Jenkins farm. |
|
The situation is the same. There are OOM errors. |
|
Sorry again, @iRakson . I fully understand you spent lots of time to make this contribution and the failed test cases passed individually and locally. However, we cannot keep this patch in I admit that this is my bad because I reviewed and merged the same PR twice and reverted. With two tries, we proved that the original patch has some issue with For now, I'll revert this commit to recover |
|
@dongjoon-hyun Its behaviour is pretty confusing. But yeah, if this is breaking branch again then we should not keep it. Yes, this patch failed twice so we must move on. Thank you for actively monitoring this patch. :) :) |
|
I know why this caused OOM. Here is the reason and the fix: #32748 |
What changes were proposed in this pull request?
Added a Shutdown Hook in
executor.scalawhich will ensure that executor'sstop()method is always called.Why are the changes needed?
In case executors are not going down gracefully, their
stop()is not called.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Manually