-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-25139][SPARK-18406][CORE][BRANCH-2.3] Avoid NonFatals to kill the Executor in PythonRunner #24670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… in PythonRunner ## What changes were proposed in this pull request? Python uses a prefetch approach to read the result from upstream and serve them in another thread, thus it's possible that if the children operator doesn't consume all the data then the Task cleanup may happen before Python side read process finishes, this in turn create a race condition that the block read locks are freed during Task cleanup and then the reader try to release the read lock it holds and find it has been released, in this case we shall hit a AssertionError. We shall catch the AssertionError in PythonRunner and prevent this kill the Executor. ## How was this patch tested? Hard to write a unit test case for this case, manually verified with failed job. Closes apache#24542 from jiangxb1987/pyError. Authored-by: Xingbo Jiang <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit e63fbfc)
|
I created this pr per @HyukjinKwon comment on the jira, but the cherry-pick to 2.3 was clean and without compile error. Core unit tests are also passing, but still running locally. |
|
ok to test |
|
Test build #105652 has finished for PR 24670 at commit
|
|
Test build #105654 has finished for PR 24670 at commit
|
|
Oops, I meant cc @JoshRosen and @jiangxb1987 |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. This is a clean cherry-pick. And the logic is valid in branch-2.3, too.
Thank you, @rezasafi and @HyukjinKwon .
Merged to branch-2.3.
…the Executor in PythonRunner ## What changes were proposed in this pull request? Python uses a prefetch approach to read the result from upstream and serve them in another thread, thus it's possible that if the children operator doesn't consume all the data then the Task cleanup may happen before Python side read process finishes, this in turn create a race condition that the block read locks are freed during Task cleanup and then the reader try to release the read lock it holds and find it has been released, in this case we shall hit a AssertionError. We shall catch the AssertionError in PythonRunner and prevent this kill the Executor. ## How was this patch tested? Hard to write a unit test case for this case, manually verified with failed job. Closes #24670 from rezasafi/branch-2.3. Authored-by: Xingbo Jiang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Python uses a prefetch approach to read the result from upstream and serve them in another thread, thus it's possible that if the children operator doesn't consume all the data then the Task cleanup may happen before Python side read process finishes, this in turn create a race condition that the block read locks are freed during Task cleanup and then the reader try to release the read lock it holds and find it has been released, in this case we shall hit a AssertionError.
We shall catch the AssertionError in PythonRunner and prevent this kill the Executor.
How was this patch tested?
Hard to write a unit test case for this case, manually verified with failed job.