Skip to content

Conversation

@rezasafi
Copy link
Contributor

@rezasafi rezasafi commented May 22, 2019

What changes were proposed in this pull request?

Python uses a prefetch approach to read the result from upstream and serve them in another thread, thus it's possible that if the children operator doesn't consume all the data then the Task cleanup may happen before Python side read process finishes, this in turn create a race condition that the block read locks are freed during Task cleanup and then the reader try to release the read lock it holds and find it has been released, in this case we shall hit a AssertionError.

We shall catch the AssertionError in PythonRunner and prevent this kill the Executor.

How was this patch tested?

Hard to write a unit test case for this case, manually verified with failed job.

… in PythonRunner

## What changes were proposed in this pull request?

Python uses a prefetch approach to read the result from upstream and serve them in another thread, thus it's possible that if the children operator doesn't consume all the data then the Task cleanup may happen before Python side read process finishes, this in turn create a race condition that the block read locks are freed during Task cleanup and then the reader try to release the read lock it holds and find it has been released, in this case we shall hit a AssertionError.

We shall catch the AssertionError in PythonRunner and prevent this kill the Executor.

## How was this patch tested?

Hard to write a unit test case for this case, manually verified with failed job.

Closes apache#24542 from jiangxb1987/pyError.

Authored-by: Xingbo Jiang <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit e63fbfc)
@rezasafi
Copy link
Contributor Author

rezasafi commented May 22, 2019

I created this pr per @HyukjinKwon comment on the jira, but the cherry-pick to 2.3 was clean and without compile error. Core unit tests are also passing, but still running locally.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented May 22, 2019

Test build #105652 has finished for PR 24670 at commit cb4dfa3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 22, 2019

Test build #105654 has finished for PR 24670 at commit cb4dfa3.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Oops, I meant cc @JoshRosen and @jiangxb1987

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. This is a clean cherry-pick. And the logic is valid in branch-2.3, too.
Thank you, @rezasafi and @HyukjinKwon .

Merged to branch-2.3.

dongjoon-hyun pushed a commit that referenced this pull request May 22, 2019
…the Executor in PythonRunner

## What changes were proposed in this pull request?

Python uses a prefetch approach to read the result from upstream and serve them in another thread, thus it's possible that if the children operator doesn't consume all the data then the Task cleanup may happen before Python side read process finishes, this in turn create a race condition that the block read locks are freed during Task cleanup and then the reader try to release the read lock it holds and find it has been released, in this case we shall hit a AssertionError.

We shall catch the AssertionError in PythonRunner and prevent this kill the Executor.

## How was this patch tested?

Hard to write a unit test case for this case, manually verified with failed job.

Closes #24670 from rezasafi/branch-2.3.

Authored-by: Xingbo Jiang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants