-
Couldn't load subscription status.
- Fork 28.9k
[SPARK-29104][CORE][TESTS] Fix PipedRDDSuite to use eventually to check thread termination
#25808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…checking thread termination
| val result = piped.mapPartitions(_ => Array.emptyIntArray.iterator) | ||
|
|
||
| assert(result.collect().length === 0) | ||
| sc.listenerBus.waitUntilEmpty(10000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the fix and the remaining is typo fix from err to in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Could you elaborate how it helps to resolve flakiness? I'm seeing that |
|
Test build #110672 has finished for PR 25808 at commit
|
|
@HeartSaVioR . What do you mean by the following?
Listener is added like the following. 1. PipedRDD context.addTaskCompletionListener[Unit] { _ =>
if (proc.isAlive) {
proc.destroy()
}
if (stdinWriterThread.isAlive) {
stdinWriterThread.interrupt()
}
if (stderrReaderThread.isAlive) {
stderrReaderThread.interrupt()
}
}2. TaskContext def addTaskCompletionListener[U](f: (TaskContext) => U): TaskContext = {
// Note that due to this scala bug: https://github.com/scala/bug/issues/11016, we need to make
// this function polymorphic for every scala version >= 2.12, otherwise an overloaded method
// resolution error occurs at compile time.
addTaskCompletionListener(new TaskCompletionListener {
override def onTaskCompletion(context: TaskContext): Unit = f(context)
})
}3. TaskContextImpl override def addTaskCompletionListener(listener: TaskCompletionListener)
: this.type = synchronized {
if (completed) {
listener.onTaskCompletion(this)
} else {
onCompleteCallbacks += listener
}
this
}And, that will be invoked like the following. private[spark] override def markTaskCompleted(error: Option[Throwable]): Unit = synchronized {
if (completed) return
completed = true
invokeListeners(onCompleteCallbacks, "TaskCompletionListener", error) {
_.onTaskCompletion(this)
}
} |
|
Thank you for review and approval, @HyukjinKwon . |
|
I meant the flow doesn't seem to deal with listenerBus in SparkContext. These listeners are registered to TaskContext instead of listenerBus, and Task will directly update the status to listeners. spark/core/src/main/scala/org/apache/spark/scheduler/Task.scala Lines 126 to 167 in 95073fb
Here both |
|
Oh, let me take a look at that once more. |
|
Got it. I was confused during following the code. After |
|
Yeah, actually I also don't get the reason why such thread is shown afterwards, as |
|
It seems that it will take some time. I'll close this PR first to avoid further confusion. Thank you for feedbacks, @HyukjinKwon and @HeartSaVioR ! |
|
It seems that there exists another possibility. I'll test and reopen this PR. |
eventually to check thread termination
|
Hi, @HyukjinKwon and @HeartSaVioR . This is the second try to fix the |
|
|
||
| assert(stderrWriterThread.isEmpty) | ||
| // SPARK-29104 PipedRDD will invoke `stdinWriterThread.interrupt()` at task completion, | ||
| // and `obj.wait` will get InterruptedException. However, there exists a possibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updated patch. Much better with artificially reproducing the issue.
I'm trying to understand the situation - for the sake of understanding, interrupting thread would throw InterruptedException when thread is waiting via obj.wait() and stdinWriterThread will catch it, and run() will finish.
I'm trying to understand the problematic situation as well - if Thread.interrupt() occurs before obj.wait(), the thread will keep running until there's some method understanding Thread's interrupted state (like Object.wait() which will throw InterruptedException even it was thrown earlier) so there's possible delay for thread to be finished. Could you confirm my understanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second flow is correct. However, the first one is slightly differently described. stdinWriterThread is not catching the exception. The task thread is just marked as interrupted status and restarts from the after obj.wait bytecode with InterruptedException. However, this Task thread can be slow and the test code grabs the Thread.getAllStackTraces before the Task thread escaped from the run function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played with debugger for the first case as second case cannot be reproduced easily but first case would be always reproduced. It's slightly different though no big deal.
With debugger, for normal case with current master branch, stdinWriterThread catches the InterruptedException and escape run() function after executing catch/finally statements. stderrReaderThread doesn't catch the InterruptedException as it's already destroyed when we try to interrupt.
We don't join these threads for task completion anyway so it seems to be also possible even for first case to let test code verify earlier than let stdinWriterThread be destroyed. Maybe you've already explained same and I misunderstood ;( If that's the case, I'm sorry to bother you.
|
Test build #110737 has finished for PR 25808 at commit
|
|
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Test build #110747 has finished for PR 25808 at commit
|
|
Merged to master |
|
Thank you so much for helping me in this PR, @HyukjinKwon and @HeartSaVioR ! |
…heck thread termination ### What changes were proposed in this pull request? `PipedRDD` will invoke `stdinWriterThread.interrupt()` at task completion, and `obj.wait` will get `InterruptedException`. However, there exists a possibility which the thread termination gets delayed because the thread starts from `obj.wait()` with that exception. To prevent test flakiness, we need to use `eventually`. Also, This PR fixes the typo in code comment and variable name. ### Why are the changes needed? ``` - stdin writer thread should be exited when task is finished *** FAILED *** Some(Thread[stdin writer for List(cat),5,]) was not empty (PipedRDDSuite.scala:107) ``` - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6867/testReport/junit/org.apache.spark.rdd/PipedRDDSuite/stdin_writer_thread_should_be_exited_when_task_is_finished/ ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manual. We can reproduce the same failure like Jenkins if we catch `InterruptedException` and sleep longer than the `eventually` timeout inside the test code. The following is the example to reproduce it. ```scala val nums = sc.makeRDD(Array(1, 2, 3, 4), 1).map { x => try { obj.synchronized { obj.wait() // make the thread waits here. } } catch { case ie: InterruptedException => Thread.sleep(15000) throw ie } x } ``` Closes #25808 from dongjoon-hyun/SPARK-29104. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 34915b2) Signed-off-by: Dongjoon Hyun <[email protected]>
|
This is backported to |
What changes were proposed in this pull request?
PipedRDDwill invokestdinWriterThread.interrupt()at task completion, andobj.waitwill getInterruptedException. However, there exists a possibility which the thread termination gets delayed because the thread starts fromobj.wait()with that exception. To prevent test flakiness, we need to useeventually. Also, This PR fixes the typo in code comment and variable name.Why are the changes needed?
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Manual.
We can reproduce the same failure like Jenkins if we catch
InterruptedExceptionand sleep longer than theeventuallytimeout inside the test code. The following is the example to reproduce it.