-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23197][STREAMING][TESTS] Fix ReceiverSuite."receiver_life_cycle" to not rely on timing #25862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ot rely on timing
|
Test build #111033 has finished for PR 25862 at commit
|
|
Test build #111035 has finished for PR 25862 at commit
|
|
Test build #111039 has finished for PR 25862 at commit
|
| executor.callsRecorder.reset() | ||
| receiver.callsRecorder.reset() | ||
| receiver.restart("restarting", null, 100) | ||
| eventually(timeout(10.seconds), interval(10.milliseconds)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, 10.seconds is enough? Or, do you need to re-trigger this PR to validate more?
BTW, thank you so much for taking care of this case! This is really an long standing issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that was actually 1.3 seconds (300ms + 1s) and it hasn't been failing for high probability so it should be pretty enough.
|
cc. @tdas @sameeragarwal as they've authored and reviewed 15adcc8 |
| assert(executor.isReceiverStarted) | ||
| executor.callsRecorder.reset() | ||
| receiver.callsRecorder.reset() | ||
| receiver.restart("restarting", null, 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes down from 600 to 100?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes we no longer need so long delay as we don't rely on timing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also verified this PR with the given test method. New test suite is logically robust and doesn't fail.
+1, LGTM. Merged to master.
Thank you so much, @HeartSaVioR !
|
Could you make a backporting PR, @HeartSaVioR ? |
|
Thanks for reviewing and merging, @dongjoon-hyun ! I'll submit a PR for branch-2.4 as well. |
…e" to not rely on timing This patch changes ReceiverSuite."receiver_life_cycle" to record actual calls with timestamp in FakeReceiver/FakeReceiverSupervisor, which doesn't rely on timing of stopping and starting receiver in restarting receiver. It enables us to give enough huge timeout on verification of restart as we can verify both stopping and starting together. The test is flaky without this patch. We increased timeout to fix flakyness of this test (apache@15adcc8) but even with longer timeout it has been still failing intermittently. No I've reproduced test failure artificially via below diff: ``` diff --git a/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala b/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala index faf6db8..d8977543c0 100644 --- a/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala +++ b/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala -191,9 +191,11 private[streaming] abstract class ReceiverSupervisor( // thread pool. logWarning("Restarting receiver with delay " + delay + " ms: " + message, error.getOrElse(null)) + Thread.sleep(1000) stopReceiver("Restarting receiver with delay " + delay + "ms: " + message, error) logDebug("Sleeping for " + delay) Thread.sleep(delay) + Thread.sleep(1000) logInfo("Starting receiver again") startReceiver() logInfo("Receiver started again") ``` and confirmed this patch doesn't fail with the change. Closes apache#25862 from HeartSaVioR/SPARK-23197-v2. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
|
Thanks! |
What changes were proposed in this pull request?
This patch changes ReceiverSuite."receiver_life_cycle" to record actual calls with timestamp in FakeReceiver/FakeReceiverSupervisor, which doesn't rely on timing of stopping and starting receiver in restarting receiver. It enables us to give enough huge timeout on verification of restart as we can verify both stopping and starting together.
Why are the changes needed?
The test is flaky without this patch. We increased timeout to fix flakyness of this test (15adcc8) but even with longer timeout it has been still failing intermittently.
Does this PR introduce any user-facing change?
No
How was this patch tested?
I've reproduced test failure artificially via below diff:
and confirmed this patch doesn't fail with the change.