-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[WIP] Remove many uses of Thread.sleep() from streaming tests #3687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…est" This reverts commit f8f6c93. Conflicts: streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala
|
Test build #24424 has started for PR 3687 at commit
|
|
I expect this to fail due to https://issues.apache.org/jira/browse/SPARK-4835. |
|
Test build #24424 has finished for PR 3687 at commit
|
|
Test FAILed. |
|
/cc @tdas. This is a work-in-progress towards removing most of the Thread.sleep() calls. I'm making slow-but-steady progress; would love your feedback + any suggestions for other types of test restructuring / cleanup to improve stability. |
|
Test build #24434 has started for PR 3687 at commit
|
|
Test build #24435 has started for PR 3687 at commit
|
|
Test build #24434 has finished for PR 3687 at commit
|
|
Test FAILed. |
|
@JoshRosen This is a wonderful and much-required refactoring. I havent been able to see it in detail yet, will do when I reach home. |
|
Test build #24435 has finished for PR 3687 at commit
|
|
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little uneasy with this approach because its not clean. Ideally nothing in Spark should refer to the requirements of higher level libraries like Spark Streaming.
|
This latest round of test failures was due to the flaky |
|
This latest run failed three tests:
Here are the failed assertions:
It's possible that this could have been a side-effect of the |
|
It turns out that my current use of I wish there was a better way to mock the filesystem in these tests. In the meantime, though, I could stick with batch intervals that are multiples of seconds (since we only need to do this in two tests). |
|
Test build #24523 has started for PR 3687 at commit
|
|
Alright, I think I fixed up a few problems in my first version of the SPARK-1600 fix, so hopefully it works now. For those who are interested in the details:
|
|
Test build #24524 has started for PR 3687 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this and the tests seem to pass, but now I'm wondering whether I've inadvertently introduced a new source of flakiness...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this attempt to remove all sleep (that is, removing all waiting related logic), it is a good idea that we remove this and fix all tests that uses set this to true.
|
Test build #24523 has finished for PR 3687 at commit
|
|
Test PASSed. |
|
Test build #24524 has finished for PR 3687 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve this change. But probably should be a different PR that just touches this input stream and its tests.
|
At a high-level i think, lets spilt this PR into two PRs
This would allow the discussion of one to proceed independently of the other. Frankly, given the current state in this PR, (3) and (2) is closer to being merged than (1). |
|
EDIT: see following comment instead I agree that it's a good idea to split this up. For starters, I'm going to try splitting off only the fix for the FileInputDStream test (SPARK-1600), since that's one of the flakiest tests, has some somewhat-unique code changes (the file modification timestamp stuff) and should be a relatively small PR to review by itself. I'll introduce the StreamingTestWaiter class in that PR. Once we're done reviewing and merging that, I'll move onto a PR to clean up all of the remaining uses of Let's chat offline about the |
|
Just realized that my last comment was a bit confusing, since SPARK-1600 is not related to the FileInputStream ManualClock fix. I'll file a new improvement JIRA to cover replacing our uses of SystemClock in tests. |
|
I'm going to close this for now; I'll open separate, smaller PRs for the remaining test cleanup. |
Update: we have decided to split this into a series of smaller PRs: #3801 and #3832.
This is a work-in-progress PR towards removing many of the
Thread.sleep()calls from Spark Streaming's test suite, since I think that these calls make these tests race-prone and flaky.Running list of the JIRAs that this patch addresses / intends to address:
TODOS:
Thread.sleepcalls from therecovery with file input streamtest. There are several uses ofThread.sleephere which seem to serve several different purposes, so this will require some care.InputStreamSuite. Most of these cases look simple to replace with the new StreamingTestWaiter, but thetestFileStreamcase looks like it could be tricky. Going to defer removing the sleeps frommulti-thread receiversince that looks like a bit of work and we can do it later if we observe flakiness.ReceiverSuite. There are a fewThread.sleep(0)cases which seem confusing; if they're still necessary, they should be commented.MasterFailureTest, since these seem to be blocking on events like the StreamingContext starting and we have waiters for this.