Skip to content

Conversation

@JoshRosen
Copy link
Contributor

Update: we have decided to split this into a series of smaller PRs: #3801 and #3832.

This is a work-in-progress PR towards removing many of the Thread.sleep() calls from Spark Streaming's test suite, since I think that these calls make these tests race-prone and flaky.

Running list of the JIRAs that this patch addresses / intends to address:

  • SPARK-4835: "Streaming saveAs*HadoopFiles() methods may throw FileAlreadyExistsException during checkpoint recovery"
  • SPARK-1600: "Flaky "recovery with file input stream" test in streaming.CheckpointSuite"

TODOS:

  • Fix SPARK-1600 by removing the Thread.sleep calls from the recovery with file input stream test. There are several uses of Thread.sleep here which seem to serve several different purposes, so this will require some care.
  • Remove the sleep calls in InputStreamSuite. Most of these cases look simple to replace with the new StreamingTestWaiter, but the testFileStream case looks like it could be tricky. Going to defer removing the sleeps from multi-thread receiver since that looks like a bit of work and we can do it later if we observe flakiness.
  • Investigate the uses in ReceiverSuite. There are a few Thread.sleep(0) cases which seem confusing; if they're still necessary, they should be commented.
  • Might be able to remove one or two low-hanging fruit calls in MasterFailureTest, since these seem to be blocking on events like the StreamingContext starting and we have waiters for this.
  • Determine whether the fix for SPARK-4835 should be done differently (see discussion on JIRA).

@SparkQA
Copy link

SparkQA commented Dec 13, 2014

Test build #24424 has started for PR 3687 at commit ad0056b.

  • This patch merges cleanly.

@JoshRosen
Copy link
Contributor Author

I expect this to fail due to https://issues.apache.org/jira/browse/SPARK-4835.

@SparkQA
Copy link

SparkQA commented Dec 13, 2014

Test build #24424 has finished for PR 3687 at commit ad0056b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StreamingTestWaiter(ssc: StreamingContext)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24424/
Test FAILed.

@JoshRosen JoshRosen changed the title Remove many uses of Thread.sleep() from streaming tests [WIP] Remove many uses of Thread.sleep() from streaming tests Dec 13, 2014
@JoshRosen
Copy link
Contributor Author

/cc @tdas. This is a work-in-progress towards removing most of the Thread.sleep() calls. I'm making slow-but-steady progress; would love your feedback + any suggestions for other types of test restructuring / cleanup to improve stability.

@SparkQA
Copy link

SparkQA commented Dec 13, 2014

Test build #24434 has started for PR 3687 at commit 3db335f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 13, 2014

Test build #24435 has started for PR 3687 at commit 12635b4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 13, 2014

Test build #24434 has finished for PR 3687 at commit 3db335f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24434/
Test FAILed.

@tdas
Copy link
Contributor

tdas commented Dec 13, 2014

@JoshRosen This is a wonderful and much-required refactoring. I havent been able to see it in detail yet, will do when I reach home.

@SparkQA
Copy link

SparkQA commented Dec 13, 2014

Test build #24435 has finished for PR 3687 at commit 12635b4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24435/
Test FAILed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little uneasy with this approach because its not clean. Ideally nothing in Spark should refer to the requirements of higher level libraries like Spark Streaming.

@JoshRosen
Copy link
Contributor Author

This latest round of test failures was due to the flaky WriteAheadLogBackedBlockRDDSuite tests (SPARK-4826). I'm going to work on investigating that test now, but I'll probably fix that in a separate PR since I think that the root cause there is not Thread.sleep calls (and that fix needs to be targeted differently for backports, since that feature is new in 1.2.0).

@JoshRosen
Copy link
Contributor Author

This latest run failed three tests:

  • org.apache.spark.streaming.CheckpointSuite.recovery with file input stream
  • org.apache.spark.streaming.FailureSuite.multiple failures with map
  • org.apache.spark.streaming.FailureSuite.multiple failures with updateStateByKey

Here are the failed assertions:

It's possible that this could have been a side-effect of the Charsets.UTF_8 change, but I don't see how that could have caused problems. I'll investigate.

@JoshRosen
Copy link
Contributor Author

It turns out that my current use of file.setLastModified is brittle since the underlying filesystem may only support one-second resolution:

     * <p> All platforms support file-modification times to the nearest second,
     * but some provide more precision.  The argument will be truncated to fit
     * the supported precision.  If the operation succeeds and no intervening
     * operations on the file take place, then the next invocation of the
     * <code>{@link #lastModified}</code> method will return the (possibly
     * truncated) <code>time</code> argument that was passed to this method.

I wish there was a better way to mock the filesystem in these tests. In the meantime, though, I could stick with batch intervals that are multiples of seconds (since we only need to do this in two tests).

@SparkQA
Copy link

SparkQA commented Dec 17, 2014

Test build #24523 has started for PR 3687 at commit 520bade.

  • This patch merges cleanly.

@JoshRosen
Copy link
Contributor Author

Alright, I think I fixed up a few problems in my first version of the SPARK-1600 fix, so hopefully it works now.

For those who are interested in the details:

  • We had a "time-travel" issue when restarting from the checkpoint because I didn't copy over the old ManualClock value to the new StreamingContext.
  • My modified FileInputDStream didn't handle clock properly during recovery (the field shouldn't have been @transient, so I made it into a def instead).

@SparkQA
Copy link

SparkQA commented Dec 17, 2014

Test build #24524 has started for PR 3687 at commit 1304776.

  • This patch merges cleanly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this and the tests seem to pass, but now I'm wondering whether I've inadvertently introduced a new source of flakiness...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this attempt to remove all sleep (that is, removing all waiting related logic), it is a good idea that we remove this and fix all tests that uses set this to true.

@SparkQA
Copy link

SparkQA commented Dec 17, 2014

Test build #24523 has finished for PR 3687 at commit 520bade.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24523/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Dec 17, 2014

Test build #24524 has finished for PR 3687 at commit 1304776.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24524/
Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve this change. But probably should be a different PR that just touches this input stream and its tests.

@tdas
Copy link
Contributor

tdas commented Dec 25, 2014

At a high-level i think, lets spilt this PR into two PRs

  1. Fixing saveAsHadoopFiles
  2. Fixing ManualClock and its uses
  3. Removing thread sleeps

This would allow the discussion of one to proceed independently of the other. Frankly, given the current state in this PR, (3) and (2) is closer to being merged than (1).

@JoshRosen
Copy link
Contributor Author

EDIT: see following comment instead

I agree that it's a good idea to split this up.

For starters, I'm going to try splitting off only the fix for the FileInputDStream test (SPARK-1600), since that's one of the flakiest tests, has some somewhat-unique code changes (the file modification timestamp stuff) and should be a relatively small PR to review by itself. I'll introduce the StreamingTestWaiter class in that PR. Once we're done reviewing and merging that, I'll move onto a PR to clean up all of the remaining uses of Thread.sleep(). Some of those uses have not led to flaky tests, though, so I think that splitting the change up and prioritizing based on the tests that are known to be flaky will be a good way to reduce the review burden here.

Let's chat offline about the saveAsHadoopFiles fix.

@JoshRosen
Copy link
Contributor Author

Just realized that my last comment was a bit confusing, since SPARK-1600 is not related to the FileInputStream ManualClock fix. I'll file a new improvement JIRA to cover replacing our uses of SystemClock in tests.

@JoshRosen
Copy link
Contributor Author

I'm going to close this for now; I'll open separate, smaller PRs for the remaining test cleanup.

@JoshRosen JoshRosen closed this Jan 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants