Skip to content

Conversation

@jkbradley
Copy link
Member

Recently, PySpark ML streaming tests have been flaky, most likely because of the batches not being processed in time. Proposal: Replace the use of _ssc_wait (which waits for a fixed amount of time) with a method which waits for a fixed amount of time but can terminate early based on a termination condition method. With this, we can extend the waiting period (to make tests less flaky) but also stop early when possible (making tests faster on average, which I verified locally).

CC: @mengxr @tdas @freeman-lab

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still a slight possibility that between the last time term_check() is called in the _ssc_wait_checked, and next time its called in this method, another batch may have been processed, which which fail the test unnecessarily. So a better approach would be for the _ssc_wait_checked method to return True if the term_check() has succeeded within the timeout, otherwise return false. Then there is not need to check term_check() once again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these tests, they should pass whenever all batches have been processed, so the current setup should be safe. I'm actually thinking of copying the checks so that assertions print out more useful error messages. (I don't see a great way to avoid copying the checks if I want them for both early stopping & useful error messages.)

@SparkQA
Copy link

SparkQA commented Aug 11, 2015

Test build #40354 has finished for PR 8087 at commit 3fb7c0c.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 11, 2015

Test build #40357 has finished for PR 8087 at commit ef49b2b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Aug 11, 2015

Test build #40495 has finished for PR 8087 at commit ff1ee1b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 11, 2015

Test build #40502 has finished for PR 8087 at commit afbe8b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

Yay it passed! If this looks reasonable, I'll make similar changes for the other streaming ML pyspark tests.

@freeman-lab
Copy link
Contributor

Nice! I think this is a solid strategy. Maybe in the next round of changes make that 20.0, which will presumably be used throughout, a var shared by all the tests?

@tdas
Copy link
Contributor

tdas commented Aug 11, 2015

I think you can make a generic equivalent of scalatest eventually in python. That takes care of failing with timeout and providing meaningful last error message.

def eventually(timeout, condition, errorMessage)

# condition: function that must return boolean
# errorMessage: can be a string, or a function that returns a string, it invoked if there is a timeout. 

Then thats solves the problem I alluded to earlier about a possible race condition.

@jkbradley
Copy link
Member Author

@tdas Sure, I can do that. I don't think the race condition matters for ML tests (or if it does, then the test was written incorrectly), but that does clarify semantics. I guess I'll have to duplicate the check code no matter what to get nice error messages.

@jkbradley
Copy link
Member Author

Actually, I'm going to switch the design to instead:

  • accept a single check method which will use assertions
  • catch AssertionErrors when deciding whether we can terminate
  • throw the last caught AssertionError upon timeout

That will allow us to (a) avoid copying the set of checks and (b) take advantage of the many assertion variants, including approximate equality.

AFAIK, the overhead in catching errors should be negligible compared to the time for the tests. (Correct me if I'm wrong here.)

@jkbradley jkbradley changed the title [WIP] [SPARK-9805] [MLLIB] [PYTHON] [STREAMING] Added _ssc_wait_checked for ml streaming pyspark tests [SPARK-9805] [MLLIB] [PYTHON] [STREAMING] Added _ssc_wait_checked for ml streaming pyspark tests Aug 12, 2015
@SparkQA
Copy link

SparkQA commented Aug 12, 2015

Test build #40578 has finished for PR 8087 at commit 48f43c8.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

Jenkins test this please

@mengxr
Copy link
Contributor

mengxr commented Aug 12, 2015

What if condition requires at least one batch to work correctly? This is not the case for streaming ML algorithms, but I'm not sure for other streaming unit tests.

@jkbradley
Copy link
Member Author

Yeah, I should document that. I made sure to make condition() work for those cases (e.g., checking result array length instead of the values in the result array which might not yet exist).

@SparkQA
Copy link

SparkQA commented Aug 12, 2015

Test build #40598 has finished for PR 8087 at commit 5e49327.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 12, 2015

Test build #1474 has finished for PR 8087 at commit 3717fc4.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

Working on improvements...

@jkbradley
Copy link
Member Author

OK everyone, I think that should fix things...but we'll wait and see. I changed the logic of eventually to support the 2 types of tests: ones which have a simple condition to check and cannot stop early, and ones which can stop early if all batches have been processed.

@SparkQA
Copy link

SparkQA commented Aug 12, 2015

Test build #40678 has finished for PR 8087 at commit 002e838.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 12, 2015

Test build #40688 has finished for PR 8087 at commit 2897833.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Aug 13, 2015

LGTM. @tdas Do you want to make a final pass?

@jkbradley jkbradley changed the title [SPARK-9805] [MLLIB] [PYTHON] [STREAMING] Added _ssc_wait_checked for ml streaming pyspark tests [SPARK-9805] [MLLIB] [PYTHON] [STREAMING] Added _eventually for ml streaming pyspark tests Aug 13, 2015
@jkbradley
Copy link
Member Author

Increasing timing in the spirit of robustness...and testing again for fun.

@jkbradley
Copy link
Member Author

But yeah @tdas I'll wait for your final OK

@SparkQA
Copy link

SparkQA commented Aug 13, 2015

Test build #40816 has finished for PR 8087 at commit a4c3f1e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Aug 14, 2015

LGTM!

@jkbradley
Copy link
Member Author

OK, I'll merge this with master and branch-1.5 then. Thanks for reviewing, everyone!

asfgit pushed a commit that referenced this pull request Aug 16, 2015
…reaming pyspark tests

Recently, PySpark ML streaming tests have been flaky, most likely because of the batches not being processed in time.  Proposal: Replace the use of _ssc_wait (which waits for a fixed amount of time) with a method which waits for a fixed amount of time but can terminate early based on a termination condition method.  With this, we can extend the waiting period (to make tests less flaky) but also stop early when possible (making tests faster on average, which I verified locally).

CC: mengxr tdas freeman-lab

Author: Joseph K. Bradley <[email protected]>

Closes #8087 from jkbradley/streaming-ml-tests.

(cherry picked from commit 1db7179)
Signed-off-by: Joseph K. Bradley <[email protected]>
@asfgit asfgit closed this in 1db7179 Aug 16, 2015
@jkbradley jkbradley deleted the streaming-ml-tests branch August 16, 2015 01:53
CodingCat pushed a commit to CodingCat/spark that referenced this pull request Aug 17, 2015
…reaming pyspark tests

Recently, PySpark ML streaming tests have been flaky, most likely because of the batches not being processed in time.  Proposal: Replace the use of _ssc_wait (which waits for a fixed amount of time) with a method which waits for a fixed amount of time but can terminate early based on a termination condition method.  With this, we can extend the waiting period (to make tests less flaky) but also stop early when possible (making tests faster on average, which I verified locally).

CC: mengxr tdas freeman-lab

Author: Joseph K. Bradley <[email protected]>

Closes apache#8087 from jkbradley/streaming-ml-tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants