[SPARK-22882][ML][TESTS] ML test for structured streaming: ml.classification #20121

WeichenXu123 · 2017-12-30T07:25:30Z

What changes were proposed in this pull request?

adding Structured Streaming tests for all Models/Transformers in spark.ml.classification

How was this patch tested?

N/A

SparkQA · 2017-12-30T07:36:04Z

Test build #85541 has finished for PR 20121 at commit dbb04f6.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DecisionTreeClassifierSuite extends MLTest with DefaultReadWriteTest
class GBTClassifierSuite extends MLTest with DefaultReadWriteTest
class LinearSVCSuite extends MLTest with DefaultReadWriteTest
class LogisticRegressionSuite extends MLTest with DefaultReadWriteTest
class MultilayerPerceptronClassifierSuite extends MLTest with DefaultReadWriteTest
class NaiveBayesSuite extends MLTest with DefaultReadWriteTest
class OneVsRestSuite extends MLTest with DefaultReadWriteTest
class RandomForestClassifierSuite extends MLTest with DefaultReadWriteTest

SparkQA · 2017-12-30T09:12:14Z

Test build #85543 has finished for PR 20121 at commit daedd8b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DecisionTreeClassifierSuite extends MLTest with DefaultReadWriteTest
class GBTClassifierSuite extends MLTest with DefaultReadWriteTest
class LinearSVCSuite extends MLTest with DefaultReadWriteTest
class LogisticRegressionSuite extends MLTest with DefaultReadWriteTest
class MultilayerPerceptronClassifierSuite extends MLTest with DefaultReadWriteTest
class NaiveBayesSuite extends MLTest with DefaultReadWriteTest
class OneVsRestSuite extends MLTest with DefaultReadWriteTest
class RandomForestClassifierSuite extends MLTest with DefaultReadWriteTest

SparkQA · 2018-01-12T02:11:38Z

Test build #86005 has finished for PR 20121 at commit f9125a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

smurakozi

I think this change is ok except for a couple of nits.

However, I found two places where transform was used and which could be potentially changed to test streaming too:
1.
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala#L2565
Here one model is used to initialize the other and their results are zipped and compared - it seems to me that it would need a bit more complicated logic to convert to be streaming-friendly.
2.
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala#L122
This is called from many other suites.

Were they skipped intentionally?

smurakozi · 2018-01-17T12:33:05Z

mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala

Why were these transformations and checks removed?

These testing code path has been covered by ProbabilisticClassifierSuite.testPredictMethods.

smurakozi · 2018-01-17T13:39:07Z

.../src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala

nit: unused import, could be removed

smurakozi · 2018-01-17T13:40:25Z

mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala

nit: unused import, could be removed, just like SparkFunSuite a couple lines above

smurakozi · 2018-01-17T13:47:00Z

.../src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala

dataSet is always a dataFrame in this suite. If it was declared as such there would be no need to always call toDF()

The type of dataset is Dataset[_] ?

@transient var dataset: Dataset[_] = _

Yes. Dataset[_] cannot match the testTransformer param. It require DataFrame.

smurakozi · 2018-01-17T14:06:39Z

mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala

nit: import org.apache.spark.mllib.util.MLlibTestSparkContext
is unused at line 32

WeichenXu123 · 2018-03-02T11:51:56Z

@smurakozi Address your comments. Thanks!

SparkQA · 2018-03-02T12:57:34Z

Test build #87884 has finished for PR 20121 at commit 6d59c5b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley

I made a review pass. Just 2 small comments.

jkbradley · 2018-03-02T19:57:03Z

mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala

-      assert(p1 === p2)
+    val binaryExpected = model1.transform(smallBinaryDataset).select("prediction").collect()
+      .map(_.getDouble(0))
+    for (model <- Seq(model1, model2)) {


Why test model1 against itself?

This line code val binaryExpected = model1.transform(smallBinaryDataset).select("prediction").collect().map(_.getDouble(0)) used to generate the binaryExpected dataset.
And then test model1/model2 on both df.transform and streamDF.transform and compare result to binaryExpected (assert equal).
Otherwise we need to hardcoding the binaryExpected dataset in the code.

My thought is that testing binaryExpected (from model1) against model2 would already test the 2 things we care about:

batch vs streaming prediction

initial model

I'll just merge this though since it's not a big deal (just a bit longer testing time).

jkbradley · 2018-03-02T19:57:27Z

mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala

-      assert(p1 === p2)
+    val multinomialExpected = model3.transform(smallMultinomialDataset).select("prediction")
+      .collect().map(_.getDouble(0))
+    for (model <- Seq(model3, model4)) {


ditto: why test model3 against itself?

The same reason above.

jkbradley · 2018-03-05T18:49:20Z

LGTM
Merging with master and branch-2.3
Thanks!

…ication ## What changes were proposed in this pull request? adding Structured Streaming tests for all Models/Transformers in spark.ml.classification ## How was this patch tested? N/A Author: WeichenXu <[email protected]> Closes #20121 from WeichenXu123/ml_stream_test_classification. (cherry picked from commit 98a5c0a) Signed-off-by: Joseph K. Bradley <[email protected]>

…ication ## What changes were proposed in this pull request? adding Structured Streaming tests for all Models/Transformers in spark.ml.classification ## How was this patch tested? N/A Author: WeichenXu <[email protected]> Closes apache#20121 from WeichenXu123/ml_stream_test_classification.

…ication ## What changes were proposed in this pull request? adding Structured Streaming tests for all Models/Transformers in spark.ml.classification ## How was this patch tested? N/A Author: WeichenXu <[email protected]> Closes apache#20121 from WeichenXu123/ml_stream_test_classification. (cherry picked from commit 98a5c0a) Signed-off-by: Joseph K. Bradley <[email protected]>

WeichenXu123 changed the title ~~[SPARK-22927][ML][TESTS] ML test for structured streaming: ml.classification~~ [SPARK-22882][ML][TESTS] ML test for structured streaming: ml.classification Dec 30, 2017

WeichenXu123 force-pushed the ml_stream_test_classification branch from dbb04f6 to daedd8b Compare December 30, 2017 08:06

smurakozi reviewed Jan 17, 2018

View reviewed changes

WeichenXu123 added 5 commits March 2, 2018 17:02

init pr

9dbb1b1

update ovs test

8de207d

address smurakozi comments

fb474c0

add stream test for testPredictMethods

1335a91

add streaming test for lor set initial model

6d59c5b

WeichenXu123 force-pushed the ml_stream_test_classification branch from f9125a9 to 6d59c5b Compare March 2, 2018 11:51

jkbradley reviewed Mar 2, 2018

View reviewed changes

asfgit closed this in 98a5c0a Mar 5, 2018

WeichenXu123 deleted the ml_stream_test_classification branch March 6, 2018 03:11

WeichenXu123 mentioned this pull request Mar 6, 2018

[SPARK-22915][MLlib] Streaming tests for spark.ml.feature, from N to Z #20686

Closed

[SPARK-22882][ML][TESTS] ML test for structured streaming: ml.classification #20121

[SPARK-22882][ML][TESTS] ML test for structured streaming: ml.classification #20121

Uh oh!

Conversation

WeichenXu123 commented Dec 30, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 30, 2017

Uh oh!

SparkQA commented Dec 30, 2017

Uh oh!

SparkQA commented Jan 12, 2018

Uh oh!

smurakozi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 commented Mar 2, 2018

Uh oh!

SparkQA commented Mar 2, 2018

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Mar 5, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants