Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

What changes were proposed in this pull request?

adding Structured Streaming tests for all Models/Transformers in spark.ml.classification

How was this patch tested?

N/A

@WeichenXu123 WeichenXu123 changed the title [SPARK-22927][ML][TESTS] ML test for structured streaming: ml.classification [SPARK-22882][ML][TESTS] ML test for structured streaming: ml.classification Dec 30, 2017
@SparkQA
Copy link

SparkQA commented Dec 30, 2017

Test build #85541 has finished for PR 20121 at commit dbb04f6.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DecisionTreeClassifierSuite extends MLTest with DefaultReadWriteTest
  • class GBTClassifierSuite extends MLTest with DefaultReadWriteTest
  • class LinearSVCSuite extends MLTest with DefaultReadWriteTest
  • class LogisticRegressionSuite extends MLTest with DefaultReadWriteTest
  • class MultilayerPerceptronClassifierSuite extends MLTest with DefaultReadWriteTest
  • class NaiveBayesSuite extends MLTest with DefaultReadWriteTest
  • class OneVsRestSuite extends MLTest with DefaultReadWriteTest
  • class RandomForestClassifierSuite extends MLTest with DefaultReadWriteTest

@WeichenXu123 WeichenXu123 force-pushed the ml_stream_test_classification branch from dbb04f6 to daedd8b Compare December 30, 2017 08:06
@SparkQA
Copy link

SparkQA commented Dec 30, 2017

Test build #85543 has finished for PR 20121 at commit daedd8b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DecisionTreeClassifierSuite extends MLTest with DefaultReadWriteTest
  • class GBTClassifierSuite extends MLTest with DefaultReadWriteTest
  • class LinearSVCSuite extends MLTest with DefaultReadWriteTest
  • class LogisticRegressionSuite extends MLTest with DefaultReadWriteTest
  • class MultilayerPerceptronClassifierSuite extends MLTest with DefaultReadWriteTest
  • class NaiveBayesSuite extends MLTest with DefaultReadWriteTest
  • class OneVsRestSuite extends MLTest with DefaultReadWriteTest
  • class RandomForestClassifierSuite extends MLTest with DefaultReadWriteTest

@SparkQA
Copy link

SparkQA commented Jan 12, 2018

Test build #86005 has finished for PR 20121 at commit f9125a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@smurakozi smurakozi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is ok except for a couple of nits.

However, I found two places where transform was used and which could be potentially changed to test streaming too:
1.
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala#L2565
Here one model is used to initialize the other and their results are zipped and compared - it seems to me that it would need a bit more complicated logic to convert to be streaming-friendly.
2.
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala#L122
This is called from many other suites.

Were they skipped intentionally?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were these transformations and checks removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These testing code path has been covered by ProbabilisticClassifierSuite.testPredictMethods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unused import, could be removed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unused import, could be removed, just like SparkFunSuite a couple lines above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataSet is always a dataFrame in this suite. If it was declared as such there would be no need to always call toDF()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of dataset is Dataset[_] ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@transient var dataset: Dataset[_] = _

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Dataset[_] cannot match the testTransformer param. It require DataFrame.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: import org.apache.spark.mllib.util.MLlibTestSparkContext
is unused at line 32

@WeichenXu123 WeichenXu123 force-pushed the ml_stream_test_classification branch from f9125a9 to 6d59c5b Compare March 2, 2018 11:51
@WeichenXu123
Copy link
Contributor Author

@smurakozi Address your comments. Thanks!

@SparkQA
Copy link

SparkQA commented Mar 2, 2018

Test build #87884 has finished for PR 20121 at commit 6d59c5b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a review pass. Just 2 small comments.

assert(p1 === p2)
val binaryExpected = model1.transform(smallBinaryDataset).select("prediction").collect()
.map(_.getDouble(0))
for (model <- Seq(model1, model2)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why test model1 against itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line code val binaryExpected = model1.transform(smallBinaryDataset).select("prediction").collect().map(_.getDouble(0)) used to generate the binaryExpected dataset.
And then test model1/model2 on both df.transform and streamDF.transform and compare result to binaryExpected (assert equal).
Otherwise we need to hardcoding the binaryExpected dataset in the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought is that testing binaryExpected (from model1) against model2 would already test the 2 things we care about:

  • batch vs streaming prediction
  • initial model

I'll just merge this though since it's not a big deal (just a bit longer testing time).

assert(p1 === p2)
val multinomialExpected = model3.transform(smallMultinomialDataset).select("prediction")
.collect().map(_.getDouble(0))
for (model <- Seq(model3, model4)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: why test model3 against itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same reason above.

@jkbradley
Copy link
Member

LGTM
Merging with master and branch-2.3
Thanks!

@asfgit asfgit closed this in 98a5c0a Mar 5, 2018
asfgit pushed a commit that referenced this pull request Mar 5, 2018
…ication

## What changes were proposed in this pull request?

adding Structured Streaming tests for all Models/Transformers in spark.ml.classification

## How was this patch tested?

N/A

Author: WeichenXu <[email protected]>

Closes #20121 from WeichenXu123/ml_stream_test_classification.

(cherry picked from commit 98a5c0a)
Signed-off-by: Joseph K. Bradley <[email protected]>
mgaido91 pushed a commit to mgaido91/spark that referenced this pull request Mar 5, 2018
…ication

## What changes were proposed in this pull request?

adding Structured Streaming tests for all Models/Transformers in spark.ml.classification

## How was this patch tested?

N/A

Author: WeichenXu <[email protected]>

Closes apache#20121 from WeichenXu123/ml_stream_test_classification.
@WeichenXu123 WeichenXu123 deleted the ml_stream_test_classification branch March 6, 2018 03:11
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
…ication

## What changes were proposed in this pull request?

adding Structured Streaming tests for all Models/Transformers in spark.ml.classification

## How was this patch tested?

N/A

Author: WeichenXu <[email protected]>

Closes apache#20121 from WeichenXu123/ml_stream_test_classification.

(cherry picked from commit 98a5c0a)
Signed-off-by: Joseph K. Bradley <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants