-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-22882][ML][TESTS] ML test for structured streaming: ml.classification #20121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22882][ML][TESTS] ML test for structured streaming: ml.classification #20121
Conversation
|
Test build #85541 has finished for PR 20121 at commit
|
dbb04f6 to
daedd8b
Compare
|
Test build #85543 has finished for PR 20121 at commit
|
|
Test build #86005 has finished for PR 20121 at commit
|
smurakozi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is ok except for a couple of nits.
However, I found two places where transform was used and which could be potentially changed to test streaming too:
1.
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala#L2565
Here one model is used to initialize the other and their results are zipped and compared - it seems to me that it would need a bit more complicated logic to convert to be streaming-friendly.
2.
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala#L122
This is called from many other suites.
Were they skipped intentionally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why were these transformations and checks removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These testing code path has been covered by ProbabilisticClassifierSuite.testPredictMethods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unused import, could be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unused import, could be removed, just like SparkFunSuite a couple lines above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataSet is always a dataFrame in this suite. If it was declared as such there would be no need to always call toDF()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type of dataset is Dataset[_] ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@transient var dataset: Dataset[_] = _
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Dataset[_] cannot match the testTransformer param. It require DataFrame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: import org.apache.spark.mllib.util.MLlibTestSparkContext
is unused at line 32
f9125a9 to
6d59c5b
Compare
|
@smurakozi Address your comments. Thanks! |
|
Test build #87884 has finished for PR 20121 at commit
|
jkbradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a review pass. Just 2 small comments.
| assert(p1 === p2) | ||
| val binaryExpected = model1.transform(smallBinaryDataset).select("prediction").collect() | ||
| .map(_.getDouble(0)) | ||
| for (model <- Seq(model1, model2)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why test model1 against itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line code val binaryExpected = model1.transform(smallBinaryDataset).select("prediction").collect().map(_.getDouble(0)) used to generate the binaryExpected dataset.
And then test model1/model2 on both df.transform and streamDF.transform and compare result to binaryExpected (assert equal).
Otherwise we need to hardcoding the binaryExpected dataset in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought is that testing binaryExpected (from model1) against model2 would already test the 2 things we care about:
- batch vs streaming prediction
- initial model
I'll just merge this though since it's not a big deal (just a bit longer testing time).
| assert(p1 === p2) | ||
| val multinomialExpected = model3.transform(smallMultinomialDataset).select("prediction") | ||
| .collect().map(_.getDouble(0)) | ||
| for (model <- Seq(model3, model4)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: why test model3 against itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same reason above.
|
LGTM |
…ication ## What changes were proposed in this pull request? adding Structured Streaming tests for all Models/Transformers in spark.ml.classification ## How was this patch tested? N/A Author: WeichenXu <[email protected]> Closes #20121 from WeichenXu123/ml_stream_test_classification. (cherry picked from commit 98a5c0a) Signed-off-by: Joseph K. Bradley <[email protected]>
…ication ## What changes were proposed in this pull request? adding Structured Streaming tests for all Models/Transformers in spark.ml.classification ## How was this patch tested? N/A Author: WeichenXu <[email protected]> Closes apache#20121 from WeichenXu123/ml_stream_test_classification.
…ication ## What changes were proposed in this pull request? adding Structured Streaming tests for all Models/Transformers in spark.ml.classification ## How was this patch tested? N/A Author: WeichenXu <[email protected]> Closes apache#20121 from WeichenXu123/ml_stream_test_classification. (cherry picked from commit 98a5c0a) Signed-off-by: Joseph K. Bradley <[email protected]>
What changes were proposed in this pull request?
adding Structured Streaming tests for all Models/Transformers in spark.ml.classification
How was this patch tested?
N/A