-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM #15777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
python/pyspark/ml/classification.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before, this would throw a Py4JJavaError. I think it's slightly better to throw a RuntimeError here as is done in Scala.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think thats generally a good improvement, the Py4J errors are often confusing to end users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this change, we should always throw an exception easy to understand by users.
python/pyspark/ml/tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should test that hasSummary returns False when there is no summary available, and that summary throws an exception. The problem I'm having is that I'm not sure how to create this test case. The only way to get a model is by calling fit, which will produce a model with a summary. Calling model._call_java("setSummary", None) doesn't work either. Is there some way that I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to update setSummary to treat null as empty (e.g. Option instead of Some)) for easy testing.
|
Test build #68167 has finished for PR 15777 at commit
|
|
Test build #68216 has finished for PR 15777 at commit
|
|
Thanks for working on this @sethah - more work towards increased parity is good :) |
|
ping @yanboliang |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per @holdenk's suggestion, I changed the setSummary to use Option.apply which treats null as None. This allows us to exercise the test case for hasSummary when summary == None. This was never tested in Scala either so I added unit tests for both Scala and Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd more prefer to make the argument as Option[BisectingKMeansSummary] like:
private[clustering] def setSummary(summary: Option[BisectingKMeansSummary]): this.type = {
this.trainingSummary = summary
this
}
And test summary == None by:
model.setSummary(None)
assert(!model.hasSummary)
Since I think setSummary(null) and test whether it existing is very tricky. The type of summary is Option[BisectingKMeansSummary] and with None as default value, so setSummary(None) should make more sense for the scenario that the model does not have summary.
I saw the reason for make this change is that you want to call setSummary at Python side, and Python None would be converted to null in Scala. But I think this is private function, we don't need to run test across Scala and Python, since private function should not be called by users.
|
Test build #68624 has finished for PR 15777 at commit
|
python/pyspark/ml/clustering.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we dont need to expose ClusteringSummary, for in the scala side ClusteringSummary is private in [clustering].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@sethah I will take a look in a few days. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd more prefer to make the argument as Option[BisectingKMeansSummary] like:
private[clustering] def setSummary(summary: Option[BisectingKMeansSummary]): this.type = {
this.trainingSummary = summary
this
}
And test summary == None by:
model.setSummary(None)
assert(!model.hasSummary)
Since I think setSummary(null) and test whether it existing is very tricky. The type of summary is Option[BisectingKMeansSummary] and with None as default value, so setSummary(None) should make more sense for the scenario that the model does not have summary.
I saw the reason for make this change is that you want to call setSummary at Python side, and Python None would be converted to null in Scala. But I think this is private function, we don't need to run test across Scala and Python, since private function should not be called by users.
python/pyspark/ml/clustering.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python/pyspark/ml/classification.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this change, we should always throw an exception easy to understand by users.
python/pyspark/ml/clustering.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, should be BisectingKMeansSummary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, good catch!
python/pyspark/ml/tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we test this? Actually setSummary is private and it should not generate a model w/o summary in ordinary case. I think we should only test the public API for Python. Further more, if we want to test model w/o summary, we need to write a dummy Scala model w/o summary, and check hasSummary directly at Python side. I think if the Scala function is not public, we may not confirm Scala/Python compatibility, so testing is also not very make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This came out of a suggestion in the other PR to add KMeansSummary. Adding a hasSummary method implies that models can both have and not have summaries. How can we test that hasSummary works properly when we can only exercise one of its test cases?
Edit: hasSummary will return false if the model is saved and then loaded back since the summary is not saved with the model, so it doesn't always return true. We could test the hasSummary method by calling save/load but that seems expensive just to test a simple function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, after loading a saved model, the summary should be None and hasSummary return false. I think this is the correct test route, although with some extra cost. What about add hasSummary test at save/load doc test? Then it should not need extra cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do it, though typically the doc tests are for things that we want to test that also illustrate functionality to the users. And @jkbradley seemed against adding it as a doc test here. I'll add it for now and we can revert it if we decide that's best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it makes sense to add summary related doc tests for algorithms to illustrate the output of summary. So one more line to check hasSummary does not seam to have much impact. @jkbradley What's your opinion? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yanboliang I switched it up. Let me know what you think
|
Test build #68722 has finished for PR 15777 at commit
|
|
Test build #68789 has finished for PR 15777 at commit
|
b6062b9 to
428348d
Compare
|
Test build #68794 has finished for PR 15777 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sethah Only last comment, otherwise, LGTM. I'd like to get this in before 2.1. Thanks.
| val copied = copyValues(new BisectingKMeansModel(uid, parentModel), extra) | ||
| if (trainingSummary.isDefined) copied.setSummary(trainingSummary.get) | ||
| copied.setParent(this.parent) | ||
| copied.setSummary(trainingSummary).setParent(this.parent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks better. Could you make the change for Scala LiR, LoR, GLM and KMeans as well? I think they should be consistent. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. I also added tests. Thanks for reviewing!
|
Test build #68919 has finished for PR 15777 at commit
|
|
LGTM, merged into master and branch-2.1. Thanks! |
…d BKM ## What changes were proposed in this pull request? Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in pyspark. ## How was this patch tested? Unit tests. Author: sethah <[email protected]> Closes #15777 from sethah/pyspark_cluster_summaries. (cherry picked from commit e811fbf) Signed-off-by: Yanbo Liang <[email protected]>
|
@yanboliang I noticed the JIRA is still pending. Are there follow-up tasks? |
|
No follow-up, forgetting to close it. Thanks for reminding. |
…d BKM ## What changes were proposed in this pull request? Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in pyspark. ## How was this patch tested? Unit tests. Author: sethah <[email protected]> Closes apache#15777 from sethah/pyspark_cluster_summaries.
What changes were proposed in this pull request?
Add model summary APIs for
GaussianMixtureModelandBisectingKMeansModelin pyspark.How was this patch tested?
Unit tests.