-
Notifications
You must be signed in to change notification settings - Fork 28.9k
SPARK-5018 [MLlib] [WIP] Make MultivariateGaussian public #3923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
MultivariateGaussian.scala - Made class public and exposed public methods leveraging MLlib vectors and matrices. Added logpdf method providing log-density calculation. MultivariateGaussianSuite.scala - Test are now performed through the public methods.
|
Test build #25135 has started for PR 3923 at commit
|
|
Test build #25135 has finished for PR 3923 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a check here to make sure that the dimensions of mu, sigma match up.
|
@tgaloppo It's looking good. Mostly small comments, except for the first 2. Your call about naming mu,sigma vs. mean,covariance. |
Modified calculations to avoid log(pow(x,y)) calculations
|
Test build #25187 has started for PR 3923 at commit
|
|
Test build #25190 has started for PR 3923 at commit
|
|
@jkbradley Thanks! I have made the requested changes. Are there any other public methods that you think would be useful to add at this time? |
|
Test build #25187 has finished for PR 3923 at commit
|
|
Test PASSed. |
|
Test build #25190 has finished for PR 3923 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, but you can also use require for checking input arguments (since that throws IllegalArgumentException and is a bit shorter).
|
@tgaloppo Two small comments, but they are just superficial. After those, I think it will be ready. I don't see a need to add other methods for now. Thanks! |
|
@jkbradley Changes made. Thanks! |
|
Test build #25261 has started for PR 3923 at commit
|
|
Test build #25261 has finished for PR 3923 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove space after { and before }.
Moved MultivariateGaussian (and test suite) from stat.impl to stat.distribution (required updates in GaussianMixture{EM,Model}.scala)
Marked MultivariateGaussian as @DeveloperAPI
Fixed style error
|
Test build #25339 has started for PR 3923 at commit
|
|
Test build #25339 has finished for PR 3923 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort import alphabetically
|
@tgaloppo Besides inline comments, please resolve conflicts with the master branch. The patch does not merge cleanly. |
Conflicts: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureEM.scala
|
Test build #25360 has started for PR 3923 at commit
|
|
I have made the requested changes and resolved the merge conflicts. Question: MutlivariateGuassian now keeps a private Breeze version of the mean vector rather than convert the MLlib version to a Breeze vector with each call to {log}pdf(). Is this worthwhile? Or is the .toBreeze.toDenseVector sequence negligible enough that this is unnecessary? CC: @mengxr @jkbradley |
|
Test build #25360 has finished for PR 3923 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Organize imports (scala/java come first)
|
Thanks for the updates! LGTM except for organizing the imports in my 1 comment. As far as I understand, the conversions to and from Breeze should be efficient since they mainly involve copying references, not the underlying data (as long as the representations remain the same: dense to dense, or sparse to sparse). |
|
Test build #25367 has started for PR 3923 at commit
|
|
Thanks, @jkbradley |
|
Test build #25367 has finished for PR 3923 at commit
|
|
Test PASSed. |
|
Merged into master. Thanks! |
Moving MutlivariateGaussian from private[mllib] to public. The class uses Breeze vectors internally, so this involves creating a public interface using MLlib vectors and matrices.
This initial commit provides public construction, accessors for mean/covariance, density and log-density.
Other potential methods include entropy and sample generation.