-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17748][ML] Minor cleanups to one-pass linear regression with elastic net #15779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@sethah @yanboliang I just saw your PRs for SPARK-17748. Awesome change. I just saw a few nits along the way. The only major item is making SingularMatrixException private ml. This would be the first public Exception type we have in MLlib, I believe. If it is to be public, I'd prefer it live in spark.ml.linalg, so we could move it there instead of making it private. Also, I'm not a big fan of using Exceptions to handle logic. I'd prefer to change the use of CholeskySolver to pass the failure code, rather than relying on exceptions. If that sounds good to you, then I'll make a follow up JIRA for it. Thanks! |
|
Test build #68174 has finished for PR 15779 at commit
|
|
+1 on removing the use of exceptions. I thought it was a bit of an awkward solution to begin with. Thanks a lot for this pr, I will take a look soon. |
sethah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM other than the two comments.
| * (singular). | ||
| */ | ||
| class SingularMatrixException(message: String, cause: Throwable) | ||
| private[ml] class SingularMatrixException(message: String, cause: Throwable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private[spark] as Jenkins was nice enough to point out
| * The default value is "auto" which means that the solver algorithm is | ||
| * selected automatically. | ||
| * | ||
| * The Normal Equation solver is limited to a few thousand features; when needed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a bit misleading? If I set solver = "normal" and my dataset contains >4096 features, an exception is thrown instead of automatically falling back to iterative methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 @sethah It will only fall back when solver = "auto", we should clarify it more clear.
|
@jkbradley @sethah The idea of using failure code rather than exception sounds good to me. I thought users could get the exception and infer the cause of failure from it. After rethinking it, I think it's an expert exception that few users would like to intercept it, so using failure code should make more sense. Thanks for the follow-up work. |
|
I agree that exceptions can become, quite surprisingly, part of a public API. It has to be done with care. I am not sure I agree that failure codes are better; they're exceptions that don't use a language API and are even more ignorable. If it's simply status, that makes sense; if it's indicating failure, you want it to be hard-er to ignore. |
|
I should have clarified: I meant for us to use the failure code internally. If the linear algebra ops become more public, then we may need to consider better ways to return result codes. I made a JIRA: [https://issues.apache.org/jira/browse/SPARK-18341] |
|
Test build #68312 has finished for PR 15779 at commit
|
yanboliang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the last nit, otherwise, LGTM. Thanks.
| * features to at most this number. | ||
| */ | ||
| @Since("2.1.0") | ||
| val MAX_FEATURES_FOR_NORMAL_SOLVER: Int = WeightedLeastSquares.MAX_NUM_FEATURES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAX_NUM_FEATURES_FOR_NORMAL_SOLVER should be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the current name. MAX_NUM_.... is longer but not any clearer to me. I don't feel too strongly about it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I left out "NUM_" for conciseness.
|
|
||
| /** | ||
| * When using [[LinearRegression.solver]] == "normal", the solver must limit the number of | ||
| * features to at most this number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor suggestion: perhaps an explanation as to why might be useful. "must limit the number of features to at most this number, because the entire covariance matrix X^T^X will be collected on the driver. This limit helps prevent memory overflow errors."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
|
LGTM |
|
Test build #68353 has finished for PR 15779 at commit
|
|
Thanks! I'll merge this with master and branch-2.1 |
…lastic net ## What changes were proposed in this pull request? * Made SingularMatrixException private ml * WeightedLeastSquares: Changed to allow tol >= 0 instead of only tol > 0 ## How was this patch tested? existing tests Author: Joseph K. Bradley <[email protected]> Closes apache#15779 from jkbradley/wls-cleanups. (cherry picked from commit 26e1c53) Signed-off-by: Joseph K. Bradley <[email protected]>
|
Linking commit for branch-2.1 since it hasn't linked automatically: [https://github.com/apache/spark/commit/21bbf94b41fbd193e370a3820131e449aaf0e3db] |
…lastic net ## What changes were proposed in this pull request? * Made SingularMatrixException private ml * WeightedLeastSquares: Changed to allow tol >= 0 instead of only tol > 0 ## How was this patch tested? existing tests Author: Joseph K. Bradley <[email protected]> Closes apache#15779 from jkbradley/wls-cleanups.
What changes were proposed in this pull request?
How was this patch tested?
existing tests