-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-29144][ML] Binarizer handle sparse vectors incorrectly with negative threshold #25829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29144][ML] Binarizer handle sparse vectors incorrectly with negative threshold #25829
Conversation
|
Test build #110902 has finished for PR 25829 at commit
|
|
Just notice that existing ML algs deal with sparse dataset in a different way from scikit-learn: I will follow ML’s way and update this PR tomorrow |
|
I think the right answer is to return 1 for all of the implicit 0 entries when the threshold is < 0. Yes it makes it dense, but it's the right answer. |
|
Test build #110962 has finished for PR 25829 at commit
|
|
retest this please |
|
Test build #110971 has finished for PR 25829 at commit
|
|
|
||
| Vectors.sparse(data.size, indices.result(), values.result()).compressed | ||
| case _: VectorUDT if td < 0 => | ||
| this.logWarning(s"Binarization operations on sparse dataset with negative threshold " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK. It will almost always be dense but not always. The warning is spurious if the input is dense already, but, a negative threshold is rare... I think. I'm trying to recall whether this is ever applied to outputs of classifiers like SVMs that output [-1, 1].
|
|
|
I have been ignoring those. @dongjoon-hyun is the Github action for JDK 11 above supposed to be working? I've seen it always fail with something like |
|
@zhengruifeng and @srowen . |
|
I tested this PR manually on JDK11. There is no problem for this PR~ We can ignore the above failure. |
|
Merged to master |
|
Thank you @srowen @dongjoon-hyun ! |
What changes were proposed in this pull request?
if threshold<0, convert implict 0 to 1, althought this will break sparsity
Why are the changes needed?
if
threshold<0, current impl deal with sparse vector incorrectly.See JIRA SPARK-29144 and Scikit-Learn's Binarizer ('Threshold may not be less than 0 for operations on sparse matrices.') for details.
Does this PR introduce any user-facing change?
no
How was this patch tested?
added testsuite