[SPARK-29144][ML] Binarizer handle sparse vectors incorrectly with negative threshold #25829

zhengruifeng · 2019-09-18T10:32:59Z

What changes were proposed in this pull request?

if threshold<0, convert implict 0 to 1, althought this will break sparsity

Why are the changes needed?

if threshold<0, current impl deal with sparse vector incorrectly.
See JIRA SPARK-29144 and Scikit-Learn's Binarizer ('Threshold may not be less than 0 for operations on sparse matrices.') for details.

Does this PR introduce any user-facing change?

no

How was this patch tested?

added testsuite

SparkQA · 2019-09-18T11:50:35Z

Test build #110902 has finished for PR 25829 at commit cb52a09.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-09-18T12:10:39Z

Just notice that existing ML algs deal with sparse dataset in a different way from scikit-learn：
scikit-learn refuse to break the data sparsity, and will throw an exception; while ML will convert sparse vector to dense one.

I will follow ML’s way and update this PR tomorrow

srowen · 2019-09-18T14:38:38Z

I think the right answer is to return 1 for all of the implicit 0 entries when the threshold is < 0. Yes it makes it dense, but it's the right answer.

SparkQA · 2019-09-19T04:11:52Z

Test build #110962 has finished for PR 25829 at commit 190c3b8.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-09-19T05:02:52Z

retest this please

SparkQA · 2019-09-19T06:25:45Z

Test build #110971 has finished for PR 25829 at commit 190c3b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-09-19T11:48:48Z

mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala


-      Vectors.sparse(data.size, indices.result(), values.result()).compressed
+      case _: VectorUDT if td < 0 =>
+        this.logWarning(s"Binarization operations on sparse dataset with negative threshold " +


I think this is OK. It will almost always be dense but not always. The warning is spurious if the input is dense already, but, a negative threshold is rare... I think. I'm trying to recall whether this is ever applied to outputs of classifiers like SVMs that output [-1, 1].

zhengruifeng · 2019-09-20T07:17:59Z

master / Build Spark with JDK 11 and hadoop-3.2 (pull_request) Fai this check fails, do it matter?

srowen · 2019-09-20T14:17:13Z

I have been ignoring those. @dongjoon-hyun is the Github action for JDK 11 above supposed to be working? I've seen it always fail with something like

[ERROR] Plugin org.codehaus.mojo:build-helper-maven-plugin:3.0.0 or one of its dependencies could not be resolved: Failed to read artifact descriptor for org.codehaus.mojo:build-helper-maven-plugin:jar:3.0.0: Could not transfer artifact org.codehaus.mojo:build-helper-maven-plugin:pom:3.0.0 from/to central (https://repo.maven.apache.org/maven2): Connection timed out (Read failed) -> [Help 1]

dongjoon-hyun · 2019-09-20T19:18:49Z

@zhengruifeng and @srowen .
The failure of GitHub Action seems due to the maven artifacts download.
On top of that, there is a GitHub Action bug which sometimes it doesn't allow re-trigger because it thinks the test is still running.

dongjoon-hyun · 2019-09-20T19:32:58Z

I tested this PR manually on JDK11. There is no problem for this PR~ We can ignore the above failure.

srowen · 2019-09-21T00:22:55Z

Merged to master

zhengruifeng · 2019-09-23T06:50:44Z

Thank you @srowen @dongjoon-hyun !

zhengruifeng added 2 commits September 18, 2019 18:26

create pr

e3a5ff4

nit

cb52a09

zhengruifeng changed the title ~~[SPARK-29144][ML] Binarizer handel sparse vector incorrectly with negative threshold~~ [SPARK-29144][ML] Binarizer handle sparse vectors incorrectly with negative threshold Sep 18, 2019

zhengruifeng added the ML label Sep 18, 2019

zhengruifeng mentioned this pull request Sep 18, 2019

[SPARK-23578][ML] Add multicolumn support for Binarizer #20732

Closed

swith to ML fashion

190c3b8

srowen approved these changes Sep 19, 2019

View reviewed changes

srowen closed this in c764dd6 Sep 21, 2019

zhengruifeng deleted the binarizer_throw_exception_sparse_vector branch September 23, 2019 06:50

[SPARK-29144][ML] Binarizer handle sparse vectors incorrectly with negative threshold #25829

[SPARK-29144][ML] Binarizer handle sparse vectors incorrectly with negative threshold #25829

Uh oh!

Conversation

zhengruifeng commented Sep 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Sep 18, 2019

Uh oh!

zhengruifeng commented Sep 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Sep 18, 2019

Uh oh!

SparkQA commented Sep 19, 2019

Uh oh!

zhengruifeng commented Sep 19, 2019

Uh oh!

SparkQA commented Sep 19, 2019

Uh oh!

srowen Sep 19, 2019

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Sep 20, 2019

Uh oh!

srowen commented Sep 20, 2019

Uh oh!

dongjoon-hyun commented Sep 20, 2019

Uh oh!

dongjoon-hyun commented Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Sep 21, 2019

Uh oh!

zhengruifeng commented Sep 23, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhengruifeng commented Sep 18, 2019 •

edited

Loading

zhengruifeng commented Sep 18, 2019 •

edited

Loading

dongjoon-hyun commented Sep 20, 2019 •

edited

Loading