Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Sep 18, 2019

What changes were proposed in this pull request?

if threshold<0, convert implict 0 to 1, althought this will break sparsity

Why are the changes needed?

if threshold<0, current impl deal with sparse vector incorrectly.
See JIRA SPARK-29144 and Scikit-Learn's Binarizer ('Threshold may not be less than 0 for operations on sparse matrices.') for details.

Does this PR introduce any user-facing change?

no

How was this patch tested?

added testsuite

@zhengruifeng zhengruifeng changed the title [SPARK-29144][ML] Binarizer handel sparse vector incorrectly with negative threshold [SPARK-29144][ML] Binarizer handle sparse vectors incorrectly with negative threshold Sep 18, 2019
@SparkQA
Copy link

SparkQA commented Sep 18, 2019

Test build #110902 has finished for PR 25829 at commit cb52a09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Sep 18, 2019

Just notice that existing ML algs deal with sparse dataset in a different way from scikit-learn:
scikit-learn refuse to break the data sparsity, and will throw an exception; while ML will convert sparse vector to dense one.

I will follow ML’s way and update this PR tomorrow

@srowen
Copy link
Member

srowen commented Sep 18, 2019

I think the right answer is to return 1 for all of the implicit 0 entries when the threshold is < 0. Yes it makes it dense, but it's the right answer.

@SparkQA
Copy link

SparkQA commented Sep 19, 2019

Test build #110962 has finished for PR 25829 at commit 190c3b8.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Sep 19, 2019

Test build #110971 has finished for PR 25829 at commit 190c3b8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


Vectors.sparse(data.size, indices.result(), values.result()).compressed
case _: VectorUDT if td < 0 =>
this.logWarning(s"Binarization operations on sparse dataset with negative threshold " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK. It will almost always be dense but not always. The warning is spurious if the input is dense already, but, a negative threshold is rare... I think. I'm trying to recall whether this is ever applied to outputs of classifiers like SVMs that output [-1, 1].

@zhengruifeng
Copy link
Contributor Author

master / Build Spark with JDK 11 and hadoop-3.2 (pull_request) Fai this check fails, do it matter?

@srowen
Copy link
Member

srowen commented Sep 20, 2019

I have been ignoring those. @dongjoon-hyun is the Github action for JDK 11 above supposed to be working? I've seen it always fail with something like

[ERROR] Plugin org.codehaus.mojo:build-helper-maven-plugin:3.0.0 or one of its dependencies could not be resolved: Failed to read artifact descriptor for org.codehaus.mojo:build-helper-maven-plugin:jar:3.0.0: Could not transfer artifact org.codehaus.mojo:build-helper-maven-plugin:pom:3.0.0 from/to central (https://repo.maven.apache.org/maven2): Connection timed out (Read failed) -> [Help 1]

@dongjoon-hyun
Copy link
Member

@zhengruifeng and @srowen .
The failure of GitHub Action seems due to the maven artifacts download.
On top of that, there is a GitHub Action bug which sometimes it doesn't allow re-trigger because it thinks the test is still running.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Sep 20, 2019

I tested this PR manually on JDK11. There is no problem for this PR~ We can ignore the above failure.

@srowen srowen closed this in c764dd6 Sep 21, 2019
@srowen
Copy link
Member

srowen commented Sep 21, 2019

Merged to master

@zhengruifeng zhengruifeng deleted the binarizer_throw_exception_sparse_vector branch September 23, 2019 06:50
@zhengruifeng
Copy link
Contributor Author

Thank you @srowen @dongjoon-hyun !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants