Skip to content

Conversation

@freeman-lab
Copy link
Contributor

This is a small change addressing a potentially significant bug in how PySpark + MLlib handles non-float64 numpy arrays. The automatic conversion to DenseVector that occurs when passing RDDs to MLlib algorithms in PySpark should automatically upcast to float64s, but currently this wasn't actually happening. As a result, non-float64 would be silently parsed inappropriately during SerDe, yielding erroneous results when running, for example, KMeans.

The PR includes the fix, as well as a new test for the correct conversion behavior.

@davies

@SparkQA
Copy link

SparkQA commented Jan 5, 2015

Test build #25058 has started for PR 3902 at commit 764db47.

  • This patch merges cleanly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@davies
Copy link
Contributor

davies commented Jan 5, 2015

LGTM, thanks!

asfgit pushed a commit that referenced this pull request Jan 5, 2015
This is a small change addressing a potentially significant bug in how PySpark + MLlib handles non-float64 numpy arrays. The automatic conversion to `DenseVector` that occurs when passing RDDs to MLlib algorithms in PySpark should automatically upcast to float64s, but currently this wasn't actually happening. As a result, non-float64 would be silently parsed inappropriately during SerDe, yielding erroneous results when running, for example, KMeans.

The PR includes the fix, as well as a new test for the correct conversion behavior.

davies

Author: freeman <[email protected]>

Closes #3902 from freeman-lab/fix-vector-convert and squashes the following commits:

764db47 [freeman] Add a test for proper conversion behavior
704f97e [freeman] Return array after changing type

(cherry picked from commit 6c6f325)
Signed-off-by: Xiangrui Meng <[email protected]>
@asfgit asfgit closed this in 6c6f325 Jan 5, 2015
@SparkQA
Copy link

SparkQA commented Jan 5, 2015

Test build #25058 has finished for PR 3902 at commit 764db47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25058/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Jan 5, 2015

Merged into master and branch-1.2 Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants