[SPARK-5089][PYSPARK][MLLIB] Fix vector convert #3902

freeman-lab · 2015-01-05T20:06:41Z

This is a small change addressing a potentially significant bug in how PySpark + MLlib handles non-float64 numpy arrays. The automatic conversion to DenseVector that occurs when passing RDDs to MLlib algorithms in PySpark should automatically upcast to float64s, but currently this wasn't actually happening. As a result, non-float64 would be silently parsed inappropriately during SerDe, yielding erroneous results when running, for example, KMeans.

The PR includes the fix, as well as a new test for the correct conversion behavior.

@davies

SparkQA · 2015-01-05T20:07:37Z

Test build #25058 has started for PR 3902 at commit 764db47.

This patch merges cleanly.

davies · 2015-01-05T20:11:20Z

python/pyspark/mllib/linalg.py

Good catch!

davies · 2015-01-05T20:12:41Z

LGTM, thanks!

This is a small change addressing a potentially significant bug in how PySpark + MLlib handles non-float64 numpy arrays. The automatic conversion to `DenseVector` that occurs when passing RDDs to MLlib algorithms in PySpark should automatically upcast to float64s, but currently this wasn't actually happening. As a result, non-float64 would be silently parsed inappropriately during SerDe, yielding erroneous results when running, for example, KMeans. The PR includes the fix, as well as a new test for the correct conversion behavior. davies Author: freeman <[email protected]> Closes #3902 from freeman-lab/fix-vector-convert and squashes the following commits: 764db47 [freeman] Add a test for proper conversion behavior 704f97e [freeman] Return array after changing type (cherry picked from commit 6c6f325) Signed-off-by: Xiangrui Meng <[email protected]>

SparkQA · 2015-01-05T21:12:38Z

Test build #25058 has finished for PR 3902 at commit 764db47.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-05T21:12:42Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25058/
Test PASSed.

mengxr · 2015-01-05T21:12:44Z

Merged into master and branch-1.2 Thanks!

freeman-lab added 2 commits January 5, 2015 14:02

Return array after changing type

704f97e

Add a test for proper conversion behavior

764db47

davies reviewed Jan 5, 2015
View reviewed changes

python/pyspark/mllib/linalg.py

Copy link

Contributor

davies Jan 5, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

asfgit closed this in 6c6f325 Jan 5, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-5089][PYSPARK][MLLIB] Fix vector convert #3902

[SPARK-5089][PYSPARK][MLLIB] Fix vector convert #3902

Uh oh!

freeman-lab commented Jan 5, 2015

Uh oh!

SparkQA commented Jan 5, 2015

Uh oh!

davies Jan 5, 2015

Uh oh!

davies commented Jan 5, 2015

Uh oh!

SparkQA commented Jan 5, 2015

Uh oh!

AmplabJenkins commented Jan 5, 2015

Uh oh!

mengxr commented Jan 5, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-5089][PYSPARK][MLLIB] Fix vector convert #3902

[SPARK-5089][PYSPARK][MLLIB] Fix vector convert #3902

Uh oh!

Conversation

freeman-lab commented Jan 5, 2015

Uh oh!

SparkQA commented Jan 5, 2015

Uh oh!

davies Jan 5, 2015

Choose a reason for hiding this comment

Uh oh!

davies commented Jan 5, 2015

Uh oh!

SparkQA commented Jan 5, 2015

Uh oh!

AmplabJenkins commented Jan 5, 2015

Uh oh!

mengxr commented Jan 5, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants