[SPARK-14615][ML] Use the new ML Vector and Matrix in the ML pipeline based algorithms #12627

dbtsai · 2016-04-22T23:32:40Z

What changes were proposed in this pull request?

Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis.

How was this patch tested?

Unit tests

dbtsai · 2016-04-22T23:33:07Z

Waiting #12259 to be merged.

SparkQA · 2016-04-22T23:38:51Z

Test build #56754 has finished for PR 12627 at commit 3944f56.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds no public classes.

mengxr · 2016-04-29T16:47:20Z

@dbtsai #12259 was merged. Could you update this PR?

dbtsai · 2016-04-29T16:56:56Z

@mengxr working on this now. Thanks.

SparkQA · 2016-05-03T07:23:57Z

Test build #57609 has finished for PR 12627 at commit 93a1c20.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-04T00:03:53Z

Test build #57692 has finished for PR 12627 at commit 8346987.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-05T02:23:44Z

Test build #57828 has finished for PR 12627 at commit 09a3dd8.

This patch fails MiMa tests.
This patch does not merge cleanly.
This patch adds no public classes.

mengxr · 2016-05-05T21:20:10Z

@dbtsai Would it help using implicit conversions?

dbtsai · 2016-05-05T21:36:20Z

@mengxr That can work, but need to import everywhere. I can give it a shot.

mengxr · 2016-05-05T22:45:50Z

@dbtsai Please just try it with one algorithm and see which one is cleaner.

SparkQA · 2016-05-05T23:02:12Z

Test build #57922 has finished for PR 12627 at commit e4265ab.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-05-06T01:40:03Z

Test build #57930 has finished for PR 12627 at commit 1602f6f.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

dbtsai · 2016-05-06T23:01:32Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
-    val data = dataset.select(col($(featuresCol))).rdd.map { case Row(point: Vector) => point }
+    val data = dataset.select(col($(featuresCol))).rdd.map { case Row(point: Vector) =>
+      OldVectors.fromML(point)


@mengxr Implicit conversion doesn't work things like those. We still need manually convert them. But I agree that some of the code can be simplified by implicit which I will push in the next commit.

SparkQA · 2016-05-07T01:18:31Z

Test build #58040 has finished for PR 12627 at commit 82c7750.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-05-10T01:04:52Z

Test build #58181 has finished for PR 12627 at commit c16d1ea.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-10T01:58:54Z

Test build #58187 has finished for PR 12627 at commit 126e6f2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-10T08:36:30Z

Test build #58221 has finished for PR 12627 at commit 6faec8a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-10T08:41:23Z

Test build #58222 has finished for PR 12627 at commit 283b04a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-05-17T14:52:03Z

I'm making a pass.

SparkQA · 2016-05-17T15:32:40Z

Test build #58692 has finished for PR 12627 at commit 9d25eba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-05-17T16:55:47Z

examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala

 import org.apache.spark.examples.mllib.AbstractParams
-import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.ml.linalg.Vector
+import org.apache.spark.mllib.linalg.VectorImplicits._


VectorImplicits shouldn't appear in example code. I created https://issues.apache.org/jira/browse/SPARK-15363 to track it.

…ython

remove ml.LabeledPoint from PySpark and annotate ml.LabeledPoint

SparkQA · 2016-05-17T19:47:31Z

Test build #58708 has finished for PR 12627 at commit 953eea7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class LabeledPoint(@Since(\"2.0.0\") label: Double, @Since(\"2.0.0\") features: Vector)

mengxr · 2016-05-17T19:55:14Z

LGTM. Merged into master and branch-2.0. This should complete the major MLlib API changes in 2.0. Thanks!

On retrospective, I think we under-estimated the amount of work required and didn't allocate enough time to make the changes before the feature freeze deadline. We should discuss the design and scope the work earlier next time.

… based algorithms ## What changes were proposed in this pull request? Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis. ## How was this patch tested? Unit tests Author: DB Tsai <[email protected]> Author: Liang-Chi Hsieh <[email protected]> Author: Xiangrui Meng <[email protected]> Closes #12627 from dbtsai/SPARK-14615-NewML. (cherry picked from commit e2efe05) Signed-off-by: Xiangrui Meng <[email protected]>

dbtsai · 2016-05-18T17:39:59Z

Thank you for everyone who involved in this work. I agree that the amount of work was underestimated, and some of them were actually hard to estimate given the issues were popped up durning the implementation. However, we should work on this kind of major changes in the beginning of release to ensure that we have enough time to address unexpected issues. Thanks again!

HyukjinKwon · 2016-05-29T09:30:32Z

Hi @dbtsai I just happened to run some Python tests for ML and I noticed some examples related with this PR are failed:

examples/src/main/python/ml/aft_survival_regression.py
examples/src/main/python/ml/chisq_selector_example.py
examples/src/main/python/ml/dct_example.py
examples/src/main/python/ml/elementwise_product_example.py
examples/src/main/python/ml/estimator_transformer_param_example.py
examples/src/main/python/ml/pca_example.py
examples/src/main/python/ml/polynomial_expansion_example.py
examples/src/main/python/ml/simple_params_example.py
examples/src/main/python/ml/vector_assembler_example.py
examples/src/main/python/ml/vector_slicer_example.py

I see some Scala and Java examples were fixed here. So, I made a rough PR for Python examples. However, I feel a bit hesitated to submit this because I am not used to this part (but could do this based on your PR) and I feel like you know there are Python examples to fix already.

Do you mind if I ask that they were just mistakenly missed?

viirya · 2016-05-29T09:44:58Z

@HyukjinKwon Thanks for reporting this! I think we missed python example in this change. If you can submit your PR, that is good. If not or you feel hesitated about this, I can submit a PR to fix it.

HyukjinKwon · 2016-05-29T10:20:37Z

@viirya Ah, thank you so much. Since I already have it on my local, I will create a followup!

…tor and Matrix APIs in the ML pipeline based algorithms ## What changes were proposed in this pull request? This PR fixes Python examples to use the new ML Vector and Matrix APIs in the ML pipeline based algorithms. I firstly executed this shell command, `grep -r "from pyspark.mllib" .` and then executed them all. Some of tests in `ml` produced the error messages as below: ``` pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Input type must be VectorUDT but got org.apache.spark.mllib.linalg.VectorUDTf71b0bce.' ``` So, I fixed them to use new ones just identically with some Python tests fixed in #12627 ## How was this patch tested? Manually tested for all the examples listed by `grep -r "from pyspark.mllib" .`. Author: hyukjinkwon <[email protected]> Closes #13393 from HyukjinKwon/SPARK-14615. (cherry picked from commit 99f3c82) Signed-off-by: Joseph K. Bradley <[email protected]>

[SPARK-14615](https://issues.apache.org/jira/browse/SPARK-14615) and #12627 changed `spark.ml` pipelines to use the new `ml.linalg` classes for `Vector`/`Matrix`. Some `Since` annotations for public methods/vals have not been updated accordingly to be `2.0.0`. This PR updates them. ## How was this patch tested? Existing unit tests. Author: Nick Pentreath <[email protected]> Closes #13840 from MLnick/SPARK-16127-ml-linalg-since.

[SPARK-14615](https://issues.apache.org/jira/browse/SPARK-14615) and #12627 changed `spark.ml` pipelines to use the new `ml.linalg` classes for `Vector`/`Matrix`. Some `Since` annotations for public methods/vals have not been updated accordingly to be `2.0.0`. This PR updates them. ## How was this patch tested? Existing unit tests. Author: Nick Pentreath <[email protected]> Closes #13840 from MLnick/SPARK-16127-ml-linalg-since. (cherry picked from commit 18faa58) Signed-off-by: Xiangrui Meng <[email protected]>

first commit

93a1c20

dbtsai force-pushed the SPARK-14615-NewML branch from 3944f56 to 93a1c20 Compare May 3, 2016 07:21

some importing ordering

8346987

wow, finally can compile..

09a3dd8

Updated MiMaExcludes

e4265ab

Implicit conversion

1602f6f

dbtsai reviewed May 6, 2016
View reviewed changes

use some implicit

82c7750

DB Tsai added 2 commits May 9, 2016 16:12

Merge branch 'master' into SPARK-14615-NewML

c16d1ea

Added tests for implict conversion

126e6f2

DB Tsai added 2 commits May 10, 2016 00:11

Fix some tests

6faec8a

more test fix

283b04a

fix more tests

064c7da

mengxr reviewed May 17, 2016
View reviewed changes

mengxr added 2 commits May 17, 2016 10:39

remove ml.LabeledPoint from PySpark and annotate ml.LabeledPoint in P…

f385367

…ython

Merge pull request #2 from mengxr/SPARK-14615

953eea7

remove ml.LabeledPoint from PySpark and annotate ml.LabeledPoint

asfgit closed this in e2efe05 May 17, 2016

dbtsai deleted the SPARK-14615-NewML branch May 19, 2016 18:07

HyukjinKwon mentioned this pull request May 29, 2016

[SPARK-14615][ML][FOLLOWUP] Fix Python examples to use the new ML Vector and Matrix APIs in the ML pipeline based algorithms #13393

Closed

This was referenced Jun 22, 2016

[SPARK-16127][ML][PYPSARK] Audit @Since annotations related to ml.linalg #13840

Closed

[SPARK-10258][DOC][ML] Add @Since annotations to ml.feature #13641

Closed

[SPARK-14615][ML] Use the new ML Vector and Matrix in the ML pipeline based algorithms #12627

[SPARK-14615][ML] Use the new ML Vector and Matrix in the ML pipeline based algorithms #12627

Uh oh!

Conversation

dbtsai commented Apr 22, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dbtsai commented Apr 22, 2016

Uh oh!

SparkQA commented Apr 22, 2016

Uh oh!

mengxr commented Apr 29, 2016

Uh oh!

dbtsai commented Apr 29, 2016

Uh oh!

SparkQA commented May 3, 2016

Uh oh!

SparkQA commented May 4, 2016

Uh oh!

SparkQA commented May 5, 2016

Uh oh!

mengxr commented May 5, 2016

Uh oh!

dbtsai commented May 5, 2016

Uh oh!

mengxr commented May 5, 2016

Uh oh!

SparkQA commented May 5, 2016

Uh oh!

SparkQA commented May 6, 2016

Uh oh!

dbtsai May 6, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 7, 2016

Uh oh!

SparkQA commented May 10, 2016

Uh oh!

SparkQA commented May 10, 2016

Uh oh!

SparkQA commented May 10, 2016

Uh oh!

SparkQA commented May 10, 2016

Uh oh!

mengxr commented May 17, 2016

Uh oh!

SparkQA commented May 17, 2016

Uh oh!

mengxr May 17, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 17, 2016

Uh oh!

mengxr commented May 17, 2016

Uh oh!

dbtsai commented May 18, 2016

Uh oh!

HyukjinKwon commented May 29, 2016

Uh oh!

viirya commented May 29, 2016

Uh oh!

HyukjinKwon commented May 29, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants