-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-14615][ML] Use the new ML Vector and Matrix in the ML pipeline based algorithms #12627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Waiting #12259 to be merged. |
|
Test build #56754 has finished for PR 12627 at commit
|
|
@mengxr working on this now. Thanks. |
|
Test build #57609 has finished for PR 12627 at commit
|
|
Test build #57692 has finished for PR 12627 at commit
|
|
Test build #57828 has finished for PR 12627 at commit
|
|
@dbtsai Would it help using implicit conversions? |
|
@mengxr That can work, but need to import everywhere. I can give it a shot. |
|
@dbtsai Please just try it with one algorithm and see which one is cleaner. |
|
Test build #57922 has finished for PR 12627 at commit
|
|
Test build #57930 has finished for PR 12627 at commit
|
| SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT) | ||
| val data = dataset.select(col($(featuresCol))).rdd.map { case Row(point: Vector) => point } | ||
| val data = dataset.select(col($(featuresCol))).rdd.map { case Row(point: Vector) => | ||
| OldVectors.fromML(point) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mengxr Implicit conversion doesn't work things like those. We still need manually convert them. But I agree that some of the code can be simplified by implicit which I will push in the next commit.
|
Test build #58040 has finished for PR 12627 at commit
|
|
Test build #58181 has finished for PR 12627 at commit
|
|
Test build #58187 has finished for PR 12627 at commit
|
|
Test build #58221 has finished for PR 12627 at commit
|
|
Test build #58222 has finished for PR 12627 at commit
|
|
I'm making a pass. |
|
Test build #58692 has finished for PR 12627 at commit
|
| import org.apache.spark.examples.mllib.AbstractParams | ||
| import org.apache.spark.mllib.linalg.Vector | ||
| import org.apache.spark.ml.linalg.Vector | ||
| import org.apache.spark.mllib.linalg.VectorImplicits._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VectorImplicits shouldn't appear in example code. I created https://issues.apache.org/jira/browse/SPARK-15363 to track it.
remove ml.LabeledPoint from PySpark and annotate ml.LabeledPoint
|
Test build #58708 has finished for PR 12627 at commit
|
|
LGTM. Merged into master and branch-2.0. This should complete the major MLlib API changes in 2.0. Thanks! On retrospective, I think we under-estimated the amount of work required and didn't allocate enough time to make the changes before the feature freeze deadline. We should discuss the design and scope the work earlier next time. |
… based algorithms ## What changes were proposed in this pull request? Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis. ## How was this patch tested? Unit tests Author: DB Tsai <[email protected]> Author: Liang-Chi Hsieh <[email protected]> Author: Xiangrui Meng <[email protected]> Closes #12627 from dbtsai/SPARK-14615-NewML. (cherry picked from commit e2efe05) Signed-off-by: Xiangrui Meng <[email protected]>
|
Thank you for everyone who involved in this work. I agree that the amount of work was underestimated, and some of them were actually hard to estimate given the issues were popped up durning the implementation. However, we should work on this kind of major changes in the beginning of release to ensure that we have enough time to address unexpected issues. Thanks again! |
|
Hi @dbtsai I just happened to run some Python tests for ML and I noticed some examples related with this PR are failed: I see some Scala and Java examples were fixed here. So, I made a rough PR for Python examples. However, I feel a bit hesitated to submit this because I am not used to this part (but could do this based on your PR) and I feel like you know there are Python examples to fix already. Do you mind if I ask that they were just mistakenly missed? |
|
@HyukjinKwon Thanks for reporting this! I think we missed python example in this change. If you can submit your PR, that is good. If not or you feel hesitated about this, I can submit a PR to fix it. |
|
@viirya Ah, thank you so much. Since I already have it on my local, I will create a followup! |
…tor and Matrix APIs in the ML pipeline based algorithms ## What changes were proposed in this pull request? This PR fixes Python examples to use the new ML Vector and Matrix APIs in the ML pipeline based algorithms. I firstly executed this shell command, `grep -r "from pyspark.mllib" .` and then executed them all. Some of tests in `ml` produced the error messages as below: ``` pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Input type must be VectorUDT but got org.apache.spark.mllib.linalg.VectorUDTf71b0bce.' ``` So, I fixed them to use new ones just identically with some Python tests fixed in #12627 ## How was this patch tested? Manually tested for all the examples listed by `grep -r "from pyspark.mllib" .`. Author: hyukjinkwon <[email protected]> Closes #13393 from HyukjinKwon/SPARK-14615. (cherry picked from commit 99f3c82) Signed-off-by: Joseph K. Bradley <[email protected]>
[SPARK-14615](https://issues.apache.org/jira/browse/SPARK-14615) and #12627 changed `spark.ml` pipelines to use the new `ml.linalg` classes for `Vector`/`Matrix`. Some `Since` annotations for public methods/vals have not been updated accordingly to be `2.0.0`. This PR updates them. ## How was this patch tested? Existing unit tests. Author: Nick Pentreath <[email protected]> Closes #13840 from MLnick/SPARK-16127-ml-linalg-since.
[SPARK-14615](https://issues.apache.org/jira/browse/SPARK-14615) and #12627 changed `spark.ml` pipelines to use the new `ml.linalg` classes for `Vector`/`Matrix`. Some `Since` annotations for public methods/vals have not been updated accordingly to be `2.0.0`. This PR updates them. ## How was this patch tested? Existing unit tests. Author: Nick Pentreath <[email protected]> Closes #13840 from MLnick/SPARK-16127-ml-linalg-since. (cherry picked from commit 18faa58) Signed-off-by: Xiangrui Meng <[email protected]>
What changes were proposed in this pull request?
Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis.
How was this patch tested?
Unit tests