-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-9230] [ML] Support StringType features in RFormula #7574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Conflicts: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala
|
Test build #37977 has finished for PR 7574 at commit
|
|
Test build #37982 has finished for PR 7574 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be simpler to construct the entire preprocessing pipeline in fit, which includes StringIndexers, OneHotEncoder, and VectorAssembler. Then call fit on the pipeline and pass the PipelineModel to RFormulaModel. We might add StringVectorizer to combine StringIndexer and OneHotEncoder in the future.
I'm a little worried about the generated feature names. But we could address this issue separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
@ericl I think it is simpler to construct a |
|
@mengxr to clarify, not calling |
|
Hmm, I guess that is pretty harmless though. Will do. |
|
You can construct a |
|
Test build #38316 has finished for PR 7574 at commit
|
|
Test build #38410 has finished for PR 7574 at commit
|
|
ptal |
|
test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: Seq could be replaced by ArrayBuffer to avoid creating temp sequences. Then :+= below becomes +=, slightly simpler to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array(...) is not necessary. Vectors.dense takes varargs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
test this please |
|
LGTM pending Jenkins. |
|
Test build #38602 has finished for PR 7574 at commit
|
|
Merged into master. Thanks! |
|
Test build #38597 has finished for PR 7574 at commit
|
This adds StringType feature support via OneHotEncoder. As part of this task it was necessary to change RFormula to an Estimator, so that factor levels could be determined from the training dataset.
Not sure if I am using uids correctly here, would be good to get reviewer help on that.
cc @mengxr
Umbrella design doc: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit#