Skip to content

Conversation

@ericl
Copy link
Contributor

@ericl ericl commented Jul 21, 2015

This adds StringType feature support via OneHotEncoder. As part of this task it was necessary to change RFormula to an Estimator, so that factor levels could be determined from the training dataset.

Not sure if I am using uids correctly here, would be good to get reviewer help on that.
cc @mengxr

Umbrella design doc: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit#

@SparkQA
Copy link

SparkQA commented Jul 21, 2015

Test build #37977 has finished for PR 7574 at commit 169a085.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 21, 2015

Test build #37982 has finished for PR 7574 at commit 4d79193.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be simpler to construct the entire preprocessing pipeline in fit, which includes StringIndexers, OneHotEncoder, and VectorAssembler. Then call fit on the pipeline and pass the PipelineModel to RFormulaModel. We might add StringVectorizer to combine StringIndexer and OneHotEncoder in the future.

I'm a little worried about the generated feature names. But we could address this issue separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mengxr
Copy link
Contributor

mengxr commented Jul 23, 2015

@ericl I think it is simpler to construct a pipeline in RFormula.fit without calling StringIndexer.fit explicitly. That leaves space for pipeline.fit optimization. Then RFormulaModel takes the PipelineModel object directly, which does most of the job.

@ericl
Copy link
Contributor Author

ericl commented Jul 23, 2015

@mengxr to clarify, not calling StringIndexer.fit in RFormula.fit means RFormulaModel will have a reference to the original training dataset, correct?

@ericl
Copy link
Contributor Author

ericl commented Jul 23, 2015

Hmm, I guess that is pretty harmless though. Will do.

@mengxr
Copy link
Contributor

mengxr commented Jul 23, 2015

You can construct a Pipeline object in RFormula.fit, which contains all StringIndexer, OneHotEncoder, etc. Then call Pipeline.fit in RFormula.fit and get the fitted PipelineModel. Pass it to RFormulaModel. RFormulaModel becomes a simple wrapper over the fitted pipeline.

@SparkQA
Copy link

SparkQA commented Jul 24, 2015

Test build #38316 has finished for PR 7574 at commit c302a2c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RFormula(override val uid: String) extends Estimator[RFormulaModel] with RFormulaBase

@SparkQA
Copy link

SparkQA commented Jul 25, 2015

Test build #38410 has finished for PR 7574 at commit 0bf3c26.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RFormula(override val uid: String) extends Estimator[RFormulaModel] with RFormulaBase

@ericl
Copy link
Contributor Author

ericl commented Jul 25, 2015

ptal

@mengxr
Copy link
Contributor

mengxr commented Jul 27, 2015

test this please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Seq could be replaced by ArrayBuffer to avoid creating temp sequences. Then :+= below becomes +=, slightly simpler to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array(...) is not necessary. Vectors.dense takes varargs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mengxr
Copy link
Contributor

mengxr commented Jul 27, 2015

test this please

@mengxr
Copy link
Contributor

mengxr commented Jul 27, 2015

LGTM pending Jenkins.

@SparkQA
Copy link

SparkQA commented Jul 27, 2015

Test build #38602 has finished for PR 7574 at commit f99131a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RFormula(override val uid: String) extends Estimator[RFormulaModel] with RFormulaBase

@mengxr
Copy link
Contributor

mengxr commented Jul 28, 2015

Merged into master. Thanks!

@asfgit asfgit closed this in 8ddfa52 Jul 28, 2015
@SparkQA
Copy link

SparkQA commented Jul 28, 2015

Test build #38597 has finished for PR 7574 at commit 0bf3c26.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RFormula(override val uid: String) extends Estimator[RFormulaModel] with RFormulaBase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants