Skip to content

Conversation

@ericl
Copy link
Contributor

@ericl ericl commented Aug 6, 2015

This is a pre-req for supporting the ":" operator in the RFormula feature transformer.

Design doc from umbrella task: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit

@mengxr

@SparkQA
Copy link

SparkQA commented Aug 6, 2015

Test build #40009 has finished for PR 7987 at commit 386881b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RInteraction(override val uid: String) extends Estimator[PipelineModel]

@SparkQA
Copy link

SparkQA commented Aug 6, 2015

Test build #40011 has finished for PR 7987 at commit 4c11a77.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Aug 6, 2015

@ericl Shall we split this PR into two?

  1. Add Interaction as a transformer (SPARK-9698).
  2. Support feature interaction in RFormula.

After 1) is merged, people can start working on the Python API, without being blocked by 2).

@ericl ericl changed the title [SPARK-9681] [ML] Support R feature interactions in RFormula [SPARK-9698] [ML] Add RInteraction transformer for supporting R-style feature interactions Aug 7, 2015
@ericl
Copy link
Contributor Author

ericl commented Aug 7, 2015

@mengxr done, this PR now just has the RInteraction changes.

@SparkQA
Copy link

SparkQA commented Aug 7, 2015

Test build #40114 has finished for PR 7987 at commit 303b8d7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 7, 2015

Test build #40115 has finished for PR 7987 at commit 26b6925.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RInteraction(override val uid: String) extends Estimator[PipelineModel]

@dputler
Copy link

dputler commented Aug 20, 2015

I'm not clear as to how the order operation is determined. Looking at the tests, in the case of a categorical interaction it appears that it is based on the order in which unique category values are encountered for a categorical variable. Specifically, for the numeric/categorical interaction, the last category encountered ("baz") provides the first values of the interaction values, and the first category encountered ("foo") provides the last values of the interaction. In contrast, for the interaction between two categorical variables, the column order is set by the first category of the second underlying categorical variable (the value zq) is primary in column ordering (with zq-bar being the first column), so order is used again, but it runs in opposite direction for the two variables. This structure will actually work fine for model training, however, things get more complicated for predicting new data with this model. The approach is basically the same approach as MS/Revolution uses in their Revo ScaleR package (i.e., the order of the categories depends on when they are first encountered in the data). However, this turns out to greatly complicate predicting new data with a Revo ScaleR model in practice. Open source R works by first determining all the category labels for each categorical variable, alphabetically sorts the unique label for each categorical variable, and then basis the new feature order on the alphabetical sort of category labels, so the order in which a category label is encountered does not matter. This turns out to make dealing with predicting new data with an existing model much easier. The cost is the data needs to be passed over twice, with the first determining the set of unique category labels.

@mengxr
Copy link
Contributor

mengxr commented Aug 20, 2015

@dputler Under distributed setting, we need to make at least one pass to collect all categories. The ordering is not alphabetical but by frequency (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L86). The most frequent category gets index 0, and one-hot encoder by default drops the last category, i.e., the least frequent one.

@dputler
Copy link

dputler commented Aug 20, 2015

That actually doesn't deal with the scoring issue. What happens when new data to be predicted from an existing model has a more frequent category in a categorical variable than was the case in the training data? What happens if this is included in a Spark Streaming scoring process when the batch size might be one? As before, the frequency base indexing works for estimation, but will cause heartburn in many cases when trying to predict new data with an existing model.

@ericl
Copy link
Contributor Author

ericl commented Aug 20, 2015

If if I understand correctly, the concern is that the category to index assignment when predicting data will be different from that used when fitting the model. This should be OK here since StringIndexer retains a mapping from category to indices, which is reused when calling predict() on the model later.

It is true that it would be nice to have a more predictable ordering (such as alphabetic) for some tasks like comparing coefficients between different models, but I think that could be a feature of StringIndexer and is not really related to this PR.

@ericl
Copy link
Contributor Author

ericl commented Sep 16, 2015

@mengxr I did the refactoring as suggested

@SparkQA
Copy link

SparkQA commented Sep 16, 2015

Test build #42524 has finished for PR 7987 at commit 92c8287.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Interaction(override val uid: String) extends Transformer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove R-style because this is a common feature transformation

@SparkQA
Copy link

SparkQA commented Sep 16, 2015

Test build #42542 has finished for PR 7987 at commit 92c8287.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Interaction(override val uid: String) extends Transformer

@ericl
Copy link
Contributor Author

ericl commented Sep 17, 2015

@mengxr I made the requested changes. I found it simpler to keep numFeatures in combination with an array of offsets instead of just the cumulative count though.

clean up

validate params
@SparkQA
Copy link

SparkQA commented Sep 17, 2015

Test build #42588 has finished for PR 7987 at commit 09cba2c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Interaction(override val uid: String) extends Transformer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayBuffer is not used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mengxr
Copy link
Contributor

mengxr commented Sep 17, 2015

LGTM except minor comments.

@SparkQA
Copy link

SparkQA commented Sep 17, 2015

Test build #42615 has finished for PR 7987 at commit 1ae9ef0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Interaction(override val uid: String) extends Transformer

@mengxr
Copy link
Contributor

mengxr commented Sep 17, 2015

Merged into master. Thanks!

@asfgit asfgit closed this in 4fbf332 Sep 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants