[SPARK-9698] [ML] Add RInteraction transformer for supporting R-style feature interactions #7987

ericl · 2015-08-06T07:17:42Z

This is a pre-req for supporting the ":" operator in the RFormula feature transformer.

Design doc from umbrella task: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit

@mengxr

SparkQA · 2015-08-06T07:58:35Z

Test build #40009 has finished for PR 7987 at commit 386881b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class RInteraction(override val uid: String) extends Estimator[PipelineModel]

SparkQA · 2015-08-06T08:04:53Z

Test build #40011 has finished for PR 7987 at commit 4c11a77.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2015-08-06T18:44:04Z

@ericl Shall we split this PR into two?

Add Interaction as a transformer (SPARK-9698).
Support feature interaction in RFormula.

After 1) is merged, people can start working on the Python API, without being blocked by 2).

ericl · 2015-08-07T00:34:48Z

@mengxr done, this PR now just has the RInteraction changes.

SparkQA · 2015-08-07T01:09:38Z

Test build #40114 has finished for PR 7987 at commit 303b8d7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-07T01:21:08Z

Test build #40115 has finished for PR 7987 at commit 26b6925.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class RInteraction(override val uid: String) extends Estimator[PipelineModel]

dputler · 2015-08-20T06:45:13Z

I'm not clear as to how the order operation is determined. Looking at the tests, in the case of a categorical interaction it appears that it is based on the order in which unique category values are encountered for a categorical variable. Specifically, for the numeric/categorical interaction, the last category encountered ("baz") provides the first values of the interaction values, and the first category encountered ("foo") provides the last values of the interaction. In contrast, for the interaction between two categorical variables, the column order is set by the first category of the second underlying categorical variable (the value zq) is primary in column ordering (with zq-bar being the first column), so order is used again, but it runs in opposite direction for the two variables. This structure will actually work fine for model training, however, things get more complicated for predicting new data with this model. The approach is basically the same approach as MS/Revolution uses in their Revo ScaleR package (i.e., the order of the categories depends on when they are first encountered in the data). However, this turns out to greatly complicate predicting new data with a Revo ScaleR model in practice. Open source R works by first determining all the category labels for each categorical variable, alphabetically sorts the unique label for each categorical variable, and then basis the new feature order on the alphabetical sort of category labels, so the order in which a category label is encountered does not matter. This turns out to make dealing with predicting new data with an existing model much easier. The cost is the data needs to be passed over twice, with the first determining the set of unique category labels.

mengxr · 2015-08-20T17:31:28Z

@dputler Under distributed setting, we need to make at least one pass to collect all categories. The ordering is not alphabetical but by frequency (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L86). The most frequent category gets index 0, and one-hot encoder by default drops the last category, i.e., the least frequent one.

dputler · 2015-08-20T18:40:35Z

That actually doesn't deal with the scoring issue. What happens when new data to be predicted from an existing model has a more frequent category in a categorical variable than was the case in the training data? What happens if this is included in a Spark Streaming scoring process when the batch size might be one? As before, the frequency base indexing works for estimation, but will cause heartburn in many cases when trying to predict new data with an existing model.

ericl · 2015-08-20T19:15:13Z

If if I understand correctly, the concern is that the category to index assignment when predicting data will be different from that used when fitting the model. This should be OK here since StringIndexer retains a mapping from category to indices, which is reused when calling predict() on the model later.

It is true that it would be nice to have a more predictable ordering (such as alphabetic) for some tasks like comparing coefficients between different models, but I think that could be a feature of StringIndexer and is not really related to this PR.

ericl · 2015-09-16T03:19:52Z

@mengxr I did the refactoring as suggested

SparkQA · 2015-09-16T03:31:53Z

Test build #42524 has finished for PR 7987 at commit 92c8287.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Interaction(override val uid: String) extends Transformer

mengxr · 2015-09-16T12:17:02Z

mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala

remove R-style because this is a common feature transformation

SparkQA · 2015-09-16T20:37:48Z

Test build #42542 has finished for PR 7987 at commit 92c8287.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Interaction(override val uid: String) extends Transformer

ericl · 2015-09-17T08:20:11Z

@mengxr I made the requested changes. I found it simpler to keep numFeatures in combination with an array of offsets instead of just the cumulative count though.

clean up validate params

SparkQA · 2015-09-17T09:16:46Z

Test build #42588 has finished for PR 7987 at commit 09cba2c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Interaction(override val uid: String) extends Transformer

mengxr · 2015-09-17T20:10:56Z

mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala

ArrayBuffer is not used.

mengxr · 2015-09-17T20:12:18Z

LGTM except minor comments.

SparkQA · 2015-09-17T21:06:29Z

Test build #42615 has finished for PR 7987 at commit 1ae9ef0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Interaction(override val uid: String) extends Transformer

mengxr · 2015-09-17T21:09:29Z

Merged into master. Thanks!

ericl added 11 commits August 3, 2015 18:32

first pass

429cb52

compiles now

0ece16c

combiner

ab2a347

attribute generation

a3623aa

Wed Aug 5 15:58:14 PDT 2015

a12e58e

Wed Aug 5 19:59:50 PDT 2015

dc8801a

fix parser

2957cb6

Merge branch 'master' into interaction

11bb70f

add rformula test

478ee8f

docs

3ad5464

Wed Aug 5 23:15:14 PDT 2015

5f7cb9b

ericl force-pushed the interaction branch from e44dd83 to 386881b Compare August 6, 2015 07:19

small nits

4c11a77

ericl force-pushed the interaction branch from 386881b to 4c11a77 Compare August 6, 2015 07:32

ericl added 2 commits August 6, 2015 17:21

tests and attribute refactorign

e5099f6

Merge branch 'master' into interaction

3816477

ericl changed the title ~~[SPARK-9681] [ML] Support R feature interactions in RFormula~~ [SPARK-9698] [ML] Add RInteraction transformer for supporting R-style feature interactions Aug 7, 2015

ericl force-pushed the interaction branch from 303b8d7 to 7a21488 Compare August 7, 2015 00:31

Revert user-facing R changes

26b6925

ericl force-pushed the interaction branch from 7a21488 to 26b6925 Compare August 7, 2015 00:36

revert attributes change

92c8287

mengxr reviewed Sep 16, 2015
View reviewed changes

first pass

09cba2c

clean up validate params

ericl force-pushed the interaction branch from c72b11f to 09cba2c Compare September 17, 2015 08:20

mengxr reviewed Sep 17, 2015
View reviewed changes

comments 2

1ae9ef0

asfgit closed this in 4fbf332 Sep 17, 2015

[SPARK-9698] [ML] Add RInteraction transformer for supporting R-style feature interactions #7987

[SPARK-9698] [ML] Add RInteraction transformer for supporting R-style feature interactions #7987

Uh oh!

Conversation

ericl commented Aug 6, 2015

Uh oh!

SparkQA commented Aug 6, 2015

Uh oh!

SparkQA commented Aug 6, 2015

Uh oh!

mengxr commented Aug 6, 2015

Uh oh!

ericl commented Aug 7, 2015

Uh oh!

SparkQA commented Aug 7, 2015

Uh oh!

SparkQA commented Aug 7, 2015

Uh oh!

dputler commented Aug 20, 2015

Uh oh!

mengxr commented Aug 20, 2015

Uh oh!

dputler commented Aug 20, 2015

Uh oh!

ericl commented Aug 20, 2015

Uh oh!

ericl commented Sep 16, 2015

Uh oh!

SparkQA commented Sep 16, 2015

Uh oh!

mengxr Sep 16, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 16, 2015

Uh oh!

ericl commented Sep 17, 2015

Uh oh!

SparkQA commented Sep 17, 2015

Uh oh!

mengxr Sep 17, 2015

Choose a reason for hiding this comment

Uh oh!

ericl Sep 17, 2015

Choose a reason for hiding this comment

Uh oh!

mengxr commented Sep 17, 2015

Uh oh!

SparkQA commented Sep 17, 2015

Uh oh!

mengxr commented Sep 17, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants