-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-9698] [ML] Add RInteraction transformer for supporting R-style feature interactions #7987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #40009 has finished for PR 7987 at commit
|
|
Test build #40011 has finished for PR 7987 at commit
|
|
@ericl Shall we split this PR into two?
After 1) is merged, people can start working on the Python API, without being blocked by 2). |
|
@mengxr done, this PR now just has the RInteraction changes. |
|
Test build #40114 has finished for PR 7987 at commit
|
|
Test build #40115 has finished for PR 7987 at commit
|
|
I'm not clear as to how the order operation is determined. Looking at the tests, in the case of a categorical interaction it appears that it is based on the order in which unique category values are encountered for a categorical variable. Specifically, for the numeric/categorical interaction, the last category encountered ("baz") provides the first values of the interaction values, and the first category encountered ("foo") provides the last values of the interaction. In contrast, for the interaction between two categorical variables, the column order is set by the first category of the second underlying categorical variable (the value zq) is primary in column ordering (with zq-bar being the first column), so order is used again, but it runs in opposite direction for the two variables. This structure will actually work fine for model training, however, things get more complicated for predicting new data with this model. The approach is basically the same approach as MS/Revolution uses in their Revo ScaleR package (i.e., the order of the categories depends on when they are first encountered in the data). However, this turns out to greatly complicate predicting new data with a Revo ScaleR model in practice. Open source R works by first determining all the category labels for each categorical variable, alphabetically sorts the unique label for each categorical variable, and then basis the new feature order on the alphabetical sort of category labels, so the order in which a category label is encountered does not matter. This turns out to make dealing with predicting new data with an existing model much easier. The cost is the data needs to be passed over twice, with the first determining the set of unique category labels. |
|
@dputler Under distributed setting, we need to make at least one pass to collect all categories. The ordering is not alphabetical but by frequency (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L86). The most frequent category gets index |
|
That actually doesn't deal with the scoring issue. What happens when new data to be predicted from an existing model has a more frequent category in a categorical variable than was the case in the training data? What happens if this is included in a Spark Streaming scoring process when the batch size might be one? As before, the frequency base indexing works for estimation, but will cause heartburn in many cases when trying to predict new data with an existing model. |
|
If if I understand correctly, the concern is that the category to index assignment when predicting data will be different from that used when fitting the model. This should be OK here since It is true that it would be nice to have a more predictable ordering (such as alphabetic) for some tasks like comparing coefficients between different models, but I think that could be a feature of |
|
@mengxr I did the refactoring as suggested |
|
Test build #42524 has finished for PR 7987 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove R-style because this is a common feature transformation
|
Test build #42542 has finished for PR 7987 at commit
|
|
@mengxr I made the requested changes. I found it simpler to keep |
clean up validate params
|
Test build #42588 has finished for PR 7987 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrayBuffer is not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
LGTM except minor comments. |
|
Test build #42615 has finished for PR 7987 at commit
|
|
Merged into master. Thanks! |
This is a pre-req for supporting the ":" operator in the RFormula feature transformer.
Design doc from umbrella task: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit
@mengxr