-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-11215][ML] Add multiple columns support to StringIndexer #19621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11215][ML] Add multiple columns support to StringIndexer #19621
Conversation
|
Test build #83263 has finished for PR 19621 at commit
|
faa8390 to
fa0be31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets the values for each input column sequentially. Can we get the values for all input columns at one run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're right, this is what I am going to do.
|
@viirya Code updated. Thanks! |
|
Jenkins, test this please. |
97a2948 to
8e71b45
Compare
|
@WeichenXu123 I will try to look into this today. |
|
Test build #83872 has finished for PR 19621 at commit
|
|
Test build #83878 has finished for PR 19621 at commit
|
|
I want to ask, for option |
|
Seems in the frequency-based string orders, the order of labels with same frequency is non-deterministic. |
| ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.ml.feature.StringIndexerModel.this"), | ||
| ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasOutputCols.outputCols"), | ||
| ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasOutputCols.getOutputCols"), | ||
| ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasOutputCols.org$apache$spark$ml$param$shared$HasOutputCols$_setter_$outputCols_=") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can those cause binary incompatibility issue in user application?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to keep binary compatibility for validateAndTransformSchema ? Will user extend this class and override this method ?
Others relates to outputCols parameter.
| case StringIndexer.frequencyDesc => countByValue.toSeq.sortBy(-_._2).map(_._1).toArray | ||
| case StringIndexer.frequencyAsc => countByValue.toSeq.sortBy(_._2).map(_._1).toArray | ||
| case StringIndexer.alphabetDesc => countByValue.toSeq.map(_._1).sortWith(_ > _).toArray | ||
| case StringIndexer.alphabetAsc => countByValue.toSeq.map(_._1).sortWith(_ < _).toArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For alphabetAsc and alphabetDesc, seems we don't need to aggregate to count by value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but will aggregate count bring apparent overhead ? I don't want the code including too many if ..else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the dataset is large, it might be. We can leave it as it is. If we find it is a bottleneck, we still can revisit it.
| * Model fitted by [[StringIndexer]]. | ||
| * | ||
| * @param labels Ordered list of labels, corresponding to indices to be assigned. | ||
| * @param labelsArray Array of Ordered list of labels, corresponding to indices to be assigned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ordered -> ordered.
| "Skip StringIndexerModel.") | ||
| return dataset.toDF | ||
| } | ||
| transformSchema(dataset.schema, logging = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can skip StringIndexerModel too if all input columns don't exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
| } | ||
| } | ||
| filteredDataset.withColumns(outputColNames.filter(_ != null), | ||
| outputColumns.filter(_ != null)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case outputColNames and outputColNames are empty, withColumns might return an empty dataset, not original dataset.
|
Test build #84066 has finished for PR 19621 at commit
|
|
Test build #84093 has finished for PR 19621 at commit
|
|
Jenkins retest this please. |
|
@WeichenXu123 with reference to #19621 (comment) - the sort is stable with respect to the input collection. So as long as the result of the "count by value" aggregation is deterministic so will the sort order be in the case of equalities. |
|
@MLnick Will RDD "count by value" aggregation be deterministic ? e.g., 2 RDD with the same elements, but with different element order and different partition number, will |
|
It won't be deterministic in the case of different RDDs / partitions / shuffle etc. For a given input RDD it should be deterministic? But perhaps we could ensure it by first sorting alphabetically and then by frequency? |
|
Test build #84101 has finished for PR 19621 at commit
|
|
@MLnick How about this way: |
|
The first case you mention wouldn’t actually end up sorting by freq, no? It
would have to be the other way around?
For second case, yes equality must mean it is the same string / key so
shouldn’t actually occur
…On Wed, 22 Nov 2017 at 16:01, WeichenXu ***@***.***> wrote:
@MLnick <https://github.com/mlnick> How about this way:
The case "fequencyAsc/Desc", sort first by frequency and then by alphabet,
The case "alphabetAsc/Desc", sort by alphabet (and if alphabetically
equal, the two label should be the same ?)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19621 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA_SB78zMKYwfyjaNMedcxN9GIFdJclNks5s5ClBgaJpZM4QM3Ni>
.
|
|
@MLnick Ah, I don't express it exactly, the first case, what I mean is, sort by frequency, but if the case frequency equal, sort by alphabet. |
|
|
||
| val countByValueArray = dataset.na.drop(inputCols) | ||
| .select(inputCols.map(col(_).cast(StringType)): _*) | ||
| .rdd.aggregate(zeroState)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is treeAggregate better? I think it should be faster?
|
Test build #84125 has finished for PR 19621 at commit
|
|
I checked the failed tests in sparkR. There's some trouble in the failed |
|
Any one can provide some suggestion ? for fixing sparkR glm test failure here. (Only this one, and other failures are minor issue and easy to fix) |
|
stringindexer is set automatically for index column. are we having breaking API change here? |
|
@felixcheung There is no breaking change. But, we meet some trouble thing about indeterministic behavior. When frequency equal, the indexer result is indeterministic(if use default string order). I already fix those in RFormula test. But, I don't know how to fix the sparkR |
|
I think I understand what you are saying but the latest test failure I see it from spark.mlp instead and be results are different from the existing ones.
|
|
@felixcheung Yes, the spark.mlp test result changed because of indexer order changed. That's because, StringIndexer when item frequency equal, there's no definite rule for index order. And, in this PR, I change the code logic in StringIndexer, but it cannot make sure to generate exactly the same indexer order(this is uncontrollable), because when item frequency equal, there's no definite rule for index order, if I add some additional rule to make the indexer order stable, than the result is different from current result. So it make me very trouble. |
|
maybe we could also change the test itself to make it more deterministic? @actuaryzhang do you have any thought on this? |
|
@felixcheung "iris" is a built-in dataset in R, used in many algo testing, so is it proper to change it ? |
|
|
||
| require((isSet(inputCol) && isSet(outputCol) && !isSet(inputCols) && !isSet(outputCols)) || | ||
| (!isSet(inputCol) && !isSet(outputCol) && isSet(inputCols) && isSet(outputCols)), | ||
| "Only allow to set either inputCol/outputCol, or inputCols/outputCols" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe match the language for the exception message here?
StringIndexer only supports setting either inputCol/outputCol or inputCols/outputCols
| if (isSet(inputCol)) { | ||
| (Array($(inputCol)), Array($(outputCol))) | ||
| } else { | ||
| require($(inputCols).length == $(outputCols).length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add a test case for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test added.
|
Test build #84955 has finished for PR 19621 at commit
|
|
You can change the dataset used in testing.
Will be good if you could test with the same data before and after your change to make sure that’s not broken.
|
|
@felixcheung Another failed testcase, spark.mlp in sparkR, it also use It can not set the string order and the default Now I make the |
|
I think we need to address that too. Sounds to me these tests aren’t stable before.
|
|
I am too busy recently to fix those failed R tests. Anyone who has spare time can take over this PR and I will help review. Thanks! |
## What changes were proposed in this pull request? This takes over #19621 to add multi-column support to StringIndexer: 1. Supports encoding multiple columns. 2. Previously, when specifying `frequencyDesc` or `frequencyAsc` as `stringOrderType` param in `StringIndexer`, in case of equal frequency, the order of strings is undefined. After this change, the strings with equal frequency are further sorted alphabetically. ## How was this patch tested? Added tests. Closes #20146 from viirya/SPARK-11215. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request? This takes over apache#19621 to add multi-column support to StringIndexer: 1. Supports encoding multiple columns. 2. Previously, when specifying `frequencyDesc` or `frequencyAsc` as `stringOrderType` param in `StringIndexer`, in case of equal frequency, the order of strings is undefined. After this change, the strings with equal frequency are further sorted alphabetically. ## How was this patch tested? Added tests. Closes apache#20146 from viirya/SPARK-11215. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Sean Owen <[email protected]>
What changes were proposed in this pull request?
Add multiple columns support to StringIndexer.
How was this patch tested?
UT added.