[SPARK-11215][ML] Add multiple columns support to StringIndexer #19621

WeichenXu123 · 2017-10-31T14:47:33Z

What changes were proposed in this pull request?

Add multiple columns support to StringIndexer.

How was this patch tested?

UT added.

SparkQA · 2017-10-31T14:54:31Z

Test build #83263 has finished for PR 19621 at commit faa8390.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-11-01T02:51:37Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

This gets the values for each input column sequentially. Can we get the values for all input columns at one run?

Yes you're right, this is what I am going to do.

WeichenXu123 · 2017-11-02T07:24:19Z

@viirya Code updated. Thanks!

WeichenXu123 · 2017-11-03T09:35:35Z

Jenkins, test this please.

WeichenXu123 · 2017-11-15T02:18:05Z

@viirya @MLnick Thanks!

viirya · 2017-11-15T02:20:03Z

@WeichenXu123 I will try to look into this today.

SparkQA · 2017-11-15T02:28:42Z

Test build #83872 has finished for PR 19621 at commit b0b14b0.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-15T05:55:26Z

Test build #83878 has finished for PR 19621 at commit 77bea32.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-11-15T09:09:32Z

I want to ask, for option StringIndexer.frequencyDesc, in the case existing two labels which have the same frequency, which of them will be put in the front ?
If this is not specified, then we should modify some test cases in RFomular, which will generate nondeterministic result.

viirya · 2017-11-16T07:31:35Z

Seems in the frequency-based string orders, the order of labels with same frequency is non-deterministic.

viirya · 2017-11-16T07:37:37Z

project/MimaExcludes.scala

+    ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.ml.feature.StringIndexerModel.this"),
+    ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasOutputCols.outputCols"),
+    ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasOutputCols.getOutputCols"),
+    ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasOutputCols.org$apache$spark$ml$param$shared$HasOutputCols$_setter_$outputCols_=")


Can those cause binary incompatibility issue in user application?

Do we need to keep binary compatibility for validateAndTransformSchema ? Will user extend this class and override this method ?
Others relates to outputCols parameter.

viirya · 2017-11-16T07:45:22Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

+        case StringIndexer.frequencyDesc => countByValue.toSeq.sortBy(-_._2).map(_._1).toArray
+        case StringIndexer.frequencyAsc => countByValue.toSeq.sortBy(_._2).map(_._1).toArray
+        case StringIndexer.alphabetDesc => countByValue.toSeq.map(_._1).sortWith(_ > _).toArray
+        case StringIndexer.alphabetAsc => countByValue.toSeq.map(_._1).sortWith(_ < _).toArray


For alphabetAsc and alphabetDesc, seems we don't need to aggregate to count by value.

Yes, but will aggregate count bring apparent overhead ? I don't want the code including too many if ..else.

If the dataset is large, it might be. We can leave it as it is. If we find it is a bottleneck, we still can revisit it.

viirya · 2017-11-16T07:47:25Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

 * Model fitted by [[StringIndexer]].
 *
- * @param labels  Ordered list of labels, corresponding to indices to be assigned.
+ * @param labelsArray  Array of Ordered list of labels, corresponding to indices to be assigned


Ordered -> ordered.

viirya · 2017-11-16T07:58:45Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

-        "Skip StringIndexerModel.")
-      return dataset.toDF
-    }
    transformSchema(dataset.schema, logging = true)


We can skip StringIndexerModel too if all input columns don't exist?

viirya · 2017-11-16T08:20:44Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

+      }
+    }
+    filteredDataset.withColumns(outputColNames.filter(_ != null),
+      outputColumns.filter(_ != null))


In case outputColNames and outputColNames are empty, withColumns might return an empty dataset, not original dataset.

SparkQA · 2017-11-21T13:42:02Z

Test build #84066 has finished for PR 19621 at commit e5db190.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-22T08:05:01Z

Test build #84093 has finished for PR 19621 at commit 031f53f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-11-22T08:59:53Z

Jenkins retest this please.

MLnick · 2017-11-22T09:35:53Z

@WeichenXu123 with reference to #19621 (comment) - the sort is stable with respect to the input collection. So as long as the result of the "count by value" aggregation is deterministic so will the sort order be in the case of equalities.

WeichenXu123 · 2017-11-22T12:03:48Z

@MLnick Will RDD "count by value" aggregation be deterministic ? e.g., 2 RDD with the same elements, but with different element order and different partition number, will rdd.countByValue().toSeq keep deterministic ? The shuffling in countByValue seems also possible to break determinacy.

MLnick · 2017-11-22T12:13:05Z

It won't be deterministic in the case of different RDDs / partitions / shuffle etc. For a given input RDD it should be deterministic?

But perhaps we could ensure it by first sorting alphabetically and then by frequency?

SparkQA · 2017-11-22T12:24:35Z

Test build #84101 has finished for PR 19621 at commit 031f53f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-11-22T14:00:53Z

@MLnick How about this way:
The case "fequencyAsc/Desc", sort first by frequency and then by alphabet,
The case "alphabetAsc/Desc", sort by alphabet (and if alphabetically equal, the two label should be the same ?)

MLnick · 2017-11-22T16:37:11Z

The first case you mention wouldn’t actually end up sorting by freq, no? It would have to be the other way around? For second case, yes equality must mean it is the same string / key so shouldn’t actually occur

…

On Wed, 22 Nov 2017 at 16:01, WeichenXu ***@***.***> wrote: @MLnick <https://github.com/mlnick> How about this way: The case "fequencyAsc/Desc", sort first by frequency and then by alphabet, The case "alphabetAsc/Desc", sort by alphabet (and if alphabetically equal, the two label should be the same ?) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19621 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA_SB78zMKYwfyjaNMedcxN9GIFdJclNks5s5ClBgaJpZM4QM3Ni> .

WeichenXu123 · 2017-11-23T01:46:28Z

@MLnick Ah, I don't express it exactly, the first case, what I mean is, sort by frequency, but if the case frequency equal, sort by alphabet.
It seems to be the same with what you said "we could ensure it by first sorting alphabetically and then by frequency" (if use stable sort)

viirya · 2017-11-23T03:06:44Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

+
+    val countByValueArray = dataset.na.drop(inputCols)
+      .select(inputCols.map(col(_).cast(StringType)): _*)
+      .rdd.aggregate(zeroState)(


Is treeAggregate better? I think it should be faster?

WeichenXu123 · 2017-11-23T08:49:28Z

@viirya @MLnick Code updated. Thanks!

SparkQA · 2017-11-23T12:13:44Z

Test build #84125 has finished for PR 19621 at commit 66d054a.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-11-24T08:55:19Z

I checked the failed tests in sparkR. There's some trouble in the failed glm sparkR tests.
These tests compare sparkR glm and R-lib glm results on test data "iris", but, what's the string indexer order for R-lib glm ? I check the dataset "iris", the "Species" column has three value "setosa", "versicolor", "virginica", their frequency are all 50, and only when RFormula index them as: "setosa"->2, "versicolor"->0, "virginica"->1, the result will be the same with R-lib glm. This is a strange indexer order.
How to set string indexer order for R-lib glm ?

WeichenXu123 · 2017-12-01T09:39:10Z

Any one can provide some suggestion ? for fixing sparkR glm test failure here. (Only this one, and other failures are minor issue and easy to fix)
@felixcheung

felixcheung · 2017-12-05T08:23:58Z

stringindexer is set automatically for index column. are we having breaking API change here?
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala#L216

WeichenXu123 · 2017-12-05T08:50:46Z

@felixcheung There is no breaking change. But, we meet some trouble thing about indeterministic behavior. When frequency equal, the indexer result is indeterministic(if use default string order). I already fix those in RFormula test. But, I don't know how to fix the sparkR glm test. Because it depends on R-glm library and I don't know how to set the indexer order for R-glm library. Do you know anyone who is familiar with this ?

felixcheung · 2017-12-06T13:59:06Z

I think I understand what you are saying but the latest test failure I see it from spark.mlp instead and be results are different from the existing ones.

WeichenXu123 · 2017-12-07T00:36:55Z

@felixcheung Yes, the spark.mlp test result changed because of indexer order changed. That's because, StringIndexer when item frequency equal, there's no definite rule for index order. And, in this PR, I change the code logic in StringIndexer, but it cannot make sure to generate exactly the same indexer order(this is uncontrollable), because when item frequency equal, there's no definite rule for index order, if I add some additional rule to make the indexer order stable, than the result is different from current result. So it make me very trouble.
and look at the spark.mlp test data, "iris", the "Species" column has three value "setosa", "versicolor", "virginica", their frequency are all the same.

felixcheung · 2017-12-08T06:28:19Z

maybe we could also change the test itself to make it more deterministic?
we could first create a new test dataset that avoid having frequency values, run it through the original implementation, then run through this new change to make sure they match.

@actuaryzhang do you have any thought on this?

WeichenXu123 · 2017-12-12T06:25:37Z

@felixcheung "iris" is a built-in dataset in R, used in many algo testing, so is it proper to change it ?

MLnick · 2017-12-15T09:46:15Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

+
+    require((isSet(inputCol) && isSet(outputCol) && !isSet(inputCols) && !isSet(outputCols)) ||
+      (!isSet(inputCol) && !isSet(outputCol) && isSet(inputCols) && isSet(outputCols)),
+      "Only allow to set either inputCol/outputCol, or inputCols/outputCols"


Maybe match the language for the exception message here?

StringIndexer only supports setting either inputCol/outputCol or inputCols/outputCols

MLnick · 2017-12-15T09:48:58Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

+    if (isSet(inputCol)) {
+      (Array($(inputCol)), Array($(outputCol)))
+    } else {
+      require($(inputCols).length == $(outputCols).length,


Should add a test case for this

test added.

SparkQA · 2017-12-15T13:58:53Z

Test build #84955 has finished for PR 19621 at commit bb209c8.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-12-15T14:03:30Z

You can change the dataset used in testing. Will be good if you could test with the same data before and after your change to make sure that’s not broken.

WeichenXu123 · 2017-12-18T10:13:12Z

@felixcheung Another failed testcase, spark.mlp in sparkR, it also use RFormula and it will also generate indeterministic result, see class MultilayerPerceptronClassifierWrapper line 78:

val rFormula = new RFormula()
      .setFormula(formula)
      .setForceIndexLabel(true)
      .setHandleInvalid(handleInvalid)

It can not set the string order and the default frequencyDesc order will bring indeterministic result.

Now I make the StringIndexer when specified frequencyDesc, sorted by (frequency, alphabets), so the sort result will be stable. This will cause spark.mlp testcase result changed. I will update the test result in it. Do you agree with this ?

felixcheung · 2017-12-19T17:07:14Z

I think we need to address that too. Sounds to me these tests aren’t stable before.

WeichenXu123 · 2018-01-02T17:25:45Z

I am too busy recently to fix those failed R tests. Anyone who has spare time can take over this PR and I will help review. Thanks!

## What changes were proposed in this pull request? This takes over #19621 to add multi-column support to StringIndexer: 1. Supports encoding multiple columns. 2. Previously, when specifying `frequencyDesc` or `frequencyAsc` as `stringOrderType` param in `StringIndexer`, in case of equal frequency, the order of strings is undefined. After this change, the strings with equal frequency are further sorted alphabetically. ## How was this patch tested? Added tests. Closes #20146 from viirya/SPARK-11215. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? This takes over apache#19621 to add multi-column support to StringIndexer: 1. Supports encoding multiple columns. 2. Previously, when specifying `frequencyDesc` or `frequencyAsc` as `stringOrderType` param in `StringIndexer`, in case of equal frequency, the order of strings is undefined. After this change, the strings with equal frequency are further sorted alphabetically. ## How was this patch tested? Added tests. Closes apache#20146 from viirya/SPARK-11215. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Sean Owen <[email protected]>

WeichenXu123 changed the title ~~[SPARK-11215][ml] Add multiple columns support to StringIndexer~~ [SPARK-11215][ML] Add multiple columns support to StringIndexer Oct 31, 2017

WeichenXu123 force-pushed the multi-col-string-indexer branch from faa8390 to fa0be31 Compare October 31, 2017 15:51

viirya reviewed Nov 1, 2017

View reviewed changes

WeichenXu123 added 3 commits November 3, 2017 18:16

init pr

b227e3b

optimize fit & add UT

6a17617

fix style

8e71b45

WeichenXu123 force-pushed the multi-col-string-indexer branch from 97a2948 to 8e71b45 Compare November 3, 2017 10:16

merge 'master' and resolve conflicts

b0b14b0

fix_mima

77bea32

viirya reviewed Nov 16, 2017

View reviewed changes

address failed RFormula tests

e5db190

fix pyspark tests

031f53f

viirya reviewed Nov 23, 2017

View reviewed changes

make frequency order result stable

66d054a

MLnick reviewed Dec 15, 2017

View reviewed changes

WeichenXu123 added 2 commits December 15, 2017 18:34

Merge branch 'master' into multi-col-string-indexer

0bd9f66

address comments

bb209c8

WeichenXu123 closed this Jan 2, 2018

viirya mentioned this pull request Jan 4, 2018

[SPARK-11215][ML] Add multiple columns support to StringIndexer #20146

Closed

WeichenXu123 deleted the multi-col-string-indexer branch January 11, 2018 18:43

[SPARK-11215][ML] Add multiple columns support to StringIndexer #19621

[SPARK-11215][ML] Add multiple columns support to StringIndexer #19621

Uh oh!

Conversation

WeichenXu123 commented Oct 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 31, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 commented Nov 2, 2017

Uh oh!

WeichenXu123 commented Nov 3, 2017

Uh oh!

WeichenXu123 commented Nov 15, 2017

Uh oh!

viirya commented Nov 15, 2017

Uh oh!

SparkQA commented Nov 15, 2017

Uh oh!

SparkQA commented Nov 15, 2017

Uh oh!

WeichenXu123 commented Nov 15, 2017

Uh oh!

viirya commented Nov 16, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 21, 2017

Uh oh!

SparkQA commented Nov 22, 2017

Uh oh!

WeichenXu123 commented Nov 22, 2017

Uh oh!

MLnick commented Nov 22, 2017

Uh oh!

WeichenXu123 commented Nov 22, 2017

Uh oh!

MLnick commented Nov 22, 2017

Uh oh!

SparkQA commented Nov 22, 2017

Uh oh!

WeichenXu123 commented Nov 22, 2017

Uh oh!

MLnick commented Nov 22, 2017 via email

Uh oh!

WeichenXu123 commented Nov 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 commented Nov 23, 2017

Uh oh!

SparkQA commented Nov 23, 2017

Uh oh!

WeichenXu123 commented Nov 24, 2017

Uh oh!

WeichenXu123 commented Dec 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

WeichenXu123 commented Oct 31, 2017 •

edited

Loading

WeichenXu123 commented Nov 23, 2017 •

edited

Loading

WeichenXu123 commented Dec 1, 2017 •

edited

Loading

WeichenXu123 commented Dec 5, 2017 •

edited

Loading

WeichenXu123 commented Dec 7, 2017 •

edited

Loading

WeichenXu123 commented Dec 18, 2017 •

edited

Loading