Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Oct 31, 2017

What changes were proposed in this pull request?

Add multiple columns support to StringIndexer.

How was this patch tested?

UT added.

@WeichenXu123 WeichenXu123 changed the title [SPARK-11215][ml] Add multiple columns support to StringIndexer [SPARK-11215][ML] Add multiple columns support to StringIndexer Oct 31, 2017
@SparkQA
Copy link

SparkQA commented Oct 31, 2017

Test build #83263 has finished for PR 19621 at commit faa8390.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123 WeichenXu123 force-pushed the multi-col-string-indexer branch from faa8390 to fa0be31 Compare October 31, 2017 15:51
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets the values for each input column sequentially. Can we get the values for all input columns at one run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right, this is what I am going to do.

@WeichenXu123
Copy link
Contributor Author

@viirya Code updated. Thanks!

@WeichenXu123
Copy link
Contributor Author

Jenkins, test this please.

@WeichenXu123 WeichenXu123 force-pushed the multi-col-string-indexer branch from 97a2948 to 8e71b45 Compare November 3, 2017 10:16
@WeichenXu123
Copy link
Contributor Author

@viirya @MLnick Thanks!

@viirya
Copy link
Member

viirya commented Nov 15, 2017

@WeichenXu123 I will try to look into this today.

@SparkQA
Copy link

SparkQA commented Nov 15, 2017

Test build #83872 has finished for PR 19621 at commit b0b14b0.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 15, 2017

Test build #83878 has finished for PR 19621 at commit 77bea32.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

I want to ask, for option StringIndexer.frequencyDesc, in the case existing two labels which have the same frequency, which of them will be put in the front ?
If this is not specified, then we should modify some test cases in RFomular, which will generate nondeterministic result.

@viirya
Copy link
Member

viirya commented Nov 16, 2017

Seems in the frequency-based string orders, the order of labels with same frequency is non-deterministic.

ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.ml.feature.StringIndexerModel.this"),
ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasOutputCols.outputCols"),
ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasOutputCols.getOutputCols"),
ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasOutputCols.org$apache$spark$ml$param$shared$HasOutputCols$_setter_$outputCols_=")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can those cause binary incompatibility issue in user application?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep binary compatibility for validateAndTransformSchema ? Will user extend this class and override this method ?
Others relates to outputCols parameter.

case StringIndexer.frequencyDesc => countByValue.toSeq.sortBy(-_._2).map(_._1).toArray
case StringIndexer.frequencyAsc => countByValue.toSeq.sortBy(_._2).map(_._1).toArray
case StringIndexer.alphabetDesc => countByValue.toSeq.map(_._1).sortWith(_ > _).toArray
case StringIndexer.alphabetAsc => countByValue.toSeq.map(_._1).sortWith(_ < _).toArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For alphabetAsc and alphabetDesc, seems we don't need to aggregate to count by value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but will aggregate count bring apparent overhead ? I don't want the code including too many if ..else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the dataset is large, it might be. We can leave it as it is. If we find it is a bottleneck, we still can revisit it.

* Model fitted by [[StringIndexer]].
*
* @param labels Ordered list of labels, corresponding to indices to be assigned.
* @param labelsArray Array of Ordered list of labels, corresponding to indices to be assigned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ordered -> ordered.

"Skip StringIndexerModel.")
return dataset.toDF
}
transformSchema(dataset.schema, logging = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can skip StringIndexerModel too if all input columns don't exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

}
}
filteredDataset.withColumns(outputColNames.filter(_ != null),
outputColumns.filter(_ != null))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case outputColNames and outputColNames are empty, withColumns might return an empty dataset, not original dataset.

@SparkQA
Copy link

SparkQA commented Nov 21, 2017

Test build #84066 has finished for PR 19621 at commit e5db190.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 22, 2017

Test build #84093 has finished for PR 19621 at commit 031f53f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

Jenkins retest this please.

@MLnick
Copy link
Contributor

MLnick commented Nov 22, 2017

@WeichenXu123 with reference to #19621 (comment) - the sort is stable with respect to the input collection. So as long as the result of the "count by value" aggregation is deterministic so will the sort order be in the case of equalities.

@WeichenXu123
Copy link
Contributor Author

@MLnick Will RDD "count by value" aggregation be deterministic ? e.g., 2 RDD with the same elements, but with different element order and different partition number, will rdd.countByValue().toSeq keep deterministic ? The shuffling in countByValue seems also possible to break determinacy.

@MLnick
Copy link
Contributor

MLnick commented Nov 22, 2017

It won't be deterministic in the case of different RDDs / partitions / shuffle etc. For a given input RDD it should be deterministic?

But perhaps we could ensure it by first sorting alphabetically and then by frequency?

@SparkQA
Copy link

SparkQA commented Nov 22, 2017

Test build #84101 has finished for PR 19621 at commit 031f53f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

@MLnick How about this way:
The case "fequencyAsc/Desc", sort first by frequency and then by alphabet,
The case "alphabetAsc/Desc", sort by alphabet (and if alphabetically equal, the two label should be the same ?)

@MLnick
Copy link
Contributor

MLnick commented Nov 22, 2017 via email

@WeichenXu123
Copy link
Contributor Author

WeichenXu123 commented Nov 23, 2017

@MLnick Ah, I don't express it exactly, the first case, what I mean is, sort by frequency, but if the case frequency equal, sort by alphabet.
It seems to be the same with what you said "we could ensure it by first sorting alphabetically and then by frequency" (if use stable sort)


val countByValueArray = dataset.na.drop(inputCols)
.select(inputCols.map(col(_).cast(StringType)): _*)
.rdd.aggregate(zeroState)(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is treeAggregate better? I think it should be faster?

@WeichenXu123
Copy link
Contributor Author

@viirya @MLnick Code updated. Thanks!

@SparkQA
Copy link

SparkQA commented Nov 23, 2017

Test build #84125 has finished for PR 19621 at commit 66d054a.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

I checked the failed tests in sparkR. There's some trouble in the failed glm sparkR tests.
These tests compare sparkR glm and R-lib glm results on test data "iris", but, what's the string indexer order for R-lib glm ? I check the dataset "iris", the "Species" column has three value "setosa", "versicolor", "virginica", their frequency are all 50, and only when RFormula index them as: "setosa"->2, "versicolor"->0, "virginica"->1, the result will be the same with R-lib glm. This is a strange indexer order.
How to set string indexer order for R-lib glm ?

@WeichenXu123
Copy link
Contributor Author

WeichenXu123 commented Dec 1, 2017

Any one can provide some suggestion ? for fixing sparkR glm test failure here. (Only this one, and other failures are minor issue and easy to fix)
@felixcheung

@felixcheung
Copy link
Member

stringindexer is set automatically for index column. are we having breaking API change here?
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala#L216

@WeichenXu123
Copy link
Contributor Author

WeichenXu123 commented Dec 5, 2017

@felixcheung There is no breaking change. But, we meet some trouble thing about indeterministic behavior. When frequency equal, the indexer result is indeterministic(if use default string order). I already fix those in RFormula test. But, I don't know how to fix the sparkR glm test. Because it depends on R-glm library and I don't know how to set the indexer order for R-glm library. Do you know anyone who is familiar with this ?

@felixcheung
Copy link
Member

felixcheung commented Dec 6, 2017 via email

@WeichenXu123
Copy link
Contributor Author

WeichenXu123 commented Dec 7, 2017

@felixcheung Yes, the spark.mlp test result changed because of indexer order changed. That's because, StringIndexer when item frequency equal, there's no definite rule for index order. And, in this PR, I change the code logic in StringIndexer, but it cannot make sure to generate exactly the same indexer order(this is uncontrollable), because when item frequency equal, there's no definite rule for index order, if I add some additional rule to make the indexer order stable, than the result is different from current result. So it make me very trouble.
and look at the spark.mlp test data, "iris", the "Species" column has three value "setosa", "versicolor", "virginica", their frequency are all the same.

@felixcheung
Copy link
Member

maybe we could also change the test itself to make it more deterministic?
we could first create a new test dataset that avoid having frequency values, run it through the original implementation, then run through this new change to make sure they match.

@actuaryzhang do you have any thought on this?

@WeichenXu123
Copy link
Contributor Author

@felixcheung "iris" is a built-in dataset in R, used in many algo testing, so is it proper to change it ?


require((isSet(inputCol) && isSet(outputCol) && !isSet(inputCols) && !isSet(outputCols)) ||
(!isSet(inputCol) && !isSet(outputCol) && isSet(inputCols) && isSet(outputCols)),
"Only allow to set either inputCol/outputCol, or inputCols/outputCols"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe match the language for the exception message here?

StringIndexer only supports setting either inputCol/outputCol or inputCols/outputCols

if (isSet(inputCol)) {
(Array($(inputCol)), Array($(outputCol)))
} else {
require($(inputCols).length == $(outputCols).length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a test case for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test added.

@SparkQA
Copy link

SparkQA commented Dec 15, 2017

Test build #84955 has finished for PR 19621 at commit bb209c8.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

felixcheung commented Dec 15, 2017 via email

@WeichenXu123
Copy link
Contributor Author

WeichenXu123 commented Dec 18, 2017

@felixcheung Another failed testcase, spark.mlp in sparkR, it also use RFormula and it will also generate indeterministic result, see class MultilayerPerceptronClassifierWrapper line 78:

val rFormula = new RFormula()
      .setFormula(formula)
      .setForceIndexLabel(true)
      .setHandleInvalid(handleInvalid)

It can not set the string order and the default frequencyDesc order will bring indeterministic result.

Now I make the StringIndexer when specified frequencyDesc, sorted by (frequency, alphabets), so the sort result will be stable. This will cause spark.mlp testcase result changed. I will update the test result in it. Do you agree with this ?

@felixcheung
Copy link
Member

felixcheung commented Dec 19, 2017 via email

@WeichenXu123
Copy link
Contributor Author

I am too busy recently to fix those failed R tests. Anyone who has spare time can take over this PR and I will help review. Thanks!

@WeichenXu123 WeichenXu123 deleted the multi-col-string-indexer branch January 11, 2018 18:43
srowen pushed a commit that referenced this pull request Jan 29, 2019
## What changes were proposed in this pull request?

This takes over #19621 to add multi-column support to StringIndexer:

1. Supports encoding multiple columns.
2. Previously, when specifying `frequencyDesc` or `frequencyAsc` as `stringOrderType` param in `StringIndexer`, in case of equal frequency, the order of strings is undefined. After this change, the strings with equal frequency are further sorted alphabetically.

## How was this patch tested?

Added tests.

Closes #20146 from viirya/SPARK-11215.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

This takes over apache#19621 to add multi-column support to StringIndexer:

1. Supports encoding multiple columns.
2. Previously, when specifying `frequencyDesc` or `frequencyAsc` as `stringOrderType` param in `StringIndexer`, in case of equal frequency, the order of strings is undefined. After this change, the strings with equal frequency are further sorted alphabetically.

## How was this patch tested?

Added tests.

Closes apache#20146 from viirya/SPARK-11215.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants