[SPARK-11569][ML] Fix StringIndexer to handle null value properly #17233

crackcell · 2017-03-10T05:06:37Z

What changes were proposed in this pull request?

This PR is to enhance StringIndexer with NULL values handling.

Before the PR, StringIndexer will throw an exception when encounters NULL values.
With this PR:

handleInvalid=error: Throw an exception as before
handleInvalid=skip: Skip null values as well as unseen labels
handleInvalid=keep: Give null values an additional index as well as unseen labels

BTW, I noticed someone was trying to solve the same problem ( #9920 ) but seems getting no progress or response for a long time. Would you mind to give me a chance to solve it ? I'm eager to help. :-)

How was this patch tested?

new unit tests

merge master to my repo

AmplabJenkins · 2017-03-10T05:07:12Z

Can one of the admins verify this patch?

crackcell · 2017-03-10T15:56:52Z

cc @srowen @cloud-fan @MLnick

jkbradley · 2017-03-13T17:02:19Z

I'll take a look

jkbradley

Thanks for the PR! Just a few comments

jkbradley · 2017-03-13T16:32:37Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

+   * Param for how to handle invalid data (unseen labels or NULL values).
+   * Options are 'skip' (filter out rows with invalid data),
+   * 'error' (throw an error), or 'keep' (put invalid data in a special additional
   * bucket, at index numLabels.


Add ")" at end: "numLabels)."

jkbradley · 2017-03-13T17:20:44Z

mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala

+      .setInputCol("label")
+      .setOutputCol("labelIndex")
+
+    withClue("StringIndexer should throw error when setHandleValid=error when given NULL values") {


setHandleValid -> setHandleInvalid

jkbradley · 2017-03-13T17:22:55Z

mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala

+    assert(attrSkip.values.get === Array("b", "a"))
+    assert(transformedSkip.select("labelIndex").rdd.map { r =>
+      r.getDouble(0)
+    }.collect() === expectedSkip)


Don't assume that the order of Rows of a collected DataFrame/Dataset will be the same each time. Add an ID column so that you can collect things as Sets for comparison to make this robust; check out the unit test above for missing labels for an example.

jkbradley · 2017-03-13T17:24:06Z

mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala

+    }.collect() === expectedKeep)
+  }
+
+  test("StringIndexer with a numeric input column with NULLs") {


Do you need to test numerics separately? If there's a reason to, then can you please refactor these 2 tests to eliminate duplicated code?

OK, I'll remove the numeric test.

jkbradley · 2017-03-13T17:29:46Z

mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala

-        labelToIndex(label)
-      } else if (keepInvalid) {
-        labels.length
+    val indexer = udf { row: Row =>


No need to use a Row; just take a String:

val indexer = udf { label: String => if (label == null) { if (keepInvalid) { labels.length } else { throw new SparkException("StringIndexer encountered NULL value. To handle or skip " + "NULLS, try setting StringIndexer.handleInvalid.") } } else { if (labelToIndex.contains(label)) { labelToIndex(label) } else if (keepInvalid) { labels.length } else { throw new SparkException(s"Unseen label: $label. To handle unseen labels, " + s"set Param handleInvalid to ${StringIndexer.KEEP_INVALID}.") } } }

crackcell · 2017-03-14T03:40:16Z

@jkbradley Hi, I have made some updates according to your comments, please review it again. :-)

jkbradley · 2017-03-14T14:44:54Z

LGTM
Merging with master
Thanks for the improvement!

Jacquelin803 · 2020-12-22T11:17:10Z

how did you solve this problem? about "getting no progress"
I meet this on iris data which only has 100 samples. if i dont use StringIndexer , i can get pmml model file in 1 minute.
I set StringIndexer().setHandleInvalid("keep"),but no use.
Anyone who can help me?thanks a lot~~~

crackcell and others added 3 commits March 8, 2017 11:50

Merge pull request #1 from apache/master

75e3975

merge master to my repo

Enhance StringIndexer with NULL values

79d7060

filter out NULLs when transform dataset

0cb121c

jkbradley reviewed Mar 13, 2017

View reviewed changes

Menglong TAN added 2 commits March 14, 2017 11:20

improve code and unit tests

e80a158

remove unused import

2a0a756

asfgit closed this in 85941ec Mar 14, 2017

crackcell deleted the 11569_StringIndexer_NULL branch March 14, 2017 15:03

yanboliang mentioned this pull request May 18, 2017

[SPARK-20506][DOCS] 2.2 migration guide #17996

Closed

[SPARK-11569][ML] Fix StringIndexer to handle null value properly #17233

[SPARK-11569][ML] Fix StringIndexer to handle null value properly #17233

Uh oh!

Conversation

crackcell commented Mar 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Mar 10, 2017

Uh oh!

crackcell commented Mar 10, 2017

Uh oh!

jkbradley commented Mar 13, 2017

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crackcell commented Mar 14, 2017

Uh oh!

jkbradley commented Mar 14, 2017

Uh oh!

Jacquelin803 commented Dec 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

crackcell commented Mar 10, 2017 •

edited

Loading