Skip to content

Conversation

@crackcell
Copy link

@crackcell crackcell commented Mar 10, 2017

What changes were proposed in this pull request?

This PR is to enhance StringIndexer with NULL values handling.

Before the PR, StringIndexer will throw an exception when encounters NULL values.
With this PR:

  • handleInvalid=error: Throw an exception as before
  • handleInvalid=skip: Skip null values as well as unseen labels
  • handleInvalid=keep: Give null values an additional index as well as unseen labels

BTW, I noticed someone was trying to solve the same problem ( #9920 ) but seems getting no progress or response for a long time. Would you mind to give me a chance to solve it ? I'm eager to help. :-)

How was this patch tested?

new unit tests

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@crackcell
Copy link
Author

cc @srowen @cloud-fan @MLnick

@jkbradley
Copy link
Member

I'll take a look

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Just a few comments

* Param for how to handle invalid data (unseen labels or NULL values).
* Options are 'skip' (filter out rows with invalid data),
* 'error' (throw an error), or 'keep' (put invalid data in a special additional
* bucket, at index numLabels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add ")" at end: "numLabels)."

.setInputCol("label")
.setOutputCol("labelIndex")

withClue("StringIndexer should throw error when setHandleValid=error when given NULL values") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setHandleValid -> setHandleInvalid

assert(attrSkip.values.get === Array("b", "a"))
assert(transformedSkip.select("labelIndex").rdd.map { r =>
r.getDouble(0)
}.collect() === expectedSkip)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't assume that the order of Rows of a collected DataFrame/Dataset will be the same each time. Add an ID column so that you can collect things as Sets for comparison to make this robust; check out the unit test above for missing labels for an example.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roger

}.collect() === expectedKeep)
}

test("StringIndexer with a numeric input column with NULLs") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to test numerics separately? If there's a reason to, then can you please refactor these 2 tests to eliminate duplicated code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll remove the numeric test.

labelToIndex(label)
} else if (keepInvalid) {
labels.length
val indexer = udf { row: Row =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to use a Row; just take a String:

    val indexer = udf { label: String =>
      if (label == null) {
        if (keepInvalid) {
          labels.length
        } else {
          throw new SparkException("StringIndexer encountered NULL value. To handle or skip " +
            "NULLS, try setting StringIndexer.handleInvalid.")
        }
      } else {
        if (labelToIndex.contains(label)) {
          labelToIndex(label)
        } else if (keepInvalid) {
          labels.length
        } else {
          throw new SparkException(s"Unseen label: $label.  To handle unseen labels, " +
            s"set Param handleInvalid to ${StringIndexer.KEEP_INVALID}.")
        }
      }
    }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

@crackcell
Copy link
Author

@jkbradley Hi, I have made some updates according to your comments, please review it again. :-)

@jkbradley
Copy link
Member

LGTM
Merging with master
Thanks for the improvement!

@asfgit asfgit closed this in 85941ec Mar 14, 2017
@crackcell crackcell deleted the 11569_StringIndexer_NULL branch March 14, 2017 15:03
@Jacquelin803
Copy link

how did you solve this problem? about "getting no progress"
I meet this on iris data which only has 100 samples. if i dont use StringIndexer , i can get pmml model file in 1 minute.
I set StringIndexer().setHandleInvalid("keep"),but no use.
Anyone who can help me?thanks a lot~~~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants