Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

What changes were proposed in this pull request?

Add python api for VectorIndexerModel support handle unseen categories via handleInvalid.

How was this patch tested?

doctest added.

@SparkQA
Copy link

SparkQA commented Nov 15, 2017

Test build #83881 has finished for PR 19753 at commit 108ce2b.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid, JavaMLReadable,

@SparkQA
Copy link

SparkQA commented Nov 15, 2017

Test build #83884 has finished for PR 19753 at commit f684cd0.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@smurching
Copy link
Contributor

Looking at this now, thanks @WeichenXu123!

Copy link
Contributor

@smurching smurching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question, do we want to add a Python setter/getter for handleInvalid? Otherwise this LGTM.

@WeichenXu123
Copy link
Contributor Author

@smurching The getter/setter is included in the super class HasHandleInvalid. I can add test for it.

@SparkQA
Copy link

SparkQA commented Nov 16, 2017

Test build #83917 has finished for PR 19753 at commit 311da8a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@smurching smurching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for clarifying :)
Just had a few more thoughts, nice work!

@keyword_only
@since("1.4.0")
def setParams(self, maxCategories=20, inputCol=None, outputCol=None):
def setParams(self, maxCategories=20, inputCol=None, outputCol=None, handleInvalid="error"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another Q: I see there's a pattern of setParams using None as a default value for all/most of its arguments in other featurizers, perhaps we should do the same (i.e. have a default argument of handleValid=None here)? IMO specifying the default parameter value in one place is preferable to duplicating it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same goes for the constructor (IMO we should default to handleInvalid=None there too), but open to hearing your thoughts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, but, unfortunately, I think you're wrong. The inputCol=None represent, if user do not specify the inputCol, there is no default value, and exception will be thrown.
Duplicating default params is an issue, but already exists in all the pyspark.ml estimator/models.
e.g., you can check StringIndexer in pyspark, it also has handleInvalid param.

Copy link
Contributor Author

@WeichenXu123 WeichenXu123 Nov 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also check Params._set method in pyspark, you will find, it skips input params which value is None

Copy link
Contributor

@smurching smurching Nov 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, that makes sense!

@smurching
Copy link
Contributor

This LGTM, @jkbradley would you be able to give this a look?

@jkbradley
Copy link
Member

I'll try to take a look but am pretty swamped currently. CC @yanboliang @MLnick @dbtsai @holdenk might you have time?

JavaMLWritable):
"""
Class for indexing categorical feature columns in a dataset of `Vector`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a TODO in the doc of VectorIndexer: Add option for allowing unknown categories.. I think we can remove it?

"How to handle invalid data (unseen labels or NULL values). " +
"Options are 'skip' (filter out rows with invalid data), 'error' (throw an error), " +
"or 'keep' (put invalid data in a special additional bucket, at index numLabels).",
"or 'keep' (put invalid data in a special additional bucket, at index numCategories).",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can numCategories be confused for users with a defined constant? How about more verbose one: at index of the number of categories of the feature?

@viirya
Copy link
Member

viirya commented Nov 21, 2017

LGTM with two minor comments.

@holdenk
Copy link
Contributor

holdenk commented Nov 21, 2017

I can take a look tomorrow, been traveling but just got back.

@WeichenXu123
Copy link
Contributor Author

Thanks @holdenk

@SparkQA
Copy link

SparkQA commented Nov 21, 2017

Test build #84067 has finished for PR 19753 at commit c302d59.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@asfgit asfgit closed this in 2d868d9 Nov 21, 2017
@WeichenXu123 WeichenXu123 deleted the vector_indexer_invalid_py branch April 24, 2019 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants