-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20736][Python] PySpark StringIndexer supports StringOrderType #17978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #76914 has finished for PR 17978 at commit
|
|
Test build #76916 has finished for PR 17978 at commit
|
|
Test build #76917 has finished for PR 17978 at commit
|
|
@viirya @MLnick @BryanCutler @yinxusen @brkyvz @HyukjinKwon @srowen |
| stringOrderType="frequencyDesc"): | ||
| """ | ||
| __init__(self, inputCol=None, outputCol=None, handleInvalid="error") | ||
| __init__(self, inputCol=None, outputCol=None, handleInvalid="error", \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we need at least a doctest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Thank you. Added tests.
|
(I am not used to ML. I just left a trivial comment for Python.) |
|
Code changes looks good. But we need to add test for this. |
python/pyspark/ml/feature.py
Outdated
| stringOrderType = Param(Params._dummy(), "stringOrderType", | ||
| "How to order labels of string column. The first label after " + | ||
| "ordering is assigned an index of 0. Supported options: " + | ||
| "frequencyDesc, frequencyAsc, alphabetDsec, alphabetAsc.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alphabetDsec -> alphabetDesc
|
Test build #76928 has finished for PR 17978 at commit
|
|
Test build #76929 has finished for PR 17978 at commit
|
|
@viirya Thanks much for your review. I corrected the typo and added some tests. |
|
@felixcheung Could you take a look? Thanks. |
| stringOrderType = Param(Params._dummy(), "stringOrderType", | ||
| "How to order labels of string column. The first label after " + | ||
| "ordering is assigned an index of 0. Supported options: " + | ||
| "frequencyDesc, frequencyAsc, alphabetDesc, alphabetAsc.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be generated instead of hardcoded - you can find a example on python..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung stringOrderType is not a shared trait on the Scala side, and I thought only the shared traits should be automatically generated.
I have looked at the code for other ML transformers, and many of them hard coded, for example, Imputer and OneHotEncoder.
Please let me know if I'm wrong, and a reference to example would be greatly appreciated. Thanks much!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, ok, I see a few examples that they are not generated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know were mixed on doing this, but I like including the default value in the docstring, makes the documentation closer to the Scala doc and makes it easier to read without having to refer to the ScalaDoc.
python/pyspark/ml/feature.py
Outdated
| So the most frequent label gets index 0. | ||
| The indices are in [0, numLabels). By default, this is ordered by label frequencies | ||
| so the most frequent label gets index 0. The ordering behavior is controlled by | ||
| setting stringOrderType. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to backtick and add a tag for the attribute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added tag. Thanks
python/pyspark/ml/feature.py
Outdated
| @since("2.3.0") | ||
| def getStringOrderType(self): | ||
| """ | ||
| Gets the value of :py:attr:`stringOrderType` or its default value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it say what the default value is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, added default value.
|
@felixcheung Thanks so much for the review. I addressed most of the comments except auto generating code for defining Please let me know if I'm wrong, and a reference to example would be greatly appreciated. |
|
Test build #76967 has finished for PR 17978 at commit
|
|
@viirya @felixcheung Any additional changes needed for this one? |
|
LGTM, ping @holdenk @jkbradley if they are interested |
|
LGTM |
|
@felixcheung Would you help merge this? Thanks. |
|
I'd hold this for another 3-4 days just in case.. |
|
LGTM @felixcheung |
|
One minor optional comment, but not a blocker so LGTM (although if you decide to update the docstring LGTM pending tests). |
|
@holdenk Thanks for the comment. Added default value in docstring. |
|
Test build #77131 has finished for PR 17978 at commit
|
|
Test build #77132 has finished for PR 17978 at commit
|
felixcheung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
merged to master. thanks everyone! |
## What changes were proposed in this pull request? PySpark StringIndexer supports StringOrderType added in apache#17879. Author: Wayne Zhang <[email protected]> Closes apache#17978 from actuaryzha
What changes were proposed in this pull request?
PySpark StringIndexer supports StringOrderType added in #17879.