Skip to content

Conversation

@actuaryzhang
Copy link
Contributor

What changes were proposed in this pull request?

PySpark StringIndexer supports StringOrderType added in #17879.

@actuaryzhang
Copy link
Contributor Author

@felixcheung

@actuaryzhang actuaryzhang changed the title [SPARK-20736] PySpark StringIndexer supports StringOrderType [SPARK-20736][Python] PySpark StringIndexer supports StringOrderType May 14, 2017
@SparkQA
Copy link

SparkQA commented May 14, 2017

Test build #76914 has finished for PR 17978 at commit e5c8dcf.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 14, 2017

Test build #76916 has finished for PR 17978 at commit bd80b37.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 15, 2017

Test build #76917 has finished for PR 17978 at commit 1f336ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

@viirya @MLnick @BryanCutler @yinxusen @brkyvz @HyukjinKwon @srowen
Ping for reviews or comments. Thanks much.

stringOrderType="frequencyDesc"):
"""
__init__(self, inputCol=None, outputCol=None, handleInvalid="error")
__init__(self, inputCol=None, outputCol=None, handleInvalid="error", \
Copy link
Member

@HyukjinKwon HyukjinKwon May 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need at least a doctest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Thank you. Added tests.

@HyukjinKwon
Copy link
Member

(I am not used to ML. I just left a trivial comment for Python.)

@viirya
Copy link
Member

viirya commented May 15, 2017

Code changes looks good. But we need to add test for this.

stringOrderType = Param(Params._dummy(), "stringOrderType",
"How to order labels of string column. The first label after " +
"ordering is assigned an index of 0. Supported options: " +
"frequencyDesc, frequencyAsc, alphabetDsec, alphabetAsc.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alphabetDsec -> alphabetDesc

@SparkQA
Copy link

SparkQA commented May 15, 2017

Test build #76928 has finished for PR 17978 at commit 44f0a36.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 15, 2017

Test build #76929 has finished for PR 17978 at commit f66a445.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

@viirya Thanks much for your review. I corrected the typo and added some tests.

@actuaryzhang
Copy link
Contributor Author

@felixcheung Could you take a look? Thanks.

stringOrderType = Param(Params._dummy(), "stringOrderType",
"How to order labels of string column. The first label after " +
"ordering is assigned an index of 0. Supported options: " +
"frequencyDesc, frequencyAsc, alphabetDesc, alphabetAsc.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be generated instead of hardcoded - you can find a example on python..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung stringOrderType is not a shared trait on the Scala side, and I thought only the shared traits should be automatically generated.
I have looked at the code for other ML transformers, and many of them hard coded, for example, Imputer and OneHotEncoder.
Please let me know if I'm wrong, and a reference to example would be greatly appreciated. Thanks much!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, ok, I see a few examples that they are not generated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know were mixed on doing this, but I like including the default value in the docstring, makes the documentation closer to the Scala doc and makes it easier to read without having to refer to the ScalaDoc.

So the most frequent label gets index 0.
The indices are in [0, numLabels). By default, this is ordered by label frequencies
so the most frequent label gets index 0. The ordering behavior is controlled by
setting stringOrderType.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to backtick and add a tag for the attribute

Copy link
Contributor Author

@actuaryzhang actuaryzhang May 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tag. Thanks

@since("2.3.0")
def getStringOrderType(self):
"""
Gets the value of :py:attr:`stringOrderType` or its default value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it say what the default value is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, added default value.

@actuaryzhang
Copy link
Contributor Author

@felixcheung Thanks so much for the review. I addressed most of the comments except auto generating code for defining stringOrderType. This parameter is not a shared trait on the Scala side, and I thought only the shared traits should be automatically generated. I have looked at the code for other ML transformers, and many of them hard coded, for example, Imputer and OneHotEncoder.

Please let me know if I'm wrong, and a reference to example would be greatly appreciated.
Thanks much!

@SparkQA
Copy link

SparkQA commented May 16, 2017

Test build #76967 has finished for PR 17978 at commit 6acabc2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

@viirya @felixcheung Any additional changes needed for this one?

@felixcheung
Copy link
Member

LGTM, ping @holdenk @jkbradley if they are interested

@viirya
Copy link
Member

viirya commented May 16, 2017

LGTM

@actuaryzhang
Copy link
Contributor Author

@felixcheung Would you help merge this? Thanks.

@felixcheung
Copy link
Member

I'd hold this for another 3-4 days just in case..

@MLnick
Copy link
Contributor

MLnick commented May 20, 2017

LGTM @felixcheung

@holdenk
Copy link
Contributor

holdenk commented May 20, 2017

One minor optional comment, but not a blocker so LGTM (although if you decide to update the docstring LGTM pending tests).

@actuaryzhang
Copy link
Contributor Author

@holdenk Thanks for the comment. Added default value in docstring.
@felixcheung Please let me know if there is anything else needed for this PR.
Thanks everyone for the review and comments!

@SparkQA
Copy link

SparkQA commented May 20, 2017

Test build #77131 has finished for PR 17978 at commit 2fe9432.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 20, 2017

Test build #77132 has finished for PR 17978 at commit 5bfa4dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@felixcheung
Copy link
Member

merged to master. thanks everyone!

@asfgit asfgit closed this in 0f2f56c May 21, 2017
@actuaryzhang actuaryzhang deleted the PythonStringIndexer branch May 21, 2017 23:58
liyichao pushed a commit to liyichao/spark that referenced this pull request May 24, 2017
## What changes were proposed in this pull request?
PySpark StringIndexer supports StringOrderType added in apache#17879.

Author: Wayne Zhang <[email protected]>

Closes apache#17978 from actuaryzha