[SPARK-20736][Python] PySpark StringIndexer supports StringOrderType #17978

actuaryzhang · 2017-05-14T21:02:50Z

What changes were proposed in this pull request?

PySpark StringIndexer supports StringOrderType added in #17879.

actuaryzhang · 2017-05-14T21:03:01Z

@felixcheung

SparkQA · 2017-05-14T21:08:50Z

Test build #76914 has finished for PR 17978 at commit e5c8dcf.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-14T23:34:39Z

Test build #76916 has finished for PR 17978 at commit bd80b37.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-15T00:14:15Z

Test build #76917 has finished for PR 17978 at commit 1f336ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

actuaryzhang · 2017-05-15T00:21:47Z

@viirya @MLnick @BryanCutler @yinxusen @brkyvz @HyukjinKwon @srowen
Ping for reviews or comments. Thanks much.

HyukjinKwon · 2017-05-15T00:49:48Z

python/pyspark/ml/feature.py

+                 stringOrderType="frequencyDesc"):
        """
-        __init__(self, inputCol=None, outputCol=None, handleInvalid="error")
+        __init__(self, inputCol=None, outputCol=None, handleInvalid="error", \


I guess we need at least a doctest.

@HyukjinKwon Thank you. Added tests.

HyukjinKwon · 2017-05-15T00:51:07Z

(I am not used to ML. I just left a trivial comment for Python.)

viirya · 2017-05-15T02:22:15Z

Code changes looks good. But we need to add test for this.

viirya · 2017-05-15T02:25:40Z

python/pyspark/ml/feature.py

+    stringOrderType = Param(Params._dummy(), "stringOrderType",
+                            "How to order labels of string column. The first label after " +
+                            "ordering is assigned an index of 0. Supported options: " +
+                            "frequencyDesc, frequencyAsc, alphabetDsec, alphabetAsc.",


alphabetDsec -> alphabetDesc

SparkQA · 2017-05-15T04:02:38Z

Test build #76928 has finished for PR 17978 at commit 44f0a36.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-15T04:39:23Z

Test build #76929 has finished for PR 17978 at commit f66a445.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

actuaryzhang · 2017-05-15T05:11:02Z

@viirya Thanks much for your review. I corrected the typo and added some tests.

actuaryzhang · 2017-05-15T16:34:36Z

@felixcheung Could you take a look? Thanks.

felixcheung · 2017-05-16T04:15:29Z

python/pyspark/ml/feature.py

+    stringOrderType = Param(Params._dummy(), "stringOrderType",
+                            "How to order labels of string column. The first label after " +
+                            "ordering is assigned an index of 0. Supported options: " +
+                            "frequencyDesc, frequencyAsc, alphabetDesc, alphabetAsc.",


I think this should be generated instead of hardcoded - you can find a example on python..

@felixcheung stringOrderType is not a shared trait on the Scala side, and I thought only the shared traits should be automatically generated.
I have looked at the code for other ML transformers, and many of them hard coded, for example, Imputer and OneHotEncoder.
Please let me know if I'm wrong, and a reference to example would be greatly appreciated. Thanks much!

hmm, ok, I see a few examples that they are not generated

I know were mixed on doing this, but I like including the default value in the docstring, makes the documentation closer to the Scala doc and makes it easier to read without having to refer to the ScalaDoc.

felixcheung · 2017-05-16T04:16:15Z

python/pyspark/ml/feature.py

-    So the most frequent label gets index 0.
+    The indices are in [0, numLabels). By default, this is ordered by label frequencies
+    so the most frequent label gets index 0. The ordering behavior is controlled by
+    setting stringOrderType.


I think you need to backtick and add a tag for the attribute

Added tag. Thanks

felixcheung · 2017-05-16T04:16:46Z

python/pyspark/ml/feature.py

+    @since("2.3.0")
+    def getStringOrderType(self):
+        """
+        Gets the value of :py:attr:`stringOrderType` or its default value.


should it say what the default value is?

OK, added default value.

actuaryzhang · 2017-05-16T05:00:49Z

@felixcheung Thanks so much for the review. I addressed most of the comments except auto generating code for defining stringOrderType. This parameter is not a shared trait on the Scala side, and I thought only the shared traits should be automatically generated. I have looked at the code for other ML transformers, and many of them hard coded, for example, Imputer and OneHotEncoder.

Please let me know if I'm wrong, and a reference to example would be greatly appreciated.
Thanks much!

SparkQA · 2017-05-16T15:39:28Z

Test build #76967 has finished for PR 17978 at commit 6acabc2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

actuaryzhang · 2017-05-16T16:33:45Z

@viirya @felixcheung Any additional changes needed for this one?

felixcheung · 2017-05-16T20:59:34Z

LGTM, ping @holdenk @jkbradley if they are interested

viirya · 2017-05-16T23:21:56Z

LGTM

actuaryzhang · 2017-05-19T20:50:27Z

@felixcheung Would you help merge this? Thanks.

felixcheung · 2017-05-20T05:03:42Z

I'd hold this for another 3-4 days just in case..

MLnick · 2017-05-20T11:08:54Z

LGTM @felixcheung

holdenk · 2017-05-20T16:30:16Z

One minor optional comment, but not a blocker so LGTM (although if you decide to update the docstring LGTM pending tests).

actuaryzhang · 2017-05-20T20:44:03Z

@holdenk Thanks for the comment. Added default value in docstring.
@felixcheung Please let me know if there is anything else needed for this PR.
Thanks everyone for the review and comments!

SparkQA · 2017-05-20T21:01:46Z

Test build #77131 has finished for PR 17978 at commit 2fe9432.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-20T21:29:42Z

Test build #77132 has finished for PR 17978 at commit 5bfa4dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

LGTM

felixcheung · 2017-05-21T23:52:11Z

merged to master. thanks everyone!

## What changes were proposed in this pull request? PySpark StringIndexer supports StringOrderType added in apache#17879. Author: Wayne Zhang <[email protected]> Closes apache#17978 from actuaryzha

Wayne Zhang added 3 commits May 13, 2017 17:41

Python API to StringOrderType in StringIndexer

ddf34a5

fix typo

c1966bb

fix typo

e5c8dcf

actuaryzhang changed the title ~~[SPARK-20736] PySpark StringIndexer supports StringOrderType~~ [SPARK-20736][Python] PySpark StringIndexer supports StringOrderType May 14, 2017

fix style

bd80b37

fix style

1f336ab

HyukjinKwon reviewed May 15, 2017

View reviewed changes

viirya reviewed May 15, 2017

View reviewed changes

add tests

44f0a36

fix test error

f66a445

felixcheung reviewed May 16, 2017

View reviewed changes

address comments

36006bf

minor style fix

6acabc2

add default value for stringOrderType in docstring

2fe9432

fix example error

5bfa4dc

felixcheung approved these changes May 21, 2017

View reviewed changes

asfgit closed this in 0f2f56c May 21, 2017

actuaryzhang deleted the PythonStringIndexer branch May 21, 2017 23:58

[SPARK-20736][Python] PySpark StringIndexer supports StringOrderType #17978

[SPARK-20736][Python] PySpark StringIndexer supports StringOrderType #17978

Uh oh!

Conversation

actuaryzhang commented May 14, 2017

What changes were proposed in this pull request?

Uh oh!

actuaryzhang commented May 14, 2017

Uh oh!

SparkQA commented May 14, 2017

Uh oh!

SparkQA commented May 14, 2017

Uh oh!

SparkQA commented May 15, 2017

Uh oh!

actuaryzhang commented May 15, 2017

Uh oh!

HyukjinKwon May 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 15, 2017

Uh oh!

viirya commented May 15, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 15, 2017

Uh oh!

SparkQA commented May 15, 2017

Uh oh!

actuaryzhang commented May 15, 2017

Uh oh!

actuaryzhang commented May 15, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

actuaryzhang May 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

actuaryzhang commented May 16, 2017

Uh oh!

SparkQA commented May 16, 2017

Uh oh!

actuaryzhang commented May 16, 2017

Uh oh!

felixcheung commented May 16, 2017

Uh oh!

viirya commented May 16, 2017

Uh oh!

actuaryzhang commented May 19, 2017

Uh oh!

felixcheung commented May 20, 2017

Uh oh!

MLnick commented May 20, 2017

Uh oh!

holdenk commented May 20, 2017

Uh oh!

actuaryzhang commented May 20, 2017

Uh oh!

SparkQA commented May 20, 2017

Uh oh!

SparkQA commented May 20, 2017

Uh oh!

felixcheung left a comment

HyukjinKwon May 15, 2017 •

edited

Loading

actuaryzhang May 16, 2017 •

edited

Loading