[SPARK-14104][PYSPARK][ML] All Python param setters should use the `_set` method #11939

sethah · 2016-03-24T17:45:40Z

What changes were proposed in this pull request?

Param setters in python previously accessed the _paramMap directly to update values. The _set method now implements type checking, so it should be used to update all parameters. This PR eliminates all direct accesses to _paramMap besides the one in the _set method to ensure type checking happens.

Additional changes:

SPARK-13068 missed adding type converters in evaluation.py so those are done here
An incorrect toBoolean type converter was used for StringIndexer handleInvalid param in previous PR. This is fixed here.

How was this patch tested?

Existing unit tests verify that parameters are still set properly. No new functionality is actually added in this PR.

SparkQA · 2016-03-24T18:04:40Z

Test build #54062 has finished for PR 11939 at commit d8d97e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-03-24T18:41:01Z

python/pyspark/ml/param/__init__.py

In a previous PR a parameter was given an incorrect type converter, and this was not caught by the tests. Enforcing _setDefault to use the type converter for the param will ensure that all params with default values cannot be given incompatible type converters.

SparkQA · 2016-03-24T18:51:37Z

Test build #54068 has finished for PR 11939 at commit 793ba7c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-03-24T19:05:04Z

python/pyspark/ml/feature.py

With the change to _setDefault, I had to change this default to be a list instead of JavaObject. The other option would be to have type converters do nothing if they encounter JavaObjects. It is nice to leave stop words as a JavaObject if they are never accessed explicitly on the Python side. Would appreciate thoughts on this problem.

I don't think we ever explicitly access them on the Python side - although a users application might attempt to do that and append stop words to the existing list in which case having it as a list is maybe good. One could get a similar effect by changing getStopWords without having to round trip the list in cases where it isn't ever accessed on the python side.

It is simple to make this change, so I think it's a good idea. This will help in the future for similar cases or if the list of stopwords grows even larger. I changed getStopWords to return a list always which is better for users, I think. Thanks for the suggestion!

sethah · 2016-04-01T15:50:22Z

cc @holdenk @jkbradley Could you take a look whenever you get a chance?

SparkQA · 2016-04-01T18:51:52Z

Test build #54711 has finished for PR 11939 at commit 3b0f89b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-04-11T20:08:14Z

Would be good to update to master so can run the latest tests, but at its current point it seems to have gotten all of the direct paramMap sets (although there may be more in master now). It might make sense to also add a clearParam function and then add a note that no one (including developers) should directly access the param map but instead use one of the access functions?

sethah · 2016-04-14T00:30:26Z

@holdenk I added a _clearParam function. I am open to adding a note, but I'm not sure where to put it that would make it most effective. It seems a bit awkward for the note to go into the API docs, since it's more for developers. You were right, more direct param sets were added in Generalized Linear Regression, so I've removed them.

SparkQA · 2016-04-14T00:42:17Z

Test build #55767 has finished for PR 11939 at commit 8079c11.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-04-14T20:57:08Z

I'll take a look now

jkbradley · 2016-04-15T03:12:13Z

python/pyspark/ml/evaluation.py

    metricName = Param(Params._dummy(), "metricName",
-                       "metric name in evaluation (areaUnderROC|areaUnderPR)")
+                       "metric name in evaluation (areaUnderROC|areaUnderPR)",
+                       TypeConverters.toString)


Specify typeConverter as a keyword arg (here and elsewhere)

SparkQA · 2016-04-15T15:01:28Z

Test build #55930 has finished for PR 11939 at commit 37b9ac5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-04-15T18:32:26Z

Thanks for the updates. I just sent #12422 Could you please take a look at it?

SparkQA · 2016-04-15T19:13:00Z

Test build #2792 has finished for PR 11939 at commit 37b9ac5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-04-15T19:14:17Z

LGTM
Merging with master
Thanks for the PR!

…set` method ## What changes were proposed in this pull request? Param setters in python previously accessed the _paramMap directly to update values. The `_set` method now implements type checking, so it should be used to update all parameters. This PR eliminates all direct accesses to `_paramMap` besides the one in the `_set` method to ensure type checking happens. Additional changes: * [SPARK-13068](apache#11663) missed adding type converters in evaluation.py so those are done here * An incorrect `toBoolean` type converter was used for StringIndexer `handleInvalid` param in previous PR. This is fixed here. ## How was this patch tested? Existing unit tests verify that parameters are still set properly. No new functionality is actually added in this PR. Author: sethah <[email protected]> Closes apache#11939 from sethah/SPARK-14104.

… method ## What changes were proposed in this pull request? #11939 make Python param setters use the `_set` method. This PR fix omissive ones. ## How was this patch tested? Existing tests. cc jkbradley sethah Author: Yanbo Liang <[email protected]> Closes #12531 from yanboliang/setters-omissive.

sethah reviewed Mar 24, 2016
View reviewed changes

sethah added 6 commits April 13, 2016 17:06

using _set in all params

3598c9f

cleaning up

436745f

style fix

ea02225

_setDefault uses typeConverter

0c0fc63

set default ignores java objects

c864597

updating with master

8079c11

sethah force-pushed the SPARK-14104 branch from 3b0f89b to 8079c11 Compare April 14, 2016 00:27

jkbradley mentioned this pull request Apr 14, 2016

[SPARK-7861][ML] PySpark OneVsRest #12124

Closed

jkbradley reviewed Apr 15, 2016
View reviewed changes

code review

37b9ac5

holdenk mentioned this pull request Apr 15, 2016

[SPARK-14665][ML][PYTHON] Fixed bug with StopWordsRemover default stopwords #12422

Closed

asfgit closed this in 129f2f4 Apr 15, 2016

holdenk mentioned this pull request Apr 18, 2016

[Spark-14564] [ML] [MLlib] [PySpark] Python Word2Vec missing setWindowSize method #12428

Closed

This was referenced Apr 20, 2016

[Minor] [ML] [PySpark] Fix omissive param setters which should use _set method #12531

Closed

[SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA #10242

Closed

[SPARK-14104][PYSPARK][ML] All Python param setters should use the _set method #11939

[SPARK-14104][PYSPARK][ML] All Python param setters should use the _set method #11939

Uh oh!

Conversation

sethah commented Mar 24, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 24, 2016

Uh oh!

sethah Mar 24, 2016

Choose a reason for hiding this comment

Uh oh!

jkbradley Apr 15, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 24, 2016

Uh oh!

sethah Mar 24, 2016

Choose a reason for hiding this comment

Uh oh!

holdenk Apr 1, 2016

Choose a reason for hiding this comment

Uh oh!

sethah Apr 1, 2016

Choose a reason for hiding this comment

Uh oh!

sethah commented Apr 1, 2016

Uh oh!

SparkQA commented Apr 1, 2016

Uh oh!

holdenk commented Apr 11, 2016

Uh oh!

sethah commented Apr 14, 2016

Uh oh!

SparkQA commented Apr 14, 2016

Uh oh!

jkbradley commented Apr 14, 2016

Uh oh!

jkbradley Apr 15, 2016

Choose a reason for hiding this comment

Uh oh!

sethah Apr 15, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 15, 2016

Uh oh!

jkbradley commented Apr 15, 2016

Uh oh!

SparkQA commented Apr 15, 2016

Uh oh!

jkbradley commented Apr 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-14104][PYSPARK][ML] All Python param setters should use the `_set` method #11939

[SPARK-14104][PYSPARK][ML] All Python param setters should use the `_set` method #11939