[SPARK-13068][PYSPARK][ML] Type conversion for Pyspark params #11663

sethah · 2016-03-11T23:15:29Z

What changes were proposed in this pull request?

This patch adds type conversion functionality for parameters in Pyspark. A typeConverter field is added to the constructor of Param class. This argument is a function which converts values passed to this param to the appropriate type if possible. This is beneficial so that the params can fail at set time if they are given inappropriate values, but even more so because coherent error messages are now provided when Py4J cannot cast the python type to the appropriate Java type.

This patch also adds a TypeConverters class with factory methods for common type conversions. Most of the changes involve adding these factory type converters to existing params. The previous solution to this issue, expectedType, is deprecated and can be removed in 2.1.0 as discussed on the Jira.

How was this patch tested?

Unit tests were added in python/pyspark/ml/tests.py to test parameter type conversion. These tests check that values that should be convertible are converted correctly, and that the appropriate errors are thrown when invalid values are provided.

SparkQA · 2016-03-11T23:34:56Z

Test build #52960 has finished for PR 11663 at commit 40b00dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-14T15:58:48Z

Test build #53076 has finished for PR 11663 at commit 34e9925.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-14T16:30:15Z

Test build #53077 has finished for PR 11663 at commit 40b00dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-03-16T20:16:43Z

python/pyspark/ml/param/__init__.py

"use typeConverter instead, as a keyword argument"

Also, I'd put this same message in the docstring too.

jkbradley · 2016-03-16T20:17:25Z

Made an initial pass. I like this update--thanks!

SparkQA · 2016-03-17T17:47:48Z

Test build #53438 has finished for PR 11663 at commit c483da8.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- probabilityCol = Param(Params._dummy(), \"probabilityCol\", \"Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.\", typeConverter=TypeConverters.toString)
- thresholds = Param(Params._dummy(), \"thresholds\", \"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.\", typeConverter=TypeConverters.toListFloat)

sethah · 2016-03-17T18:10:37Z

Jenkins retest this please

SparkQA · 2016-03-17T18:17:54Z

Test build #53446 has finished for PR 11663 at commit c483da8.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- probabilityCol = Param(Params._dummy(), \"probabilityCol\", \"Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.\", typeConverter=TypeConverters.toString)
- thresholds = Param(Params._dummy(), \"thresholds\", \"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.\", typeConverter=TypeConverters.toListFloat)

SparkQA · 2016-03-21T19:11:08Z

Test build #2654 has finished for PR 11663 at commit c483da8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- probabilityCol = Param(Params._dummy(), \"probabilityCol\", \"Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.\", typeConverter=TypeConverters.toString)
- thresholds = Param(Params._dummy(), \"thresholds\", \"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.\", typeConverter=TypeConverters.toListFloat)

jkbradley · 2016-03-22T16:58:09Z

python/pyspark/ml/feature.py

We can remove this restriction in the doc now.

jkbradley · 2016-03-22T16:58:57Z

I made another pass. I only had minor comments.

SparkQA · 2016-03-22T21:06:52Z

Test build #53816 has finished for PR 11663 at commit ff7f94d.

This patch fails R style tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-03-22T21:41:53Z

Test build #53821 has finished for PR 11663 at commit e1e4643.

This patch fails R style tests.
This patch does not merge cleanly.
This patch adds no public classes.

sethah · 2016-03-22T22:01:53Z

python/pyspark/ml/param/__init__.py

I changed this to do "safe" unicode to str conversions. The way it was previously, a user could provide non-ascii characters in a string param and get a somewhat mysterious UnicodeEncodeError. This way, they should at least get an error message consistent with other TypeConverters. I appreciate feedback on this.

I hadn't thought about this before, but we actually should support unicode. The main use case is StringIndexer, which might be used to index unicode. For that, we'd want to pass an array of unicode and probably avoid converting it to str types.

Java/Scala should already handle this since java.lang.String handles unicode.

I changed the string conversions to handle unicode.

@sethah @jkbradley sorry for a question on an old bit of code, but hope this is a quick one:
These converters don't allow None. Some parameters can be set to None. Should they actually all just return None if their value is None? any harm in that?

SparkQA · 2016-03-22T22:07:37Z

Test build #53827 has finished for PR 11663 at commit 99eed51.

This patch fails R style tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-03-22T22:46:29Z

Test build #53832 has finished for PR 11663 at commit 2c46076.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-03-22T22:56:44Z

python/pyspark/ml/param/__init__.py

+        if TypeConverters._can_convert_to_string(value):
+            return str(value)
+        else:
+            raise TypeError("Could not convert value of type %s to string" % type(value).__name__)


I actually like not having __name__ since it's nice to see the module name as well. I guess you could write Could not convert %s to string to avoid having "type" appear twice.

Same for other uses of __name__

jkbradley · 2016-03-22T22:57:19Z

I also commented on 1 earlier item (about the conversions to str)

SparkQA · 2016-03-22T23:06:07Z

Test build #53835 has finished for PR 11663 at commit 6a8d5f6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-23T16:36:52Z

Test build #53943 has finished for PR 11663 at commit 3e55f83.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-03-23T18:18:57Z

LGTM
Thanks for a great PR!
Merging with master

…set` method ## What changes were proposed in this pull request? Param setters in python previously accessed the _paramMap directly to update values. The `_set` method now implements type checking, so it should be used to update all parameters. This PR eliminates all direct accesses to `_paramMap` besides the one in the `_set` method to ensure type checking happens. Additional changes: * [SPARK-13068](apache#11663) missed adding type converters in evaluation.py so those are done here * An incorrect `toBoolean` type converter was used for StringIndexer `handleInvalid` param in previous PR. This is fixed here. ## How was this patch tested? Existing unit tests verify that parameters are still set properly. No new functionality is actually added in this PR. Author: sethah <[email protected]> Closes apache#11939 from sethah/SPARK-14104.

## What changes were proposed in this pull request? #11663 adds type conversion functionality for parameters in Pyspark. This PR find out the omissive ```Param``` that did not pass corresponding ```TypeConverter``` argument and fix them. After this PR, all params in pyspark/ml/ used ```TypeConverter```. ## How was this patch tested? Existing tests. cc jkbradley sethah Author: Yanbo Liang <[email protected]> Closes #12529 from yanboliang/typeConverter.

sethah force-pushed the SPARK-13068-tc branch from 34e9925 to 40b00dd Compare March 14, 2016 16:08

jkbradley reviewed Mar 16, 2016
View reviewed changes

jkbradley reviewed Mar 22, 2016
View reviewed changes

python/pyspark/ml/feature.py Outdated

Copy link

Member

jkbradley Mar 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this restriction in the doc now.

sethah reviewed Mar 22, 2016
View reviewed changes

sethah added 9 commits March 22, 2016 15:27

type conversion for params

8ab0242

removing _convert method and using typeConverter directly in _set

8f00d8a

docstring and typo

66fe8fb

refactoring type conversions

91f156b

fixing deprecation message

f0ad1f5

addressing comments

6b0880b

support other list-like, python3 stuff, and other changes

8fec4f4

fixing boolean error message

f8294e8

safe unicode conversions

2c46076

sethah force-pushed the SPARK-13068-tc branch from 99eed51 to 2c46076 Compare March 22, 2016 22:29

Fixing deprecation msg

6a8d5f6

jkbradley reviewed Mar 22, 2016
View reviewed changes

handle unicode in string conversions

3e55f83

asfgit closed this in 30bdb5c Mar 23, 2016

sethah mentioned this pull request Mar 24, 2016

[SPARK-14104][PYSPARK][ML] All Python param setters should use the _set method #11939

Closed

This was referenced Apr 20, 2016

[Minor] [ML] [PySpark] Fix omissive params which should use TypeConverter #12529

Closed

[SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA #10242

Closed

[SPARK-13068][PYSPARK][ML] Type conversion for Pyspark params #11663

[SPARK-13068][PYSPARK][ML] Type conversion for Pyspark params #11663

Uh oh!

Conversation

sethah commented Mar 11, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 11, 2016

Uh oh!

SparkQA commented Mar 14, 2016

Uh oh!

SparkQA commented Mar 14, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Mar 16, 2016

Uh oh!

SparkQA commented Mar 17, 2016

Uh oh!

sethah commented Mar 17, 2016

Uh oh!

SparkQA commented Mar 17, 2016

Uh oh!

SparkQA commented Mar 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Mar 22, 2016

Uh oh!

SparkQA commented Mar 22, 2016

Uh oh!

SparkQA commented Mar 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 22, 2016

Uh oh!

SparkQA commented Mar 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Mar 22, 2016

Uh oh!

SparkQA commented Mar 22, 2016

Uh oh!

SparkQA commented Mar 23, 2016

Uh oh!

jkbradley commented Mar 23, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants