Skip to content

Conversation

@sethah
Copy link
Contributor

@sethah sethah commented Mar 11, 2016

What changes were proposed in this pull request?

This patch adds type conversion functionality for parameters in Pyspark. A typeConverter field is added to the constructor of Param class. This argument is a function which converts values passed to this param to the appropriate type if possible. This is beneficial so that the params can fail at set time if they are given inappropriate values, but even more so because coherent error messages are now provided when Py4J cannot cast the python type to the appropriate Java type.

This patch also adds a TypeConverters class with factory methods for common type conversions. Most of the changes involve adding these factory type converters to existing params. The previous solution to this issue, expectedType, is deprecated and can be removed in 2.1.0 as discussed on the Jira.

How was this patch tested?

Unit tests were added in python/pyspark/ml/tests.py to test parameter type conversion. These tests check that values that should be convertible are converted correctly, and that the appropriate errors are thrown when invalid values are provided.

@SparkQA
Copy link

SparkQA commented Mar 11, 2016

Test build #52960 has finished for PR 11663 at commit 40b00dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 14, 2016

Test build #53076 has finished for PR 11663 at commit 34e9925.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 14, 2016

Test build #53077 has finished for PR 11663 at commit 40b00dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"use typeConverter instead, as a keyword argument"

Also, I'd put this same message in the docstring too.

@jkbradley
Copy link
Member

Made an initial pass. I like this update--thanks!

@SparkQA
Copy link

SparkQA commented Mar 17, 2016

Test build #53438 has finished for PR 11663 at commit c483da8.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • probabilityCol = Param(Params._dummy(), \"probabilityCol\", \"Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.\", typeConverter=TypeConverters.toString)
    • thresholds = Param(Params._dummy(), \"thresholds\", \"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.\", typeConverter=TypeConverters.toListFloat)

@sethah
Copy link
Contributor Author

sethah commented Mar 17, 2016

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Mar 17, 2016

Test build #53446 has finished for PR 11663 at commit c483da8.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • probabilityCol = Param(Params._dummy(), \"probabilityCol\", \"Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.\", typeConverter=TypeConverters.toString)
    • thresholds = Param(Params._dummy(), \"thresholds\", \"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.\", typeConverter=TypeConverters.toListFloat)

@SparkQA
Copy link

SparkQA commented Mar 21, 2016

Test build #2654 has finished for PR 11663 at commit c483da8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • probabilityCol = Param(Params._dummy(), \"probabilityCol\", \"Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.\", typeConverter=TypeConverters.toString)
    • thresholds = Param(Params._dummy(), \"thresholds\", \"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.\", typeConverter=TypeConverters.toListFloat)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this restriction in the doc now.

@jkbradley
Copy link
Member

I made another pass. I only had minor comments.

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53816 has finished for PR 11663 at commit ff7f94d.

  • This patch fails R style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53821 has finished for PR 11663 at commit e1e4643.

  • This patch fails R style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to do "safe" unicode to str conversions. The way it was previously, a user could provide non-ascii characters in a string param and get a somewhat mysterious UnicodeEncodeError. This way, they should at least get an error message consistent with other TypeConverters. I appreciate feedback on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't thought about this before, but we actually should support unicode. The main use case is StringIndexer, which might be used to index unicode. For that, we'd want to pass an array of unicode and probably avoid converting it to str types.

Java/Scala should already handle this since java.lang.String handles unicode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the string conversions to handle unicode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sethah @jkbradley sorry for a question on an old bit of code, but hope this is a quick one:
These converters don't allow None. Some parameters can be set to None. Should they actually all just return None if their value is None? any harm in that?

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53827 has finished for PR 11663 at commit 99eed51.

  • This patch fails R style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53832 has finished for PR 11663 at commit 2c46076.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if TypeConverters._can_convert_to_string(value):
return str(value)
else:
raise TypeError("Could not convert value of type %s to string" % type(value).__name__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually like not having __name__ since it's nice to see the module name as well. I guess you could write Could not convert %s to string to avoid having "type" appear twice.

Same for other uses of __name__

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jkbradley
Copy link
Member

I also commented on 1 earlier item (about the conversions to str)

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53835 has finished for PR 11663 at commit 6a8d5f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #53943 has finished for PR 11663 at commit 3e55f83.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM
Thanks for a great PR!
Merging with master

@asfgit asfgit closed this in 30bdb5c Mar 23, 2016
lw-lin pushed a commit to lw-lin/spark that referenced this pull request Apr 20, 2016
…set` method

## What changes were proposed in this pull request?

Param setters in python previously accessed the _paramMap directly to update values. The `_set` method now implements type checking, so it should be used to update all parameters. This PR eliminates all direct accesses to `_paramMap` besides the one in the `_set` method to ensure type checking happens.

Additional changes:
* [SPARK-13068](apache#11663) missed adding type converters in evaluation.py so those are done here
* An incorrect `toBoolean` type converter was used for StringIndexer `handleInvalid` param in previous PR. This is fixed here.

## How was this patch tested?

Existing unit tests verify that parameters are still set properly. No new functionality is actually added in this PR.

Author: sethah <[email protected]>

Closes apache#11939 from sethah/SPARK-14104.
asfgit pushed a commit that referenced this pull request Apr 20, 2016
## What changes were proposed in this pull request?
#11663 adds type conversion functionality for parameters in Pyspark. This PR find out the omissive ```Param``` that did not pass corresponding ```TypeConverter``` argument and fix them. After this PR, all params in pyspark/ml/ used ```TypeConverter```.

## How was this patch tested?
Existing tests.

cc jkbradley sethah

Author: Yanbo Liang <[email protected]>

Closes #12529 from yanboliang/typeConverter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants