-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-13068][PYSPARK][ML] Type conversion for Pyspark params #11663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #52960 has finished for PR 11663 at commit
|
|
Test build #53076 has finished for PR 11663 at commit
|
|
Test build #53077 has finished for PR 11663 at commit
|
python/pyspark/ml/param/__init__.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"use typeConverter instead, as a keyword argument"
Also, I'd put this same message in the docstring too.
|
Made an initial pass. I like this update--thanks! |
|
Test build #53438 has finished for PR 11663 at commit
|
|
Jenkins retest this please |
|
Test build #53446 has finished for PR 11663 at commit
|
|
Test build #2654 has finished for PR 11663 at commit
|
python/pyspark/ml/feature.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this restriction in the doc now.
|
I made another pass. I only had minor comments. |
|
Test build #53816 has finished for PR 11663 at commit
|
|
Test build #53821 has finished for PR 11663 at commit
|
python/pyspark/ml/param/__init__.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to do "safe" unicode to str conversions. The way it was previously, a user could provide non-ascii characters in a string param and get a somewhat mysterious UnicodeEncodeError. This way, they should at least get an error message consistent with other TypeConverters. I appreciate feedback on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't thought about this before, but we actually should support unicode. The main use case is StringIndexer, which might be used to index unicode. For that, we'd want to pass an array of unicode and probably avoid converting it to str types.
Java/Scala should already handle this since java.lang.String handles unicode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the string conversions to handle unicode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sethah @jkbradley sorry for a question on an old bit of code, but hope this is a quick one:
These converters don't allow None. Some parameters can be set to None. Should they actually all just return None if their value is None? any harm in that?
|
Test build #53827 has finished for PR 11663 at commit
|
|
Test build #53832 has finished for PR 11663 at commit
|
python/pyspark/ml/param/__init__.py
Outdated
| if TypeConverters._can_convert_to_string(value): | ||
| return str(value) | ||
| else: | ||
| raise TypeError("Could not convert value of type %s to string" % type(value).__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually like not having __name__ since it's nice to see the module name as well. I guess you could write Could not convert %s to string to avoid having "type" appear twice.
Same for other uses of __name__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
I also commented on 1 earlier item (about the conversions to str) |
|
Test build #53835 has finished for PR 11663 at commit
|
|
Test build #53943 has finished for PR 11663 at commit
|
|
LGTM |
…set` method ## What changes were proposed in this pull request? Param setters in python previously accessed the _paramMap directly to update values. The `_set` method now implements type checking, so it should be used to update all parameters. This PR eliminates all direct accesses to `_paramMap` besides the one in the `_set` method to ensure type checking happens. Additional changes: * [SPARK-13068](apache#11663) missed adding type converters in evaluation.py so those are done here * An incorrect `toBoolean` type converter was used for StringIndexer `handleInvalid` param in previous PR. This is fixed here. ## How was this patch tested? Existing unit tests verify that parameters are still set properly. No new functionality is actually added in this PR. Author: sethah <[email protected]> Closes apache#11939 from sethah/SPARK-14104.
## What changes were proposed in this pull request? #11663 adds type conversion functionality for parameters in Pyspark. This PR find out the omissive ```Param``` that did not pass corresponding ```TypeConverter``` argument and fix them. After this PR, all params in pyspark/ml/ used ```TypeConverter```. ## How was this patch tested? Existing tests. cc jkbradley sethah Author: Yanbo Liang <[email protected]> Closes #12529 from yanboliang/typeConverter.
What changes were proposed in this pull request?
This patch adds type conversion functionality for parameters in Pyspark. A
typeConverterfield is added to the constructor ofParamclass. This argument is a function which converts values passed to this param to the appropriate type if possible. This is beneficial so that the params can fail at set time if they are given inappropriate values, but even more so because coherent error messages are now provided when Py4J cannot cast the python type to the appropriate Java type.This patch also adds a
TypeConvertersclass with factory methods for common type conversions. Most of the changes involve adding these factory type converters to existing params. The previous solution to this issue,expectedType, is deprecated and can be removed in 2.1.0 as discussed on the Jira.How was this patch tested?
Unit tests were added in python/pyspark/ml/tests.py to test parameter type conversion. These tests check that values that should be convertible are converted correctly, and that the appropriate errors are thrown when invalid values are provided.