-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19871] [PySpark][SQL] Improve error message in verify_type to indicate the field #17213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
|
||
|
|
||
| def _verify_type(obj, dataType, nullable=True): | ||
| def _verify_type(obj, dataType, nullable=True, name=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a doctest below.
|
I think we should run |
|
cc @dgingrich who I guess the reporter of SPARK-19507 - what do you think about this? |
|
Ah yes it seems |
|
Will add doctest and check linting. |
dgingrich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, and basically the same as what I did. I had some additional handling for nested structs and map/array elements which I think would be good to add. Here's my PR (incomplete, tests not done: #17227). I'm fine with either one being merged as long as I get better debug messages in the next release :) . Let me know what you want me to do.
| if isinstance(obj, dict): | ||
| for f in dataType.fields: | ||
| _verify_type(obj.get(f.name), f.dataType, f.nullable) | ||
| _verify_type(obj.get(f.name), f.dataType, f.nullable, f.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work that well for nested structs:
MySubType = StructType([StructField('value', StringType(), nullable=False)])
MyType = StructType([
StructField('one', MySubType),
StructField('two', MySubType)])
_verify_type({'one': {'value': 'good'}, 'two': {'value': None}}, MyType)
# "This field (value, of type StringType) is not nullable, but got None"
# But is it one.value or two.value?| _verify_type(k, dataType.keyType, False) | ||
| _verify_type(v, dataType.valueType, dataType.valueContainsNull) | ||
| _verify_type(k, dataType.keyType, False, name) | ||
| _verify_type(v, dataType.valueType, dataType.valueContainsNull, name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might also want to flag individual array/map elements.
|
Yours looks more complete so I'm willing to close this if you finish it |
|
Let me then resolve this JIRA as a duplicate and reopen @dgingrich's one as soon as this one is closed. |
https://issues.apache.org/jira/browse/SPARK-19871
What changes were proposed in this pull request?
Improve error message in verify_type to indicate the field which is responsible for the verification error. This is incredibly useful for tracking down type/nullability errors.
Sample changes:
Before:
This field is not nullable, but got NoneAfter:
This field (my_column, of type BooleanType) is not nullable, but got NoneBefore:
FloatType can not accept object True in type boolAfter:
FloatType can not accept object True in type bool for field my_columnHow was this patch tested?
Unit tests pass
Please review http://spark.apache.org/contributing.html before opening a pull request.