-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for Rows #24448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
If this can get merged then it would be great to also backport the patch to the currently supported stable version(s) of Spark (2.4 and 2.3?). What do I need to do to make that happen? |
|
ok to test |
python/pyspark/sql/types.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I applied the suggested change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test non-ascii compatible characters
I have added a test with unicode values. Of course that breaks in Python 2.7 because %s is used in repr. I think it would be reasonable to change the use of %s to %r for representing the individual tuple values, just as was suggested originally in the JIRA ticket. I've made that change and adapted the doctest accordingly. What do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this form of creating Row objects is probably rarely used and should at least be documented. Maybe it wasn't even intended to be used in the way I did. Please have a look at the docstring change I've added and let me know if this is appropriate.
|
Test build #104890 has finished for PR 24448 at commit
|
Rows __repr__ assumes data is strings when column name is missing.
Examples,
>>> Row ("Alice", "11")
<Row(Alice, 11)>
>>> Row (name="Alice", age=11)
Row(age=11, name='Alice')
>>> Row ("Alice", 11)
<snip stack trace>
TypeError: sequence item 1: expected string, int found
This is because Row () when called without column names assumes
everything is string.
Change the representation of tuple value Row objects: use repr instead of str for the individual tuple values for unicode compatibility. - change Row.__repr__ to use %r instead of %s - adapt Row's doctest - add test for unicode tuple values
|
Test build #104949 has finished for PR 24448 at commit
|
holdenk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for the change. For back porting you can open a PR to the branch-2.4 and branch-2.3. We could also backport during the merge, but I think it makes sense to have a discussion around the backporting separately since it may break folks doctests.
|
Merged to master. Thanks for making your first pull request to Spark @tbcs :) Do you have a JIRA username I can assign the issue to now that it's fixed for tracking? |
|
I also wouldn't mind back-porting this. I used the varargs Row constructor, which seems normal, and hit this too. But I get that it may break doctests so, hm, unclear. |
This is PR is meant to replace #20503, which lay dormant for a while. The solution in the original PR is still valid, so this is just that patch rebased onto the current master.
Original summary follows.
What changes were proposed in this pull request?
Fix
__repr__behaviour for Rows.Rows
__repr__assumes data is a string when column name is missing.Examples,
This is because Row () when called without column names assumes everything is a string.
How was this patch tested?
Manually tested and a unit test was added to
python/pyspark/sql/tests/test_types.py.