Skip to content

Conversation

@tbcs
Copy link
Contributor

@tbcs tbcs commented Apr 24, 2019

This is PR is meant to replace #20503, which lay dormant for a while. The solution in the original PR is still valid, so this is just that patch rebased onto the current master.

Original summary follows.

What changes were proposed in this pull request?

Fix __repr__ behaviour for Rows.

Rows __repr__ assumes data is a string when column name is missing.
Examples,

>>> from pyspark.sql.types import Row
>>> Row ("Alice", "11")
<Row(Alice, 11)>

>>> Row (name="Alice", age=11)
Row(age=11, name='Alice')

>>> Row ("Alice", 11)
<snip stack trace>
TypeError: sequence item 1: expected string, int found

This is because Row () when called without column names assumes everything is a string.

How was this patch tested?

Manually tested and a unit test was added to python/pyspark/sql/tests/test_types.py.

@tbcs
Copy link
Contributor Author

tbcs commented Apr 24, 2019

If this can get merged then it would be great to also backport the patch to the currently supported stable version(s) of Spark (2.4 and 2.3?). What do I need to do to make that happen?

@felixcheung
Copy link
Member

@HyukjinKwon

@HyukjinKwon
Copy link
Member

ok to test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I applied the suggested change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test non-ascii compatible characters

I have added a test with unicode values. Of course that breaks in Python 2.7 because %s is used in repr. I think it would be reasonable to change the use of %s to %r for representing the individual tuple values, just as was suggested originally in the JIRA ticket. I've made that change and adapted the doctest accordingly. What do you think about this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this form of creating Row objects is probably rarely used and should at least be documented. Maybe it wasn't even intended to be used in the way I did. Please have a look at the docstring change I've added and let me know if this is appropriate.

@SparkQA
Copy link

SparkQA commented Apr 25, 2019

Test build #104890 has finished for PR 24448 at commit 8c448dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

ashashwat and others added 3 commits April 25, 2019 11:18
Rows __repr__ assumes data is strings when column name is missing.

Examples,
>>> Row ("Alice", "11")
<Row(Alice, 11)>

>>> Row (name="Alice", age=11)
Row(age=11, name='Alice')

>>> Row ("Alice", 11)
<snip stack trace>
TypeError: sequence item 1: expected string, int found

This is because Row () when called without column names assumes
everything is string.
Change the representation of tuple value Row objects: use repr instead of str
for the individual tuple values for unicode compatibility.

- change Row.__repr__ to use %r instead of %s
- adapt Row's doctest
- add test for unicode tuple values
@SparkQA
Copy link

SparkQA commented Apr 26, 2019

Test build #104949 has finished for PR 24448 at commit ae011c0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for the change. For back porting you can open a PR to the branch-2.4 and branch-2.3. We could also backport during the merge, but I think it makes sense to have a discussion around the backporting separately since it may break folks doctests.

@asfgit asfgit closed this in eec1a3c May 6, 2019
@holdenk
Copy link
Contributor

holdenk commented May 6, 2019

Merged to master. Thanks for making your first pull request to Spark @tbcs :) Do you have a JIRA username I can assign the issue to now that it's fixed for tracking?

@tbcs
Copy link
Contributor Author

tbcs commented May 13, 2019

I'll open a PR for the stable branches when I get a chance to do so.

@holdenk: I created a user on Jira: tbcs

@srowen
Copy link
Member

srowen commented Sep 3, 2019

I also wouldn't mind back-porting this. I used the varargs Row constructor, which seems normal, and hit this too. But I get that it may break doctests so, hm, unclear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants