[SPARK-23299][SQL][PYSPARK] Fix repr behaviour for Rows #24448

tbcs · 2019-04-24T12:13:22Z

This is PR is meant to replace #20503, which lay dormant for a while. The solution in the original PR is still valid, so this is just that patch rebased onto the current master.

Original summary follows.

What changes were proposed in this pull request?

Fix __repr__ behaviour for Rows.

Rows __repr__ assumes data is a string when column name is missing.
Examples,

>>> from pyspark.sql.types import Row
>>> Row ("Alice", "11")
<Row(Alice, 11)>

>>> Row (name="Alice", age=11)
Row(age=11, name='Alice')

>>> Row ("Alice", 11)
<snip stack trace>
TypeError: sequence item 1: expected string, int found

This is because Row () when called without column names assumes everything is a string.

How was this patch tested?

Manually tested and a unit test was added to python/pyspark/sql/tests/test_types.py.

tbcs · 2019-04-24T12:17:13Z

If this can get merged then it would be great to also backport the patch to the currently supported stable version(s) of Spark (2.4 and 2.3?). What do I need to do to make that happen?

felixcheung · 2019-04-25T04:01:13Z

@HyukjinKwon

HyukjinKwon · 2019-04-25T04:06:13Z

ok to test

HyukjinKwon · 2019-04-25T04:07:42Z

python/pyspark/sql/types.py

#20503 (comment)

I applied the suggested change.

HyukjinKwon · 2019-04-25T04:07:45Z

python/pyspark/sql/tests/test_types.py

#20503 (comment)

test non-ascii compatible characters

I have added a test with unicode values. Of course that breaks in Python 2.7 because %s is used in repr. I think it would be reasonable to change the use of %s to %r for representing the individual tuple values, just as was suggested originally in the JIRA ticket. I've made that change and adapted the doctest accordingly. What do you think about this?

HyukjinKwon · 2019-04-25T04:07:53Z

python/pyspark/sql/tests/test_types.py

#20503 (comment)

Indeed, this form of creating Row objects is probably rarely used and should at least be documented. Maybe it wasn't even intended to be used in the way I did. Please have a look at the docstring change I've added and let me know if this is appropriate.

SparkQA · 2019-04-25T04:38:31Z

Test build #104890 has finished for PR 24448 at commit 8c448dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Rows __repr__ assumes data is strings when column name is missing. Examples, >>> Row ("Alice", "11") <Row(Alice, 11)> >>> Row (name="Alice", age=11) Row(age=11, name='Alice') >>> Row ("Alice", 11) <snip stack trace> TypeError: sequence item 1: expected string, int found This is because Row () when called without column names assumes everything is string.

Change the representation of tuple value Row objects: use repr instead of str for the individual tuple values for unicode compatibility. - change Row.__repr__ to use %r instead of %s - adapt Row's doctest - add test for unicode tuple values

SparkQA · 2019-04-26T23:02:15Z

Test build #104949 has finished for PR 24448 at commit ae011c0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk

LGTM thanks for the change. For back porting you can open a PR to the branch-2.4 and branch-2.3. We could also backport during the merge, but I think it makes sense to have a discussion around the backporting separately since it may break folks doctests.

holdenk · 2019-05-06T17:05:40Z

Merged to master. Thanks for making your first pull request to Spark @tbcs :) Do you have a JIRA username I can assign the issue to now that it's fixed for tracking?

tbcs · 2019-05-13T09:04:27Z

I'll open a PR for the stable branches when I get a chance to do so.

@holdenk: I created a user on Jira: tbcs

srowen · 2019-09-03T14:07:01Z

I also wouldn't mind back-porting this. I used the varargs Row constructor, which seems normal, and hit this too. But I get that it may break doctests so, hm, unclear.

tbcs mentioned this pull request Apr 24, 2019

[SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for Rows. #20503

Closed

HyukjinKwon reviewed Apr 25, 2019

View reviewed changes

ashashwat and others added 3 commits April 25, 2019 11:18

[SPARK-23299][SQL][PYSPARK] Document the use of Row as a tuple value

735d68b

tbcs force-pushed the SPARK-23299 branch from 8c448dc to ae011c0 Compare April 26, 2019 22:26

holdenk approved these changes May 6, 2019

View reviewed changes

asfgit closed this in eec1a3c May 6, 2019

[SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for Rows #24448

[SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for Rows #24448

Uh oh!

Conversation

tbcs commented Apr 24, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

tbcs commented Apr 24, 2019

Uh oh!

felixcheung commented Apr 25, 2019

Uh oh!

HyukjinKwon commented Apr 25, 2019

Uh oh!

HyukjinKwon Apr 25, 2019

Choose a reason for hiding this comment

Uh oh!

tbcs Apr 26, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Apr 25, 2019

Choose a reason for hiding this comment

Uh oh!

tbcs Apr 26, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Apr 25, 2019

Choose a reason for hiding this comment

Uh oh!

tbcs Apr 26, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 25, 2019

Uh oh!

SparkQA commented Apr 26, 2019

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

holdenk commented May 6, 2019

Uh oh!

tbcs commented May 13, 2019

Uh oh!

srowen commented Sep 3, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[SPARK-23299][SQL][PYSPARK] Fix repr behaviour for Rows #24448

[SPARK-23299][SQL][PYSPARK] Fix repr behaviour for Rows #24448