[SPARK-23299][SQL][PYSPARK] Fix repr behaviour for Rows. #20503

ashashwat · 2018-02-04T11:33:42Z

What changes were proposed in this pull request?

Fix __repr__ behaviour for Rows.

Rows __repr__ assumes data is a string when column name is missing.
Examples,

>>> from pyspark.sql.types import Row
>>> Row ("Alice", "11")
<Row(Alice, 11)>

>>> Row (name="Alice", age=11)
Row(age=11, name='Alice')

>>> Row ("Alice", 11)
<snip stack trace>
TypeError: sequence item 1: expected string, int found

This is because Row () when called without column names assumes
everything is a string.

How was this patch tested?

Manually tested and unittest was added in python/pyspark/sql/tests.py.

Rows __repr__ assumes data is strings when column name is missing. Examples, >>> Row ("Alice", "11") <Row(Alice, 11)> >>> Row (name="Alice", age=11) Row(age=11, name='Alice') >>> Row ("Alice", 11) <snip stack trace> TypeError: sequence item 1: expected string, int found This is because Row () when called without column names assumes everything is string.

HyukjinKwon · 2018-02-04T12:06:47Z

BTW, does non-string field names work in this namedtuple way?

ashashwat · 2018-02-04T12:16:52Z

@HyukjinKwon Do you mean something like Row (a=1, b=2, c=3) or Row (1="Alice", 2=11)? Former works fine, latter fails with SyntaxError: keyword can't be an expression.

HyukjinKwon · 2018-02-04T12:46:35Z

I meant things like this:

>>> from pyspark.sql import Row
>>> RowClass = Row(1)
>>> RowClass("a")
Row(1='a')

>>> spark.createDataFrame([RowClass("a")])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/session.py", line 686, in createDataFrame
    rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/.../spark/python/pyspark/sql/session.py", line 410, in _createFromLocal
    struct = self._inferSchemaFromList(data, names=schema)
  File "/.../spark/python/pyspark/sql/session.py", line 342, in _inferSchemaFromList
    schema = reduce(_merge_type, (_infer_schema(row, names) for row in data))
  File "/.../spark/python/pyspark/sql/session.py", line 342, in <genexpr>
    schema = reduce(_merge_type, (_infer_schema(row, names) for row in data))
  File "/.../spark/python/pyspark/sql/types.py", line 1099, in _infer_schema
    fields = [StructField(k, _infer_type(v), True) for k, v in items]
  File "/.../spark/python/pyspark/sql/types.py", line 407, in __init__
    assert isinstance(name, basestring), "field name should be string"
AssertionError: field name should be string

The reason I initially didn't suggest to use str is, it breaks unicode in Python 2 IIRC. For example,

str(u"아")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\uc544' in position 0: ordinal not in range(128)

HyukjinKwon · 2018-02-04T12:48:22Z

I think it still makes sense to produce a repr anyway because we successfully can create the instance for now but .. let me take a closer look within few days for sure.

ashashwat · 2018-02-04T13:35:27Z

@HyukjinKwon Here is what I tried:

# Code: return "<Row(%s)>" % ", ".join(fields.encode("utf8") for fields in self)
>>> Row (u"아", "11")
<Row(아, 11)>
# Fails for integer fields.

# Code: return "<Row(%s)>" % ", ".join(str(fields) for fields in self)
>>> Row (u"아", "11")
UnicodeEncodeError: 'ascii' codec can't encode character u'\uc544' in position 0: ordinal not in range(128)

# Code: return "<Row(%s)>" % ", ".join(repr(fields) for fields in self)
>>> Row (u"아", 11)
<Row(u'\uc544', 11)>

# Code: return "<Row(%s)>" % ", ".join(unicode(fields).encode("utf8") for fields in self)
>>> Row (u"아", 11)
<Row(아, 11)>

repr is definitely a better option than str. But why not unicode?

HyukjinKwon · 2018-02-04T13:41:47Z

unicode(fields).encode("utf8"): in this case, we will try to decode it by system default encoding first and then encode it by udf-8 if the input is str (bytes). So, for example, I think unicode("아") case won't work.

HyukjinKwon · 2018-02-04T13:44:46Z

Check if it's unicode and convert, etc. might also work ..

ashashwat · 2018-02-04T13:48:14Z

@HyukjinKwon return "<Row(%s)>" % ", ".join("%s" % (fields) for fields in self) takes care of everything.


>>> Row ("aa", 11)
<Row(aa, 11)>

>>> Row (u"아", 11)
<Row(아, 11)>

>>> Row ("아", 11)
<Row(아, 11)>

felixcheung · 2018-02-07T05:18:11Z

Jenkins, ok to test

SparkQA · 2018-02-07T05:54:52Z

Test build #87143 has finished for PR 20503 at commit 890aa65.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ashashwat · 2018-02-08T14:42:53Z

@HyukjinKwon Should I add more tests covering Unicode?

felixcheung · 2018-06-09T03:54:28Z

we still need to fix this, right?

HyukjinKwon · 2018-06-09T03:57:02Z

ok to test

SparkQA · 2018-06-09T04:29:04Z

Test build #91594 has finished for PR 20503 at commit 890aa65.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2018-09-27T14:48:23Z

Sure let's add a test with a unicode string to it if there's concern about that and make sure the existing repr with named fields is covered the same test case since I don't see an existing explicit test for that (although it's probably covered implicitly elsewhere).

holdenk · 2018-09-27T14:49:19Z

I think this could be good to backport into 2.4 assuming the current RC fails if @ashashwat has the chance to update it and no one sees any issues with including this in a backport to that branch.

holdenk · 2018-10-04T14:50:44Z

Gentle ping

holdenk · 2018-10-12T17:43:14Z

Gentle ping again to @ashashwat . Also @HyukjinKwon what are your opinions on the test coverage?

HyukjinKwon · 2018-10-17T09:18:11Z

python/pyspark/sql/tests.py

        self.assertEqual(len(row), 0)

+    def test_row_without_column_name(self):
+        row = Row("Alice", 11)


Can we add a doctest for this usage (Row as objects not as a namedtuple class), and documentation in Row at types.py?

HyukjinKwon · 2018-10-17T09:22:43Z

python/pyspark/sql/tests.py


+    def test_row_without_column_name(self):
+        row = Row("Alice", 11)
+        self.assertEqual(row.__repr__(), "<Row(Alice, 11)>")


I would test non-ascii compatible characters as well

HyukjinKwon · 2018-10-17T09:23:33Z

python/pyspark/sql/types.py

                                         for k, v in zip(self.__fields__, tuple(self)))
        else:
-            return "<Row(%s)>" % ", ".join(self)
+            return "<Row(%s)>" % ", ".join("%s" % (fields) for fields in self)


nit fields => field

"%s" % (fields) for fields in self -> "%s" % field for field in self

HyukjinKwon · 2018-10-17T09:25:38Z

Looks good otherwise.

holdenk · 2018-10-26T16:28:54Z

Jenkins ok to test.
Gentle ping again to @ashashwat - are you still interested in this PR?

ashashwat · 2018-10-26T16:36:53Z

@holdenk I am on it.

HyukjinKwon · 2018-10-27T00:34:14Z

ok to test

SparkQA · 2018-10-27T01:10:56Z

Test build #98110 has finished for PR 20503 at commit 890aa65.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2018-11-02T15:56:04Z

Awesome, thanks. Let me know if I can help :)

HyukjinKwon · 2018-11-11T03:41:08Z

ping @ashashwat to update

AmplabJenkins · 2019-02-25T12:21:11Z

Can one of the admins verify this patch?

srowen · 2019-02-25T15:50:43Z

@ashashwat are you able to update this?

tbcs · 2019-04-17T19:23:40Z

By the way, this is not just an annoyance for interactive use: I bumped into this issue while trying to understand failing tests (run via pytest). Having a broken __repr__ is rather annoying in such a situation.

output for a failing test with broken __repr__:

E       assert [<Row(foo)>] == [<[TypeError("sequence ...ect at 0x7fa1d24a4240>]
E         (pytest_assertion plugin: representation of details failed.  Probably an object has a faulty __repr__.)
E         /home/user/projects/foobar/venv/lib/python3.6/site-packages/pyspark/sql/types.py:1552: TypeError: sequence item 0: expected str instance, NoneType found

output for a failing test with fixed __repr__:

E       assert [<Row(foo)>] == [<Row(None)>]
E         At index 0 diff: <Row(foo)> != <Row(None)>
E         Full diff:
E         - [<Row(foo)>]
E         ?       ^ ^
E         + [<Row(None)>]
E         ?

holdenk · 2019-04-19T23:29:32Z

@tbcs would you want to take over updating this PR? You could make a fork?

tbcs · 2019-04-24T12:14:19Z

@holdenk OK, please see #24448.

This is PR is meant to replace #20503, which lay dormant for a while. The solution in the original PR is still valid, so this is just that patch rebased onto the current master. Original summary follows. ## What changes were proposed in this pull request? Fix `__repr__` behaviour for Rows. Rows `__repr__` assumes data is a string when column name is missing. Examples, ``` >>> from pyspark.sql.types import Row >>> Row ("Alice", "11") <Row(Alice, 11)> >>> Row (name="Alice", age=11) Row(age=11, name='Alice') >>> Row ("Alice", 11) <snip stack trace> TypeError: sequence item 1: expected string, int found ``` This is because Row () when called without column names assumes everything is a string. ## How was this patch tested? Manually tested and a unit test was added to `python/pyspark/sql/tests/test_types.py`. Closes #24448 from tbcs/SPARK-23299. Lead-authored-by: Tibor Csögör <[email protected]> Co-authored-by: Shashwat Anand <[email protected]> Signed-off-by: Holden Karau <[email protected]>

Unicode fix.

890aa65

HyukjinKwon reviewed Oct 17, 2018

View reviewed changes

tbcs mentioned this pull request Apr 24, 2019

[SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for Rows #24448

Closed

srowen closed this Apr 24, 2019

[SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for Rows. #20503

[SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for Rows. #20503

Uh oh!

Conversation

ashashwat commented Feb 4, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Feb 4, 2018

Uh oh!

ashashwat commented Feb 4, 2018

Uh oh!

HyukjinKwon commented Feb 4, 2018

Uh oh!

HyukjinKwon commented Feb 4, 2018

Uh oh!

ashashwat commented Feb 4, 2018

Uh oh!

HyukjinKwon commented Feb 4, 2018

Uh oh!

HyukjinKwon commented Feb 4, 2018

Uh oh!

ashashwat commented Feb 4, 2018

Uh oh!

felixcheung commented Feb 7, 2018

Uh oh!

SparkQA commented Feb 7, 2018

Uh oh!

ashashwat commented Feb 8, 2018

Uh oh!

felixcheung commented Jun 9, 2018

Uh oh!

HyukjinKwon commented Jun 9, 2018

Uh oh!

SparkQA commented Jun 9, 2018

Uh oh!

holdenk commented Sep 27, 2018

Uh oh!

holdenk commented Sep 27, 2018

Uh oh!

holdenk commented Oct 4, 2018

Uh oh!

holdenk commented Oct 12, 2018

Uh oh!

HyukjinKwon Oct 17, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 17, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 17, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 17, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 17, 2018

Uh oh!

holdenk commented Oct 26, 2018

Uh oh!

ashashwat commented Oct 26, 2018

Uh oh!

HyukjinKwon commented Oct 27, 2018

Uh oh!

SparkQA commented Oct 27, 2018

Uh oh!

holdenk commented Nov 2, 2018

Uh oh!

HyukjinKwon commented Nov 11, 2018

Uh oh!

AmplabJenkins commented Feb 25, 2019

Uh oh!

srowen commented Feb 25, 2019

Uh oh!

tbcs commented Apr 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

holdenk commented Apr 19, 2019

Uh oh!

tbcs commented Apr 24, 2019

[SPARK-23299][SQL][PYSPARK] Fix repr behaviour for Rows. #20503

[SPARK-23299][SQL][PYSPARK] Fix repr behaviour for Rows. #20503

tbcs commented Apr 17, 2019 •

edited

Loading