Skip to content

Conversation

@ashashwat
Copy link
Contributor

What changes were proposed in this pull request?

Fix __repr__ behaviour for Rows.

Rows __repr__ assumes data is a string when column name is missing.
Examples,

>>> from pyspark.sql.types import Row
>>> Row ("Alice", "11")
<Row(Alice, 11)>

>>> Row (name="Alice", age=11)
Row(age=11, name='Alice')

>>> Row ("Alice", 11)
<snip stack trace>
TypeError: sequence item 1: expected string, int found

This is because Row () when called without column names assumes
everything is a string.

How was this patch tested?

Manually tested and unittest was added in python/pyspark/sql/tests.py.

Rows __repr__ assumes data is strings when column name is missing.

Examples,
>>> Row ("Alice", "11")
<Row(Alice, 11)>

>>> Row (name="Alice", age=11)
Row(age=11, name='Alice')

>>> Row ("Alice", 11)
<snip stack trace>
TypeError: sequence item 1: expected string, int found

This is because Row () when called without column names assumes
everything is string.
@HyukjinKwon
Copy link
Member

BTW, does non-string field names work in this namedtuple way?

@ashashwat
Copy link
Contributor Author

@HyukjinKwon Do you mean something like Row (a=1, b=2, c=3) or Row (1="Alice", 2=11)? Former works fine, latter fails with SyntaxError: keyword can't be an expression.

@HyukjinKwon
Copy link
Member

I meant things like this:

>>> from pyspark.sql import Row
>>> RowClass = Row(1)
>>> RowClass("a")
Row(1='a')
>>> spark.createDataFrame([RowClass("a")])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/session.py", line 686, in createDataFrame
    rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/.../spark/python/pyspark/sql/session.py", line 410, in _createFromLocal
    struct = self._inferSchemaFromList(data, names=schema)
  File "/.../spark/python/pyspark/sql/session.py", line 342, in _inferSchemaFromList
    schema = reduce(_merge_type, (_infer_schema(row, names) for row in data))
  File "/.../spark/python/pyspark/sql/session.py", line 342, in <genexpr>
    schema = reduce(_merge_type, (_infer_schema(row, names) for row in data))
  File "/.../spark/python/pyspark/sql/types.py", line 1099, in _infer_schema
    fields = [StructField(k, _infer_type(v), True) for k, v in items]
  File "/.../spark/python/pyspark/sql/types.py", line 407, in __init__
    assert isinstance(name, basestring), "field name should be string"
AssertionError: field name should be string

The reason I initially didn't suggest to use str is, it breaks unicode in Python 2 IIRC. For example,

str(u"아")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\uc544' in position 0: ordinal not in range(128)

@HyukjinKwon
Copy link
Member

I think it still makes sense to produce a repr anyway because we successfully can create the instance for now but .. let me take a closer look within few days for sure.

@ashashwat
Copy link
Contributor Author

@HyukjinKwon Here is what I tried:

# Code: return "<Row(%s)>" % ", ".join(fields.encode("utf8") for fields in self)
>>> Row (u"아", "11")
<Row(아, 11)>
# Fails for integer fields.

# Code: return "<Row(%s)>" % ", ".join(str(fields) for fields in self)
>>> Row (u"아", "11")
UnicodeEncodeError: 'ascii' codec can't encode character u'\uc544' in position 0: ordinal not in range(128)

# Code: return "<Row(%s)>" % ", ".join(repr(fields) for fields in self)
>>> Row (u"아", 11)
<Row(u'\uc544', 11)>

# Code: return "<Row(%s)>" % ", ".join(unicode(fields).encode("utf8") for fields in self)
>>> Row (u"아", 11)
<Row(아, 11)>

repr is definitely a better option than str. But why not unicode?

@HyukjinKwon
Copy link
Member

unicode(fields).encode("utf8"): in this case, we will try to decode it by system default encoding first and then encode it by udf-8 if the input is str (bytes). So, for example, I think unicode("아") case won't work.

@HyukjinKwon
Copy link
Member

Check if it's unicode and convert, etc. might also work ..

@ashashwat
Copy link
Contributor Author

@HyukjinKwon return "<Row(%s)>" % ", ".join("%s" % (fields) for fields in self) takes care of everything.


>>> Row ("aa", 11)
<Row(aa, 11)>

>>> Row (u"아", 11)
<Row(아, 11)>

>>> Row ("아", 11)
<Row(아, 11)>

@felixcheung
Copy link
Member

Jenkins, ok to test

@SparkQA
Copy link

SparkQA commented Feb 7, 2018

Test build #87143 has finished for PR 20503 at commit 890aa65.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ashashwat
Copy link
Contributor Author

@HyukjinKwon Should I add more tests covering Unicode?

@felixcheung
Copy link
Member

we still need to fix this, right?

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jun 9, 2018

Test build #91594 has finished for PR 20503 at commit 890aa65.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Sep 27, 2018

Sure let's add a test with a unicode string to it if there's concern about that and make sure the existing repr with named fields is covered the same test case since I don't see an existing explicit test for that (although it's probably covered implicitly elsewhere).

@holdenk
Copy link
Contributor

holdenk commented Sep 27, 2018

I think this could be good to backport into 2.4 assuming the current RC fails if @ashashwat has the chance to update it and no one sees any issues with including this in a backport to that branch.

@holdenk
Copy link
Contributor

holdenk commented Oct 4, 2018

Gentle ping

@holdenk
Copy link
Contributor

holdenk commented Oct 12, 2018

Gentle ping again to @ashashwat . Also @HyukjinKwon what are your opinions on the test coverage?

self.assertEqual(len(row), 0)

def test_row_without_column_name(self):
row = Row("Alice", 11)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a doctest for this usage (Row as objects not as a namedtuple class), and documentation in Row at types.py?


def test_row_without_column_name(self):
row = Row("Alice", 11)
self.assertEqual(row.__repr__(), "<Row(Alice, 11)>")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would test non-ascii compatible characters as well

for k, v in zip(self.__fields__, tuple(self)))
else:
return "<Row(%s)>" % ", ".join(self)
return "<Row(%s)>" % ", ".join("%s" % (fields) for fields in self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit fields => field

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"%s" % (fields) for fields in self -> "%s" % field for field in self

@HyukjinKwon
Copy link
Member

Looks good otherwise.

@holdenk
Copy link
Contributor

holdenk commented Oct 26, 2018

Jenkins ok to test.
Gentle ping again to @ashashwat - are you still interested in this PR?

@ashashwat
Copy link
Contributor Author

@holdenk I am on it.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Oct 27, 2018

Test build #98110 has finished for PR 20503 at commit 890aa65.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Nov 2, 2018

Awesome, thanks. Let me know if I can help :)

@HyukjinKwon
Copy link
Member

ping @ashashwat to update

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Feb 25, 2019

@ashashwat are you able to update this?

@tbcs
Copy link
Contributor

tbcs commented Apr 17, 2019

By the way, this is not just an annoyance for interactive use: I bumped into this issue while trying to understand failing tests (run via pytest). Having a broken __repr__ is rather annoying in such a situation.

output for a failing test with broken __repr__:

E       assert [<Row(foo)>] == [<[TypeError("sequence ...ect at 0x7fa1d24a4240>]
E         (pytest_assertion plugin: representation of details failed.  Probably an object has a faulty __repr__.)
E         /home/user/projects/foobar/venv/lib/python3.6/site-packages/pyspark/sql/types.py:1552: TypeError: sequence item 0: expected str instance, NoneType found

output for a failing test with fixed __repr__:

E       assert [<Row(foo)>] == [<Row(None)>]
E         At index 0 diff: <Row(foo)> != <Row(None)>
E         Full diff:
E         - [<Row(foo)>]
E         ?       ^ ^
E         + [<Row(None)>]
E         ?     

@holdenk
Copy link
Contributor

holdenk commented Apr 19, 2019

@tbcs would you want to take over updating this PR? You could make a fork?

@tbcs
Copy link
Contributor

tbcs commented Apr 24, 2019

@holdenk OK, please see #24448.

@srowen srowen closed this Apr 24, 2019
asfgit pushed a commit that referenced this pull request May 6, 2019
This is PR is meant to replace #20503, which lay dormant for a while.  The solution in the original PR is still valid, so this is just that patch rebased onto the current master.

Original summary follows.

## What changes were proposed in this pull request?

Fix `__repr__` behaviour for Rows.

Rows `__repr__` assumes data is a string when column name is missing.
Examples,

```
>>> from pyspark.sql.types import Row
>>> Row ("Alice", "11")
<Row(Alice, 11)>

>>> Row (name="Alice", age=11)
Row(age=11, name='Alice')

>>> Row ("Alice", 11)
<snip stack trace>
TypeError: sequence item 1: expected string, int found
```

This is because Row () when called without column names assumes everything is a string.

## How was this patch tested?

Manually tested and a unit test was added to `python/pyspark/sql/tests/test_types.py`.

Closes #24448 from tbcs/SPARK-23299.

Lead-authored-by: Tibor Csögör <[email protected]>
Co-authored-by: Shashwat Anand <[email protected]>
Signed-off-by: Holden Karau <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants