Skip to content

Commit eec1a3c

Browse files
tbcsashashwat
authored andcommitted
[SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for Rows
This is PR is meant to replace #20503, which lay dormant for a while. The solution in the original PR is still valid, so this is just that patch rebased onto the current master. Original summary follows. ## What changes were proposed in this pull request? Fix `__repr__` behaviour for Rows. Rows `__repr__` assumes data is a string when column name is missing. Examples, ``` >>> from pyspark.sql.types import Row >>> Row ("Alice", "11") <Row(Alice, 11)> >>> Row (name="Alice", age=11) Row(age=11, name='Alice') >>> Row ("Alice", 11) <snip stack trace> TypeError: sequence item 1: expected string, int found ``` This is because Row () when called without column names assumes everything is a string. ## How was this patch tested? Manually tested and a unit test was added to `python/pyspark/sql/tests/test_types.py`. Closes #24448 from tbcs/SPARK-23299. Lead-authored-by: Tibor Csögör <[email protected]> Co-authored-by: Shashwat Anand <[email protected]> Signed-off-by: Holden Karau <[email protected]>
1 parent 6ef4530 commit eec1a3c

File tree

2 files changed

+25
-2
lines changed

2 files changed

+25
-2
lines changed

python/pyspark/sql/tests/test_types.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# -*- encoding: utf-8 -*-
12
#
23
# Licensed to the Apache Software Foundation (ASF) under one or more
34
# contributor license agreements. See the NOTICE file distributed with
@@ -739,6 +740,17 @@ def test_timestamp_microsecond(self):
739740
tst = TimestampType()
740741
self.assertEqual(tst.toInternal(datetime.datetime.max) % 1000000, 999999)
741742

743+
# regression test for SPARK-23299
744+
def test_row_without_column_name(self):
745+
row = Row("Alice", 11)
746+
self.assertEqual(repr(row), "<Row('Alice', 11)>")
747+
748+
# test __repr__ with unicode values
749+
if sys.version_info.major >= 3:
750+
self.assertEqual(repr(Row("数", "量")), "<Row('数', '量')>")
751+
else:
752+
self.assertEqual(repr(Row(u"数", u"量")), r"<Row(u'\u6570', u'\u91cf')>")
753+
742754
def test_empty_row(self):
743755
row = Row()
744756
self.assertEqual(len(row), 0)

python/pyspark/sql/types.py

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1435,13 +1435,24 @@ class Row(tuple):
14351435
14361436
>>> Person = Row("name", "age")
14371437
>>> Person
1438-
<Row(name, age)>
1438+
<Row('name', 'age')>
14391439
>>> 'name' in Person
14401440
True
14411441
>>> 'wrong_key' in Person
14421442
False
14431443
>>> Person("Alice", 11)
14441444
Row(name='Alice', age=11)
1445+
1446+
This form can also be used to create rows as tuple values, i.e. with unnamed
1447+
fields. Beware that such Row objects have different equality semantics:
1448+
1449+
>>> row1 = Row("Alice", 11)
1450+
>>> row2 = Row(name="Alice", age=11)
1451+
>>> row1 == row2
1452+
False
1453+
>>> row3 = Row(a="Alice", b=11)
1454+
>>> row1 == row3
1455+
True
14451456
"""
14461457

14471458
def __new__(self, *args, **kwargs):
@@ -1549,7 +1560,7 @@ def __repr__(self):
15491560
return "Row(%s)" % ", ".join("%s=%r" % (k, v)
15501561
for k, v in zip(self.__fields__, tuple(self)))
15511562
else:
1552-
return "<Row(%s)>" % ", ".join(self)
1563+
return "<Row(%s)>" % ", ".join("%r" % field for field in self)
15531564

15541565

15551566
class DateConverter(object):

0 commit comments

Comments
 (0)