-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for Rows. #20503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Rows __repr__ assumes data is strings when column name is missing.
Examples,
>>> Row ("Alice", "11")
<Row(Alice, 11)>
>>> Row (name="Alice", age=11)
Row(age=11, name='Alice')
>>> Row ("Alice", 11)
<snip stack trace>
TypeError: sequence item 1: expected string, int found
This is because Row () when called without column names assumes
everything is string.
|
BTW, does non-string field names work in this namedtuple way? |
|
@HyukjinKwon Do you mean something like |
|
I meant things like this: >>> from pyspark.sql import Row
>>> RowClass = Row(1)
>>> RowClass("a")
Row(1='a')>>> spark.createDataFrame([RowClass("a")])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/session.py", line 686, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/.../spark/python/pyspark/sql/session.py", line 410, in _createFromLocal
struct = self._inferSchemaFromList(data, names=schema)
File "/.../spark/python/pyspark/sql/session.py", line 342, in _inferSchemaFromList
schema = reduce(_merge_type, (_infer_schema(row, names) for row in data))
File "/.../spark/python/pyspark/sql/session.py", line 342, in <genexpr>
schema = reduce(_merge_type, (_infer_schema(row, names) for row in data))
File "/.../spark/python/pyspark/sql/types.py", line 1099, in _infer_schema
fields = [StructField(k, _infer_type(v), True) for k, v in items]
File "/.../spark/python/pyspark/sql/types.py", line 407, in __init__
assert isinstance(name, basestring), "field name should be string"
AssertionError: field name should be stringThe reason I initially didn't suggest to use |
|
I think it still makes sense to produce a repr anyway because we successfully can create the instance for now but .. let me take a closer look within few days for sure. |
|
@HyukjinKwon Here is what I tried: repr is definitely a better option than str. But why not unicode? |
|
|
|
Check if it's |
|
@HyukjinKwon |
|
Jenkins, ok to test |
|
Test build #87143 has finished for PR 20503 at commit
|
|
@HyukjinKwon Should I add more tests covering Unicode? |
|
we still need to fix this, right? |
|
ok to test |
|
Test build #91594 has finished for PR 20503 at commit
|
|
Sure let's add a test with a unicode string to it if there's concern about that and make sure the existing repr with named fields is covered the same test case since I don't see an existing explicit test for that (although it's probably covered implicitly elsewhere). |
|
I think this could be good to backport into 2.4 assuming the current RC fails if @ashashwat has the chance to update it and no one sees any issues with including this in a backport to that branch. |
|
Gentle ping |
|
Gentle ping again to @ashashwat . Also @HyukjinKwon what are your opinions on the test coverage? |
| self.assertEqual(len(row), 0) | ||
|
|
||
| def test_row_without_column_name(self): | ||
| row = Row("Alice", 11) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a doctest for this usage (Row as objects not as a namedtuple class), and documentation in Row at types.py?
|
|
||
| def test_row_without_column_name(self): | ||
| row = Row("Alice", 11) | ||
| self.assertEqual(row.__repr__(), "<Row(Alice, 11)>") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would test non-ascii compatible characters as well
| for k, v in zip(self.__fields__, tuple(self))) | ||
| else: | ||
| return "<Row(%s)>" % ", ".join(self) | ||
| return "<Row(%s)>" % ", ".join("%s" % (fields) for fields in self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit fields => field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"%s" % (fields) for fields in self -> "%s" % field for field in self
|
Looks good otherwise. |
|
Jenkins ok to test. |
|
@holdenk I am on it. |
|
ok to test |
|
Test build #98110 has finished for PR 20503 at commit
|
|
Awesome, thanks. Let me know if I can help :) |
|
ping @ashashwat to update |
|
Can one of the admins verify this patch? |
|
@ashashwat are you able to update this? |
|
By the way, this is not just an annoyance for interactive use: I bumped into this issue while trying to understand failing tests (run via pytest). Having a broken output for a failing test with broken output for a failing test with fixed |
|
@tbcs would you want to take over updating this PR? You could make a fork? |
This is PR is meant to replace #20503, which lay dormant for a while. The solution in the original PR is still valid, so this is just that patch rebased onto the current master. Original summary follows. ## What changes were proposed in this pull request? Fix `__repr__` behaviour for Rows. Rows `__repr__` assumes data is a string when column name is missing. Examples, ``` >>> from pyspark.sql.types import Row >>> Row ("Alice", "11") <Row(Alice, 11)> >>> Row (name="Alice", age=11) Row(age=11, name='Alice') >>> Row ("Alice", 11) <snip stack trace> TypeError: sequence item 1: expected string, int found ``` This is because Row () when called without column names assumes everything is a string. ## How was this patch tested? Manually tested and a unit test was added to `python/pyspark/sql/tests/test_types.py`. Closes #24448 from tbcs/SPARK-23299. Lead-authored-by: Tibor Csögör <[email protected]> Co-authored-by: Shashwat Anand <[email protected]> Signed-off-by: Holden Karau <[email protected]>
What changes were proposed in this pull request?
Fix __repr__ behaviour for Rows.
Rows __repr__ assumes data is a string when column name is missing.
Examples,
This is because Row () when called without column names assumes
everything is a string.
How was this patch tested?
Manually tested and unittest was added in
python/pyspark/sql/tests.py.