[SPARK-33415][PYTHON][SQL] Don't encode JVM response in Column.repr #30322

zero323 · 2020-11-10T23:11:34Z

What changes were proposed in this pull request?

Removes encoding of the JVM response in pyspark.sql.column.Column.__repr__.

Why are the changes needed?

API consistency and improved readability of the expressions.

Does this PR introduce any user-facing change?

Before this change

col("abc")
col("wąż")

result in

Column<b'abc'>
Column<b'w\xc4\x85\xc5\xbc'>

After this change we'll get

Column<'abc'>
Column<'wąż'>

How was this patch tested?

Existing tests and manual inspection.

zero323 · 2020-11-10T23:13:49Z

cc @davies (current behavior has been introduced with #4645, so I'd appreciate if you take a look and let me know if you have any feedback. TIA)

maropu · 2020-11-10T23:28:36Z

python/pyspark/sql/column.py


    def __repr__(self):
-        return 'Column<%s>' % self._jc.toString().encode('utf8')
+        return "Column<'%s'>" % self._jc.toString()


python3 uses utf8 for strings by default, so this change seems fine. cc: @HyukjinKwon @viirya @srowen

Do we have any more instances of decode()?

Yeah, the change looks good, and yeah, are there more instances like this?

Seems fine. Is it originally for non printable characters in unicode?

Do we have any more instances of decode()?

We do a bit of encoding / decoding when we communicate with JVM, but purpose there is clear.

The only other place when we encode strings intended for user consumption is RDD.toDebugString. It also something that could be fixed, as it messing with the output a bit (as print won't respect line breaks).

Seems fine. Is it originally for non printable characters in unicode?

I believe the point was to have str object as the output, instead of unicode. If I recall correctly, unicode (py4j returns JVM Strings as unicode in Python 2 and as a result the whole expression would evaluate to unicode) in __repr__, wasn't handled correctly.

For example in IPython

>>> import sys >>> sys.version_info sys.version_info(major=2, minor=7, micro=15, releaselevel='final', serial=0) >>> class Foo: ... def __repr__(self): ... return u"œ" ... ... >>> Foo().__repr__() u'\u0153' >>> Foo() Traceback (most recent call last): File "/path/to/lib/python2.7/site-packages/IPython/core/formatters.py", line 686, in __call__ return repr(obj) UnicodeEncodeError: 'ascii' codec can't encode character u'\u0153' in position 0: ordinal not in range(128)

maropu · 2020-11-10T23:28:54Z

Why WIP? btw, could you add tests?

SparkQA · 2020-11-10T23:36:09Z

Test build #130892 has finished for PR 30322 at commit 49543e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-10T23:45:12Z

Test build #130893 has finished for PR 30322 at commit 88f0c50.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-10T23:55:37Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35498/

zero323 · 2020-11-11T00:02:45Z

Why WIP?

Just out of habit after working on docs.

btw, could you add tests?

Of course, done.

SparkQA · 2020-11-11T00:19:46Z

Test build #130894 has finished for PR 30322 at commit ec72805.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-11T00:23:08Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35498/

SparkQA · 2020-11-11T00:24:55Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35499/

SparkQA · 2020-11-11T00:46:31Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35500/

SparkQA · 2020-11-11T00:49:36Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35499/

SparkQA · 2020-11-11T01:09:45Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35500/

HyukjinKwon · 2020-11-11T15:12:44Z

Let me merge this in for now as this PR fixes what it aims.

Merged to master.

zero323 · 2020-11-11T20:22:45Z

Thanks everyone!

Don't encode JVM response in Column.__repr__

49543e8

github-actions bot added CORE PYTHON SQL labels Nov 10, 2020

maropu reviewed Nov 10, 2020

View reviewed changes

Add test

88f0c50

Drop blank line

ec72805

zero323 marked this pull request as ready for review November 11, 2020 00:02

HyukjinKwon approved these changes Nov 11, 2020

View reviewed changes

HyukjinKwon closed this in 4b76a74 Nov 11, 2020

zero323 deleted the SPARK-33415 branch November 11, 2020 20:22

[SPARK-33415][PYTHON][SQL] Don't encode JVM response in Column.__repr__ #30322

[SPARK-33415][PYTHON][SQL] Don't encode JVM response in Column.__repr__ #30322

Uh oh!

Conversation

zero323 commented Nov 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zero323 commented Nov 10, 2020

Uh oh!

maropu Nov 10, 2020

Choose a reason for hiding this comment

Uh oh!

srowen Nov 11, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 11, 2020

Choose a reason for hiding this comment

Uh oh!

viirya Nov 11, 2020

Choose a reason for hiding this comment

Uh oh!

zero323 Nov 11, 2020

Choose a reason for hiding this comment

Uh oh!

zero323 Nov 11, 2020

Choose a reason for hiding this comment

Uh oh!

maropu commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

SparkQA commented Nov 10, 2020

Uh oh!

zero323 commented Nov 11, 2020

Uh oh!

SparkQA commented Nov 11, 2020

Uh oh!

SparkQA commented Nov 11, 2020

Uh oh!

SparkQA commented Nov 11, 2020

Uh oh!

SparkQA commented Nov 11, 2020

Uh oh!

SparkQA commented Nov 11, 2020

Uh oh!

SparkQA commented Nov 11, 2020

Uh oh!

HyukjinKwon commented Nov 11, 2020

Uh oh!

zero323 commented Nov 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[SPARK-33415][PYTHON][SQL] Don't encode JVM response in Column.repr #30322

[SPARK-33415][PYTHON][SQL] Don't encode JVM response in Column.repr #30322

zero323 commented Nov 10, 2020 •

edited

Loading