Skip to content

Conversation

@advancedxy
Copy link
Contributor

@advancedxy advancedxy commented Sep 19, 2019

What changes were proposed in this pull request?

This PR allows non-ascii string as an exception message in Python 2 by explicitly en/decoding in case of str in Python 2.

Why are the changes needed?

Previously PySpark will hang when the UnicodeDecodeError occurs and the real exception cannot be passed to the JVM side.

See the reproducer as below:

def f():
    raise Exception("中")
spark = SparkSession.builder.master('local').getOrCreate()
spark.sparkContext.parallelize([1]).map(lambda x: f()).count()

Does this PR introduce any user-facing change?

User may not observe hanging for the similar cases.

How was this patch tested?

Added a new test and manually checking.

This pr is based on #18324, credits should also go to @dataknocker.
To make lint-python happy for python3, it also includes a followup fix for #25814

@advancedxy
Copy link
Contributor Author

cc @HyukjinKwon, @ueshin and @cloud-fan

self.assertIsInstance(t.exception, Py4JJavaError)
if sys.version_info.major < 3:
# we have to use unicode here to avoid UnicodeDecodeError
self.assertRegexpMatches(unicode(t.exception).encode("utf-8"), "exception with 中")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, str against Py4j exception doesn't properly handle non-ascii codes (py4j/py4j#308)

@HyukjinKwon
Copy link
Member

ok to test

except Exception:
try:
exc_info = traceback.format_exc()
if sys.version_info.major < 3:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, let's drop this right after we drop Python 2, which I will do right after Spark 3.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good otherwise.

@HyukjinKwon HyukjinKwon changed the title [SPARK-21045][PYSPARK] Defensive check for exception info thrown by user [SPARK-21045][PYTHON] Allow non-ascii string as an exception message in Python 2 Sep 19, 2019
@SparkQA
Copy link

SparkQA commented Sep 19, 2019

Test build #110987 has finished for PR 25847 at commit 90559c0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, when does an exception have non-ASCII chars? when it reports a table name or input value from the user app?

@HyukjinKwon
Copy link
Member

@srowen, for instance, users could manually throw an exception with python native function execution like udf or rdd.

@SparkQA
Copy link

SparkQA commented Sep 19, 2019

Test build #110997 has finished for PR 25847 at commit fb72447.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 19, 2019

Test build #111002 has finished for PR 25847 at commit ff7f248.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon changed the title [SPARK-21045][PYTHON] Allow non-ascii string as an exception message in Python 2 [SPARK-21045][PYTHON] Allow non-ascii string as an exception message from python execution in Python 2 Sep 19, 2019

if sys.version >= '3':
basestring = str
unicode = str
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary, see #25814 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I think we don't need the comditionat 603 line

@SparkQA
Copy link

SparkQA commented Sep 19, 2019

Test build #111007 has finished for PR 25847 at commit 0652966.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 19, 2019

Test build #111009 has finished for PR 25847 at commit ffb4d29.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good except for a question.

@SparkQA
Copy link

SparkQA commented Sep 20, 2019

Test build #111077 has finished for PR 25847 at commit d6ec7ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants