- 
                Notifications
    You must be signed in to change notification settings 
- Fork 28.9k
[SPARK-21045][PYTHON] Allow non-ascii string as an exception message from python execution in Python 2 #25847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21045][PYTHON] Allow non-ascii string as an exception message from python execution in Python 2 #25847
Conversation
| cc @HyukjinKwon, @ueshin and @cloud-fan | 
        
          
                python/pyspark/tests/test_worker.py
              
                Outdated
          
        
      | self.assertIsInstance(t.exception, Py4JJavaError) | ||
| if sys.version_info.major < 3: | ||
| # we have to use unicode here to avoid UnicodeDecodeError | ||
| self.assertRegexpMatches(unicode(t.exception).encode("utf-8"), "exception with 中") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, str against Py4j exception doesn't properly handle non-ascii codes (py4j/py4j#308)
| ok to test | 
        
          
                python/pyspark/worker.py
              
                Outdated
          
        
      | except Exception: | ||
| try: | ||
| exc_info = traceback.format_exc() | ||
| if sys.version_info.major < 3: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, let's drop this right after we drop Python 2, which I will do right after Spark 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good otherwise.
| Test build #110987 has finished for PR 25847 at commit  
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, when does an exception have non-ASCII chars? when it reports a table name or input value from the user app?
| @srowen, for instance, users could manually throw an exception with python native function execution like udf or rdd. | 
| Test build #110997 has finished for PR 25847 at commit  
 | 
| Test build #111002 has finished for PR 25847 at commit  
 | 
        
          
                python/pyspark/worker.py
              
                Outdated
          
        
      |  | ||
| if sys.version >= '3': | ||
| basestring = str | ||
| unicode = str | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary, see #25814 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I think we don't need the comditionat 603 line
| Test build #111007 has finished for PR 25847 at commit  
 | 
| Test build #111009 has finished for PR 25847 at commit  
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good except for a question.
| Test build #111077 has finished for PR 25847 at commit  
 | 
| Merged to master. | 
What changes were proposed in this pull request?
This PR allows non-ascii string as an exception message in Python 2 by explicitly en/decoding in case of
strin Python 2.Why are the changes needed?
Previously PySpark will hang when the
UnicodeDecodeErroroccurs and the real exception cannot be passed to the JVM side.See the reproducer as below:
Does this PR introduce any user-facing change?
User may not observe hanging for the similar cases.
How was this patch tested?
Added a new test and manually checking.
This pr is based on #18324, credits should also go to @dataknocker.
To make lint-python happy for python3, it also includes a followup fix for #25814