[SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2. #20507

ueshin · 2018-02-05T10:18:48Z

What changes were proposed in this pull request?

In Python 2, when pandas_udf tries to return string type value created in the udf with "..", the execution fails. E.g.,

from pyspark.sql.functions import pandas_udf, col
import pandas as pd

df = spark.range(10)
str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string")
df.select(str_f(col('id'))).show()

raises the following exception:

...

java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: expected StringType, got BinaryType
	at scala.Predef$.assert(Predef.scala:170)
	at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:93)

...

Seems like pyarrow ignores type parameter for pa.Array.from_pandas() and consider it as binary type when the type is string type and the string values are str instead of unicode in Python 2.

This pr adds a workaround for the case.

How was this patch tested?

Added a test and existing tests.

…rly.

ueshin · 2018-02-05T10:25:16Z

cc @BryanCutler @icexelloss @HyukjinKwon
Could you help me double-check this?
Since seems like this happens only in Python 2 environment, Jenkins will skip the tests.
And let me know if you know better workaround.

SparkQA · 2018-02-05T10:56:26Z

Test build #87063 has finished for PR 20507 at commit 47b8873.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

LGTM. I don't have a better idea. Just two nits I found while double checking.

HyukjinKwon · 2018-02-05T13:19:34Z

python/pyspark/sql/tests.py

+        import pandas as pd
+        df = self.spark.range(10)
+        str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), StringType())
+        res = df.select(str_f(col('id')))


How about variable names 'expected' and 'actual'?

Sure, I'll update it.

HyukjinKwon · 2018-02-05T13:33:54Z

python/pyspark/sql/tests.py

+        from pyspark.sql.functions import pandas_udf, col
+        import pandas as pd
+        df = self.spark.range(10)
+        str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), StringType())


Not a big deal. How about pd.Series(map(str, x))?

Sounds good. I'll take it.

SparkQA · 2018-02-05T14:46:31Z

Test build #87069 has finished for PR 20507 at commit 06ae568.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-02-05T16:03:36Z

python/pyspark/serializers.py

            return pa.Array.from_pandas(s, mask=mask).cast(t, safe=False)
+        elif t is not None and pa.types.is_string(t) and sys.version < '3':
+            # TODO: need decode before converting to Arrow in Python 2
+            return pa.Array.from_pandas(s.str.decode('utf-8'), mask=mask, type=t)


@ueshin, actually, how about s.apply(lambda v: v.decode("utf-8") if isinstance(v, str) else v) to allow non-ascii encodable unicodes too like u"아"? I was worried of performance but I ran a simple perf test vs s.str.decode('utf-8') for sure. Seems actually fine.

Good catch! I'll take it. Thanks!

ueshin · 2018-02-06T02:03:07Z

Seems like pyarrow ignores type parameter for pa.Array.from_pandas() and consider it as binary type when the type is string type and the string values are str instead of unicode in Python 2.

@BryanCutler Btw, do you think this is a bug of pyarrow in Python 2?

SparkQA · 2018-02-06T02:31:17Z

Test build #87083 has finished for PR 20507 at commit b3d5209.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-02-06T05:26:16Z

also cc @cloud-fan @gatorsmile @sameeragarwal

BryanCutler · 2018-02-06T08:55:06Z

Sorry I've been travelling, but I'll try to look into this soon on the Arrow side to see if it is a bug in pyarrow. The workaround here seems fine to me.

HyukjinKwon · 2018-02-06T09:30:46Z

Merged to master and branch-2.3.

…() to handle str type properly in Python 2. ## What changes were proposed in this pull request? In Python 2, when `pandas_udf` tries to return string type value created in the udf with `".."`, the execution fails. E.g., ```python from pyspark.sql.functions import pandas_udf, col import pandas as pd df = spark.range(10) str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string") df.select(str_f(col('id'))).show() ``` raises the following exception: ``` ... java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: expected StringType, got BinaryType at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:93) ... ``` Seems like pyarrow ignores `type` parameter for `pa.Array.from_pandas()` and consider it as binary type when the type is string type and the string values are `str` instead of `unicode` in Python 2. This pr adds a workaround for the case. ## How was this patch tested? Added a test and existing tests. Author: Takuya UESHIN <[email protected]> Closes #20507 from ueshin/issues/SPARK-23334. (cherry picked from commit 63c5bf1) Signed-off-by: hyukjinkwon <[email protected]>

ueshin · 2018-02-06T09:37:58Z

Thanks! @HyukjinKwon @BryanCutler

BryanCutler · 2018-02-06T19:44:41Z

I made https://issues.apache.org/jira/browse/ARROW-2101 to track the issue in Arrow

Fix pandas_udf with return type StringType() to handle str type prope…

47b8873

…rly.

HyukjinKwon approved these changes Feb 5, 2018

View reviewed changes

Address comments.

06ae568

HyukjinKwon reviewed Feb 5, 2018

View reviewed changes

Address a comment.

b3d5209

asfgit closed this in 63c5bf1 Feb 6, 2018

HyukjinKwon mentioned this pull request Feb 7, 2018

[SPARK-23352][PYTHON] Explicitly specify supported types in Pandas UDFs #20531

Closed

[SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2. #20507

[SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2. #20507

Uh oh!

Conversation

ueshin commented Feb 5, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

ueshin commented Feb 5, 2018

Uh oh!

SparkQA commented Feb 5, 2018

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Feb 5, 2018

Choose a reason for hiding this comment

Uh oh!

ueshin Feb 5, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Feb 5, 2018

Choose a reason for hiding this comment

Uh oh!

ueshin Feb 5, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 5, 2018

Uh oh!

HyukjinKwon Feb 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin Feb 6, 2018

Choose a reason for hiding this comment

Uh oh!

ueshin commented Feb 6, 2018

Uh oh!

SparkQA commented Feb 6, 2018

Uh oh!

ueshin commented Feb 6, 2018

Uh oh!

BryanCutler commented Feb 6, 2018

Uh oh!

HyukjinKwon commented Feb 6, 2018

Uh oh!

ueshin commented Feb 6, 2018

Uh oh!

BryanCutler commented Feb 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HyukjinKwon Feb 5, 2018 •

edited

Loading