Skip to content

Commit c9fcfd5

Browse files
viiryaJackey Lee
authored andcommitted
[SPARK-25461][PYSPARK][SQL] Add document for mismatch between return type of Pandas.Series and return type of pandas udf
## What changes were proposed in this pull request? For Pandas UDFs, we get arrow type from defined Catalyst return data type of UDFs. We use this arrow type to do serialization of data. If the defined return data type doesn't match with actual return type of Pandas.Series returned by Pandas UDFs, it has a risk to return incorrect data from Python side. Currently we don't have reliable approach to check if the data conversion is safe or not. We leave some document to notify this to users for now. When there is next upgrade of PyArrow available we can use to check it, we should add the option to check it. ## How was this patch tested? Only document change. Closes apache#22610 from viirya/SPARK-25461. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>
1 parent 35e2c01 commit c9fcfd5

File tree

1 file changed

+6
-0
lines changed

1 file changed

+6
-0
lines changed

python/pyspark/sql/functions.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2948,6 +2948,12 @@ def pandas_udf(f=None, returnType=None, functionType=None):
29482948
can fail on special rows, the workaround is to incorporate the condition into the functions.
29492949
29502950
.. note:: The user-defined functions do not take keyword arguments on the calling side.
2951+
2952+
.. note:: The data type of returned `pandas.Series` from the user-defined functions should be
2953+
matched with defined returnType (see :meth:`types.to_arrow_type` and
2954+
:meth:`types.from_arrow_type`). When there is mismatch between them, Spark might do
2955+
conversion on returned data. The conversion is not guaranteed to be correct and results
2956+
should be checked for accuracy by users.
29512957
"""
29522958
# decorator @pandas_udf(returnType, functionType)
29532959
is_decorator = f is None or isinstance(f, (str, DataType))

0 commit comments

Comments
 (0)