[SPARK-43082][CONNECT][PYTHON] Arrow-optimized Python UDFs in Spark Connect #40725

xinrong-meng · 2023-04-10T17:54:34Z

What changes were proposed in this pull request?

Implement Arrow-optimized Python UDFs in Spark Connect.

Please see #39384 for motivation and performance improvements of Arrow-optimized Python UDFs.

Why are the changes needed?

Parity with vanilla PySpark.

Does this PR introduce any user-facing change?

Yes. In Spark Connect Python Client, users can:

Set useArrow parameter True to enable Arrow optimization for a specific Python UDF.

>>> df = spark.range(2)
>>> df.select(udf(lambda x : x + 1, useArrow=True)('id')).show()
+------------+                                                                  
|<lambda>(id)|
+------------+
|           1|
|           2|
+------------+

# ArrowEvalPython indicates Arrow optimization
>>> df.select(udf(lambda x : x + 1, useArrow=True)('id')).explain()
== Physical Plan ==
*(2) Project [pythonUDF0#18 AS <lambda>(id)#16]
+- ArrowEvalPython [<lambda>(id#14L)#15], [pythonUDF0#18], 200
   +- *(1) Range (0, 2, step=1, splits=1)

Enable spark.sql.execution.pythonUDF.arrow.enabled Spark Conf to make all Python UDFs Arrow-optimized.

>>> spark.conf.set("spark.sql.execution.pythonUDF.arrow.enabled", True)
>>> df.select(udf(lambda x : x + 1)('id')).show()
+------------+                                                                  
|<lambda>(id)|
+------------+
|           1|
|           2|
+------------+

# ArrowEvalPython indicates Arrow optimization
>>> df.select(udf(lambda x : x + 1)('id')).explain()
== Physical Plan ==
*(2) Project [pythonUDF0#30 AS <lambda>(id)#28]
+- ArrowEvalPython [<lambda>(id#26L)#27], [pythonUDF0#30], 200
   +- *(1) Range (0, 2, step=1, splits=1)

How was this patch tested?

Parity unit tests.

SPARK-40307

xinrong-meng · 2023-04-10T18:29:49Z

python/pyspark/sql/udf.py

Ignoring the type annotations of _create_arrow_py_udf because it is shared between vanilla PySpark and Spark Connect Python Client.

The function is only an extraction of original code L142 - L179 for code reuse.

xinrong-meng · 2023-04-10T18:35:44Z

python/pyspark/sql/connect/udf.py

There is duplicated code in _create_py_udf between Spark Connect Python Client and vanilla PySpark, except for fetching the active SparkSession.
However, for a clear code path separation and abstraction, I decided not to refactor it for now.

xinrong-meng · 2023-04-11T00:53:12Z

CI failed because of

Run echo "APACHE_SPARK_REF=$(git rev-parse HEAD)" >> $GITHUB_ENV
fatal: detected dubious ownership in repository at '/__w/spark/spark'
To add an exception for this directory, call:

	git config --global --add safe.directory /__w/spark/spark
fatal: detected dubious ownership in repository at '/__w/spark/spark'
To add an exception for this directory, call:

	git config --global --add safe.directory /__w/spark/spark
Error: Process completed with exit code 128.

xinrong-meng · 2023-04-18T00:50:00Z

@HyukjinKwon @zhengruifeng Would you please take a look? Thank you!

HyukjinKwon · 2023-04-18T00:50:59Z

cc @ueshin FYI

python/pyspark/sql/connect/udf.py

python/pyspark/sql/tests/connect/test_parity_arrow_python_udf.py

zhengruifeng · 2023-04-20T07:58:09Z

python/pyspark/sql/udf.py

+    import pandas as pd
+    from pyspark.sql.pandas.functions import _create_pandas_udf
+
+    return_type = regular_udf.returnType


it seems that the regular_udf is only used to pass the returnType and evalType ?

And regular_udf.func based on the updated code.

python/pyspark/sql/udf.py

python/pyspark/sql/tests/connect/test_parity_arrow_python_udf.py

python/pyspark/sql/tests/test_arrow_python_udf.py

python/pyspark/sql/connect/udf.py

HyukjinKwon · 2023-04-22T00:30:44Z

Merged to master.

github-actions bot added CONNECT CORE PYTHON SQL labels Apr 10, 2023

xinrong-meng commented Apr 10, 2023

View reviewed changes

xinrong-meng added 7 commits April 17, 2023 13:55

_create_arrow_py_udf

63bb36b

in Connect

d702b67

tests

01f7190

- debug

0fb7712

docstrings

f46d006

TEST

3abeef4

TEST

f6fc6e1

xinrong-meng force-pushed the connect_arrow_py_udf branch from 95cad25 to f6fc6e1 Compare April 17, 2023 20:56

HyukjinKwon approved these changes Apr 18, 2023

View reviewed changes

rmv duplicate test

63ef94e

zhengruifeng reviewed Apr 20, 2023

View reviewed changes

zhengruifeng changed the title ~~[SPARK-43082][Connect][PYTHON] Arrow-optimized Python UDFs in Spark Connect~~ [SPARK-43082][CONNECT][PYTHON] Arrow-optimized Python UDFs in Spark Connect Apr 20, 2023

xinrong-meng added 2 commits April 20, 2023 10:51

tearDownClass

86938d5

rmv f from _create_arrow_py_udf

f5aef18

ueshin reviewed Apr 20, 2023

View reviewed changes

python/pyspark/sql/tests/connect/test_parity_arrow_python_udf.py Outdated Show resolved Hide resolved

python/pyspark/sql/tests/test_arrow_python_udf.py Outdated Show resolved Hide resolved

xinrong-meng added 2 commits April 20, 2023 11:18

UserWarning

f313063

finally super tearDownClass

5e78632

ueshin reviewed Apr 20, 2023

View reviewed changes

python/pyspark/sql/connect/udf.py Show resolved Hide resolved

fallback to regular udf

ac86bf1

xinrong-meng requested review from ueshin and zhengruifeng April 21, 2023 18:01

ueshin approved these changes Apr 21, 2023

View reviewed changes

HyukjinKwon closed this in f29502a Apr 22, 2023

ueshin mentioned this pull request Aug 19, 2023

[SPARK-44876][PYTHON] Fix Arrow-optimized Python UDF on Spark Connect #42568

Closed

[SPARK-43082][CONNECT][PYTHON] Arrow-optimized Python UDFs in Spark Connect #40725

[SPARK-43082][CONNECT][PYTHON] Arrow-optimized Python UDFs in Spark Connect #40725

Uh oh!

Conversation

xinrong-meng commented Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

xinrong-meng Apr 10, 2023

Choose a reason for hiding this comment

Uh oh!

xinrong-meng Apr 10, 2023

Choose a reason for hiding this comment

Uh oh!

xinrong-meng Apr 10, 2023

Choose a reason for hiding this comment

Uh oh!

xinrong-meng commented Apr 11, 2023

Uh oh!

xinrong-meng commented Apr 18, 2023

Uh oh!

HyukjinKwon commented Apr 18, 2023

Uh oh!

Uh oh!

Uh oh!

zhengruifeng Apr 20, 2023

Choose a reason for hiding this comment

Uh oh!

xinrong-meng Apr 20, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Apr 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xinrong-meng commented Apr 10, 2023 •

edited

Loading