[SPARK-44876][PYTHON] Fix Arrow-optimized Python UDF on Spark Connect #42568

ueshin · 2023-08-18T20:33:08Z

What changes were proposed in this pull request?

Fixes Arrow-optimized Python UDF on Spark Connect.

Also enables the missing test pyspark.sql.tests.connect.test_parity_arrow_python_udf.

Why are the changes needed?

pyspark.sql.tests.connect.test_parity_arrow_python_udf is not listed in dev/sparktestsupport/modules.py, and it fails when running manually.

======================================================================
ERROR [0.072s]: test_register (pyspark.sql.tests.connect.test_parity_arrow_python_udf.ArrowPythonUDFParityTests)
----------------------------------------------------------------------
Traceback (most recent call last):
...
pyspark.errors.exceptions.base.PySparkRuntimeError: [SCHEMA_MISMATCH_FOR_PANDAS_UDF] Result vector from pandas_udf was not the required length: expected 1, got 38.

The failure had not been captured because the test is missing in the module.py file.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

ueshin · 2023-08-18T20:33:30Z

cc @xinrong-meng @HyukjinKwon

python/pyspark/sql/udf.py

python/pyspark/sql/connect/udf.py

xinrong-meng · 2023-08-18T21:19:59Z

LGTM, pending test

dongjoon-hyun

Sorry, but the PR title looks misleading to me. This PR technically implements UDF feature instead of simply enabling UDF test code. Can we have a more intuitive PR title?

dongjoon-hyun · 2023-08-19T04:54:06Z

To the reviewers,

This is filed as a blocker issue for Apache Spark 3.5.0.
However, this looks like a missing major feature which didn't implemented yet before the feature freeze.

I'm not disagree with backporting this PR, but I believe we need to give a correct PR title instead of saying a missing test coverage.

ueshin · 2023-08-19T05:36:43Z

@dongjoon-hyun Thanks for reviewing this!

Sure, I'll update the title and description, but as for the backport, actually this is already implemented in 3.5 at #39384 and #40725. Unfortunately it had a bug and we didn't notice it because the test had not been activated for CI.

HyukjinKwon · 2023-08-21T00:20:07Z

Merged to master and branch-3.5

### What changes were proposed in this pull request? Fixes Arrow-optimized Python UDF on Spark Connect. Also enables the missing test `pyspark.sql.tests.connect.test_parity_arrow_python_udf`. ### Why are the changes needed? `pyspark.sql.tests.connect.test_parity_arrow_python_udf` is not listed in `dev/sparktestsupport/modules.py`, and it fails when running manually. ``` ====================================================================== ERROR [0.072s]: test_register (pyspark.sql.tests.connect.test_parity_arrow_python_udf.ArrowPythonUDFParityTests) ---------------------------------------------------------------------- Traceback (most recent call last): ... pyspark.errors.exceptions.base.PySparkRuntimeError: [SCHEMA_MISMATCH_FOR_PANDAS_UDF] Result vector from pandas_udf was not the required length: expected 1, got 38. ``` The failure had not been captured because the test is missing in the `module.py` file. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #42568 from ueshin/issues/SPARK-44876/test_parity_arrow_python_udf. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 75c0b8b) Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? Fixes Arrow-optimized Python UDF on Spark Connect. Also enables the missing test `pyspark.sql.tests.connect.test_parity_arrow_python_udf`. ### Why are the changes needed? `pyspark.sql.tests.connect.test_parity_arrow_python_udf` is not listed in `dev/sparktestsupport/modules.py`, and it fails when running manually. ``` ====================================================================== ERROR [0.072s]: test_register (pyspark.sql.tests.connect.test_parity_arrow_python_udf.ArrowPythonUDFParityTests) ---------------------------------------------------------------------- Traceback (most recent call last): ... pyspark.errors.exceptions.base.PySparkRuntimeError: [SCHEMA_MISMATCH_FOR_PANDAS_UDF] Result vector from pandas_udf was not the required length: expected 1, got 38. ``` The failure had not been captured because the test is missing in the `module.py` file. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes apache#42568 from ueshin/issues/SPARK-44876/test_parity_arrow_python_udf. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

Enable and fix test_parity_arrow_python_udf.

5526efc

github-actions bot added SQL BUILD CORE PYTHON CONNECT labels Aug 18, 2023

ueshin commented Aug 18, 2023

View reviewed changes

python/pyspark/sql/udf.py Show resolved Hide resolved

xinrong-meng approved these changes Aug 18, 2023

View reviewed changes

xinrong-meng reviewed Aug 18, 2023

View reviewed changes

python/pyspark/sql/connect/udf.py Outdated Show resolved Hide resolved

Fix.

5e2e341

dongjoon-hyun reviewed Aug 19, 2023

View reviewed changes

ueshin changed the title ~~[SPARK-44876][PYTHON] Enable and fix test_parity_arrow_python_udf.~~ [SPARK-44876][PYTHON] Fix Arrow-optimized Python UDF on Spark Connect Aug 19, 2023

HyukjinKwon approved these changes Aug 21, 2023

View reviewed changes

HyukjinKwon closed this in 75c0b8b Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-44876][PYTHON] Fix Arrow-optimized Python UDF on Spark Connect #42568

[SPARK-44876][PYTHON] Fix Arrow-optimized Python UDF on Spark Connect #42568

Uh oh!

ueshin commented Aug 18, 2023 •

edited

Loading

Uh oh!

ueshin commented Aug 18, 2023

Uh oh!

Uh oh!

Uh oh!

xinrong-meng commented Aug 18, 2023

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Aug 19, 2023

Uh oh!

ueshin commented Aug 19, 2023 •

edited

Loading

Uh oh!

HyukjinKwon commented Aug 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-44876][PYTHON] Fix Arrow-optimized Python UDF on Spark Connect #42568

[SPARK-44876][PYTHON] Fix Arrow-optimized Python UDF on Spark Connect #42568

Uh oh!

Conversation

ueshin commented Aug 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

ueshin commented Aug 18, 2023

Uh oh!

Uh oh!

Uh oh!

xinrong-meng commented Aug 18, 2023

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Aug 19, 2023

Uh oh!

ueshin commented Aug 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Aug 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ueshin commented Aug 18, 2023 •

edited

Loading

ueshin commented Aug 19, 2023 •

edited

Loading