-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-44876][PYTHON] Fix Arrow-optimized Python UDF on Spark Connect #42568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-44876][PYTHON] Fix Arrow-optimized Python UDF on Spark Connect #42568
Conversation
|
LGTM, pending test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but the PR title looks misleading to me. This PR technically implements UDF feature instead of simply enabling UDF test code. Can we have a more intuitive PR title?
|
To the reviewers,
I'm not disagree with backporting this PR, but I believe we need to give a correct PR title instead of saying a missing test coverage. |
|
@dongjoon-hyun Thanks for reviewing this! Sure, I'll update the title and description, but as for the backport, actually this is already implemented in 3.5 at #39384 and #40725. Unfortunately it had a bug and we didn't notice it because the test had not been activated for CI. |
|
Merged to master and branch-3.5 |
### What changes were proposed in this pull request? Fixes Arrow-optimized Python UDF on Spark Connect. Also enables the missing test `pyspark.sql.tests.connect.test_parity_arrow_python_udf`. ### Why are the changes needed? `pyspark.sql.tests.connect.test_parity_arrow_python_udf` is not listed in `dev/sparktestsupport/modules.py`, and it fails when running manually. ``` ====================================================================== ERROR [0.072s]: test_register (pyspark.sql.tests.connect.test_parity_arrow_python_udf.ArrowPythonUDFParityTests) ---------------------------------------------------------------------- Traceback (most recent call last): ... pyspark.errors.exceptions.base.PySparkRuntimeError: [SCHEMA_MISMATCH_FOR_PANDAS_UDF] Result vector from pandas_udf was not the required length: expected 1, got 38. ``` The failure had not been captured because the test is missing in the `module.py` file. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #42568 from ueshin/issues/SPARK-44876/test_parity_arrow_python_udf. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 75c0b8b) Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? Fixes Arrow-optimized Python UDF on Spark Connect. Also enables the missing test `pyspark.sql.tests.connect.test_parity_arrow_python_udf`. ### Why are the changes needed? `pyspark.sql.tests.connect.test_parity_arrow_python_udf` is not listed in `dev/sparktestsupport/modules.py`, and it fails when running manually. ``` ====================================================================== ERROR [0.072s]: test_register (pyspark.sql.tests.connect.test_parity_arrow_python_udf.ArrowPythonUDFParityTests) ---------------------------------------------------------------------- Traceback (most recent call last): ... pyspark.errors.exceptions.base.PySparkRuntimeError: [SCHEMA_MISMATCH_FOR_PANDAS_UDF] Result vector from pandas_udf was not the required length: expected 1, got 38. ``` The failure had not been captured because the test is missing in the `module.py` file. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes apache#42568 from ueshin/issues/SPARK-44876/test_parity_arrow_python_udf. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? Fixes Arrow-optimized Python UDF on Spark Connect. Also enables the missing test `pyspark.sql.tests.connect.test_parity_arrow_python_udf`. ### Why are the changes needed? `pyspark.sql.tests.connect.test_parity_arrow_python_udf` is not listed in `dev/sparktestsupport/modules.py`, and it fails when running manually. ``` ====================================================================== ERROR [0.072s]: test_register (pyspark.sql.tests.connect.test_parity_arrow_python_udf.ArrowPythonUDFParityTests) ---------------------------------------------------------------------- Traceback (most recent call last): ... pyspark.errors.exceptions.base.PySparkRuntimeError: [SCHEMA_MISMATCH_FOR_PANDAS_UDF] Result vector from pandas_udf was not the required length: expected 1, got 38. ``` The failure had not been captured because the test is missing in the `module.py` file. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes apache#42568 from ueshin/issues/SPARK-44876/test_parity_arrow_python_udf. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
Fixes Arrow-optimized Python UDF on Spark Connect.
Also enables the missing test
pyspark.sql.tests.connect.test_parity_arrow_python_udf.Why are the changes needed?
pyspark.sql.tests.connect.test_parity_arrow_python_udfis not listed indev/sparktestsupport/modules.py, and it fails when running manually.The failure had not been captured because the test is missing in the
module.pyfile.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests.