Skip to content

Conversation

@ueshin
Copy link
Member

@ueshin ueshin commented Aug 18, 2023

What changes were proposed in this pull request?

Fixes Arrow-optimized Python UDF on Spark Connect.

Also enables the missing test pyspark.sql.tests.connect.test_parity_arrow_python_udf.

Why are the changes needed?

pyspark.sql.tests.connect.test_parity_arrow_python_udf is not listed in dev/sparktestsupport/modules.py, and it fails when running manually.

======================================================================
ERROR [0.072s]: test_register (pyspark.sql.tests.connect.test_parity_arrow_python_udf.ArrowPythonUDFParityTests)
----------------------------------------------------------------------
Traceback (most recent call last):
...
pyspark.errors.exceptions.base.PySparkRuntimeError: [SCHEMA_MISMATCH_FOR_PANDAS_UDF] Result vector from pandas_udf was not the required length: expected 1, got 38.

The failure had not been captured because the test is missing in the module.py file.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

@ueshin
Copy link
Member Author

ueshin commented Aug 18, 2023

cc @xinrong-meng @HyukjinKwon

@xinrong-meng
Copy link
Member

LGTM, pending test

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but the PR title looks misleading to me. This PR technically implements UDF feature instead of simply enabling UDF test code. Can we have a more intuitive PR title?

@dongjoon-hyun
Copy link
Member

To the reviewers,

  • This is filed as a blocker issue for Apache Spark 3.5.0.
  • However, this looks like a missing major feature which didn't implemented yet before the feature freeze.

I'm not disagree with backporting this PR, but I believe we need to give a correct PR title instead of saying a missing test coverage.

@ueshin
Copy link
Member Author

ueshin commented Aug 19, 2023

@dongjoon-hyun Thanks for reviewing this!

Sure, I'll update the title and description, but as for the backport, actually this is already implemented in 3.5 at #39384 and #40725. Unfortunately it had a bug and we didn't notice it because the test had not been activated for CI.

@ueshin ueshin changed the title [SPARK-44876][PYTHON] Enable and fix test_parity_arrow_python_udf. [SPARK-44876][PYTHON] Fix Arrow-optimized Python UDF on Spark Connect Aug 19, 2023
@HyukjinKwon
Copy link
Member

Merged to master and branch-3.5

HyukjinKwon pushed a commit that referenced this pull request Aug 21, 2023
### What changes were proposed in this pull request?

Fixes Arrow-optimized Python UDF on Spark Connect.

Also enables the missing test `pyspark.sql.tests.connect.test_parity_arrow_python_udf`.

### Why are the changes needed?

`pyspark.sql.tests.connect.test_parity_arrow_python_udf` is not listed in `dev/sparktestsupport/modules.py`, and it fails when running manually.

```
======================================================================
ERROR [0.072s]: test_register (pyspark.sql.tests.connect.test_parity_arrow_python_udf.ArrowPythonUDFParityTests)
----------------------------------------------------------------------
Traceback (most recent call last):
...
pyspark.errors.exceptions.base.PySparkRuntimeError: [SCHEMA_MISMATCH_FOR_PANDAS_UDF] Result vector from pandas_udf was not the required length: expected 1, got 38.
```

The failure had not been captured because the test is missing in the `module.py` file.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #42568 from ueshin/issues/SPARK-44876/test_parity_arrow_python_udf.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 75c0b8b)
Signed-off-by: Hyukjin Kwon <[email protected]>
valentinp17 pushed a commit to valentinp17/spark that referenced this pull request Aug 24, 2023
### What changes were proposed in this pull request?

Fixes Arrow-optimized Python UDF on Spark Connect.

Also enables the missing test `pyspark.sql.tests.connect.test_parity_arrow_python_udf`.

### Why are the changes needed?

`pyspark.sql.tests.connect.test_parity_arrow_python_udf` is not listed in `dev/sparktestsupport/modules.py`, and it fails when running manually.

```
======================================================================
ERROR [0.072s]: test_register (pyspark.sql.tests.connect.test_parity_arrow_python_udf.ArrowPythonUDFParityTests)
----------------------------------------------------------------------
Traceback (most recent call last):
...
pyspark.errors.exceptions.base.PySparkRuntimeError: [SCHEMA_MISMATCH_FOR_PANDAS_UDF] Result vector from pandas_udf was not the required length: expected 1, got 38.
```

The failure had not been captured because the test is missing in the `module.py` file.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes apache#42568 from ueshin/issues/SPARK-44876/test_parity_arrow_python_udf.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
### What changes were proposed in this pull request?

Fixes Arrow-optimized Python UDF on Spark Connect.

Also enables the missing test `pyspark.sql.tests.connect.test_parity_arrow_python_udf`.

### Why are the changes needed?

`pyspark.sql.tests.connect.test_parity_arrow_python_udf` is not listed in `dev/sparktestsupport/modules.py`, and it fails when running manually.

```
======================================================================
ERROR [0.072s]: test_register (pyspark.sql.tests.connect.test_parity_arrow_python_udf.ArrowPythonUDFParityTests)
----------------------------------------------------------------------
Traceback (most recent call last):
...
pyspark.errors.exceptions.base.PySparkRuntimeError: [SCHEMA_MISMATCH_FOR_PANDAS_UDF] Result vector from pandas_udf was not the required length: expected 1, got 38.
```

The failure had not been captured because the test is missing in the `module.py` file.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes apache#42568 from ueshin/issues/SPARK-44876/test_parity_arrow_python_udf.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants