[SPARK-52976][PYTHON] Fix Python UDF not accepting collated string as input param/return type #51688

ilicmarkodb · 2025-07-28T14:07:02Z

What changes were proposed in this pull request?

Fix Python UDF not accepting collated strings as input param/return type.

Why are the changes needed?

Bug fix.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New tests.

Was this patch authored or co-authored using generative AI tooling?

No.

python/pyspark/sql/tests/test_udf.py

python/pyspark/sql/tests/test_udtf.py

python/pyspark/core/context.py

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala

…ypes ### What changes were proposed in this pull request? Changing the behavior of collated string types to return their collation in the `toJson` methods and to still keep backwards compatibility with older engine versions reading tables with collations by propagating this fix upstream in `StructField` where the collation will be removed from the type but still kept in the metadata. ### Why are the changes needed? Old way of handling `toJson` meant that collated string types will not be able to be serialized and deserialized correctly unless they are a part of `StructField`. Initially, we thought that this is not a big deal, but then later we faced some issues regarding this, especially in pyspark which uses json primarily to parse types back and forth. This could avoid hacky changes in future like the one in #51688 without changing any behavior for how tables/schemas work. ### Does this PR introduce _any_ user-facing change? Technically yes, but it is a small change that should not impact any queries, just how StringType is represented when not in a StructField object. ### How was this patch tested? New and existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #51850 from stefankandic/fixStringJson. Authored-by: Stefan Kandic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ypes ### What changes were proposed in this pull request? Changing the behavior of collated string types to return their collation in the `toJson` methods and to still keep backwards compatibility with older engine versions reading tables with collations by propagating this fix upstream in `StructField` where the collation will be removed from the type but still kept in the metadata. ### Why are the changes needed? Old way of handling `toJson` meant that collated string types will not be able to be serialized and deserialized correctly unless they are a part of `StructField`. Initially, we thought that this is not a big deal, but then later we faced some issues regarding this, especially in pyspark which uses json primarily to parse types back and forth. This could avoid hacky changes in future like the one in #51688 without changing any behavior for how tables/schemas work. ### Does this PR introduce _any_ user-facing change? Technically yes, but it is a small change that should not impact any queries, just how StringType is represented when not in a StructField object. ### How was this patch tested? New and existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #51850 from stefankandic/fixStringJson. Authored-by: Stefan Kandic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 19ea6ff) Signed-off-by: Wenchen Fan <[email protected]>

stefankandic

LGTM!

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala

sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala

cloud-fan · 2025-08-08T02:23:10Z

thanks, merging to master!

cloud-fan · 2025-08-08T02:26:19Z

@ilicmarkodb can you open a backport PR against branch-4.0?

zhengruifeng · 2025-08-27T11:59:07Z

python/pyspark/sql/tests/test_udtf.py

                    udtf(TestUDTF, returnType=ret_type)().collect()


+def test_udtf_with_collated_string_types(self):


@ilicmarkodb the indent here is wrong, this test is actually skipped. It should be put into a Mixin class like BaseUDTFTestsMixin

#52001

I opened a PR to fix this. I’ll finish it and tag you for review once the CI is green.

zhengruifeng · 2025-08-27T12:14:51Z

python/pyspark/sql/tests/test_udtf.py

+    )
+    df = self.spark.createDataFrame([("hello",) * 4], schema=schema)
+
+    df_out = df.select(MyUDTF(df.col1, df.col2, df.col3, df.col4).alias("out"))


does this query work? I guess it should be a lateralJoin?

It doesn’t. I just didn’t realize that, since the test wasn't executed.

PTAL #51688

…ypes ### What changes were proposed in this pull request? Changing the behavior of collated string types to return their collation in the `toJson` methods and to still keep backwards compatibility with older engine versions reading tables with collations by propagating this fix upstream in `StructField` where the collation will be removed from the type but still kept in the metadata. ### Why are the changes needed? Old way of handling `toJson` meant that collated string types will not be able to be serialized and deserialized correctly unless they are a part of `StructField`. Initially, we thought that this is not a big deal, but then later we faced some issues regarding this, especially in pyspark which uses json primarily to parse types back and forth. This could avoid hacky changes in future like the one in apache#51688 without changing any behavior for how tables/schemas work. ### Does this PR introduce _any_ user-facing change? Technically yes, but it is a small change that should not impact any queries, just how StringType is represented when not in a StructField object. ### How was this patch tested? New and existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51850 from stefankandic/fixStringJson. Authored-by: Stefan Kandic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit e62106d) Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added SQL PYTHON labels Jul 28, 2025

ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch from 2584dab to 2f1bee5 Compare July 28, 2025 14:35

stefankandic reviewed Jul 28, 2025

View reviewed changes

python/pyspark/sql/tests/test_udf.py Outdated Show resolved Hide resolved

ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch from 2f1bee5 to 0f47248 Compare July 28, 2025 17:26

ilicmarkodb requested a review from stefankandic July 28, 2025 17:29

ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch from 0f47248 to 94be795 Compare July 28, 2025 17:33

allisonwang-db reviewed Jul 28, 2025

View reviewed changes

python/pyspark/sql/tests/test_udtf.py Outdated Show resolved Hide resolved

ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch 2 times, most recently from cc8c888 to b2761fe Compare July 29, 2025 10:23

ilicmarkodb requested a review from allisonwang-db July 29, 2025 10:25

ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch from b2761fe to a1f4c93 Compare July 29, 2025 12:06

HyukjinKwon changed the title ~~[SPARK-52976][Python] Fix Python UDF not accepting collated string as input param~~ [SPARK-52976][PYTHON] Fix Python UDF not accepting collated string as input param Jul 29, 2025

ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch 11 times, most recently from 01ce67e to e1309ad Compare August 1, 2025 22:13

ilicmarkodb changed the title ~~[SPARK-52976][PYTHON] Fix Python UDF not accepting collated string as input param~~ [SPARK-52976][PYTHON] Fix Python UDF not accepting collated string as input param/return type Aug 2, 2025

ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch 4 times, most recently from 90d978d to 4e73996 Compare August 4, 2025 11:42

github-actions bot added the CONNECT label Aug 4, 2025

stefankandic reviewed Aug 5, 2025

View reviewed changes

python/pyspark/core/context.py Outdated Show resolved Hide resolved

stefankandic reviewed Aug 5, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala Outdated Show resolved Hide resolved

ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch 4 times, most recently from b594689 to 72237fa Compare August 5, 2025 13:05

stefankandic mentioned this pull request Aug 5, 2025

[SPARK-53130][SQL][PYTHON] Fix toJson behavior of collated string types #51850

Closed

ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch 2 times, most recently from a7a2161 to 3aeac3b Compare August 6, 2025 21:13

cloud-fan approved these changes Aug 7, 2025

View reviewed changes

stefankandic approved these changes Aug 7, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala Outdated Show resolved Hide resolved

ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch from 3aeac3b to 767264d Compare August 7, 2025 09:11

stefankandic reviewed Aug 7, 2025

View reviewed changes

sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala Outdated Show resolved Hide resolved

stefankandic reviewed Aug 7, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala Outdated Show resolved Hide resolved

ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch 4 times, most recently from 7fec8b9 to b178220 Compare August 7, 2025 13:22

temp

41eb246

ilicmarkodb force-pushed the fix_collated_string_as_input_of_python_udf branch from b178220 to 41eb246 Compare August 7, 2025 14:15

allisonwang-db approved these changes Aug 7, 2025

View reviewed changes

cloud-fan closed this in 6b1f1a6 Aug 8, 2025

zhengruifeng reviewed Aug 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52976][PYTHON] Fix Python UDF not accepting collated string as input param/return type #51688

[SPARK-52976][PYTHON] Fix Python UDF not accepting collated string as input param/return type #51688

Uh oh!

ilicmarkodb commented Jul 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stefankandic left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cloud-fan commented Aug 8, 2025 •

edited

Loading

Uh oh!

cloud-fan commented Aug 8, 2025

Uh oh!

zhengruifeng Aug 27, 2025 •

edited

Loading

Uh oh!

ilicmarkodb Aug 27, 2025

Uh oh!

zhengruifeng Aug 27, 2025

Uh oh!

ilicmarkodb Aug 27, 2025

Uh oh!

ilicmarkodb Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		udtf(TestUDTF, returnType=ret_type)().collect()


		def test_udtf_with_collated_string_types(self):

[SPARK-52976][PYTHON] Fix Python UDF not accepting collated string as input param/return type #51688

[SPARK-52976][PYTHON] Fix Python UDF not accepting collated string as input param/return type #51688

Uh oh!

Conversation

ilicmarkodb commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stefankandic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cloud-fan commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Aug 8, 2025

Uh oh!

zhengruifeng Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilicmarkodb Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

ilicmarkodb Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

ilicmarkodb Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ilicmarkodb commented Jul 28, 2025 •

edited

Loading

cloud-fan commented Aug 8, 2025 •

edited

Loading

zhengruifeng Aug 27, 2025 •

edited

Loading