-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-52976][PYTHON] Fix Python UDF not accepting collated string as input param/return type #51688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-52976][PYTHON] Fix Python UDF not accepting collated string as input param/return type #51688
Conversation
2584dab to
2f1bee5
Compare
2f1bee5 to
0f47248
Compare
0f47248 to
94be795
Compare
cc8c888 to
b2761fe
Compare
b2761fe to
a1f4c93
Compare
01ce67e to
e1309ad
Compare
90d978d to
4e73996
Compare
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala
Outdated
Show resolved
Hide resolved
b594689 to
72237fa
Compare
…ypes ### What changes were proposed in this pull request? Changing the behavior of collated string types to return their collation in the `toJson` methods and to still keep backwards compatibility with older engine versions reading tables with collations by propagating this fix upstream in `StructField` where the collation will be removed from the type but still kept in the metadata. ### Why are the changes needed? Old way of handling `toJson` meant that collated string types will not be able to be serialized and deserialized correctly unless they are a part of `StructField`. Initially, we thought that this is not a big deal, but then later we faced some issues regarding this, especially in pyspark which uses json primarily to parse types back and forth. This could avoid hacky changes in future like the one in #51688 without changing any behavior for how tables/schemas work. ### Does this PR introduce _any_ user-facing change? Technically yes, but it is a small change that should not impact any queries, just how StringType is represented when not in a StructField object. ### How was this patch tested? New and existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #51850 from stefankandic/fixStringJson. Authored-by: Stefan Kandic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ypes ### What changes were proposed in this pull request? Changing the behavior of collated string types to return their collation in the `toJson` methods and to still keep backwards compatibility with older engine versions reading tables with collations by propagating this fix upstream in `StructField` where the collation will be removed from the type but still kept in the metadata. ### Why are the changes needed? Old way of handling `toJson` meant that collated string types will not be able to be serialized and deserialized correctly unless they are a part of `StructField`. Initially, we thought that this is not a big deal, but then later we faced some issues regarding this, especially in pyspark which uses json primarily to parse types back and forth. This could avoid hacky changes in future like the one in #51688 without changing any behavior for how tables/schemas work. ### Does this PR introduce _any_ user-facing change? Technically yes, but it is a small change that should not impact any queries, just how StringType is represented when not in a StructField object. ### How was this patch tested? New and existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #51850 from stefankandic/fixStringJson. Authored-by: Stefan Kandic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 19ea6ff) Signed-off-by: Wenchen Fan <[email protected]>
a7a2161 to
3aeac3b
Compare
stefankandic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala
Outdated
Show resolved
Hide resolved
3aeac3b to
767264d
Compare
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
Outdated
Show resolved
Hide resolved
7fec8b9 to
b178220
Compare
b178220 to
41eb246
Compare
|
thanks, merging to master! |
|
@ilicmarkodb can you open a backport PR against branch-4.0? |
| udtf(TestUDTF, returnType=ret_type)().collect() | ||
|
|
||
|
|
||
| def test_udtf_with_collated_string_types(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilicmarkodb the indent here is wrong, this test is actually skipped. It should be put into a Mixin class like BaseUDTFTestsMixin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened a PR to fix this. I’ll finish it and tag you for review once the CI is green.
| ) | ||
| df = self.spark.createDataFrame([("hello",) * 4], schema=schema) | ||
|
|
||
| df_out = df.select(MyUDTF(df.col1, df.col2, df.col3, df.col4).alias("out")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this query work? I guess it should be a lateralJoin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn’t. I just didn’t realize that, since the test wasn't executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL #51688
…ypes ### What changes were proposed in this pull request? Changing the behavior of collated string types to return their collation in the `toJson` methods and to still keep backwards compatibility with older engine versions reading tables with collations by propagating this fix upstream in `StructField` where the collation will be removed from the type but still kept in the metadata. ### Why are the changes needed? Old way of handling `toJson` meant that collated string types will not be able to be serialized and deserialized correctly unless they are a part of `StructField`. Initially, we thought that this is not a big deal, but then later we faced some issues regarding this, especially in pyspark which uses json primarily to parse types back and forth. This could avoid hacky changes in future like the one in apache#51688 without changing any behavior for how tables/schemas work. ### Does this PR introduce _any_ user-facing change? Technically yes, but it is a small change that should not impact any queries, just how StringType is represented when not in a StructField object. ### How was this patch tested? New and existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51850 from stefankandic/fixStringJson. Authored-by: Stefan Kandic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit e62106d) Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
Fix Python UDF not accepting collated strings as input param/return type.
Why are the changes needed?
Bug fix.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New tests.
Was this patch authored or co-authored using generative AI tooling?
No.