-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-8283][SQL] Resolve udf_struct test failure in HiveCompatibilitySuite #6828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… name for non-NamedExpression children
|
There are 2 Struct UDFs in Hive: For the previous one, we will give the default names for its fields, but for the later, we will create the struct type by given names. We need to implement both of them, but this PR seems only for solve the bug of the former one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about StructField(s"col${idx + 1}", child.dataType, child.nullable, Metadata.empty) as Hive does (GenericUDFStruct)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CreateStruct here was also used by two org.apache.spark.sql.functions.struct and I did this above to avoid change the semantic of these two struct method. Maybe it's better to add another two class to act as GenericUDFStruct & GenericUDFNamedStruct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will not argue that they have some share logic, however, the difference between GenericUDFStruct and GenericUDFNamedStruct is, the former gives names for output fields as (col1, col2 ...), but the later requires specifying the output fields names from its arguments.
Probably we'd better create another expression class in Spark SQL for GenericUDFNamedStruct, mixing the functionality in a single class probably makes people confusing.
It will be great if you can create an issue under the expression umbrella task https://issues.apache.org/jira/browse/SPARK-8159
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I would create a new ticket for named_struct in JIRA and take all discussions we had above into consideration. Thanks.
|
BTW that's would be great if you can add the |
|
@chenghao-intel, yes, I've noticed The previous test failure was because there exists a |
|
|
|
ok to test |
|
Test build #35051 has finished for PR 6828 at commit
|
|
Thanks. Merging in master. |
|
Thanks. |
…ySuite This PR aimed to resolve udf_struct test failure in HiveCompatibilitySuite. Currently, this is done by loosening CreateStruct's children type from NamedExpression to Expression and automatically generating StructField name for non-NamedExpression children. The naming convention for unnamed children follows the udf's counterpart in Hive: `col1, col2, col3, ...` Author: Yijie Shen <[email protected]> Closes apache#6828 from yijieshen/SPARK-8283 and squashes the following commits: 6052b73 [Yijie Shen] Doc fix 677e0b7 [Yijie Shen] Resolve udf_struct test failure by automatically generate structField name for non-NamedExpression children
This is a follow up of [SPARK-8283](https://issues.apache.org/jira/browse/SPARK-8283) ([PR-6828](#6828)), to support both `struct` and `named_struct` in Spark SQL. After [#6725](#6828), the semantic of [`CreateStruct`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypes.scala#L56) methods have changed a little and do not limited to cols of `NamedExpressions`, it will name non-NamedExpression fields following the hive convention, col1, col2 ... This PR would both loosen [`struct`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L723) to take children of `Expression` type and add `named_struct` support. Author: Yijie Shen <[email protected]> Closes #6874 from yijieshen/SPARK-8283 and squashes the following commits: 4cd3375 [Yijie Shen] change struct documentation d599d0b [Yijie Shen] rebase code 9a7039e [Yijie Shen] fix reviews and regenerate golden answers b487354 [Yijie Shen] replace assert using checkAnswer f07e114 [Yijie Shen] tiny fix 9613be9 [Yijie Shen] review fix 7fef712 [Yijie Shen] Fix checkInputTypes' implementation using foldable and nullable 60812a7 [Yijie Shen] Fix type check 828d694 [Yijie Shen] remove unnecessary resolved assertion inside dataType method fd3cd8e [Yijie Shen] remove type check from eval 7a71255 [Yijie Shen] tiny fix ccbbd86 [Yijie Shen] Fix reviews 47da332 [Yijie Shen] remove nameStruct API from DataFrame 917e680 [Yijie Shen] Fix reviews 4bd75ad [Yijie Shen] loosen struct method in functions.scala to take Expression children 0acb7be [Yijie Shen] Add CreateNamedStruct in both DataFrame function API and FunctionRegistery
This PR aimed to resolve udf_struct test failure in HiveCompatibilitySuite.
Currently, this is done by loosening CreateStruct's children type from NamedExpression to Expression and automatically generating StructField name for non-NamedExpression children.
The naming convention for unnamed children follows the udf's counterpart in Hive:
col1, col2, col3, ...