Skip to content

Conversation

@yjshen
Copy link
Member

@yjshen yjshen commented Jun 15, 2015

This PR aimed to resolve udf_struct test failure in HiveCompatibilitySuite.

Currently, this is done by loosening CreateStruct's children type from NamedExpression to Expression and automatically generating StructField name for non-NamedExpression children.

The naming convention for unnamed children follows the udf's counterpart in Hive:
col1, col2, col3, ...

@chenghao-intel
Copy link
Contributor

There are 2 Struct UDFs in Hive:
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFStruct.java (struct)
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFNamedStruct.java (named_struct)

For the previous one, we will give the default names for its fields, but for the later, we will create the struct type by given names. We need to implement both of them, but this PR seems only for solve the bug of the former one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about StructField(s"col${idx + 1}", child.dataType, child.nullable, Metadata.empty) as Hive does (GenericUDFStruct)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CreateStruct here was also used by two org.apache.spark.sql.functions.struct and I did this above to avoid change the semantic of these two struct method. Maybe it's better to add another two class to act as GenericUDFStruct & GenericUDFNamedStruct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will not argue that they have some share logic, however, the difference between GenericUDFStruct and GenericUDFNamedStruct is, the former gives names for output fields as (col1, col2 ...), but the later requires specifying the output fields names from its arguments.
Probably we'd better create another expression class in Spark SQL for GenericUDFNamedStruct, mixing the functionality in a single class probably makes people confusing.

It will be great if you can create an issue under the expression umbrella task https://issues.apache.org/jira/browse/SPARK-8159

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I would create a new ticket for named_struct in JIRA and take all discussions we had above into consideration. Thanks.

@chenghao-intel
Copy link
Contributor

BTW that's would be great if you can add the NamedStruct support.

@yjshen
Copy link
Member Author

yjshen commented Jun 17, 2015

@chenghao-intel, yes, I've noticed struct & named_struct when I was implementing this and just wondering whether we should implement both.

The previous test failure was because there exists a struct in FunctionRegistry and happens to mask its counterpart GenericUDFStruct in Hive, and named_struct was delegate to GenericUDFNamedStruct and behaves well.

@chenghao-intel
Copy link
Contributor

named_struct is not implemented in Spark SQL right? At least, it should be registered in FunctionRegistry. Of course we can leave it for the future.

@marmbrus
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Jun 17, 2015

Test build #35051 has finished for PR 6828 at commit 6052b73.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CreateStruct(children: Seq[Expression]) extends Expression

@rxin
Copy link
Contributor

rxin commented Jun 18, 2015

Thanks. Merging in master.

@yjshen
Copy link
Member Author

yjshen commented Jun 18, 2015

Thanks.

nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…ySuite

This PR aimed to resolve udf_struct test failure in HiveCompatibilitySuite.

Currently, this is done by loosening CreateStruct's children type from NamedExpression to Expression and automatically generating StructField name for non-NamedExpression children.

The naming convention for unnamed children follows the udf's counterpart in Hive:
`col1, col2, col3, ...`

Author: Yijie Shen <[email protected]>

Closes apache#6828 from yijieshen/SPARK-8283 and squashes the following commits:

6052b73 [Yijie Shen] Doc fix
677e0b7 [Yijie Shen] Resolve udf_struct test failure by automatically generate structField name for non-NamedExpression children
asfgit pushed a commit that referenced this pull request Jul 2, 2015
This is a follow up of [SPARK-8283](https://issues.apache.org/jira/browse/SPARK-8283) ([PR-6828](#6828)), to support both `struct` and `named_struct` in Spark SQL.

After [#6725](#6828), the semantic of [`CreateStruct`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypes.scala#L56) methods have changed a little and do not limited to cols of `NamedExpressions`, it will name non-NamedExpression fields following the hive convention, col1, col2 ...

This PR would both loosen [`struct`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L723) to take children of `Expression` type and add `named_struct` support.

Author: Yijie Shen <[email protected]>

Closes #6874 from yijieshen/SPARK-8283 and squashes the following commits:

4cd3375 [Yijie Shen] change struct documentation
d599d0b [Yijie Shen] rebase code
9a7039e [Yijie Shen] fix reviews and regenerate golden answers
b487354 [Yijie Shen] replace assert using checkAnswer
f07e114 [Yijie Shen] tiny fix
9613be9 [Yijie Shen] review fix
7fef712 [Yijie Shen] Fix checkInputTypes' implementation using foldable and nullable
60812a7 [Yijie Shen] Fix type check
828d694 [Yijie Shen] remove unnecessary resolved assertion inside dataType method
fd3cd8e [Yijie Shen] remove type check from eval
7a71255 [Yijie Shen] tiny fix
ccbbd86 [Yijie Shen] Fix reviews
47da332 [Yijie Shen] remove nameStruct API from DataFrame
917e680 [Yijie Shen] Fix reviews
4bd75ad [Yijie Shen] loosen struct method in functions.scala to take Expression children
0acb7be [Yijie Shen] Add CreateNamedStruct in both DataFrame function API and FunctionRegistery
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants