-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-13320] [SQL] Support Star in CreateStruct/CreateArray and Error Handling when DataFrame/DataSet Functions using Star #11208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| case s: Star => s.expand(child, resolver) | ||
| case o => o :: Nil | ||
| }) | ||
| case c: CreateStruct if containsStar(c.children) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we have the other functions that can accept star as an input parameter. If so, I think we need to create a trait for all these case classes. Then, we can remove the duplicate code. Any better idea? Thanks! : )
|
Test build #51321 has finished for PR 11208 at commit
|
|
cc @cloud-fan |
| case s: Star => s.expand(child, resolver) | ||
| case o => o :: Nil | ||
| }) | ||
| case c: CreateStructUnsafe if containsStar(c.children) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateStructUnsafe only appears after unsafe projection, so I think we don't need to handle it in Analyzer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw it is being used in two parts in Analyzer. Will remove them. Thanks!
|
The PR title looks confusing, |
|
So far, Spark SQL does not handle star expansion when we use Actually, I am not sure if |
|
Actually we do handle stars in One of my concern is: sometimes we check stars under |
|
uh, I see. The code you posted above is for Yeah, we need a clean and complete fix for resolving star. Let me check if can move these into |
|
@cloud-fan The latest commit separates star resolution from the reference resolution, since |
|
Test build #51512 has finished for PR 11208 at commit
|
| """.stripMargin).select($"r.*"), | ||
| Row(3, 2) :: Nil) | ||
|
|
||
| assert(structDf.groupBy($"a").agg(min(struct($"record.*"))).first() == Row(3, Row(3, 1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should write a new test case to test * in CreateStruct and CreateArray, not just put in existing ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do. Thanks!
|
Overall LGTM except some comments about tests, thanks for working on it! |
|
Test build #51536 has finished for PR 11208 at commit
|
|
retest this please |
| val f = udf((a: String) => a) | ||
| val df = sparkContext.parallelize(Seq((1, 1))).toDF("a", "b") | ||
| df.select(struct($"a").as("s")).select(f($"s.a")).collect() | ||
| df.select(struct($"*").as("s")).select(f($"s.a")).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed?
|
Test build #51701 has finished for PR 11208 at commit
|
|
Test build #51729 has finished for PR 11208 at commit
|
| } | ||
| ) | ||
| case g: Generate if containsStar(g.generator.children) => | ||
| failAnalysis("Cannot explode *, explode can only be applied on a specific column.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just realized the error message is not clear enough, Generate is not always "explode"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a test for this error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I moved this from another rule. I will check the coverage of test cases. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a test case: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala#L181-L182
How about changing the message to Invalid usage of '*' in explode/json_tuple/UDTF? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explode/json_tuple/UDTF LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Let me change it now.
|
retest this please |
|
Test build #53008 has finished for PR 11208 at commit
|
|
retest this please |
|
Test build #53376 has finished for PR 11208 at commit
|
|
cc @yhuai |
|
retest this please |
|
Test build #53620 has finished for PR 11208 at commit
|
|
cc @yhuai |
| case o => o :: Nil | ||
| }) | ||
| // count(*) has been replaced by count(1) | ||
| case o if containsStar(o.children) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have a method:
private def mayContainsStar(expr: Expression): Boolean = expr.isInstnaceOf[UnresolvedFunction] || expr.isInstnaceOf[CreateStruct]...
then we can simplify this to:
expr.transformUp {
case e if mayContainsStar(e) =>
e.copy(children = ...)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a great idea! : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried it, but copy is unable to use here. When the type is Expression (abstract type), we are unable to use the copy function to change the children. In addition, withNewChildren requires the same number of children. Do you have any idea how to fix it? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i see, I don't have a better idea, let's just keep it this way.
|
Sorry for putting it here for such a long time, overall LGTM, will merge it after you address the new comments, thanks! |
|
@cloud-fan Thank you for your detailed reviews! I know all of you are very busy. Let me know if anything needs a change. Thanks again! |
| UnresolvedAlias(child = expandStarExpression(ua.child, p.child)) :: Nil | ||
| case a @ Alias(_: UnresolvedFunction | _: CreateArray | _: CreateStruct, _) => | ||
| Alias(child = expandStarExpression(a.child, p.child), a.name)( | ||
| isGenerated = a.isGenerated) :: Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will lose qualifier here, how about a.withNewChildren(expandStarExpression(a.child, p.child) :: Nil)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, a good catch!
|
Test build #53655 has finished for PR 11208 at commit
|
| } | ||
| } | ||
|
|
||
| test("Star Expansion - CreateStruct and CreateArray") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we put these tests in SQLQuerySuite? It looks like they are mostly testing DF APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, let me move them to DataFrameSuite. Thanks!
|
Test build #53661 has finished for PR 11208 at commit
|
|
Test build #53685 has finished for PR 11208 at commit
|
|
thanks! merging to master! |
… Handling when DataFrame/DataSet Functions using Star
This PR resolves two issues:
First, expanding * inside aggregate functions of structs when using Dataframe/Dataset APIs. For example,
```scala
structDf.groupBy($"a").agg(min(struct($"record.*")))
```
Second, it improves the error messages when having invalid star usage when using Dataframe/Dataset APIs. For example,
```scala
pagecounts4PartitionsDS
.map(line => (line._1, line._3))
.toDF()
.groupBy($"_1")
.agg(sum("*") as "sumOccurances")
```
Before the fix, the invalid usage will issue a confusing error message, like:
```
org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input columns _1, _2;
```
After the fix, the message is like:
```
org.apache.spark.sql.AnalysisException: Invalid usage of '*' in function 'sum'
```
cc: rxin nongli cloud-fan
Author: gatorsmile <[email protected]>
Closes apache#11208 from gatorsmile/sumDataSetResolution.
|
@gatorsmile @cloud-fan This PR revert the change in #3674, unfortunately the unit test in AnalysisSuite. This test break once we enforce max-iteration check in tests, see https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54090/testReport/org.apache.spark.sql.catalyst.analysis/AnalysisSuite/union_project__/ |
|
@gatorsmile This PR can't be easily reverted, so could you send a PR to fix it? |
|
I will fix this in #11828 |
This PR resolves two issues:
First, expanding * inside aggregate functions of structs when using Dataframe/Dataset APIs. For example,
Second, it improves the error messages when having invalid star usage when using Dataframe/Dataset APIs. For example,
Before the fix, the invalid usage will issue a confusing error message, like:
After the fix, the message is like:
cc: @rxin @nongli @cloud-fan