[SPARK-13320] [SQL] Support Star in CreateStruct/CreateArray and Error Handling when DataFrame/DataSet Functions using Star #11208

gatorsmile · 2016-02-15T18:07:05Z

This PR resolves two issues:

First, expanding * inside aggregate functions of structs when using Dataframe/Dataset APIs. For example,

structDf.groupBy($"a").agg(min(struct($"record.*")))

Second, it improves the error messages when having invalid star usage when using Dataframe/Dataset APIs. For example,

pagecounts4PartitionsDS
  .map(line => (line._1, line._3))
  .toDF()
  .groupBy($"_1")
  .agg(sum("*") as "sumOccurances")

Before the fix, the invalid usage will issue a confusing error message, like:

org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input columns _1, _2;

After the fix, the message is like:

org.apache.spark.sql.AnalysisException: Invalid usage of '*' in function 'sum'

cc: @rxin @nongli @cloud-fan

gatorsmile · 2016-02-15T18:08:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

                case s: Star => s.expand(child, resolver)
                case o => o :: Nil
              })
+            case c: CreateStruct if containsStar(c.children) =>


Not sure if we have the other functions that can accept star as an input parameter. If so, I think we need to create a trait for all these case classes. Then, we can remove the duplicate code. Any better idea? Thanks! : )

SparkQA · 2016-02-15T19:49:19Z

Test build #51321 has finished for PR 11208 at commit 3b2b448.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-02-15T22:20:03Z

cc @cloud-fan

cloud-fan · 2016-02-16T01:02:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+                case s: Star => s.expand(child, resolver)
+                case o => o :: Nil
+              })
+            case c: CreateStructUnsafe if containsStar(c.children) =>


CreateStructUnsafe only appears after unsafe projection, so I think we don't need to handle it in Analyzer

I saw it is being used in two parts in Analyzer. Will remove them. Thanks!

cloud-fan · 2016-02-16T01:16:46Z

The PR title looks confusing, Star Expansion is already done, what this PR did is fixing a problem of missing CreateStruct when handle stars and adding a better error message, @gatorsmile could you improve it to make it more clear?

gatorsmile · 2016-02-16T01:26:18Z

Star Expansion only works when the star are in a UnresolvedFunction.

So far, Spark SQL does not handle star expansion when we use star in the DataFrame or DataSet functions. That is the reason I chose this title. Let me change it.

Actually, I am not sure if CreateStruct and Count are the only two functions that can accept star. Could you help me confirm it? Thanks!

cloud-fan · 2016-02-16T01:34:15Z

Actually we do handle stars in CreateArray and CreateStruct: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L440-L458, so what you are fixing is the nested CreateStruct, I think we should also add CreateArray too.

One of my concern is: sometimes we check stars under UnresolvedAlias but sometimes also under Alias, it will be good if you can figure it out and make sure there is no missing case.

gatorsmile · 2016-02-16T01:44:15Z

uh, I see. The code you posted above is for Project. The error message in the original JIRA is for having star used in Aggregate.

Yeah, we need a clean and complete fix for resolving star. Let me check if can move these into expandStarExpressions.

gatorsmile · 2016-02-19T02:48:05Z

@cloud-fan The latest commit separates star resolution from the reference resolution, since ResolveReferences becomes pretty long now. Could you help me check if the new changes cover all the cases that can accept star? Thank you! : )

SparkQA · 2016-02-19T04:27:44Z

Test build #51512 has finished for PR 11208 at commit ac71f39.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-02-19T05:49:18Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

      """.stripMargin).select($"r.*"),
      Row(3, 2) :: Nil)

+    assert(structDf.groupBy($"a").agg(min(struct($"record.*"))).first() == Row(3, Row(3, 1)))


We should write a new test case to test * in CreateStruct and CreateArray, not just put in existing ones.

Sure, will do. Thanks!

cloud-fan · 2016-02-19T05:57:34Z

Overall LGTM except some comments about tests, thanks for working on it!

SparkQA · 2016-02-19T09:11:27Z

Test build #51536 has finished for PR 11208 at commit 2c72edf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-02-22T23:39:39Z

retest this please

cloud-fan · 2016-02-23T01:06:51Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameComplexTypeSuite.scala

    val f = udf((a: String) => a)
    val df = sparkContext.parallelize(Seq((1, 1))).toDF("a", "b")
    df.select(struct($"a").as("s")).select(f($"s.a")).collect()
+    df.select(struct($"*").as("s")).select(f($"s.a")).collect()


not needed?

SparkQA · 2016-02-23T01:26:47Z

Test build #51701 has finished for PR 11208 at commit 2c72edf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-23T06:13:01Z

Test build #51729 has finished for PR 11208 at commit 6b2d609.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-02-24T03:37:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+          }
+        )
+      case g: Generate if containsStar(g.generator.children) =>
+        failAnalysis("Cannot explode *, explode can only be applied on a specific column.")


just realized the error message is not clear enough, Generate is not always "explode"

do we have a test for this error message?

True. I moved this from another rule. I will check the coverage of test cases. Thanks!

We already have a test case: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala#L181-L182

How about changing the message to Invalid usage of '*' in explode/json_tuple/UDTF? Thanks!

explode/json_tuple/UDTF LGTM

Thanks! Let me change it now.

gatorsmile · 2016-03-12T17:26:08Z

retest this please

SparkQA · 2016-03-12T19:09:59Z

Test build #53008 has finished for PR 11208 at commit e060dea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-03-16T23:34:06Z

retest this please

SparkQA · 2016-03-17T01:15:12Z

Test build #53376 has finished for PR 11208 at commit e060dea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-03-17T01:21:25Z

cc @yhuai

gatorsmile · 2016-03-19T18:00:07Z

retest this please

SparkQA · 2016-03-19T19:38:41Z

Test build #53620 has finished for PR 11208 at commit e060dea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-03-19T22:01:26Z

cc @yhuai

cloud-fan · 2016-03-21T02:48:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+            case o => o :: Nil
+          })
+        // count(*) has been replaced by count(1)
+        case o if containsStar(o.children) =>


We can have a method:

private def mayContainsStar(expr: Expression): Boolean = expr.isInstnaceOf[UnresolvedFunction] || expr.isInstnaceOf[CreateStruct]...

then we can simplify this to:

expr.transformUp { case e if mayContainsStar(e) => e.copy(children = ...) }

That is a great idea! : )

Tried it, but copy is unable to use here. When the type is Expression (abstract type), we are unable to use the copy function to change the children. In addition, withNewChildren requires the same number of children. Do you have any idea how to fix it? Thanks!

oh i see, I don't have a better idea, let's just keep it this way.

cloud-fan · 2016-03-21T02:50:37Z

Sorry for putting it here for such a long time, overall LGTM, will merge it after you address the new comments, thanks!

gatorsmile · 2016-03-21T05:04:36Z

@cloud-fan Thank you for your detailed reviews! I know all of you are very busy. Let me know if anything needs a change. Thanks again!

cloud-fan · 2016-03-21T05:30:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+            UnresolvedAlias(child = expandStarExpression(ua.child, p.child)) :: Nil
+          case a @ Alias(_: UnresolvedFunction | _: CreateArray | _: CreateStruct, _) =>
+            Alias(child = expandStarExpression(a.child, p.child), a.name)(
+              isGenerated = a.isGenerated) :: Nil


We will lose qualifier here, how about a.withNewChildren(expandStarExpression(a.child, p.child) :: Nil)?

Yeah, a good catch!

SparkQA · 2016-03-21T07:00:02Z

Test build #53655 has finished for PR 11208 at commit ba3fe7c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-21T07:53:58Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

    }
  }

+  test("Star Expansion - CreateStruct and CreateArray") {


Why do we put these tests in SQLQuerySuite? It looks like they are mostly testing DF APIs.

True, let me move them to DataFrameSuite. Thanks!

SparkQA · 2016-03-21T08:46:51Z

Test build #53661 has finished for PR 11208 at commit 0fce075.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-21T18:41:47Z

Test build #53685 has finished for PR 11208 at commit 50abeec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-22T00:23:17Z

thanks! merging to master!

… Handling when DataFrame/DataSet Functions using Star This PR resolves two issues: First, expanding * inside aggregate functions of structs when using Dataframe/Dataset APIs. For example, ```scala structDf.groupBy($"a").agg(min(struct($"record.*"))) ``` Second, it improves the error messages when having invalid star usage when using Dataframe/Dataset APIs. For example, ```scala pagecounts4PartitionsDS .map(line => (line._1, line._3)) .toDF() .groupBy($"_1") .agg(sum("*") as "sumOccurances") ``` Before the fix, the invalid usage will issue a confusing error message, like: ``` org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input columns _1, _2; ``` After the fix, the message is like: ``` org.apache.spark.sql.AnalysisException: Invalid usage of '*' in function 'sum' ``` cc: rxin nongli cloud-fan Author: gatorsmile <[email protected]> Closes apache#11208 from gatorsmile/sumDataSetResolution.

davies · 2016-03-25T03:38:49Z

@gatorsmile @cloud-fan This PR revert the change in #3674, unfortunately the unit test in AnalysisSuite. This test break once we enforce max-iteration check in tests, see https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54090/testReport/org.apache.spark.sql.catalyst.analysis/AnalysisSuite/union_project__/

davies · 2016-03-25T03:46:07Z

@gatorsmile This PR can't be easily reverted, so could you send a PR to fix it?

davies · 2016-03-25T03:52:27Z

I will fix this in #11828

structStarExpansion

3b2b448

gatorsmile reviewed Feb 15, 2016
View reviewed changes

cloud-fan reviewed Feb 16, 2016
View reviewed changes

gatorsmile changed the title ~~[SPARK-13320] [SQL] Star Expansion for Dataframe/Dataset Functions~~ [SPARK-13320] [SQL] Support Star in CreateStruct and Error Handling when DataFrame/DataSet Functions using Star Feb 16, 2016

add a new rule for resolving star.

ac71f39

cloud-fan reviewed Feb 19, 2016
View reviewed changes

gatorsmile added 2 commits February 18, 2016 23:30

address comments.

8d809bc

address comments.

2c72edf

gatorsmile changed the title ~~[SPARK-13320] [SQL] Support Star in CreateStruct and Error Handling when DataFrame/DataSet Functions using Star~~ [SPARK-13320] [SQL] Support Star in CreateStruct/CreateArray and Error Handling when DataFrame/DataSet Functions using Star Feb 20, 2016

cloud-fan reviewed Feb 23, 2016
View reviewed changes

address comments.

6b2d609

cloud-fan reviewed Feb 24, 2016
View reviewed changes

Merge remote-tracking branch 'upstream/master' into sumDataSetResolution

e47f141

Merge remote-tracking branch 'upstream/master' into sumDataSetResolution

99f5312

gatorsmile mentioned this pull request Mar 19, 2016

[SPARK-12789]Support order by index and group by index #10731

Closed

cloud-fan reviewed Mar 21, 2016
View reviewed changes

added test cases.

ba3fe7c

cloud-fan reviewed Mar 21, 2016
View reviewed changes

address comments.

0fce075

cloud-fan reviewed Mar 21, 2016
View reviewed changes

address comments.

50abeec

asfgit closed this in 3f49e07 Mar 22, 2016

[SPARK-13320] [SQL] Support Star in CreateStruct/CreateArray and Error Handling when DataFrame/DataSet Functions using Star #11208

[SPARK-13320] [SQL] Support Star in CreateStruct/CreateArray and Error Handling when DataFrame/DataSet Functions using Star #11208

Uh oh!

Conversation

gatorsmile commented Feb 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 15, 2016

Uh oh!

rxin commented Feb 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 16, 2016

Uh oh!

gatorsmile commented Feb 16, 2016

Uh oh!

cloud-fan commented Feb 16, 2016

Uh oh!

gatorsmile commented Feb 16, 2016

Uh oh!

gatorsmile commented Feb 19, 2016

Uh oh!

SparkQA commented Feb 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 19, 2016

Uh oh!

SparkQA commented Feb 19, 2016

Uh oh!

gatorsmile commented Feb 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 23, 2016

Uh oh!

SparkQA commented Feb 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Mar 12, 2016

Uh oh!

SparkQA commented Mar 12, 2016

Uh oh!

gatorsmile commented Mar 16, 2016

Uh oh!

SparkQA commented Mar 17, 2016

Uh oh!

gatorsmile commented Mar 17, 2016

Uh oh!

gatorsmile commented Mar 19, 2016

Uh oh!

SparkQA commented Mar 19, 2016

Uh oh!

gatorsmile commented Mar 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!