[SPARK-24163][SPARK-24164][SQL] Support column list as the pivot column in Pivot #21720

maryannxue · 2018-07-05T22:24:17Z

What changes were proposed in this pull request?

Extend the Parser to enable parsing a column list as the pivot column.
Extend the Parser and the Pivot node to enable parsing complex expressions with aliases as the pivot value.
Add type check and constant check in Analyzer for Pivot node.

How was this patch tested?

Add tests in pivot.sql

gatorsmile · 2018-07-05T23:49:19Z

sql/core/src/test/resources/sql-tests/inputs/pivot.sql

 PIVOT (
  sum(e) s, avg(e) a
-  FOR y IN (2012, 2013)
+  FOR y IN (2012 as firstYear, 2013 secondYear)


can we keep the original query? add a new one for this?

SparkQA · 2018-07-06T01:11:07Z

Test build #92656 has finished for PR 21720 at commit 942a30d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-08T05:50:22Z

sql/core/src/test/resources/sql-tests/results/pivot.sql.out

+struct<>
+-- !query 20 output
+org.apache.spark.SparkException
+Exception thrown in awaitResult:


gatorsmile · 2018-07-08T06:35:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

        aggregates.foreach { e =>
          if (!isAggregateExpression(e)) {
              throw new AnalysisException(
                s"Aggregate expression required for pivot, found '$e'")


Add a test case for this exception?

SELECT * FROM ( SELECT year, course, earnings FROM courseSales ) PIVOT ( sum(earnings), year FOR course IN ('dotNET', 'Java') )

gatorsmile · 2018-07-08T06:44:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+        val evalPivotValues = pivotValues.map { value =>
+          if (!Cast.canCast(value.dataType, pivotColumn.dataType)) {
+            throw new AnalysisException(s"Invalid pivot value '$value': " +
+              s"value data type ${value.dataType.simpleString} does not match " +


simpleString -> catalogString

gatorsmile · 2018-07-08T07:19:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+          try {
+            Cast(value, pivotColumn.dataType).eval(EmptyRow)
+          } catch {
+            case _: UnsupportedOperationException =>


Do not use try catch for these cases.

if (value.foldable) { Cast(value, pivotColumn.dataType).eval(EmptyRow) } else { throw new AnalysisException( s"Literal expressions required for pivot values, found '$value'") }

We should check if the value is foldable before the type is castable

gatorsmile · 2018-07-08T07:25:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

-            def ifExpr(expr: Expression) = {
-              If(EqualNullSafe(pivotColumn, value), expr, Literal(null))
+            def ifExpr(e: Expression) = {
+              If(EqualNullSafe(pivotColumn, Cast(value, pivotColumn.dataType)), e, Literal(null))


need to consider timezone. Cast(value, pivotColumn.dataType, Some(conf.sessionLocalTimeZone))

Is it required in the other Cast(value, pivotColumn.dataType) above?

MaxGekk

Could you add tests for nested columns like there: https://github.com/apache/spark/pull/21699/files#diff-cef44d3b766a4ea0a9a52cf864c66f03R258

MaxGekk · 2018-07-08T11:08:24Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

+
+pivotColumn
+    : identifiers+=identifier
+    | '(' identifiers+=identifier (',' identifiers+=identifier)* ')'


Are there any specific reasons to restrict the pivotColumn by identifier? Are there any cases when expressions still don't supported properly with your changes?

The main reason was that I implemented this pivot SQL support based on ORACLE grammar. Please take a look at https://docs.oracle.com/database/121/SQLRF/img_text/pivot_for_clause.htm. Note that the "column" here is different from "expression" (take this for reference: https://docs.oracle.com/cd/B28359_01/server.111/b28286/expressions002.htm#SQLRF52047).
Another reason was that relaxing it to an "expr" would require a lot more tests and handling of special cases.

MaxGekk · 2018-07-08T11:12:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

      case p: Pivot if !p.childrenResolved || !p.aggregates.forall(_.resolved)
        || (p.groupByExprsOpt.isDefined && !p.groupByExprsOpt.get.forall(_.resolved))
-        || !p.pivotColumn.resolved => p
+        || !p.pivotColumn.resolved || !p.pivotValues.forall(_.resolved) => p


By which test is the change covered?

Before this PR, pivot values can only be single literals (no struct) so they have been converted to Literals in ASTBuilder. Now they are "expressions" and will be handled in this Analyzer rule.

MaxGekk · 2018-07-08T11:22:29Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

 */
 case class Pivot(
    groupByExprsOpt: Option[Seq[NamedExpression]],
    pivotColumn: Expression,


I am asking just for my understanding. If you support multiple pivot columns, why it is not declared here explicitly: pivotColumns: Seq[Expression] like for pivotValues?

No. Pivot column is one "expression" which can be either 1) a single column reference or 2) a struct of multiple columns. Either way the list of pivot values are many-to-one mapping for the pivot column.

MaxGekk · 2018-07-08T11:25:59Z

sql/core/src/test/resources/sql-tests/results/pivot.sql.out

+struct<>
+-- !query 19 output
+org.apache.spark.SparkException
+Job 17 cancelled because SparkContext was shut down


Is it expected output?

No... sorry about this. There must have been a mistake. I'll commit this file again.

MaxGekk · 2018-07-08T11:38:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

      .map(typedVisit[Expression])
-    val pivotColumn = UnresolvedAttribute.quoted(ctx.pivotColumn.getText)
-    val pivotValues = ctx.pivotValues.asScala.map(typedVisit[Expression]).map(Literal.apply)
+    val pivotColumn = if (ctx.pivotColumn.identifiers.size == 1) {


Are there any reasons to handle one pivot column separately? And what happens if size == 0?

Cannot be "0" as required by the parser rule. if size == 1, then it's single column as before, otherwise it's a construct.

MaxGekk · 2018-07-08T11:52:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+          } catch {
+            case _: UnsupportedOperationException =>
+              throw new AnalysisException(
+                s"Literal expressions required for pivot values, found '$value'")


Is UnsupportedOperationException raised only in the case if value is not a literal. Probably you can check that it is a literal earlier?

Yes, you are right. Please refer to @gatorsmile's comment.

SparkQA · 2018-07-10T07:05:01Z

Test build #92792 has finished for PR 21720 at commit d468821.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maryannxue · 2018-07-10T07:07:22Z

retest please

maryannxue · 2018-07-10T17:22:33Z

retest this please

SparkQA · 2018-07-10T21:20:57Z

Test build #92823 has finished for PR 21720 at commit d468821.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-13T16:14:50Z

ping @maryannxue Resolve the conflicts? Will review it again after that.

SparkQA · 2018-07-14T02:09:19Z

Test build #92993 has finished for PR 21720 at commit b27245e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-16T21:40:19Z

retest this please

SparkQA · 2018-07-16T23:16:58Z

Test build #93138 has finished for PR 21720 at commit b27245e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maryannxue · 2018-07-17T05:00:49Z

retest this please

SparkQA · 2018-07-17T07:05:01Z

Test build #93152 has finished for PR 21720 at commit b27245e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maryannxue · 2018-07-17T16:44:15Z

retest this please

SparkQA · 2018-07-17T20:45:49Z

Test build #93182 has finished for PR 21720 at commit b27245e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-18T20:32:42Z

LGTM

Thanks! Merged to master.

MaxGekk · 2018-07-18T20:54:09Z

@gatorsmile @maryannxue Can we move forward with this PR: #21699 ?

patricker · 2020-06-30T17:15:19Z

@maryannxue I know this is an old PR, but it doesn't actually include SPARK-24163. Can the Jira ticket be re-opened for SPARK-24163?

maryannxue added 2 commits July 3, 2018 13:54

spark-24164

fd23502

revert accidental changes

942a30d

gatorsmile reviewed Jul 5, 2018

View reviewed changes

gatorsmile reviewed Jul 8, 2018

View reviewed changes

MaxGekk reviewed Jul 8, 2018

View reviewed changes

maryannxue added 2 commits July 9, 2018 18:17

fix ref file

58e99ab

address review comments

d468821

Resolve conflicts

b27245e

asfgit closed this in cd203e0 Jul 18, 2018

[SPARK-24163][SPARK-24164][SQL] Support column list as the pivot column in Pivot #21720

[SPARK-24163][SPARK-24164][SQL] Support column list as the pivot column in Pivot #21720

Uh oh!

Conversation

maryannxue commented Jul 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 10, 2018

Uh oh!

maryannxue commented Jul 10, 2018

Uh oh!

maryannxue commented Jul 10, 2018

Uh oh!

SparkQA commented Jul 10, 2018

Uh oh!

gatorsmile commented Jul 13, 2018

Uh oh!

SparkQA commented Jul 14, 2018

Uh oh!

gatorsmile commented Jul 16, 2018

Uh oh!

SparkQA commented Jul 16, 2018

Uh oh!

maryannxue commented Jul 17, 2018

Uh oh!

SparkQA commented Jul 17, 2018

Uh oh!

maryannxue commented Jul 17, 2018

Uh oh!

SparkQA commented Jul 17, 2018

Uh oh!

gatorsmile commented Jul 18, 2018

Uh oh!

maryannxue commented Jul 5, 2018 •

edited

Loading