-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24163][SPARK-24164][SQL] Support column list as the pivot column in Pivot #21720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| PIVOT ( | ||
| sum(e) s, avg(e) a | ||
| FOR y IN (2012, 2013) | ||
| FOR y IN (2012 as firstYear, 2013 secondYear) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we keep the original query? add a new one for this?
|
Test build #92656 has finished for PR 21720 at commit
|
| struct<> | ||
| -- !query 20 output | ||
| org.apache.spark.SparkException | ||
| Exception thrown in awaitResult: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
| aggregates.foreach { e => | ||
| if (!isAggregateExpression(e)) { | ||
| throw new AnalysisException( | ||
| s"Aggregate expression required for pivot, found '$e'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test case for this exception?
SELECT * FROM (
SELECT year, course, earnings FROM courseSales
)
PIVOT (
sum(earnings), year
FOR course IN ('dotNET', 'Java')
)| val evalPivotValues = pivotValues.map { value => | ||
| if (!Cast.canCast(value.dataType, pivotColumn.dataType)) { | ||
| throw new AnalysisException(s"Invalid pivot value '$value': " + | ||
| s"value data type ${value.dataType.simpleString} does not match " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simpleString -> catalogString
| try { | ||
| Cast(value, pivotColumn.dataType).eval(EmptyRow) | ||
| } catch { | ||
| case _: UnsupportedOperationException => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not use try catch for these cases.
if (value.foldable) {
Cast(value, pivotColumn.dataType).eval(EmptyRow)
} else {
throw new AnalysisException(
s"Literal expressions required for pivot values, found '$value'")
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check if the value is foldable before the type is castable
| def ifExpr(expr: Expression) = { | ||
| If(EqualNullSafe(pivotColumn, value), expr, Literal(null)) | ||
| def ifExpr(e: Expression) = { | ||
| If(EqualNullSafe(pivotColumn, Cast(value, pivotColumn.dataType)), e, Literal(null)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to consider timezone. Cast(value, pivotColumn.dataType, Some(conf.sessionLocalTimeZone))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it required in the other Cast(value, pivotColumn.dataType) above?
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add tests for nested columns like there: https://github.com/apache/spark/pull/21699/files#diff-cef44d3b766a4ea0a9a52cf864c66f03R258
|
|
||
| pivotColumn | ||
| : identifiers+=identifier | ||
| | '(' identifiers+=identifier (',' identifiers+=identifier)* ')' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any specific reasons to restrict the pivotColumn by identifier? Are there any cases when expressions still don't supported properly with your changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason was that I implemented this pivot SQL support based on ORACLE grammar. Please take a look at https://docs.oracle.com/database/121/SQLRF/img_text/pivot_for_clause.htm. Note that the "column" here is different from "expression" (take this for reference: https://docs.oracle.com/cd/B28359_01/server.111/b28286/expressions002.htm#SQLRF52047).
Another reason was that relaxing it to an "expr" would require a lot more tests and handling of special cases.
| case p: Pivot if !p.childrenResolved || !p.aggregates.forall(_.resolved) | ||
| || (p.groupByExprsOpt.isDefined && !p.groupByExprsOpt.get.forall(_.resolved)) | ||
| || !p.pivotColumn.resolved => p | ||
| || !p.pivotColumn.resolved || !p.pivotValues.forall(_.resolved) => p |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By which test is the change covered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this PR, pivot values can only be single literals (no struct) so they have been converted to Literals in ASTBuilder. Now they are "expressions" and will be handled in this Analyzer rule.
| */ | ||
| case class Pivot( | ||
| groupByExprsOpt: Option[Seq[NamedExpression]], | ||
| pivotColumn: Expression, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am asking just for my understanding. If you support multiple pivot columns, why it is not declared here explicitly: pivotColumns: Seq[Expression] like for pivotValues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Pivot column is one "expression" which can be either 1) a single column reference or 2) a struct of multiple columns. Either way the list of pivot values are many-to-one mapping for the pivot column.
| struct<> | ||
| -- !query 19 output | ||
| org.apache.spark.SparkException | ||
| Job 17 cancelled because SparkContext was shut down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it expected output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No... sorry about this. There must have been a mistake. I'll commit this file again.
| .map(typedVisit[Expression]) | ||
| val pivotColumn = UnresolvedAttribute.quoted(ctx.pivotColumn.getText) | ||
| val pivotValues = ctx.pivotValues.asScala.map(typedVisit[Expression]).map(Literal.apply) | ||
| val pivotColumn = if (ctx.pivotColumn.identifiers.size == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any reasons to handle one pivot column separately? And what happens if size == 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot be "0" as required by the parser rule. if size == 1, then it's single column as before, otherwise it's a construct.
| } catch { | ||
| case _: UnsupportedOperationException => | ||
| throw new AnalysisException( | ||
| s"Literal expressions required for pivot values, found '$value'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is UnsupportedOperationException raised only in the case if value is not a literal. Probably you can check that it is a literal earlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. Please refer to @gatorsmile's comment.
|
Test build #92792 has finished for PR 21720 at commit
|
|
retest please |
|
retest this please |
|
Test build #92823 has finished for PR 21720 at commit
|
|
ping @maryannxue Resolve the conflicts? Will review it again after that. |
|
Test build #92993 has finished for PR 21720 at commit
|
|
retest this please |
|
Test build #93138 has finished for PR 21720 at commit
|
|
retest this please |
|
Test build #93152 has finished for PR 21720 at commit
|
|
retest this please |
|
Test build #93182 has finished for PR 21720 at commit
|
|
LGTM Thanks! Merged to master. |
|
@gatorsmile @maryannxue Can we move forward with this PR: #21699 ? |
|
@maryannxue I know this is an old PR, but it doesn't actually include SPARK-24163. Can the Jira ticket be re-opened for SPARK-24163? |
What changes were proposed in this pull request?
How was this patch tested?
Add tests in pivot.sql