-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-9251][SQL] do not order by expressions which still need evaluation #7593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -79,6 +79,7 @@ class Analyzer( | |
| ExtractWindowExpressions :: | ||
| GlobalAggregates :: | ||
| UnresolvedHavingClauseAttributes :: | ||
| RemoveEvaluationFromSort :: | ||
| HiveTypeCoercion.typeCoercionRules ++ | ||
| extendedResolutionRules : _*), | ||
| Batch("Nondeterministic", Once, | ||
|
|
@@ -947,6 +948,63 @@ class Analyzer( | |
| Project(p.output, newPlan.withNewChildren(newChild :: Nil)) | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Removes all still-need-evaluate ordering expressions from sort and use an inner project to | ||
| * materialize them, finally use a outer project to project them away to keep the result same. | ||
| * Then we can make sure we only sort by [[AttributeReference]]s. | ||
| * | ||
| * As an example, | ||
| * {{{ | ||
| * Sort('a, 'b + 1, | ||
| * Relation('a, 'b)) | ||
| * }}} | ||
| * will be turned into: | ||
| * {{{ | ||
| * Project('a, 'b, | ||
| * Sort('a, '_sortCondition, | ||
| * Project('a, 'b, ('b + 1).as("_sortCondition"), | ||
| * Relation('a, 'b)))) | ||
| * }}} | ||
| */ | ||
| object RemoveEvaluationFromSort extends Rule[LogicalPlan] { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems like an optimization not something to be done in analysis. We should always ask "Can the query be executed correctly without this rule?" to decide if it belongs in the Analyzer.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, after talking to @rxin I see he does plan to make this a requirement for execution.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And you even made it a condition of |
||
| private def hasAlias(expr: Expression) = { | ||
| expr.find { | ||
| case a: Alias => true | ||
| case _ => false | ||
| }.isDefined | ||
| } | ||
|
|
||
| override def apply(plan: LogicalPlan): LogicalPlan = plan transform { | ||
| // The ordering expressions have no effect to the output schema of `Sort`, | ||
| // so `Alias`s in ordering expressions are unnecessary and we should remove them. | ||
| case s @ Sort(ordering, _, _) if ordering.exists(hasAlias) => | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This check is probably more expensive than just doing the transformation always. If its a noop we will detect that through reference equality.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe also add a test to make sure we don't project unnecessarily when there is an alias? |
||
| val newOrdering = ordering.map(_.transformUp { | ||
| case Alias(child, _) => child | ||
| }.asInstanceOf[SortOrder]) | ||
| s.copy(order = newOrdering) | ||
|
|
||
| case s @ Sort(ordering, global, child) | ||
| if s.expressions.forall(_.resolved) && s.childrenResolved && !s.hasNoEvaluation => | ||
|
|
||
| val (ref, needEval) = ordering.partition(_.child.isInstanceOf[AttributeReference]) | ||
|
|
||
| val namedExpr = needEval.map(_.child match { | ||
| case n: NamedExpression => n | ||
| case e => Alias(e, "_sortCondition")() | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it possible that we will have multiple conditions needed to alias?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it's definitely possible, but the alias name here doesn't matter, we'll call |
||
| }) | ||
|
|
||
| val newOrdering = ref ++ needEval.zip(namedExpr).map { case (order, ne) => | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think there is a bug here. We need to consider the original ordering of the expressions or we might sort differently. Consider: Seq((1,2,3,4)).toDF("a", "b", "c", "d").registerTempTable("test")
scala> sqlContext.sql("SELECT * FROM test ORDER BY a + 1, b").queryExecution
== Parsed Logical Plan ==
'Sort [('a + 1) ASC,'b ASC], true
'Project [unresolvedalias(*)]
'UnresolvedRelation [test], None
== Optimized Logical Plan ==
Project [a#4,b#5,c#6,d#7]
Sort [b#5 ASC,_sortCondition#9 ASC], true
LocalRelation [a#4,b#5,c#6,d#7,_sortCondition#9], [[1,2,3,4,2]]Notice how we are now sorting by
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have created https://issues.apache.org/jira/browse/SPARK-9512 to bug fix work.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry I was terribly busy these days... thanks for the JIRA! I'll work on it later.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. o, it ha not been fixed yet. Will fix it before 1.5 release. From: Wenchen Fan [email protected] In sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: > + val newOrdering = ordering.map(.transformUp {> + case Alias(child, ) => child> + }.asInstanceOf[SortOrder])> + s.copy(order = newOrdering)> +> + case s @ Sort(ordering, global, child)> + if s.expressions.forall(.resolved) && s.childrenResolved && !s.hasNoEvaluation =>> +> + val (ref, needEval) = ordering.partition(.child.isInstanceOf[AttributeReference])> +> + val namedExpr = needEval.map(_.child match {> + case n: NamedExpression => n> + case e => Alias(e, "_sortCondition")()> + })> +> + val newOrdering = ref ++ needEval.zip(namedExpr).map { case (order, ne) => Sorry I was terribly busy these days... thanks for the fix! — |
||
| order.copy(child = ne.toAttribute) | ||
| } | ||
|
|
||
| // Add still-need-evaluate ordering expressions into inner project and then project | ||
| // them away after the sort. | ||
| Project(child.output, | ||
| Sort(newOrdering, global, | ||
| Project(child.output ++ namedExpr, child))) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| /** | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -66,7 +66,7 @@ case class Rand(seed: Long) extends RDG { | |
| val rngTerm = ctx.freshName("rng") | ||
| val className = classOf[XORShiftRandom].getName | ||
| ctx.addMutableState(className, rngTerm, | ||
| s"$rngTerm = new $className($seed + org.apache.spark.TaskContext.getPartitionId());") | ||
| s"$rngTerm = new $className(${seed}L + org.apache.spark.TaskContext.getPartitionId());") | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is an existing and small bug, sometimes |
||
| ev.isNull = "false" | ||
| s""" | ||
| final ${ctx.javaType(dataType)} ${ev.primitive} = $rngTerm.nextDouble(); | ||
|
|
@@ -89,7 +89,7 @@ case class Randn(seed: Long) extends RDG { | |
| val rngTerm = ctx.freshName("rng") | ||
| val className = classOf[XORShiftRandom].getName | ||
| ctx.addMutableState(className, rngTerm, | ||
| s"$rngTerm = new $className($seed + org.apache.spark.TaskContext.getPartitionId());") | ||
| s"$rngTerm = new $className(${seed}L + org.apache.spark.TaskContext.getPartitionId());") | ||
| ev.isNull = "false" | ||
| s""" | ||
| final ${ctx.javaType(dataType)} ${ev.primitive} = $rngTerm.nextGaussian(); | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This even introduce complicity.
I'm wondering what's the reason we should do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The origin motivation is adding a project to materialize nondeterministic expressions in ORDER BY to avoid extra evaluation that lead to wrong answer, see JIRA. In an offline discussion we decided to apply this rule for all still-need-evaluate expressions. But now I think it maybe overkill. @rxin What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most optimal way is we have a perfect cost model that can predict what we are trading off (network vs cpu). Minus that, I think just always projecting is the approach that makes more sense in most common cases, because:
The alternative, which is probably even better, is for the sorter itself to always project out the sort key. It might make more sense there, but is slightly more complicated to write I think.