[SPARK-9251][SQL] do not order by expressions which still need evaluation #7593

cloud-fan · 2015-07-22T09:04:11Z

as an offline discussion with @rxin , it's weird to be computing stuff while doing sorting, we should only order by bound reference during execution.

cloud-fan · 2015-07-22T09:05:12Z

cc @rxin @yhuai

cloud-fan · 2015-07-22T09:06:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/random.scala

this is an existing and small bug, sometimes seed is large and can not be represented as int literal, add a L at end to make it a long literal.

SparkQA · 2015-07-22T09:11:21Z

Test build #38064 has finished for PR 7593 at commit 0f9b6da.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-07-22T09:24:41Z

retest this please.

SparkQA · 2015-07-22T09:26:44Z

Test build #56 has finished for PR 7593 at commit 0f9b6da.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class FormatString(children: Expression*) extends Expression with ImplicitCastInputTypes

SparkQA · 2015-07-22T09:30:53Z

Test build #38065 has finished for PR 7593 at commit 0f9b6da.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-07-22T14:33:57Z

retest this please.

SparkQA · 2015-07-22T14:37:02Z

Test build #60 has finished for PR 7593 at commit 0f9b6da.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-22T14:37:35Z

Test build #38079 has finished for PR 7593 at commit 0f9b6da.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-07-22T18:02:15Z

cc @yhuai can you review this?

SparkQA · 2015-07-22T18:14:09Z

Test build #1165 has finished for PR 7593 at commit 0f9b6da.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-07-22T19:16:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Let's add a comment at here to explain we need a Project at the top to get the expected output attributes.

SparkQA · 2015-07-23T04:17:11Z

Test build #38156 has finished for PR 7593 at commit 9e2c1f6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-07-23T04:25:58Z

retest this please

SparkQA · 2015-07-23T05:57:09Z

Test build #38166 has finished for PR 7593 at commit 9e2c1f6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class TrainValidationSplit(override val uid: String) extends Estimator[TrainValidationSplitModel]
- case class UnresolvedFunction(
- case class Average(child: Expression) extends AlgebraicAggregate
- case class Count(child: Expression) extends AlgebraicAggregate
- case class First(child: Expression) extends AlgebraicAggregate
- case class Last(child: Expression) extends AlgebraicAggregate
- case class Max(child: Expression) extends AlgebraicAggregate
- case class Min(child: Expression) extends AlgebraicAggregate
- case class Sum(child: Expression) extends AlgebraicAggregate
- abstract class AlgebraicAggregate extends AggregateFunction2 with Serializable
- implicit class RichAttribute(a: AttributeReference)
- trait AggregateExpression1 extends AggregateExpression
- trait PartialAggregate1 extends AggregateExpression1
- case class Min(child: Expression) extends UnaryExpression with PartialAggregate1
- case class MinFunction(expr: Expression, base: AggregateExpression1) extends AggregateFunction1
- case class Max(child: Expression) extends UnaryExpression with PartialAggregate1
- case class MaxFunction(expr: Expression, base: AggregateExpression1) extends AggregateFunction1
- case class Count(child: Expression) extends UnaryExpression with PartialAggregate1
- case class CountFunction(expr: Expression, base: AggregateExpression1) extends AggregateFunction1
- case class CountDistinct(expressions: Seq[Expression]) extends PartialAggregate1
- case class CollectHashSet(expressions: Seq[Expression]) extends AggregateExpression1
- case class CombineSetsAndCount(inputSet: Expression) extends AggregateExpression1
- case class Average(child: Expression) extends UnaryExpression with PartialAggregate1
- case class AverageFunction(expr: Expression, base: AggregateExpression1)
- case class Sum(child: Expression) extends UnaryExpression with PartialAggregate1
- case class SumFunction(expr: Expression, base: AggregateExpression1) extends AggregateFunction1
- case class CombineSum(child: Expression) extends AggregateExpression1
- case class CombineSumFunction(expr: Expression, base: AggregateExpression1)
- case class SumDistinct(child: Expression) extends UnaryExpression with PartialAggregate1
- case class SumDistinctFunction(expr: Expression, base: AggregateExpression1)
- case class CombineSetsAndSum(inputSet: Expression, base: Expression) extends AggregateExpression1
- case class First(child: Expression) extends UnaryExpression with PartialAggregate1
- case class FirstFunction(expr: Expression, base: AggregateExpression1) extends AggregateFunction1
- case class Last(child: Expression) extends UnaryExpression with PartialAggregate1
- case class LastFunction(expr: Expression, base: AggregateExpression1) extends AggregateFunction1
- case class CreateArray(children: Seq[Expression]) extends Expression
- case class CreateStruct(children: Seq[Expression]) extends Expression
- case class CreateNamedStruct(children: Seq[Expression]) extends Expression
- case class Aggregate2Sort(
- case class FinalAndCompleteAggregate2Sort(
- class GroupingIterator(
- class PartialSortAggregationIterator(
- class PartialMergeSortAggregationIterator(
- class FinalSortAggregationIterator(
- class FinalAndCompleteSortAggregationIterator(
- abstract class UserDefinedAggregateFunction extends Serializable
- case class ScalaUDAF(

SparkQA · 2015-07-23T06:00:47Z

Test build #74 has finished for PR 7593 at commit 9e2c1f6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-07-23T08:08:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Is it possible that we will have multiple conditions needed to alias?

it's definitely possible, but the alias name here doesn't matter, we'll call toAttribute later, and thus bind it with expression id.

cloud-fan · 2015-07-23T10:50:48Z

@rxin , I'm wondering should we do this for all kind of expressions? We will copy rows before sort, with this change, sort by a + 1 adds an extra column and thus increase data size for sort, and may add IO pressure for external sort.

davies · 2015-07-23T15:46:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

This even introduce complicity.

I'm wondering what's the reason we should do this?

The origin motivation is adding a project to materialize nondeterministic expressions in ORDER BY to avoid extra evaluation that lead to wrong answer, see JIRA. In an offline discussion we decided to apply this rule for all still-need-evaluate expressions. But now I think it maybe overkill. @rxin What do you think?

The most optimal way is we have a perfect cost model that can predict what we are trading off (network vs cpu). Minus that, I think just always projecting is the approach that makes more sense in most common cases, because:

It is hard to quantify the difference.

I/O (network, disk) is rarely the bottleneck here, especially with more SSDs and 10Gbps network.

Most of the time order by is just ordering by a field, and this won't hurt that case.

If there is a complex expression, doing the eval many times during sorting is bad.

The alternative, which is probably even better, is for the sorter itself to always project out the sort key. It might make more sense there, but is slightly more complicated to write I think.

rxin · 2015-07-25T08:20:59Z

@cloud-fan would be great to add unit test for this analysis rule too.

SparkQA · 2015-07-25T16:02:58Z

Test build #38434 has finished for PR 7593 at commit ab811b7.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-25T16:38:05Z

Test build #38436 has finished for PR 7593 at commit b2a2c8c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-25T18:18:05Z

Test build #38435 has finished for PR 7593 at commit d9f0b6e.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2015-07-25T18:36:42Z

Test build #38439 has finished for PR 7593 at commit caa7dfd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-07-25T21:02:28Z

cc @yhuai for review

SparkQA · 2015-07-26T04:15:00Z

Test build #38446 has finished for PR 7593 at commit 80029ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-07-26T16:34:19Z

in another PR, I improved the newly added PullOutNondeterministic rule so that it can also work for Sort. Now this PR is not making Sort correct, but kind of optimization. Should we put it in Optimizer? cc @rxin

yhuai · 2015-07-29T05:48:18Z

LGTM. Will merge it once it passes the test.

SparkQA · 2015-07-29T07:05:42Z

Test build #38803 has finished for PR 7593 at commit 7b1bef7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-07-29T07:08:38Z

Thanks - I've merged this.

marmbrus · 2015-07-29T18:30:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

This check is probably more expensive than just doing the transformation always. If its a noop we will detect that through reference equality.

Maybe also add a test to make sure we don't project unnecessarily when there is an alias?

cloud-fan reviewed Jul 22, 2015
View reviewed changes

yhuai reviewed Jul 22, 2015
View reviewed changes

cloud-fan force-pushed the sort branch from 0f9b6da to 9e2c1f6 Compare July 23, 2015 03:35

viirya reviewed Jul 23, 2015
View reviewed changes

davies reviewed Jul 23, 2015
View reviewed changes

cloud-fan force-pushed the sort branch 2 times, most recently from d9f0b6e to b2a2c8c Compare July 25, 2015 16:16

cloud-fan force-pushed the sort branch from b2a2c8c to caa7dfd Compare July 25, 2015 16:56

cloud-fan force-pushed the sort branch from caa7dfd to 80029ac Compare July 26, 2015 02:41

cloud-fan mentioned this pull request Jul 26, 2015

[SPARK-8608][SPARK-8609][SPARK-9083][SQL] reset mutable states of nondeterministic expression before evaluation and fix PullOutNondeterministic #7674

Closed

cloud-fan added 2 commits July 29, 2015 12:49

do not order by expressions which still need evaluation

289bee0

add more comments

daf206d

cloud-fan force-pushed the sort branch from 80029ac to 7b1bef7 Compare July 29, 2015 05:22

add test

7b1bef7

cloud-fan changed the title ~~[SPARK-9251][SPARK-9083][SQL] do not order by expressions which still need evaluation~~ [SPARK-9251][SQL] do not order by expressions which still need evaluation Jul 29, 2015

asfgit closed this in 708794e Jul 29, 2015

cloud-fan deleted the sort branch July 29, 2015 07:35

marmbrus reviewed Jul 29, 2015
View reviewed changes

[SPARK-9251][SQL] do not order by expressions which still need evaluation #7593

[SPARK-9251][SQL] do not order by expressions which still need evaluation #7593

Uh oh!

Conversation

cloud-fan commented Jul 22, 2015

Uh oh!

cloud-fan commented Jul 22, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 22, 2015

Uh oh!

cloud-fan commented Jul 22, 2015

Uh oh!

SparkQA commented Jul 22, 2015

Uh oh!

SparkQA commented Jul 22, 2015

Uh oh!

cloud-fan commented Jul 22, 2015

Uh oh!

SparkQA commented Jul 22, 2015

Uh oh!

SparkQA commented Jul 22, 2015

Uh oh!

rxin commented Jul 22, 2015

Uh oh!

SparkQA commented Jul 22, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 23, 2015

Uh oh!

cloud-fan commented Jul 23, 2015

Uh oh!

SparkQA commented Jul 23, 2015

Uh oh!

SparkQA commented Jul 23, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 23, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Jul 25, 2015

Uh oh!

SparkQA commented Jul 25, 2015

Uh oh!

SparkQA commented Jul 25, 2015

Uh oh!

SparkQA commented Jul 25, 2015

Uh oh!

SparkQA commented Jul 25, 2015

Uh oh!

rxin commented Jul 25, 2015

Uh oh!

SparkQA commented Jul 26, 2015

Uh oh!

cloud-fan commented Jul 26, 2015

Uh oh!

yhuai commented Jul 29, 2015

Uh oh!

SparkQA commented Jul 29, 2015

Uh oh!

rxin commented Jul 29, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects