Skip to content

Conversation

@cloud-fan
Copy link
Contributor

as an offline discussion with @rxin , it's weird to be computing stuff while doing sorting, we should only order by bound reference during execution.

@cloud-fan
Copy link
Contributor Author

cc @rxin @yhuai

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an existing and small bug, sometimes seed is large and can not be represented as int literal, add a L at end to make it a long literal.

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #38064 has finished for PR 7593 at commit 0f9b6da.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #56 has finished for PR 7593 at commit 0f9b6da.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class FormatString(children: Expression*) extends Expression with ImplicitCastInputTypes

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #38065 has finished for PR 7593 at commit 0f9b6da.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #60 has finished for PR 7593 at commit 0f9b6da.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #38079 has finished for PR 7593 at commit 0f9b6da.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jul 22, 2015

cc @yhuai can you review this?

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #1165 has finished for PR 7593 at commit 0f9b6da.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment at here to explain we need a Project at the top to get the expected output attributes.

@SparkQA
Copy link

SparkQA commented Jul 23, 2015

Test build #38156 has finished for PR 7593 at commit 9e2c1f6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 23, 2015

Test build #38166 has finished for PR 7593 at commit 9e2c1f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TrainValidationSplit(override val uid: String) extends Estimator[TrainValidationSplitModel]
    • case class UnresolvedFunction(
    • case class Average(child: Expression) extends AlgebraicAggregate
    • case class Count(child: Expression) extends AlgebraicAggregate
    • case class First(child: Expression) extends AlgebraicAggregate
    • case class Last(child: Expression) extends AlgebraicAggregate
    • case class Max(child: Expression) extends AlgebraicAggregate
    • case class Min(child: Expression) extends AlgebraicAggregate
    • case class Sum(child: Expression) extends AlgebraicAggregate
    • abstract class AlgebraicAggregate extends AggregateFunction2 with Serializable
    • implicit class RichAttribute(a: AttributeReference)
    • trait AggregateExpression1 extends AggregateExpression
    • trait PartialAggregate1 extends AggregateExpression1
    • case class Min(child: Expression) extends UnaryExpression with PartialAggregate1
    • case class MinFunction(expr: Expression, base: AggregateExpression1) extends AggregateFunction1
    • case class Max(child: Expression) extends UnaryExpression with PartialAggregate1
    • case class MaxFunction(expr: Expression, base: AggregateExpression1) extends AggregateFunction1
    • case class Count(child: Expression) extends UnaryExpression with PartialAggregate1
    • case class CountFunction(expr: Expression, base: AggregateExpression1) extends AggregateFunction1
    • case class CountDistinct(expressions: Seq[Expression]) extends PartialAggregate1
    • case class CollectHashSet(expressions: Seq[Expression]) extends AggregateExpression1
    • case class CombineSetsAndCount(inputSet: Expression) extends AggregateExpression1
    • case class Average(child: Expression) extends UnaryExpression with PartialAggregate1
    • case class AverageFunction(expr: Expression, base: AggregateExpression1)
    • case class Sum(child: Expression) extends UnaryExpression with PartialAggregate1
    • case class SumFunction(expr: Expression, base: AggregateExpression1) extends AggregateFunction1
    • case class CombineSum(child: Expression) extends AggregateExpression1
    • case class CombineSumFunction(expr: Expression, base: AggregateExpression1)
    • case class SumDistinct(child: Expression) extends UnaryExpression with PartialAggregate1
    • case class SumDistinctFunction(expr: Expression, base: AggregateExpression1)
    • case class CombineSetsAndSum(inputSet: Expression, base: Expression) extends AggregateExpression1
    • case class First(child: Expression) extends UnaryExpression with PartialAggregate1
    • case class FirstFunction(expr: Expression, base: AggregateExpression1) extends AggregateFunction1
    • case class Last(child: Expression) extends UnaryExpression with PartialAggregate1
    • case class LastFunction(expr: Expression, base: AggregateExpression1) extends AggregateFunction1
    • case class CreateArray(children: Seq[Expression]) extends Expression
    • case class CreateStruct(children: Seq[Expression]) extends Expression
    • case class CreateNamedStruct(children: Seq[Expression]) extends Expression
    • case class Aggregate2Sort(
    • case class FinalAndCompleteAggregate2Sort(
    • class GroupingIterator(
    • class PartialSortAggregationIterator(
    • class PartialMergeSortAggregationIterator(
    • class FinalSortAggregationIterator(
    • class FinalAndCompleteSortAggregationIterator(
    • abstract class UserDefinedAggregateFunction extends Serializable
    • case class ScalaUDAF(

@SparkQA
Copy link

SparkQA commented Jul 23, 2015

Test build #74 has finished for PR 7593 at commit 9e2c1f6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that we will have multiple conditions needed to alias?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's definitely possible, but the alias name here doesn't matter, we'll call toAttribute later, and thus bind it with expression id.

@cloud-fan
Copy link
Contributor Author

@rxin , I'm wondering should we do this for all kind of expressions? We will copy rows before sort, with this change, sort by a + 1 adds an extra column and thus increase data size for sort, and may add IO pressure for external sort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This even introduce complicity.

I'm wondering what's the reason we should do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The origin motivation is adding a project to materialize nondeterministic expressions in ORDER BY to avoid extra evaluation that lead to wrong answer, see JIRA. In an offline discussion we decided to apply this rule for all still-need-evaluate expressions. But now I think it maybe overkill. @rxin What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most optimal way is we have a perfect cost model that can predict what we are trading off (network vs cpu). Minus that, I think just always projecting is the approach that makes more sense in most common cases, because:

  1. It is hard to quantify the difference.
  2. I/O (network, disk) is rarely the bottleneck here, especially with more SSDs and 10Gbps network.
  3. Most of the time order by is just ordering by a field, and this won't hurt that case.
  4. If there is a complex expression, doing the eval many times during sorting is bad.

The alternative, which is probably even better, is for the sorter itself to always project out the sort key. It might make more sense there, but is slightly more complicated to write I think.

@rxin
Copy link
Contributor

rxin commented Jul 25, 2015

@cloud-fan would be great to add unit test for this analysis rule too.

@SparkQA
Copy link

SparkQA commented Jul 25, 2015

Test build #38434 has finished for PR 7593 at commit ab811b7.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan force-pushed the sort branch 2 times, most recently from d9f0b6e to b2a2c8c Compare July 25, 2015 16:16
@SparkQA
Copy link

SparkQA commented Jul 25, 2015

Test build #38436 has finished for PR 7593 at commit b2a2c8c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 25, 2015

Test build #38435 has finished for PR 7593 at commit d9f0b6e.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 25, 2015

Test build #38439 has finished for PR 7593 at commit caa7dfd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jul 25, 2015

cc @yhuai for review

@SparkQA
Copy link

SparkQA commented Jul 26, 2015

Test build #38446 has finished for PR 7593 at commit 80029ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

in another PR, I improved the newly added PullOutNondeterministic rule so that it can also work for Sort. Now this PR is not making Sort correct, but kind of optimization. Should we put it in Optimizer? cc @rxin

@yhuai
Copy link
Contributor

yhuai commented Jul 29, 2015

LGTM. Will merge it once it passes the test.

@cloud-fan cloud-fan changed the title [SPARK-9251][SPARK-9083][SQL] do not order by expressions which still need evaluation [SPARK-9251][SQL] do not order by expressions which still need evaluation Jul 29, 2015
@SparkQA
Copy link

SparkQA commented Jul 29, 2015

Test build #38803 has finished for PR 7593 at commit 7b1bef7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jul 29, 2015

Thanks - I've merged this.

@asfgit asfgit closed this in 708794e Jul 29, 2015
@cloud-fan cloud-fan deleted the sort branch July 29, 2015 07:35
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is probably more expensive than just doing the transformation always. If its a noop we will detect that through reference equality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also add a test to make sure we don't project unnecessarily when there is an alias?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants