[SPARK-7289] handle project -> limit -> sort efficiently #6780

cloud-fan · 2015-06-12T06:39:52Z

make the TakeOrdered strategy and operator more general, such that it can optionally handle a projection when necessary

SparkQA · 2015-06-12T06:46:34Z

Test build #34755 has finished for PR 6780 at commit d74c960.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-06-12T06:48:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

LimitPushDown should run after ColumnPruning. For something like Limit(Project(Sort(...))), we should try to push down Project through Sort first.

Actually I'm think of adding a test for the whole batch. As this batch is getting larger and larger, there are a lot more interactions between rules, we should make sure there is no conflict between rules and rules are all at right order.
cc @marmbrus @rxin

Rules within a single batch shouldn't be sensitive to execution order. Especially for FixedPoint batches.

SparkQA · 2015-06-12T08:48:04Z

Test build #34759 has finished for PR 6780 at commit be08a91.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

adrian-wang · 2015-06-12T09:07:31Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LimitPushDownSuit.scala

adrian-wang · 2015-06-12T09:11:55Z

What's the relationship between this PR and SPARK-7289?

cloud-fan · 2015-06-12T09:17:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

This is for SPARK-7289. As we discussed in #5821, we need an optimize rule like this.
cc @adrian-wang

SparkQA · 2015-06-14T17:23:38Z

Test build #34890 has finished for PR 6780 at commit 3c7dab0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class JoinedRow extends InternalRow
- class JoinedRow2 extends InternalRow
- class JoinedRow3 extends InternalRow
- class JoinedRow4 extends InternalRow
- class JoinedRow5 extends InternalRow
- class JoinedRow6 extends InternalRow
- class BaseOrdering extends Ordering[InternalRow]
- case class TakeOrderedAndProject(

SparkQA · 2015-06-14T17:57:14Z

Test build #34892 has finished for PR 6780 at commit 91d798d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class JoinedRow extends InternalRow
- class JoinedRow2 extends InternalRow
- class JoinedRow3 extends InternalRow
- class JoinedRow4 extends InternalRow
- class JoinedRow5 extends InternalRow
- class JoinedRow6 extends InternalRow
- class BaseOrdering extends Ordering[InternalRow]
- case class TakeOrderedAndProject(

SparkQA · 2015-06-15T03:32:45Z

Test build #34901 has finished for PR 6780 at commit edfdc6f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TakeOrderedAndProject(

SparkQA · 2015-06-15T05:36:41Z

Test build #34904 has finished for PR 6780 at commit 0d53727.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TakeOrderedAndProject(

cloud-fan · 2015-06-15T06:33:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala

I'm confused here that we have to add @transient for it to avoid NotSerializableException . It looks to me that TakeOrderedAndProject should run on driver side and it doesn't pass its reference to RDD functions. Sorry if I missed something here.

How did you see errors here? I removed @transient and sbt sql/test still works for me.

chenghao-intel · 2015-06-17T05:59:50Z

How about the Limit(Sort(Aggregate)), will that be a bug? Sorry, I didn't test it, can you confirm that? Probably it's not a good idea to put 3 functionalities into a single physical operator.

cloud-fan · 2015-06-17T15:00:18Z

Hmm...this PR doesn't change the handling of Limit(Sort(Aggregate)), it just swap Limit and Project(and combine them) for case Project(Limit(Sort(...))), is it safe to do so? Generally speaking, the order of project and limit does't matter.

SparkQA · 2015-06-19T06:39:19Z

Test build #35230 has finished for PR 6780 at commit d11fa9e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SerializableConfiguration(@transient var value: Configuration) extends Serializable
- class SerializableJobConf(@transient var value: JobConf) extends Serializable
- case class TakeOrderedAndProject(

cloud-fan · 2015-06-21T09:42:37Z

cc @marmbrus , NotSerializableException error will exist at HiveCompatibilitySuite when insert into hive table.

SparkQA · 2015-06-21T16:00:47Z

Test build #35395 has finished for PR 6780 at commit ec57bad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TakeOrderedAndProject(

SparkQA · 2015-06-21T17:27:46Z

Test build #35400 has finished for PR 6780 at commit 9420510.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TakeOrderedAndProject(

cloud-fan · 2015-06-21T18:15:18Z

retest this please.

SparkQA · 2015-06-21T19:00:37Z

Test build #35406 has finished for PR 6780 at commit 9420510.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TakeOrderedAndProject(

SparkQA · 2015-06-21T19:08:58Z

Test build #35408 has finished for PR 6780 at commit e5c19c8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TakeOrderedAndProject(

SparkQA · 2015-06-21T19:53:33Z

Test build #35410 has finished for PR 6780 at commit b65dd6b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TakeOrderedAndProject(

cloud-fan · 2015-06-22T02:24:39Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

I made these changes to clean the closure and avoid referencing the $out so that we don't need to add a lot of @ transient. However, it fails a test CliSuite.Commands using SerDe provided in --jars with timeout. Does anybody know the reason?

It appears the change is breaking something in the serialization debugger.

15/06/22 22:25:41.020 main WARN SerializationDebugger: Exception in serialization debugger java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.getObjFieldValues$extension(SerializationDebugger.scala:248) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:158) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:107) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:166) at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:107) at org.apache.spark.serializer.SerializationDebugger$.find(SerializationDebugger.scala:66) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:319) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:312) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:139) at org.apache.spark.SparkContext.clean(SparkContext.scala:1891) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(InsertIntoHiveTable.scala:114) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:186) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:125) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:263) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:89) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:89) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:88) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:986) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:986) at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:143) at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:127) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:50) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:786) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:61) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:283) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:218) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:621) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 at java.io.ObjectStreamClass$FieldReflector.getObjFieldValues(ObjectStreamClass.java:2050) at java.io.ObjectStreamClass.getObjFieldValues(ObjectStreamClass.java:1252)

That said, it would be good if you could limit the changes in a given PR to the task at hand, and do refactorings in a separate PR. It's okay when they are very minor (i.e. fixing spelling), but grouping them together makes the PR harder to review, harder to backport, and causes it to get blocked on unrelated problems (such as this serialization issue).

SparkQA · 2015-06-23T17:47:27Z

Test build #35558 has finished for PR 6780 at commit 72c3f69.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TakeOrderedAndProject(

SparkQA · 2015-06-24T06:22:27Z

Test build #35634 has finished for PR 6780 at commit 34aa07b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TakeOrderedAndProject(

cloud-fan · 2015-06-24T06:36:47Z

retest this please.

SparkQA · 2015-06-24T08:36:06Z

Test build #35652 has finished for PR 6780 at commit 34aa07b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TakeOrderedAndProject(

marmbrus · 2015-06-24T20:29:41Z

Thanks! Merged to master.

… branch 1.4 The bug fixed by SPARK-7289 is a pretty serious one (Spark SQL generates wrong results). We should backport the fix to branch 1.4 (#6780). Also, we need to backport the fix of `TakeOrderedAndProject` as well (#8179). Author: Wenchen Fan <[email protected]> Author: Yin Huai <[email protected]> Closes #8252 from yhuai/backport7289And9949.

cloud-fan reviewed Jun 12, 2015
View reviewed changes

adrian-wang reviewed Jun 12, 2015
View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LimitPushDownSuit.scala Outdated

Copy link

Contributor

adrian-wang Jun 12, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suite

cloud-fan reviewed Jun 12, 2015
View reviewed changes

cloud-fan force-pushed the limit branch from b23034d to 3c7dab0 Compare June 14, 2015 17:02

cloud-fan changed the title ~~[SPARK-7267][SPARK-7289] add LimitPushDown rule~~ [SPARK-7289] handle project -> limit -> sort efficiently Jun 14, 2015

cloud-fan force-pushed the limit branch from 3c7dab0 to 91d798d Compare June 14, 2015 17:37

cloud-fan force-pushed the limit branch from 91d798d to edfdc6f Compare June 15, 2015 02:37

cloud-fan reviewed Jun 15, 2015
View reviewed changes

cloud-fan force-pushed the limit branch from 0d53727 to d11fa9e Compare June 19, 2015 05:44

cloud-fan force-pushed the limit branch from ec57bad to 9420510 Compare June 21, 2015 16:38

cloud-fan force-pushed the limit branch from 9420510 to e5c19c8 Compare June 21, 2015 18:22

cloud-fan force-pushed the limit branch from e5c19c8 to b65dd6b Compare June 21, 2015 19:09

cloud-fan reviewed Jun 22, 2015
View reviewed changes

cloud-fan force-pushed the limit branch from b65dd6b to 72c3f69 Compare June 23, 2015 16:17

cloud-fan added 8 commits June 24, 2015 11:52

fix existing

948f740

add LimitPushDown

2d8be83

fix style

214842b

address comments

b558549

address comments

3676a82

fix

20821ec

clean closure

07d5456

revert

34aa07b

cloud-fan force-pushed the limit branch from 72c3f69 to 34aa07b Compare June 24, 2015 04:30

asfgit closed this in f04b567 Jun 24, 2015

yhuai mentioned this pull request Aug 17, 2015

[SPARK-7289] [SPARK-9949] [SQL] Backport SPARK-7289 and SPARK-9949 to branch 1.4 #8252

Closed

[SPARK-7289] handle project -> limit -> sort efficiently #6780

[SPARK-7289] handle project -> limit -> sort efficiently #6780

Uh oh!

Conversation

cloud-fan commented Jun 12, 2015

Uh oh!

SparkQA commented Jun 12, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 12, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrian-wang commented Jun 12, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 14, 2015

Uh oh!

SparkQA commented Jun 14, 2015

Uh oh!

SparkQA commented Jun 15, 2015

Uh oh!

SparkQA commented Jun 15, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenghao-intel commented Jun 17, 2015

Uh oh!

cloud-fan commented Jun 17, 2015

Uh oh!

SparkQA commented Jun 19, 2015

Uh oh!

cloud-fan commented Jun 21, 2015

Uh oh!

SparkQA commented Jun 21, 2015

Uh oh!

SparkQA commented Jun 21, 2015

Uh oh!

cloud-fan commented Jun 21, 2015

Uh oh!

SparkQA commented Jun 21, 2015

Uh oh!

SparkQA commented Jun 21, 2015

Uh oh!

SparkQA commented Jun 21, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 23, 2015

Uh oh!

SparkQA commented Jun 24, 2015

Uh oh!

cloud-fan commented Jun 24, 2015

Uh oh!

SparkQA commented Jun 24, 2015

Uh oh!

marmbrus commented Jun 24, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants