Skip to content

Conversation

@scwf
Copy link
Contributor

@scwf scwf commented Apr 30, 2015

Optimize following sql

select key from (select * from testData order by value) t limit 5

before this PR:

== Parsed Logical Plan ==
'Limit 5
 'Project ['key]
  'Subquery t
   'Sort ['value ASC], true
    'Project [*]
     'UnresolvedRelation [testData], None

== Analyzed Logical Plan ==
Limit 5
 Project [key#0]
  Subquery t
   Sort [value#0 ASC], true
    Project [key#0,value#1]
     Subquery testData
      LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 

== Optimized Logical Plan ==
Limit 5
 Project [key#0]
  Sort [value#0 ASC], true
   LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 
== Physical Plan ==
Limit 5
 Project [key#0]
  Sort [value#0 ASC], true
   Exchange (RangePartitioning [value#0 ASC], 5), []
    PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] 

after this PR

== Parsed Logical Plan ==
'Limit 5
 'Project ['key]
  'Subquery t
   'Sort ['value ASC], true
    'Project [*]
     'UnresolvedRelation [testData], None

== Analyzed Logical Plan ==
Limit 5
 Project [key#0]
  Subquery t
   Sort [value#0 ASC], true
    Project [key#0,value#1]
     Subquery testData
      LogicalRDD [key#0,value#1], MapPartitionsRDD[1]

== Optimized Logical Plan ==
Project [key#0]
 Limit 5
  Sort [value#0 ASC], true
   LogicalRDD [key#0,value#1], MapPartitionsRDD[1] 

== Physical Plan ==
Project [key#0]
 TakeOrdered 5, [value#0 ASC]
  PhysicalRDD [key#0,value#1], MapPartitionsRDD[1]

with this rule we combine limit and sort, so it will plan as takeordered which can avoid total ordering

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Apr 30, 2015

Test build #31474 has started for PR 5821 at commit b9769df.

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31474 has finished for PR 5821 at commit b9769df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • "public class " + className + extendsText + " implements java.io.Serializable
    • class KMeansModel (
    • trait PMMLExportable
    • public static final class UNSAFE
  • This patch adds the following new dependencies:
    • RoaringBitmap-0.4.5.jar
    • akka-actor_2.10-2.3.4-spark.jar
    • akka-remote_2.10-2.3.4-spark.jar
    • akka-slf4j_2.10-2.3.4-spark.jar
    • aopalliance-1.0.jar
    • arpack_combined_all-0.1.jar
    • avro-1.7.7.jar
    • breeze-macros_2.10-0.11.2.jar
    • breeze_2.10-0.11.2.jar
    • chill-java-0.5.0.jar
    • chill_2.10-0.5.0.jar
    • commons-beanutils-1.7.0.jar
    • commons-beanutils-core-1.8.0.jar
    • commons-cli-1.2.jar
    • commons-codec-1.10.jar
    • commons-collections-3.2.1.jar
    • commons-compress-1.4.1.jar
    • commons-configuration-1.6.jar
    • commons-digester-1.8.jar
    • commons-httpclient-3.1.jar
    • commons-io-2.1.jar
    • commons-lang-2.5.jar
    • commons-lang3-3.3.2.jar
    • commons-math-2.1.jar
    • commons-math3-3.4.1.jar
    • commons-net-2.2.jar
    • compress-lzf-1.0.0.jar
    • config-1.2.1.jar
    • core-1.1.2.jar
    • curator-client-2.4.0.jar
    • curator-framework-2.4.0.jar
    • curator-recipes-2.4.0.jar
    • gmbal-api-only-3.0.0-b023.jar
    • grizzly-framework-2.1.2.jar
    • grizzly-http-2.1.2.jar
    • grizzly-http-server-2.1.2.jar
    • grizzly-http-servlet-2.1.2.jar
    • grizzly-rcm-2.1.2.jar
    • groovy-all-2.3.7.jar
    • guava-14.0.1.jar
    • guice-3.0.jar
    • hadoop-annotations-2.2.0.jar
    • hadoop-auth-2.2.0.jar
    • hadoop-client-2.2.0.jar
    • hadoop-common-2.2.0.jar
    • hadoop-hdfs-2.2.0.jar
    • hadoop-mapreduce-client-app-2.2.0.jar
    • hadoop-mapreduce-client-common-2.2.0.jar
    • hadoop-mapreduce-client-core-2.2.0.jar
    • hadoop-mapreduce-client-jobclient-2.2.0.jar
    • hadoop-mapreduce-client-shuffle-2.2.0.jar
    • hadoop-yarn-api-2.2.0.jar
    • hadoop-yarn-client-2.2.0.jar
    • hadoop-yarn-common-2.2.0.jar
    • hadoop-yarn-server-common-2.2.0.jar
    • ivy-2.4.0.jar
    • jackson-annotations-2.4.0.jar
    • jackson-core-2.4.4.jar
    • jackson-core-asl-1.8.8.jar
    • jackson-databind-2.4.4.jar
    • jackson-jaxrs-1.8.8.jar
    • jackson-mapper-asl-1.8.8.jar
    • jackson-module-scala_2.10-2.4.4.jar
    • jackson-xc-1.8.8.jar
    • jansi-1.4.jar
    • javax.inject-1.jar
    • javax.servlet-3.0.0.v201112011016.jar
    • javax.servlet-3.1.jar
    • javax.servlet-api-3.0.1.jar
    • jaxb-api-2.2.7.jar
    • jaxb-core-2.2.7.jar
    • jaxb-impl-2.2.7.jar
    • jcl-over-slf4j-1.7.10.jar
    • jersey-client-1.9.jar
    • jersey-core-1.9.jar
    • jersey-grizzly2-1.9.jar
    • jersey-guice-1.9.jar
    • jersey-json-1.9.jar
    • jersey-server-1.9.jar
    • jersey-test-framework-core-1.9.jar
    • jersey-test-framework-grizzly2-1.9.jar
    • jets3t-0.7.1.jar
    • jettison-1.1.jar
    • jetty-util-6.1.26.jar
    • jline-0.9.94.jar
    • jline-2.10.4.jar
    • jodd-core-3.6.3.jar
    • json4s-ast_2.10-3.2.10.jar
    • json4s-core_2.10-3.2.10.jar
    • json4s-jackson_2.10-3.2.10.jar
    • jsr305-1.3.9.jar
    • jtransforms-2.4.0.jar
    • jul-to-slf4j-1.7.10.jar
    • kryo-2.21.jar
    • log4j-1.2.17.jar
    • lz4-1.2.0.jar
    • management-api-3.0.0-b012.jar
    • mesos-0.21.0-shaded-protobuf.jar
    • metrics-core-3.1.0.jar
    • metrics-graphite-3.1.0.jar
    • metrics-json-3.1.0.jar
    • metrics-jvm-3.1.0.jar
    • minlog-1.2.jar
    • netty-3.8.0.Final.jar
    • netty-all-4.0.23.Final.jar
    • objenesis-1.2.jar
    • opencsv-2.3.jar
    • oro-2.0.8.jar
    • paranamer-2.6.jar
    • parquet-column-1.6.0rc3.jar
    • parquet-common-1.6.0rc3.jar
    • parquet-encoding-1.6.0rc3.jar
    • parquet-format-2.2.0-rc1.jar
    • parquet-generator-1.6.0rc3.jar
    • parquet-hadoop-1.6.0rc3.jar
    • parquet-jackson-1.6.0rc3.jar
    • pmml-agent-1.1.15.jar
    • pmml-model-1.1.15.jar
    • pmml-schema-1.1.15.jar
    • protobuf-java-2.4.1.jar
    • protobuf-java-2.5.0-spark.jar
    • py4j-0.8.2.1.jar
    • pyrolite-2.0.1.jar
    • quasiquotes_2.10-2.0.1.jar
    • reflectasm-1.07-shaded.jar
    • scala-compiler-2.10.4.jar
    • scala-library-2.10.4.jar
    • scala-reflect-2.10.4.jar
    • scalap-2.10.4.jar
    • scalatest_2.10-2.2.1.jar
    • slf4j-api-1.7.10.jar
    • slf4j-log4j12-1.7.10.jar
    • snappy-java-1.1.1.7.jar
    • spark-bagel_2.10-1.4.0-SNAPSHOT.jar
    • spark-catalyst_2.10-1.4.0-SNAPSHOT.jar
    • spark-core_2.10-1.4.0-SNAPSHOT.jar
    • spark-graphx_2.10-1.4.0-SNAPSHOT.jar
    • spark-launcher_2.10-1.4.0-SNAPSHOT.jar
    • spark-mllib_2.10-1.4.0-SNAPSHOT.jar
    • spark-network-common_2.10-1.4.0-SNAPSHOT.jar
    • spark-network-shuffle_2.10-1.4.0-SNAPSHOT.jar
    • spark-repl_2.10-1.4.0-SNAPSHOT.jar
    • spark-sql_2.10-1.4.0-SNAPSHOT.jar
    • spark-streaming_2.10-1.4.0-SNAPSHOT.jar
    • spark-unsafe_2.10-1.4.0-SNAPSHOT.jar
    • spire-macros_2.10-0.7.4.jar
    • spire_2.10-0.7.4.jar
    • stax-api-1.0.1.jar
    • stream-2.7.0.jar
    • tachyon-0.6.4.jar
    • tachyon-client-0.6.4.jar
    • uncommons-maths-1.2.2a.jar
    • unused-1.0.0.jar
    • xmlenc-0.52.jar
    • xz-1.0.jar
    • zookeeper-3.4.5.jar

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31474/
Test PASSed.

@scwf
Copy link
Contributor Author

scwf commented May 1, 2015

/cc @rxin can you take a look at this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add some explanation saying it is more efficient because it is expensive to do total ordering in distributed setting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this rule doesn't belong in combine limits does it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, after thinking more about this, i think we can not add this rule, because push down sort leads to sort global data.
But here is another case we can optimize, i am updating this.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31554 has started for PR 5821 at commit bc264dc.

@scwf scwf changed the title [SPARK-7289] [SQL] CombineLimits improvement: push down sort when it's child is Limit [SPARK-7289] [SQL] Combines Limit and Sort to avoid total ordering May 1, 2015
@scwf scwf changed the title [SPARK-7289] [SQL] Combines Limit and Sort to avoid total ordering [SPARK-7289] [SQL] Combine Limit and Sort to avoid total ordering May 1, 2015
@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31557 has started for PR 5821 at commit 34c0514.

@AmplabJenkins
Copy link

Build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31556/
Test FAILed.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31563 has started for PR 5821 at commit 8560d13.

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31554 has finished for PR 5821 at commit bc264dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31554/
Test PASSed.

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31557 has finished for PR 5821 at commit 34c0514.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31557/
Test PASSed.

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31563 has finished for PR 5821 at commit 8560d13.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31563/
Test PASSed.

@scwf
Copy link
Contributor Author

scwf commented May 2, 2015

@rxin, i have updated this PR title and description, please take a look.

@rxin
Copy link
Contributor

rxin commented May 5, 2015

cc @yhuai can you take a look at this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will LimitPushdown be a better name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think CombineLimitSort is more descriptive now, since this is single and special case of combine limit and sort.
we can rename it in future when there are more push down cases we add here

@scwf
Copy link
Contributor Author

scwf commented May 8, 2015

ping @yhuai is this ok to go?

@SparkQA
Copy link

SparkQA commented May 8, 2015

Test build #786 has started for PR 5821 at commit 8560d13.

@SparkQA
Copy link

SparkQA commented May 8, 2015

Test build #786 has finished for PR 5821 at commit 8560d13.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

marmbrus commented May 8, 2015

I'm not sold on this optimization. Its pretty weird to put an ORDER BY in a subquery, no?

@scwf
Copy link
Contributor Author

scwf commented May 9, 2015

yes, but we can not restrict user do not write sql like this.
If you think this is not necessary, i will close this.

@scwf
Copy link
Contributor Author

scwf commented May 14, 2015

ok to close this

@scwf scwf closed this May 14, 2015
@rxin
Copy link
Contributor

rxin commented Jun 11, 2015

cc @marmbrus actually I think we need a rule like this. @mengxr just constructed a case:

== Parsed Logical Plan ==
Aggregate [brand#1902], [brand#1902,COUNT(1) AS count#2769L]
 Limit 10
  Sort [-num_reviews#2161L ASC], true
   Aggregate [brand#1902], [brand#1902,COUNT(1) AS num_reviews#2161L]
    Project [asin#1892,helpful#1893,overall#1894,reviewText#1895,reviewTime#1896,reviewerID#1897,reviewerName#1898,summary#1899,unixReviewTime#1900L,brand#1902,categories#1903,imUrl#1904,price#1905,related#1906,salesRank#1907,title#1908]
     Join Inner, Some((asin#1892 = asin#1901))
      Relation[asin#1892,helpful#1893,overall#1894,reviewText#1895,reviewTime#1896,reviewerID#1897,reviewerName#1898,summary#1899,unixReviewTime#1900L] org.apache.spark.sql.parquet.ParquetRelation2@61c77b27
      Relation[asin#1901,brand#1902,categories#1903,imUrl#1904,price#1905,related#1906,salesRank#1907,title#1908] org.apache.spark.sql.parquet.ParquetRelation2@ab4decb4

== Analyzed Logical Plan ==
brand: string, count: bigint
Aggregate [brand#1902], [brand#1902,COUNT(1) AS count#2769L]
 Limit 10
  Sort [-num_reviews#2161L ASC], true
   Aggregate [brand#1902], [brand#1902,COUNT(1) AS num_reviews#2161L]
    Project [asin#1892,helpful#1893,overall#1894,reviewText#1895,reviewTime#1896,reviewerID#1897,reviewerName#1898,summary#1899,unixReviewTime#1900L,brand#1902,categories#1903,imUrl#1904,price#1905,related#1906,salesRank#1907,title#1908]
     Join Inner, Some((asin#1892 = asin#1901))
      Relation[asin#1892,helpful#1893,overall#1894,reviewText#1895,reviewTime#1896,reviewerID#1897,reviewerName#1898,summary#1899,unixReviewTime#1900L] org.apache.spark.sql.parquet.ParquetRelation2@61c77b27
      Relation[asin#1901,brand#1902,categories#1903,imUrl#1904,price#1905,related#1906,salesRank#1907,title#1908] org.apache.spark.sql.parquet.ParquetRelation2@ab4decb4

== Optimized Logical Plan ==
Aggregate [brand#1902], [brand#1902,COUNT(1) AS count#2769L]
 Limit 10
  Project [brand#1902]
   Sort [-num_reviews#2161L ASC], true
    Aggregate [brand#1902], [brand#1902,COUNT(1) AS num_reviews#2161L]
     Project [brand#1902]
      Join Inner, Some((asin#1892 = asin#1901))
       Project [asin#1892]
        Relation[asin#1892,helpful#1893,overall#1894,reviewText#1895,reviewTime#1896,reviewerID#1897,reviewerName#1898,summary#1899,unixReviewTime#1900L] org.apache.spark.sql.parquet.ParquetRelation2@61c77b27
       Project [brand#1902,asin#1901]
        Relation[asin#1901,brand#1902,categories#1903,imUrl#1904,price#1905,related#1906,salesRank#1907,title#1908] org.apache.spark.sql.parquet.ParquetRelation2@ab4decb4

== Physical Plan ==
Aggregate false, [brand#1902], [brand#1902,Coalesce(SUM(PartialCount#2772L),0) AS count#2769L]
 Aggregate true, [brand#1902], [brand#1902,COUNT(1) AS PartialCount#2772L]
  Limit 10
   Project [brand#1902]
    Sort [-num_reviews#2161L ASC], true
     Exchange (RangePartitioning 200)
      Aggregate false, [brand#1902], [brand#1902,Coalesce(SUM(PartialCount#2774L),0) AS num_reviews#2161L]
       Exchange (HashPartitioning 200)
        Aggregate true, [brand#1902], [brand#1902,COUNT(1) AS PartialCount#2774L]
         Project [brand#1902]
          ShuffledHashJoin [asin#1892], [asin#1901], BuildRight
           Exchange (HashPartitioning 200)
            PhysicalRDD [asin#1892], parquet MapPartitionsRDD[775] at
           Exchange (HashPartitioning 200)
            PhysicalRDD [brand#1902,asin#1901], parquet MapPartitionsRDD[777] at

Code Generation: false
== RDD ==

@rxin
Copy link
Contributor

rxin commented Jun 11, 2015

@marmbrus
Copy link
Contributor

@rxin, I think that it would be better to make the planning of TakeOrdered more general such that it can handle internal projections. Either that or we could also add a rule that pushes projections beneath Sort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants