[SPARK-15752] [SQL] Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators. #13494

lianhuiwang · 2016-06-03T07:45:37Z

What changes were proposed in this pull request?

when query only use metadata (example: partition key), it can return results based on metadata without scanning files. Hive did it in HIVE-1003.

How was this patch tested?

add unit tests

SparkQA · 2016-06-03T09:19:10Z

Test build #59925 has finished for PR 13494 at commit 2ca2c38.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-03T14:27:26Z

Test build #59929 has finished for PR 13494 at commit edea710.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-03T14:51:27Z

Test build #59930 has finished for PR 13494 at commit 8426522.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-03T18:53:37Z

Test build #59940 has finished for PR 13494 at commit 153293e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-06-04T05:40:14Z

Can you try to write a design doc on this? Would be great to discuss the reasons why we might want this, the kind of queries that can be answered, corner cases, and how it should be implemented. Thanks.

lianhuiwang · 2016-06-04T09:19:53Z

@rxin I have writed a design doc: https://docs.google.com/document/d/1Bmi4-PkTaBQ0HVaGjIqa3eA12toKX52QaiUyhb6WQiM/edit?usp=sharing.
Glad to get your comments. Thanks.

cloud-fan · 2016-06-23T07:19:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

+        val partitionSchema = files.partitionSchema.toAttributes
+        lazy val converter = GenerateUnsafeProjection.generate(partitionSchema, partitionSchema)
+        val partitionValues = selectedPartitions.map(_.values)
+        files.sqlContext.sparkContext.parallelize(partitionValues, 1).map(converter(_))


what if this partition has more than one data files?

Now in this PR, default of spark.sql.optimizer.metadataOnly is false, So if user needs this feature, he should set spark.sql.optimizer.metadataOnly=true.

I think optimizer should never affect the correctness of the query result. If this optimization is too hard to implement with current code base, we should improve the code base first, instead of rushing in a partial implementation.

Yes, I rethink more and then i will add a metadataOnly optimizer to optimizer list.Thanks.

cloud-fan · 2016-06-23T07:30:38Z

hi @lianhuiwang , thanks for working on it!

The overall idea LGTM, we should elimiante unnecessary file scan if only partition columns are read. However, the current implementation looks not corrected, we should also consider the number of rows. I also took a look at the hive path, it only optimize partition columns used as aggregation keys, where the number of duplicated rows doesn't matter.

I think we should either narrow down the scope of this PR and focus on aggregation queries, or spent some more time for a more general design.

cc @yhuai @liancheng

lianhuiwang · 2016-06-23T08:02:07Z

@cloud-fan Yes, I think what you said is right. as Hive/Prestodb, if queries that did some functions (example: MIN/MAX) or distinct aggregates on partition column and the value of config 'spark.sql.optimizer.metadataOnly' is true, then we can use the metadata-only optimization.
I will add a metadataOnly optimizer to optimizer list.Thanks.

This reverts commit 153293e.

This reverts commit edea710.

SparkQA · 2016-06-24T07:55:19Z

Test build #61161 has finished for PR 13494 at commit 7d7ece0.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

lianhuiwang · 2016-06-24T08:04:35Z

@cloud-fan Now i have added a extendedHiveOptimizerRules that include MetadataOnly Optimization for Hive Optimizer.
Firstly,MetadataOnly Optimization should be in Hive Model because MetastoreRelation only can be used in Hive now.
Secondly, MetadataOnly Optimization should be between Analyzer and RewriteDistinctAggregates.
In the future, we can add ParquetConversions/OrcConversions and other optimizations into extendedHiveOptimizerRules.

rxin · 2016-06-24T08:07:40Z

Why is this rule Hive specific?

lianhuiwang · 2016-06-24T08:19:31Z

@rxin good point. Because now MetastoreRelation only be defined in Hive now and if we make it using MetadataOnly optimization, like this PR we can use MetadataOnly optimization in Hive Component.
if not, we needs divide MetadataOnly optimization into two part, one for common sql, other for HiveQl.
I will think more about it and try my best to resolve it. Thanks.

cloud-fan · 2016-06-24T09:32:01Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  val OPTIMIZER_METADATA_ONLY = SQLConfigBuilder("spark.sql.optimizer.metadataOnly")
+    .doc("When true, enable the metadata-only query optimization.")
+    .booleanConf
+    .createWithDefault(false)


can we turn it on by default?

SparkQA · 2016-06-24T09:38:20Z

Test build #61162 has finished for PR 13494 at commit 2e55a9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-06-24T09:40:30Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

+          if files.partitionSchema.nonEmpty =>
+          (Some(relation), Seq.empty[Expression])
+
+        case relation: MetastoreRelation if relation.partitionKeys.nonEmpty =>


MetastoreRelation extends CatalogRelation, I think we can put this rule in sql core instead of hive module.

SparkQA · 2016-06-24T09:46:53Z

Test build #61163 has finished for PR 13494 at commit b2b6eba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-24T09:49:49Z

Test build #61164 has finished for PR 13494 at commit c5a291e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class JavaPackage
- case class StreamingRelationExec(sourceName: String, output: Seq[Attribute]) extends LeafExecNode

lianhuiwang · 2016-07-11T18:22:12Z

@hvanhovell I have addressed some of your comments. Thanks. Could you look at again?

hvanhovell · 2016-07-11T18:58:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala

+    /**
+     * Returns the partition attributes of the table relation plan.
+     */
+    def getPartitionAttrs(partitionColumnNames: Seq[String], relation: LogicalPlan)


Nit: Style.

def getPartitionAttrs( partitionColumnNames: Seq[String], relation: LogicalPlan): Seq[Attribute] = { ...

While you are at it. Change the return type to AttributeSet.

Get it. Thanks.

SparkQA · 2016-07-11T20:11:04Z

Test build #62110 has finished for PR 13494 at commit d888c85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-07-12T00:40:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala

+      case plan if plan eq relation =>
+        relation match {
+          case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _) =>
+            val partAttrs = PartitionedRelation.getPartitionAttrs(


does getPartitionAttrs need to be a method in PartitionedRelation? I think it can just be a private method in parent class.

thanks. Because object PartitionedRelation also use getPartitionAttrs, Now i just define it in PartitionedRelation. If it define a private method in class OptimizeMetadataOnlyQuery, there are two same getPartitionAttrs() functions in PartitionedRelation and OptimizeMetadataOnlyQuery.
How about define two same getPartitionAttrs() functions? or has another way?

@cloud-fan I will define two functions for getPartitionAttrs(). In the future, I think we can put getPartitionAttrs() into relation plan. If i has some problem, please tell me. thanks.

lianhuiwang · 2016-07-12T02:59:50Z

@cloud-fan @hvanhovell about getPartitionAttrs() It has a improve place that we can define it in relation node. but now relation node has not this function. how about added in follow-up PRs? Thanks.

SparkQA · 2016-07-12T04:55:04Z

Test build #62137 has finished for PR 13494 at commit ff16509.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-07-12T05:49:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala

+    /**
+     * Returns the partition attributes of the table relation plan.
+     */
+    private def getPartitionAttrs(


IIRC, inner class can access private member of outer class, we don't need to duplicate the method in inner class.

Yes, thanks.

lianhuiwang · 2016-07-12T09:49:05Z

@cloud-fan I have addressed your latest comments. thanks.

SparkQA · 2016-07-12T11:42:57Z

Test build #62156 has finished for PR 13494 at commit 030776a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-07-12T16:51:23Z

LGTM - Merging to master. Thanks!

lianhuiwang · 2016-07-12T17:16:50Z

Thank you for review and merging. @rxin @hvanhovell @cloud-fan .

init commit

2ca2c38

lianhuiwang added 2 commits June 3, 2016 20:58

fix unit test

edea710

Merge branch 'apache-master' into metadata-only

8426522

fix unit test

153293e

cloud-fan reviewed Jun 23, 2016
View reviewed changes

lianhuiwang added 6 commits June 24, 2016 14:29

update

7dfb743

Revert "fix unit test"

68e6d6d

This reverts commit 153293e.

Revert "fix unit test"

595ef36

This reverts commit edea710.

Merge branch 'apache-master' into metadata-only

7d7ece0

Merge branch 'apache-master' into metadata-only

2e55a9d

update

b2b6eba

Merge branch 'apache-master' into metadata-only

c5a291e

cloud-fan reviewed Jun 24, 2016
View reviewed changes

hvanhovell reviewed Jul 11, 2016
View reviewed changes

cloud-fan reviewed Jul 12, 2016
View reviewed changes

update

ff16509

cloud-fan reviewed Jul 12, 2016
View reviewed changes

lianhuiwang added 2 commits July 12, 2016 17:28

remove duplicate code

358ad13

fix minor

030776a

asfgit closed this in 5ad68ba Jul 12, 2016

gengliangwang mentioned this pull request Jan 24, 2019

[SPARK-26709][SQL] OptimizeMetadataOnlyQuery does not handle empty records correctly #23635

Closed

gengliangwang mentioned this pull request Jan 25, 2019

[SPARK-26709][SQL][BRANCH-2.3] OptimizeMetadataOnlyQuery does not handle empty records correctly #23648

Closed

[SPARK-15752] [SQL] Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators. #13494

[SPARK-15752] [SQL] Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators. #13494

Uh oh!

Conversation

lianhuiwang commented Jun 3, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 3, 2016

Uh oh!

SparkQA commented Jun 3, 2016

Uh oh!

SparkQA commented Jun 3, 2016

Uh oh!

SparkQA commented Jun 3, 2016

Uh oh!

rxin commented Jun 4, 2016

Uh oh!

lianhuiwang commented Jun 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jun 23, 2016

Uh oh!

lianhuiwang commented Jun 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jun 24, 2016

Uh oh!

lianhuiwang commented Jun 24, 2016

Uh oh!

rxin commented Jun 24, 2016

Uh oh!

lianhuiwang commented Jun 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 24, 2016

Uh oh!

SparkQA commented Jun 24, 2016

Uh oh!

lianhuiwang commented Jul 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lianhuiwang Jul 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lianhuiwang commented Jul 12, 2016

Uh oh!

SparkQA commented Jul 12, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lianhuiwang commented Jul 12, 2016

Uh oh!

SparkQA commented Jul 12, 2016

lianhuiwang commented Jun 4, 2016 •

edited

Loading

lianhuiwang commented Jun 23, 2016 •

edited

Loading

lianhuiwang commented Jun 24, 2016 •

edited

Loading

lianhuiwang Jul 12, 2016 •

edited

Loading