Skip to content

Conversation

@lianhuiwang
Copy link
Contributor

What changes were proposed in this pull request?

when query only use metadata (example: partition key), it can return results based on metadata without scanning files. Hive did it in HIVE-1003.

How was this patch tested?

add unit tests

@SparkQA
Copy link

SparkQA commented Jun 3, 2016

Test build #59925 has finished for PR 13494 at commit 2ca2c38.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 3, 2016

Test build #59929 has finished for PR 13494 at commit edea710.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 3, 2016

Test build #59930 has finished for PR 13494 at commit 8426522.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 3, 2016

Test build #59940 has finished for PR 13494 at commit 153293e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jun 4, 2016

Can you try to write a design doc on this? Would be great to discuss the reasons why we might want this, the kind of queries that can be answered, corner cases, and how it should be implemented. Thanks.

@lianhuiwang
Copy link
Contributor Author

lianhuiwang commented Jun 4, 2016

@rxin I have writed a design doc: https://docs.google.com/document/d/1Bmi4-PkTaBQ0HVaGjIqa3eA12toKX52QaiUyhb6WQiM/edit?usp=sharing.
Glad to get your comments. Thanks.

val partitionSchema = files.partitionSchema.toAttributes
lazy val converter = GenerateUnsafeProjection.generate(partitionSchema, partitionSchema)
val partitionValues = selectedPartitions.map(_.values)
files.sqlContext.sparkContext.parallelize(partitionValues, 1).map(converter(_))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if this partition has more than one data files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now in this PR, default of spark.sql.optimizer.metadataOnly is false, So if user needs this feature, he should set spark.sql.optimizer.metadataOnly=true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think optimizer should never affect the correctness of the query result. If this optimization is too hard to implement with current code base, we should improve the code base first, instead of rushing in a partial implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I rethink more and then i will add a metadataOnly optimizer to optimizer list.Thanks.

@cloud-fan
Copy link
Contributor

hi @lianhuiwang , thanks for working on it!

The overall idea LGTM, we should elimiante unnecessary file scan if only partition columns are read. However, the current implementation looks not corrected, we should also consider the number of rows. I also took a look at the hive path, it only optimize partition columns used as aggregation keys, where the number of duplicated rows doesn't matter.

I think we should either narrow down the scope of this PR and focus on aggregation queries, or spent some more time for a more general design.

cc @yhuai @liancheng

@lianhuiwang
Copy link
Contributor Author

lianhuiwang commented Jun 23, 2016

@cloud-fan Yes, I think what you said is right. as Hive/Prestodb, if queries that did some functions (example: MIN/MAX) or distinct aggregates on partition column and the value of config 'spark.sql.optimizer.metadataOnly' is true, then we can use the metadata-only optimization.
I will add a metadataOnly optimizer to optimizer list.Thanks.

@SparkQA
Copy link

SparkQA commented Jun 24, 2016

Test build #61161 has finished for PR 13494 at commit 7d7ece0.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@lianhuiwang
Copy link
Contributor Author

@cloud-fan Now i have added a extendedHiveOptimizerRules that include MetadataOnly Optimization for Hive Optimizer.
Firstly,MetadataOnly Optimization should be in Hive Model because MetastoreRelation only can be used in Hive now.
Secondly, MetadataOnly Optimization should be between Analyzer and RewriteDistinctAggregates.
In the future, we can add ParquetConversions/OrcConversions and other optimizations into extendedHiveOptimizerRules.

@rxin
Copy link
Contributor

rxin commented Jun 24, 2016

Why is this rule Hive specific?

@lianhuiwang
Copy link
Contributor Author

lianhuiwang commented Jun 24, 2016

@rxin good point. Because now MetastoreRelation only be defined in Hive now and if we make it using MetadataOnly optimization, like this PR we can use MetadataOnly optimization in Hive Component.
if not, we needs divide MetadataOnly optimization into two part, one for common sql, other for HiveQl.
I will think more about it and try my best to resolve it. Thanks.

val OPTIMIZER_METADATA_ONLY = SQLConfigBuilder("spark.sql.optimizer.metadataOnly")
.doc("When true, enable the metadata-only query optimization.")
.booleanConf
.createWithDefault(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we turn it on by default?

@SparkQA
Copy link

SparkQA commented Jun 24, 2016

Test build #61162 has finished for PR 13494 at commit 2e55a9d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if files.partitionSchema.nonEmpty =>
(Some(relation), Seq.empty[Expression])

case relation: MetastoreRelation if relation.partitionKeys.nonEmpty =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetastoreRelation extends CatalogRelation, I think we can put this rule in sql core instead of hive module.

@SparkQA
Copy link

SparkQA commented Jun 24, 2016

Test build #61163 has finished for PR 13494 at commit b2b6eba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 24, 2016

Test build #61164 has finished for PR 13494 at commit c5a291e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class JavaPackage
    • case class StreamingRelationExec(sourceName: String, output: Seq[Attribute]) extends LeafExecNode

@lianhuiwang
Copy link
Contributor Author

@hvanhovell I have addressed some of your comments. Thanks. Could you look at again?

/**
* Returns the partition attributes of the table relation plan.
*/
def getPartitionAttrs(partitionColumnNames: Seq[String], relation: LogicalPlan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Style.

def getPartitionAttrs(
    partitionColumnNames: Seq[String],
    relation: LogicalPlan): Seq[Attribute] = { ...

While you are at it. Change the return type to AttributeSet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get it. Thanks.

@SparkQA
Copy link

SparkQA commented Jul 11, 2016

Test build #62110 has finished for PR 13494 at commit d888c85.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case plan if plan eq relation =>
relation match {
case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _) =>
val partAttrs = PartitionedRelation.getPartitionAttrs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does getPartitionAttrs need to be a method in PartitionedRelation? I think it can just be a private method in parent class.

Copy link
Contributor Author

@lianhuiwang lianhuiwang Jul 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. Because object PartitionedRelation also use getPartitionAttrs, Now i just define it in PartitionedRelation. If it define a private method in class OptimizeMetadataOnlyQuery, there are two same getPartitionAttrs() functions in PartitionedRelation and OptimizeMetadataOnlyQuery.
How about define two same getPartitionAttrs() functions? or has another way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I will define two functions for getPartitionAttrs(). In the future, I think we can put getPartitionAttrs() into relation plan. If i has some problem, please tell me. thanks.

@lianhuiwang
Copy link
Contributor Author

@cloud-fan @hvanhovell about getPartitionAttrs() It has a improve place that we can define it in relation node. but now relation node has not this function. how about added in follow-up PRs? Thanks.

@SparkQA
Copy link

SparkQA commented Jul 12, 2016

Test build #62137 has finished for PR 13494 at commit ff16509.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* Returns the partition attributes of the table relation plan.
*/
private def getPartitionAttrs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, inner class can access private member of outer class, we don't need to duplicate the method in inner class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks.

@lianhuiwang
Copy link
Contributor Author

@cloud-fan I have addressed your latest comments. thanks.

@SparkQA
Copy link

SparkQA commented Jul 12, 2016

Test build #62156 has finished for PR 13494 at commit 030776a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

LGTM - Merging to master. Thanks!

@asfgit asfgit closed this in 5ad68ba Jul 12, 2016
@lianhuiwang
Copy link
Contributor Author

Thank you for review and merging. @rxin @hvanhovell @cloud-fan .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants