-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-15752] [SQL] Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators. #13494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #59925 has finished for PR 13494 at commit
|
|
Test build #59929 has finished for PR 13494 at commit
|
|
Test build #59930 has finished for PR 13494 at commit
|
|
Test build #59940 has finished for PR 13494 at commit
|
|
Can you try to write a design doc on this? Would be great to discuss the reasons why we might want this, the kind of queries that can be answered, corner cases, and how it should be implemented. Thanks. |
|
@rxin I have writed a design doc: https://docs.google.com/document/d/1Bmi4-PkTaBQ0HVaGjIqa3eA12toKX52QaiUyhb6WQiM/edit?usp=sharing. |
| val partitionSchema = files.partitionSchema.toAttributes | ||
| lazy val converter = GenerateUnsafeProjection.generate(partitionSchema, partitionSchema) | ||
| val partitionValues = selectedPartitions.map(_.values) | ||
| files.sqlContext.sparkContext.parallelize(partitionValues, 1).map(converter(_)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if this partition has more than one data files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now in this PR, default of spark.sql.optimizer.metadataOnly is false, So if user needs this feature, he should set spark.sql.optimizer.metadataOnly=true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think optimizer should never affect the correctness of the query result. If this optimization is too hard to implement with current code base, we should improve the code base first, instead of rushing in a partial implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I rethink more and then i will add a metadataOnly optimizer to optimizer list.Thanks.
|
hi @lianhuiwang , thanks for working on it! The overall idea LGTM, we should elimiante unnecessary file scan if only partition columns are read. However, the current implementation looks not corrected, we should also consider the number of rows. I also took a look at the hive path, it only optimize partition columns used as aggregation keys, where the number of duplicated rows doesn't matter. I think we should either narrow down the scope of this PR and focus on aggregation queries, or spent some more time for a more general design. cc @yhuai @liancheng |
|
@cloud-fan Yes, I think what you said is right. as Hive/Prestodb, if queries that did some functions (example: MIN/MAX) or distinct aggregates on partition column and the value of config 'spark.sql.optimizer.metadataOnly' is true, then we can use the metadata-only optimization. |
|
Test build #61161 has finished for PR 13494 at commit
|
|
@cloud-fan Now i have added a extendedHiveOptimizerRules that include MetadataOnly Optimization for Hive Optimizer. |
|
Why is this rule Hive specific? |
|
@rxin good point. Because now MetastoreRelation only be defined in Hive now and if we make it using MetadataOnly optimization, like this PR we can use MetadataOnly optimization in Hive Component. |
| val OPTIMIZER_METADATA_ONLY = SQLConfigBuilder("spark.sql.optimizer.metadataOnly") | ||
| .doc("When true, enable the metadata-only query optimization.") | ||
| .booleanConf | ||
| .createWithDefault(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we turn it on by default?
|
Test build #61162 has finished for PR 13494 at commit
|
| if files.partitionSchema.nonEmpty => | ||
| (Some(relation), Seq.empty[Expression]) | ||
|
|
||
| case relation: MetastoreRelation if relation.partitionKeys.nonEmpty => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MetastoreRelation extends CatalogRelation, I think we can put this rule in sql core instead of hive module.
|
Test build #61163 has finished for PR 13494 at commit
|
|
Test build #61164 has finished for PR 13494 at commit
|
|
@hvanhovell I have addressed some of your comments. Thanks. Could you look at again? |
| /** | ||
| * Returns the partition attributes of the table relation plan. | ||
| */ | ||
| def getPartitionAttrs(partitionColumnNames: Seq[String], relation: LogicalPlan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Style.
def getPartitionAttrs(
partitionColumnNames: Seq[String],
relation: LogicalPlan): Seq[Attribute] = { ...
While you are at it. Change the return type to AttributeSet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get it. Thanks.
|
Test build #62110 has finished for PR 13494 at commit
|
| case plan if plan eq relation => | ||
| relation match { | ||
| case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _) => | ||
| val partAttrs = PartitionedRelation.getPartitionAttrs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does getPartitionAttrs need to be a method in PartitionedRelation? I think it can just be a private method in parent class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. Because object PartitionedRelation also use getPartitionAttrs, Now i just define it in PartitionedRelation. If it define a private method in class OptimizeMetadataOnlyQuery, there are two same getPartitionAttrs() functions in PartitionedRelation and OptimizeMetadataOnlyQuery.
How about define two same getPartitionAttrs() functions? or has another way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I will define two functions for getPartitionAttrs(). In the future, I think we can put getPartitionAttrs() into relation plan. If i has some problem, please tell me. thanks.
|
@cloud-fan @hvanhovell about getPartitionAttrs() It has a improve place that we can define it in relation node. but now relation node has not this function. how about added in follow-up PRs? Thanks. |
|
Test build #62137 has finished for PR 13494 at commit
|
| /** | ||
| * Returns the partition attributes of the table relation plan. | ||
| */ | ||
| private def getPartitionAttrs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, inner class can access private member of outer class, we don't need to duplicate the method in inner class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks.
|
@cloud-fan I have addressed your latest comments. thanks. |
|
Test build #62156 has finished for PR 13494 at commit
|
|
LGTM - Merging to master. Thanks! |
|
Thank you for review and merging. @rxin @hvanhovell @cloud-fan . |
What changes were proposed in this pull request?
when query only use metadata (example: partition key), it can return results based on metadata without scanning files. Hive did it in HIVE-1003.
How was this patch tested?
add unit tests