-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-27342][SQL] Optimize Limit 0 queries #24271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Seems reasonable, but @gengliangwang do you have an opinion? |
|
Test build #4680 has finished for PR 24271 at commit
|
| // changes up the Logical Plan. | ||
| // | ||
| // Replace Global Limit 0 nodes with empty Local Relation | ||
| case p @ GlobalLimit(IntegerLiteral(limit), _) if limit == 0 => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better to do a new rule, to make it more modular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, we can also create a new Batch for the new rule, since it requires only one-time transformation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin @gengliangwang Hi!
We are relying on the other case rules in PropagateEmptyRelation to propagate the effects of substituting Limit 0 with empty LocalRelation through the Logical Plan - such as in case of joins, project, etc.
So, should we handle this case also part of this rule, and then propagation of empty relation happens only as part of this rule? Or do you recommend doing it as a separate rule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The optimization in this PR is not directly related to "propagate". We can create a separate rule and batch , e.g.
Batch("OptimizeLimitZero", Once, OptimizeLimitZero) :+
Batch("LocalRelation", fixedPoint,
ConvertToLocalRelation,
PropagateEmptyRelation
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin @gengliangwang @srowen I have made the recommended changes. Please have a look. Thanks!
attilapiros
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just minor stuff and a question. Otherwise LGTM.
| val testRelation1 = LocalRelation.fromExternalRows(Seq('a.int), data = Seq(Row(1))) | ||
| val testRelation2 = LocalRelation.fromExternalRows(Seq('b.int), data = Seq(Row(1))) | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super Nit: too much empty lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| comparePlans(optimized, correctAnswer) | ||
| } | ||
|
|
||
| test("Limit 0: Joins") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know you can create easily separated test for each join types, like this:
Seq(
(Inner, LocalRelation('a.int, 'b.int)),
(LeftOuter, Project(Seq('a, Literal(null).cast(IntegerType).as('b)), testRelation1).analyze),
(RightOuter, LocalRelation('a.int, 'b.int)),
(FullOuter, Project(Seq('a, Literal(null).cast(IntegerType).as('b)), testRelation1).analyze)
).foreach { case (jt, answer) =>
test(s"Limit 0: for join type $jt") {
val query = testRelation1
.join(testRelation2.limit(0), jt, condition = Some('a.attr == 'b.attr))
val optimized = Optimize.execute(query.analyze)
val correctAnswer = answer
comparePlans(optimized, correctAnswer)
}
}So this way the suite execution will contain sth like:
[info] - Limit 0: for join type Inner (30 milliseconds)
[info] - Limit 0: for join type LeftOuter (21 milliseconds)
[info] - Limit 0: for join type RightOuter (13 milliseconds)
[info] - Limit 0: for join type FullOuter (12 milliseconds)
It has only one little advantage if more than one would fail execution would not stop at the first assert.
But your current solution is also very fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@attilapiros thanks for the input. I made the required changes.
| // For example, a query such as Filter(LocalRelation) would go through all the heavy | ||
| // optimizer rules that are triggered when there is a filter | ||
| // (e.g. InferFiltersFromConstraints). If we run this batch earlier, the query becomes just | ||
| // LocalRelation and does not trigger many rules. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment should be before the batch LocalRelation early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang, fixed this.
| class OptimizeLimitZeroSuite extends PlanTest { | ||
| object Optimize extends RuleExecutor[LogicalPlan] { | ||
| val batches = | ||
| Batch("OptimizeLimitZero", Once, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can have two batches here, just the same as Optimizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang Is there a specific reason for having it that way? I modelled the test suite based on PropagateEmptyRelation, wherein also they have all rules as part of the same batch..
gengliangwang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except my last comment in the test suite.
| .join(testRelation2.limit(0), joinType = jt, condition = Some('a.attr == 'b.attr)) | ||
|
|
||
| val optimized = Optimize.execute(query.analyze) | ||
| val correctAnswer = answer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant var ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
|
||
| /** | ||
| * Replaces GlobalLimit 0 and LocalLimit 0 nodes (subtree) with empty Local Relation, as they don't | ||
| * return any rows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix the alignment here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| // changes up the Logical Plan. | ||
| // | ||
| // Replace Global Limit 0 nodes with empty Local Relation | ||
| case gl @ GlobalLimit(IntegerLiteral(limit), _) if limit == 0 => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case gl @ GlobalLimit(IntegerLiteral(0), _) => ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dilipbiswal good catch! Made this change.
| // then the following rule will handle that case as well. | ||
| // | ||
| // Replace Local Limit 0 nodes with empty Local Relation | ||
| case ll @ LocalLimit(IntegerLiteral(limit), _) if limit == 0 => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
retest this please |
|
@srowen Hi Sean.. for some reason, its not responding to "retest" ? |
|
Jenkins add to whitelist |
|
Test build #104310 has finished for PR 24271 at commit
|
|
LGTM |
|
Test build #4690 has finished for PR 24271 at commit
|
gatorsmile
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks! Merged to master.
What changes were proposed in this pull request?
With this change, unnecessary file scans are avoided in case of Limit 0 queries.
I added a case (rule) to
PropagateEmptyRelationto replaceGlobalLimit 0andLocalLimit 0nodes with an emptyLocalRelation. This prunes the subtree under the Limit 0 node and further allows other rules ofPropagateEmptyRelationto optimize the Logical Plan - while remaining semantically consistent with the Limit 0 query.For instance:
Query:
SELECT * FROM table1 INNER JOIN (SELECT * FROM table2 LIMIT 0) AS table2 ON table1.id = table2.idOptimized Plan without fix:
Optimized Plan with fix:
LocalRelation <empty>, [id#75, num1#76, id#77, num2#78]How was this patch tested?
Added unit tests to verify Limit 0 optimization for: