[SPARK-17031][SQL] Add `Scanner` operator to wrap the optimized plan directly in planner #14619

jiangxb1987 · 2016-08-12T08:16:13Z

What changes were proposed in this pull request?

Added Scanner operator to wrap the optimized plan directly in planner, it holds project lists as well as filter predicates.
Updated relative Analyzer and Optimizer rules.

How was this patch tested?

Existing testcases.

SparkQA · 2016-08-12T09:36:14Z

Test build #63679 has finished for PR 14619 at commit b1e224c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-08-12T15:22:47Z

Could you elaborate on what you are trying to do here? I am missing context here. What is the advantage of doing this?

jiangxb1987 · 2016-08-12T17:07:55Z

@hvanhovell This idea is inspired by @cloud-fan, as he stated in comment, we'd better have a wrapper node for scan, so that the planner may match the wrapper node directly instead of resolving the whole plan using PhysicalOperator.

SparkQA · 2016-08-12T19:03:42Z

Test build #63696 has finished for PR 14619 at commit 0fc7d00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-08-12T19:29:32Z

I'm pretty confused by this as well. Is this just collapsing Filter, Project, and an arbitrary node into a single logical node?

cloud-fan · 2016-08-13T08:11:40Z

see discussion here: #13893 (comment)

Currently we collect the projects and filters on scan node at planner by PhysicalOperator.unapply. The PhysicalOperator.unapply mostly duplicates the logic from column pruning and filter push down rules in optimizer, but doesn't handle non-deterministic expressions well. By adding a wrapper node on scan, we can push down projects and filters to scan node at optimizer phase, and reuse the existing rules. Thus we can eliminate the duplicated codes and handle non-deterministic expressions.

cloud-fan · 2016-08-16T08:38:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

This doesn't reuse the existing ColumnPruning and PushDownPredicate rules. I'd like to add the wrapper at the very beginning, e.g. SessionCatalog.lookUpRelation.

Roger that, will change it and retest.

SparkQA · 2016-08-27T13:21:40Z

Test build #64533 has finished for PR 14619 at commit 6eb5bd7.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-28T03:13:51Z

Test build #64544 has finished for PR 14619 at commit 1d502b6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2016-08-28T15:24:38Z

@cloud-fan I've moved the InsertRelationScanner rule to Analyzer, after relations and expressions are resolved. To reuse analyze and optimize rules, I updated relative rules such as CleanupAliases、 ColumnPruning、 PushDownPredicate、 InferFiltersFromConstraints、 ConvertToLocalRelation、 PropagateEmptyRelation, I also added new rules to combine and prune Scanner operators. Besides, I made some change in subquery related rules and recently found they have been refactored.
Now that only a few of test cases is still failing, which should be easy to fix. But, I realized adding a wrapper node over every relation maybe not a idea that is perfect enough for the following reasons:

Firstly, scan a relation is not among basic operators in SQL language, when we declare a relation, we imply it should be scanned, so It seems semantically duplicate to declare a Scanner node over a relation or calling relation.scanner(). Besides, to add this wrapper node, we have to make a new assumption that no other operators should be inserted between Scanner and its corresponding relation, this brought in more complexity.

Secondly, a wrapper node should contain the output, predicates that can be used in partition pruning, and a relation to be scanned. But this may cause complex situation in some cases, for example, in InferFiltersFromConstraints, we have to covert expression in filters to alias name when we collect valid constraints, because output maybe alias and filters have to use child expression, this behavor is not needed in other operators.

At last, I feel adding such a operator have caused too many changes, perhaps we should make some improvement on PhysicalOperation, until we figure out a way comprehensively better than current method.

After all, I'm passionate to this improvement and will try my best to contribute, please correct me if I'm wrong, thank you!

SparkQA · 2016-09-07T11:24:07Z

Test build #65033 has finished for PR 14619 at commit 957d784.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

… with it.

SparkQA · 2016-09-07T16:34:34Z

Test build #65044 has finished for PR 14619 at commit 230d65d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 mentioned this pull request Aug 12, 2016

[SPARK-14172][SQL] Hive table partition predicate not passed down correctly #13893

Closed

cloud-fan reviewed Aug 16, 2016
View reviewed changes

jiangxb1987 force-pushed the scanner branch from 0fc7d00 to a2c5227 Compare August 27, 2016 13:09

jiangxb1987 added 17 commits September 7, 2016 23:12

insert a Scanner operator over relation and combine Project/Filter(s)…

d4aff00

… with it.

resolve testcases fail.

975c7d7

fix testcases fail.

e03ab9a

fix Scalastyle check fails.

edecda8

update failed hive tests.

9223932

move InsertRelationScanner operator to the early stage of analyzing.

72b609a

refactor CovertToLocalRelation rule.

ca5c857

fix failed test cases.

e050a07

fix bug in PushDownPredicate.

5209fa4

fix ConvertToLocalRelation rule.

31310be

bugfix.

989e243

bugfix.

629c851

bugfix

7f31773

fix SQLBuilder fails.

a363ab8

bugfix.

9ff4404

bugfix.

a3dfb32

fix scalastyle fail.

3051caf

jiangxb1987 added 4 commits September 7, 2016 23:50

update SubqueryAlias references.

f8108c9

bugfix

15f9e05

update PredicateSubquery optimize rule.

85f9b6f

fix scalastyle fail.

230d65d

jiangxb1987 force-pushed the scanner branch from 957d784 to 230d65d Compare September 7, 2016 16:04

jiangxb1987 closed this Nov 29, 2016

[SPARK-17031][SQL] Add Scanner operator to wrap the optimized plan directly in planner #14619

[SPARK-17031][SQL] Add Scanner operator to wrap the optimized plan directly in planner #14619

Uh oh!

Conversation

jiangxb1987 commented Aug 12, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 12, 2016

Uh oh!

hvanhovell commented Aug 12, 2016

Uh oh!

jiangxb1987 commented Aug 12, 2016

Uh oh!

SparkQA commented Aug 12, 2016

Uh oh!

rxin commented Aug 12, 2016

Uh oh!

cloud-fan commented Aug 13, 2016

Uh oh!

cloud-fan Aug 16, 2016

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 Aug 16, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 27, 2016

Uh oh!

SparkQA commented Aug 28, 2016

Uh oh!

jiangxb1987 commented Aug 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 7, 2016

Uh oh!

SparkQA commented Sep 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-17031][SQL] Add `Scanner` operator to wrap the optimized plan directly in planner #14619

[SPARK-17031][SQL] Add `Scanner` operator to wrap the optimized plan directly in planner #14619

jiangxb1987 commented Aug 28, 2016 •

edited

Loading