Skip to content

Conversation

@jiangxb1987
Copy link
Contributor

What changes were proposed in this pull request?

Added Scanner operator to wrap the optimized plan directly in planner, it holds project lists as well as filter predicates.
Updated relative Analyzer and Optimizer rules.

How was this patch tested?

Existing testcases.

@SparkQA
Copy link

SparkQA commented Aug 12, 2016

Test build #63679 has finished for PR 14619 at commit b1e224c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

Could you elaborate on what you are trying to do here? I am missing context here. What is the advantage of doing this?

@jiangxb1987
Copy link
Contributor Author

@hvanhovell This idea is inspired by @cloud-fan, as he stated in comment, we'd better have a wrapper node for scan, so that the planner may match the wrapper node directly instead of resolving the whole plan using PhysicalOperator.

@SparkQA
Copy link

SparkQA commented Aug 12, 2016

Test build #63696 has finished for PR 14619 at commit 0fc7d00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Aug 12, 2016

I'm pretty confused by this as well. Is this just collapsing Filter, Project, and an arbitrary node into a single logical node?

@cloud-fan
Copy link
Contributor

see discussion here: #13893 (comment)

Currently we collect the projects and filters on scan node at planner by PhysicalOperator.unapply. The PhysicalOperator.unapply mostly duplicates the logic from column pruning and filter push down rules in optimizer, but doesn't handle non-deterministic expressions well. By adding a wrapper node on scan, we can push down projects and filters to scan node at optimizer phase, and reuse the existing rules. Thus we can eliminate the duplicated codes and handle non-deterministic expressions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't reuse the existing ColumnPruning and PushDownPredicate rules. I'd like to add the wrapper at the very beginning, e.g. SessionCatalog.lookUpRelation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roger that, will change it and retest.

@SparkQA
Copy link

SparkQA commented Aug 27, 2016

Test build #64533 has finished for PR 14619 at commit 6eb5bd7.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 28, 2016

Test build #64544 has finished for PR 14619 at commit 1d502b6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

jiangxb1987 commented Aug 28, 2016

@cloud-fan I've moved the InsertRelationScanner rule to Analyzer, after relations and expressions are resolved. To reuse analyze and optimize rules, I updated relative rules such as CleanupAliasesColumnPruningPushDownPredicateInferFiltersFromConstraintsConvertToLocalRelationPropagateEmptyRelation, I also added new rules to combine and prune Scanner operators. Besides, I made some change in subquery related rules and recently found they have been refactored.
Now that only a few of test cases is still failing, which should be easy to fix. But, I realized adding a wrapper node over every relation maybe not a idea that is perfect enough for the following reasons:

Firstly, scan a relation is not among basic operators in SQL language, when we declare a relation, we imply it should be scanned, so It seems semantically duplicate to declare a Scanner node over a relation or calling relation.scanner(). Besides, to add this wrapper node, we have to make a new assumption that no other operators should be inserted between Scanner and its corresponding relation, this brought in more complexity.

Secondly, a wrapper node should contain the output, predicates that can be used in partition pruning, and a relation to be scanned. But this may cause complex situation in some cases, for example, in InferFiltersFromConstraints, we have to covert expression in filters to alias name when we collect valid constraints, because output maybe alias and filters have to use child expression, this behavor is not needed in other operators.

At last, I feel adding such a operator have caused too many changes, perhaps we should make some improvement on PhysicalOperation, until we figure out a way comprehensively better than current method.

After all, I'm passionate to this improvement and will try my best to contribute, please correct me if I'm wrong, thank you!

@SparkQA
Copy link

SparkQA commented Sep 7, 2016

Test build #65033 has finished for PR 14619 at commit 957d784.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 7, 2016

Test build #65044 has finished for PR 14619 at commit 230d65d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants