[SPARK-2205] [SPARK-7871] [SPARK-9372] [SQL] [WIP] Improving SQL query planner #7685

yhuai · 2015-07-27T05:18:24Z

This PR introduces three improvements to SQL planner..

First, it adds an optimization rule FilterNullsInJoinKey to add Filter before join operators to filter out rows having null values for join keys.

Second, it adds NullUnsafeClusteredDistribution and NullUnsafeHashPartitioning, which can be used to distribute rows having null values for join keys evenly. NullUnsafeClusteredDistribution is basically the same with ClusteredDistribution (now renamed to NullSafeClusteredDistribution) except that it does not require rows having null values for join keys be clustered.

Third, it adds PartitioningCollection, which is used to represent the outputPartitioning for SparkPlans with multiple children (e.g. ShuffledHashJoin). So, a SparkPlan can have multiple descriptions of its partitioning schemes. Taking ShuffledHashJoin as an example, it has two descriptions of its partitioning schemes, i.e. left.outputPartitioning and right.outputPartitioning. So when we have a query like select * from t1 join t2 on (t1.x = t2.x) join t3 on (t2.x = t3.x) will only have three Exchange operators (when shuffled joins are needed) instead of four.

Optimizations in the first and second improvement are guarded by spark.sql.advancedOptimization.

I will add more comments/doc and test later.

SparkQA · 2015-07-27T05:25:35Z

Test build #38507 has finished for PR 7685 at commit e66d5a9.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class AtLeastNNulls(n: Int, children: Seq[Expression]) extends Predicate
- case class AtLeastNNonNullNans(n: Int, children: Seq[Expression]) extends Predicate
- class DefaultOptimizer extends Optimizer
- case class NullSafeClusteredDistribution(clustering: Seq[Expression]) extends Distribution
- case class NullUnsafeClusteredDistribution(clustering: Seq[Expression]) extends Distribution
- case class NullSafeHashPartitioning(expressions: Seq[Expression], numPartitions: Int)
- case class NullUnsafeHashPartitioning(expressions: Seq[Expression], numPartitions: Int)
- case class PartitioningCollection(partitionings: Seq[Partitioning])
- case class FilterNullsInJoinKey(

SparkQA · 2015-07-27T05:50:18Z

Test build #38511 has finished for PR 7685 at commit 13d1c9e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-27T21:39:02Z

Test build #38588 has finished for PR 7685 at commit d3d2e64.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-27T23:59:48Z

Test build #38606 has finished for PR 7685 at commit c57a954.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class AtLeastNNulls(n: Int, children: Seq[Expression]) extends Predicate
- case class AtLeastNNonNullNans(n: Int, children: Seq[Expression]) extends Predicate
- class DefaultOptimizer extends Optimizer
- case class ClusteredDistribution(
- case class HashPartitioning(expressions: Seq[Expression], numPartitions: Int, nullSafe: Boolean)
- case class PartitioningCollection(partitionings: Seq[Partitioning])
- case class FilterNullsInJoinKey(

JoshRosen · 2015-07-28T00:09:43Z

@yhuai, do you think that we should carve this up into multiple PRs to ease reviews? It looks like FilterNullsInJoinKey should be easy to split out.

yhuai · 2015-07-28T00:35:20Z

@JoshRosen yeah. Let me split it.

yhuai · 2015-07-28T00:37:01Z

btw, FilterNullsInJoinKeySuite is not deterministic. I will change it later.

…joins This PR adds `PartitioningCollection`, which is used to represent the `outputPartitioning` for SparkPlans with multiple children (e.g. `ShuffledHashJoin`). So, a `SparkPlan` can have multiple descriptions of its partitioning schemes. Taking `ShuffledHashJoin` as an example, it has two descriptions of its partitioning schemes, i.e. `left.outputPartitioning` and `right.outputPartitioning`. So when we have a query like `select * from t1 join t2 on (t1.x = t2.x) join t3 on (t2.x = t3.x)` will only have three Exchange operators (when shuffled joins are needed) instead of four. The code in this PR was authored by yhuai; I'm opening this PR to factor out this change from #7685, a larger pull request which contains two other optimizations.  [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7773)  Author: Yin Huai <[email protected]> Author: Josh Rosen <[email protected]> Closes #7773 from JoshRosen/multi-way-join-planning-improvements and squashes the following commits: 5c45924 [Josh Rosen] Merge remote-tracking branch 'origin/master' into multi-way-join-planning-improvements cd8269b [Josh Rosen] Refactor test to use SQLTestUtils 2963857 [Yin Huai] Revert unnecessary SqlConf change. 73913f7 [Yin Huai] Add comments and test. Also, revert the change in ShuffledHashOuterJoin for now. 4a99204 [Josh Rosen] Delete unrelated expression change 884ab95 [Josh Rosen] Carve out only SPARK-2205 changes. 247e5fa [Josh Rosen] Merge remote-tracking branch 'origin/master' into multi-way-join-planning-improvements c57a954 [Yin Huai] Bug fix. d3d2e64 [Yin Huai] First round of cleanup. f9516b0 [Yin Huai] Style c6667e7 [Yin Huai] Add PartitioningCollection. e616d3b [Yin Huai] wip 7c2d2d8 [Yin Huai] Bug fix and refactoring. 69bb072 [Yin Huai] Introduce NullSafeHashPartitioning and NullUnsafePartitioning. d5b84c3 [Yin Huai] Do not add unnessary filters. 2201129 [Yin Huai] Filter out rows that will not be joined in equal joins early.

This PR adds an optimization rule, `FilterNullsInJoinKey`, to add `Filter` before join operators to filter out rows having null values for join keys. This optimization is guarded by a new SQL conf, `spark.sql.advancedOptimization`. The code in this PR was authored by yhuai; I'm opening this PR to factor out this change from #7685, a larger pull request which contains two other optimizations. Author: Yin Huai <[email protected]> Author: Josh Rosen <[email protected]> Closes #7768 from JoshRosen/filter-nulls-in-join-key and squashes the following commits: c02fc3f [Yin Huai] Address Josh's comments. 0a8e096 [Yin Huai] Update comments. ea7d5a6 [Yin Huai] Make sure we do not keep adding filters. be88760 [Yin Huai] Make it clear that FilterNullsInJoinKeySuite.scala is used to test FilterNullsInJoinKey. 8bb39ad [Yin Huai] Fix non-deterministic tests. 303236b [Josh Rosen] Revert changes that are unrelated to null join key filtering 40eeece [Josh Rosen] Merge remote-tracking branch 'origin/master' into filter-nulls-in-join-key c57a954 [Yin Huai] Bug fix. d3d2e64 [Yin Huai] First round of cleanup. f9516b0 [Yin Huai] Style c6667e7 [Yin Huai] Add PartitioningCollection. e616d3b [Yin Huai] wip 7c2d2d8 [Yin Huai] Bug fix and refactoring. 69bb072 [Yin Huai] Introduce NullSafeHashPartitioning and NullUnsafePartitioning. d5b84c3 [Yin Huai] Do not add unnessary filters. 2201129 [Yin Huai] Filter out rows that will not be joined in equal joins early.

yhuai · 2015-08-03T06:43:51Z

I am closing it.

yhuai changed the title ~~[SPARK-2205] [SPARK-7871] [SPARK-9372] [SQL] [WIP] Three SQL optimziations~~ [SPARK-2205] [SPARK-7871] [SPARK-9372] [SQL] [WIP] Improving SQL query planner Jul 27, 2015

yhuai added 8 commits July 27, 2015 14:15

Filter out rows that will not be joined in equal joins early.

2201129

Do not add unnessary filters.

d5b84c3

Introduce NullSafeHashPartitioning and NullUnsafePartitioning.

69bb072

Bug fix and refactoring.

7c2d2d8

wip

e616d3b

Add PartitioningCollection.

c6667e7

Style

f9516b0

First round of cleanup.

d3d2e64

Bug fix.

c57a954

This was referenced Jul 30, 2015

[SPARK-9372] [SQL] Filter nulls in join keys #7768

Closed

[SPARK-2205] [SQL] Avoid unnecessary exchange operators in multi-way joins #7773

Closed

[SPARK-7871][SQL]Improve the outputPartitioning for HashOuterJoin #6413

Closed

yhuai mentioned this pull request Aug 3, 2015

[SPARK-7871] [SQL] Improve the outputPartitioning for outer joins. #7886

Closed

yhuai closed this Aug 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-2205] [SPARK-7871] [SPARK-9372] [SQL] [WIP] Improving SQL query planner #7685

[SPARK-2205] [SPARK-7871] [SPARK-9372] [SQL] [WIP] Improving SQL query planner #7685

Uh oh!

yhuai commented Jul 27, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

JoshRosen commented Jul 28, 2015

Uh oh!

yhuai commented Jul 28, 2015

Uh oh!

yhuai commented Jul 28, 2015

Uh oh!

yhuai commented Aug 3, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-2205] [SPARK-7871] [SPARK-9372] [SQL] [WIP] Improving SQL query planner #7685

[SPARK-2205] [SPARK-7871] [SPARK-9372] [SQL] [WIP] Improving SQL query planner #7685

Uh oh!

Conversation

yhuai commented Jul 27, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

JoshRosen commented Jul 28, 2015

Uh oh!

yhuai commented Jul 28, 2015

Uh oh!

yhuai commented Jul 28, 2015

Uh oh!

yhuai commented Aug 3, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants