[SPARK-7871] [SQL] Improve the outputPartitioning for outer joins. #7886

yhuai · 2015-08-03T06:16:05Z

https://issues.apache.org/jira/browse/SPARK-7871

This PR adds the concept of nullSafe to ClusteredDistribution and HashPartitioning. For a ClusteredDistribution, if its nullSafe field is false, it does not require all rows whose clustering expressions have nulls be clustered. For a HashPartitioning, if its nullSafe field is false, it does not guarantee that rows whose clustering expressions have nulls be clustered.

This concept can be used with equal joins. A shuffled equal join operator (ShuffledHashJoin, ShuffledHashOuterJoin, and SortMergeJoin) can use ClusteredDistributions with nullSafe = false. By adding this concept, we can avoid shuffle data when we have outer joins. For example, we only need three Exchange operators for a query like SELECT ... A LEFT OUTER JOIN B ON (A.key = B.key) LEFT OUTER JOIN (B.key = C.key) instead of four Exchange operators.

BTW, this PR does not shuffle rows with null partition keys randomly (#7685 has that part. We can add that part later).

SparkQA · 2015-08-03T07:50:45Z

Test build #39520 has finished for PR 7886 at commit 2bc9be3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class NaiveBayes(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasProbabilityCol,
- class RowOrdering(ordering: Seq[SortOrder]) extends Ordering[InternalRow]
- case class ClusteredDistribution(
- case class HashPartitioning(
- case class PartitioningCollection(partitionings: Seq[Partitioning])

SparkQA · 2015-08-03T17:29:59Z

Test build #39553 has finished for PR 7886 at commit a1d417b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ClusteredDistribution(
- case class HashPartitioning(

JoshRosen · 2015-08-03T20:45:50Z

I'd like to try to review this now since I think it's going to conflict with the SMJ outer join patch.

JoshRosen · 2015-08-03T20:50:33Z

One high-level comment: unless I've overlooked it, there doesn't seem to be any documentation in the code to explain what the nullSafe concept means here, although maybe the meaning is clear from usage and context. One potential area of naming confusion is the fact that we use "null safety" when talking about whether expression evaluation methods can expect to receive null values or not. Our usage here seems slightly backwards almost, though, since it seems like this PR says that a null-safe partitioning means that the nulls will be shuffled, whereas the unsafe version drops the nulls. Am I overlooking something or is this potentially confusing?

JoshRosen · 2015-08-03T20:52:49Z

Expression's use of nullSafe seems to be "safe due to absence of nulls", whereas this patch seems to use it as "safe to receive nulls / shuffle nulls."

JoshRosen · 2015-08-03T20:55:43Z

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala

Why did you use a while loop here instead of a for comprehension or pair of nested for loops?

JoshRosen · 2015-08-03T21:02:19Z

Actually I'm going to drop review of this for now and focus on pulling in SMJ first. That will conflict with this patch but we can remember to update SMJ's OutputPartitioning as well.

chenghao-intel · 2015-08-07T08:34:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashOuterJoin.scala

Need to overwrite the PartitioningCollection.nullSafe.

yhuai · 2015-09-03T21:19:54Z

I am closing it for now. Will reopen it when I get a chance to work on it.

yhuai added 2 commits August 3, 2015 08:33

Add the concept of nullSafe to ClusteredDistribution and Partitioning.

fce9053

Always use nullSafe version partitioning when creating Exchanges.

a1d417b

JoshRosen reviewed Aug 3, 2015
View reviewed changes

chenghao-intel reviewed Aug 7, 2015
View reviewed changes

yhuai closed this Sep 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-7871] [SQL] Improve the outputPartitioning for outer joins. #7886

[SPARK-7871] [SQL] Improve the outputPartitioning for outer joins. #7886

Uh oh!

yhuai commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

JoshRosen commented Aug 3, 2015

Uh oh!

JoshRosen commented Aug 3, 2015

Uh oh!

JoshRosen commented Aug 3, 2015

Uh oh!

JoshRosen Aug 3, 2015

Uh oh!

JoshRosen commented Aug 3, 2015

Uh oh!

chenghao-intel Aug 7, 2015

Uh oh!

yhuai commented Sep 3, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-7871] [SQL] Improve the outputPartitioning for outer joins. #7886

[SPARK-7871] [SQL] Improve the outputPartitioning for outer joins. #7886

Uh oh!

Conversation

yhuai commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

JoshRosen commented Aug 3, 2015

Uh oh!

JoshRosen commented Aug 3, 2015

Uh oh!

JoshRosen commented Aug 3, 2015

Uh oh!

JoshRosen Aug 3, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Aug 3, 2015

Uh oh!

chenghao-intel Aug 7, 2015

Choose a reason for hiding this comment

Uh oh!

yhuai commented Sep 3, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants