[SPARK-9372] [SQL] Filter nulls in Inner joins (null-skew) #9451

vidma · 2015-11-03T23:13:07Z

Do not merge yet: Work in progress / waiting for comments

Draft of first step in optimizing skew in joins (it is quite common to have skew in data, and lots of nulls on either side of join is quite common (for us), especially with left join, say when joining dimensions to fact tables)

feel free to propose a better approach / add commits.

any ideas for an easy way to check if the rule was already applied? After adding a isNotNull filter someAttribute.nullable still returns true. I couldn't come up with anything better than simply doing a separate batch of 1 iteration.

@marmbrus (as discussed at Spark Summit EU)

the next more serious step will be to fight skew in left join, where most helpers of this PR will be reused.

here is a rather simple implementation with DataFrames, solves the null skew, and don't seem to add lots of overhead (though tried only on subset of all our joins which used another abstraction of ours).

however this, so far, seems harder to express in optimizer rules:

need to add "fake" colums. no idea yet how to do this to be able to refer to the added column in join conditions

val leftNullsSprayValue = CaseWhen(
      Seq(
        nullableJoinKeys(left).map(IsNull).reduceLeft(Or), // if any join keys are null
        Cast(Multiply(new Rand(), Literal(100000)), IntegerType),
        Literal(0) // otherwise
      ))
// but how to add this column to left & right relations?
// e.g. this fails, saying it's not `resolved`
Alias(leftNullsSprayValue)("leftNullsSprayKey")()

yhuai · 2015-11-04T00:04:58Z

ok to test

yhuai · 2015-11-04T00:05:43Z

How about we update the title to include the jira? Is https://issues.apache.org/jira/browse/SPARK-9372 the right one?

yhuai · 2015-11-04T00:06:03Z

Regarding the format of the title, we can do [SPARK-xxxxx] [SQL] ...

SparkQA · 2015-11-04T00:12:53Z

Test build #44974 has finished for PR 9451 at commit 9a6d9dc.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-04T11:06:22Z

Test build #45010 has finished for PR 9451 at commit 4490d9d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vidma · 2015-11-07T18:05:49Z

so any comments, guys?
@marmbrus ?

vidma · 2015-11-07T18:34:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

ideas on better/simpler way to extract left/right join key columns ?

maybe:

joinConditionsOnBothRelations.map { case EqualTo(leftColumn, rightColumn) => // check columns on both sides of join condition, // and take the one which refers to the required join side Seq(leftColumn, rightColumn) .filter(canEvaluate(_, leftOrRight)) .filter(_.nullable) }

is there a big difference between checking for nullability one side of EqualTo() predicate vs magically extracting the equivalent attribute from left/right LogicalPlans'?

so is catalyst.planning.patterns.ExtractEquiJoinKeys is the right way to go (?)

SparkQA · 2015-11-07T21:09:34Z

Test build #45291 has finished for PR 9451 at commit 70d1fad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vidma · 2015-11-08T08:59:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

in Inner | Semi join case, the null filter could be added to joinCondition (instead of left/right relations), assuming that I'll be pushed down by subsequent optimizer rules.
which do you prefer?

SparkQA · 2015-11-08T15:00:16Z

Test build #45301 has finished for PR 9451 at commit 0fa27c4.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2015-11-09T23:29:45Z

Hey, thanks for working on this! I probably won't have time to look at this in depth until after the Spark 1.6 release (early december).

vidma · 2016-01-04T18:10:50Z

@marmbrus ping ;)

SparkQA · 2016-01-04T22:34:14Z

Test build #48673 has finished for PR 9451 at commit d05a63d.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-05T03:03:05Z

Test build #48694 has finished for PR 9451 at commit 1bcf9aa.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

i.e. should not rewrite <=> or comparison, where null semantics are more subtle

it will be pushed down by other rules, such as PushPredicateThroughJoin

SparkQA · 2016-01-05T10:13:03Z

Test build #48754 has finished for PR 9451 at commit cd8ca34.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vidma · 2016-04-01T18:00:35Z

@marmbrus pinging you to tag Catalyst plan rewriting guru.

For current inner join PR, there's a flanky test in python, couldn't track it down yet.

For more generic case (next PR), It doesn't seem to be easy not to loose table aliases, and add a randomized spraying column as extra left join key.

P.S. I'm in SF bay until Fri 8 Apr (better before Thu 9), so I could come over to chat to you guys live.
Cheers.

marmbrus · 2016-04-01T19:52:31Z

@sameeragarwal

davies · 2016-05-03T18:34:04Z

@vidma I think this is already fixed in master (having constraints for join and turn constraints into predicate, push down the predicates), do you mind to close this PR?

vidma force-pushed the feature/fight-skew-in-inner-join branch 2 times, most recently from 9acab52 to 9a6d9dc Compare November 3, 2015 23:31

vidma changed the title ~~WIP: Optimize Inner joins with skewed null values~~ [SPARK-9372] [SQL] Filter nulls in Inner joins (null-skew) Nov 4, 2015

vidma reviewed Nov 7, 2015
View reviewed changes

vidma reviewed Nov 8, 2015
View reviewed changes

vidma force-pushed the feature/fight-skew-in-inner-join branch from 0fa27c4 to d05a63d Compare January 4, 2016 18:09

vidma force-pushed the feature/fight-skew-in-inner-join branch from d05a63d to 1bcf9aa Compare January 4, 2016 22:41

vidmantas zemleris added 5 commits January 5, 2016 10:39

Optimize Inner joins with skewed null values

11de68d

Add null filter only for EqualTo join conditions

c31a290

i.e. should not rewrite <=> or comparison, where null semantics are more subtle

Fix scalaStyle

8f047f2

Refactor using ExtractEquiJoinKeys

eaa12bc

Refactor to add null filter to joinConditions

cd8ca34

it will be pushed down by other rules, such as PushPredicateThroughJoin

vidma force-pushed the feature/fight-skew-in-inner-join branch from 1bcf9aa to cd8ca34 Compare January 5, 2016 08:39

srowen mentioned this pull request May 11, 2016

[BUILD] Test closing stale PRs #13052

Closed

asfgit closed this in 5bb62b8 May 12, 2016

[SPARK-9372] [SQL] Filter nulls in Inner joins (null-skew) #9451

[SPARK-9372] [SQL] Filter nulls in Inner joins (null-skew) #9451

Uh oh!

Conversation

vidma commented Nov 3, 2015

Uh oh!

yhuai commented Nov 4, 2015

Uh oh!

yhuai commented Nov 4, 2015

Uh oh!

yhuai commented Nov 4, 2015

Uh oh!

SparkQA commented Nov 4, 2015

Uh oh!

SparkQA commented Nov 4, 2015

Uh oh!

vidma commented Nov 7, 2015

Uh oh!

vidma Nov 7, 2015

Choose a reason for hiding this comment

Uh oh!

vidma Nov 7, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 7, 2015

Uh oh!

vidma Nov 8, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 8, 2015

Uh oh!

marmbrus commented Nov 9, 2015

Uh oh!

vidma commented Jan 4, 2016

Uh oh!

SparkQA commented Jan 4, 2016

Uh oh!

SparkQA commented Jan 5, 2016

Uh oh!

SparkQA commented Jan 5, 2016

Uh oh!

vidma commented Apr 1, 2016

Uh oh!

marmbrus commented Apr 1, 2016

Uh oh!

davies commented May 3, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants