Skip to content

Conversation

@vidma
Copy link

@vidma vidma commented Nov 3, 2015

Do not merge yet: Work in progress / waiting for comments

Draft of first step in optimizing skew in joins (it is quite common to have skew in data, and lots of nulls on either side of join is quite common (for us), especially with left join, say when joining dimensions to fact tables)

feel free to propose a better approach / add commits.

any ideas for an easy way to check if the rule was already applied? After adding a isNotNull filter someAttribute.nullable still returns true. I couldn't come up with anything better than simply doing a separate batch of 1 iteration.

@marmbrus (as discussed at Spark Summit EU)


the next more serious step will be to fight skew in left join, where most helpers of this PR will be reused.

here is a rather simple implementation with DataFrames, solves the null skew, and don't seem to add lots of overhead (though tried only on subset of all our joins which used another abstraction of ours).

however this, so far, seems harder to express in optimizer rules:

  • need to add "fake" colums. no idea yet how to do this to be able to refer to the added column in join conditions
val leftNullsSprayValue = CaseWhen(
      Seq(
        nullableJoinKeys(left).map(IsNull).reduceLeft(Or), // if any join keys are null
        Cast(Multiply(new Rand(), Literal(100000)), IntegerType),
        Literal(0) // otherwise
      ))
// but how to add this column to left & right relations?
// e.g. this fails, saying it's not `resolved`
Alias(leftNullsSprayValue)("leftNullsSprayKey")()

@vidma vidma force-pushed the feature/fight-skew-in-inner-join branch 2 times, most recently from 9acab52 to 9a6d9dc Compare November 3, 2015 23:31
@yhuai
Copy link
Contributor

yhuai commented Nov 4, 2015

ok to test

@yhuai
Copy link
Contributor

yhuai commented Nov 4, 2015

How about we update the title to include the jira? Is https://issues.apache.org/jira/browse/SPARK-9372 the right one?

@yhuai
Copy link
Contributor

yhuai commented Nov 4, 2015

Regarding the format of the title, we can do [SPARK-xxxxx] [SQL] ...

@SparkQA
Copy link

SparkQA commented Nov 4, 2015

Test build #44974 has finished for PR 9451 at commit 9a6d9dc.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 4, 2015

Test build #45010 has finished for PR 9451 at commit 4490d9d.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vidma vidma changed the title WIP: Optimize Inner joins with skewed null values [SPARK-9372] [SQL] Filter nulls in Inner joins (null-skew) Nov 4, 2015
@vidma
Copy link
Author

vidma commented Nov 7, 2015

so any comments, guys?
@marmbrus ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideas on better/simpler way to extract left/right join key columns ?

maybe:

  joinConditionsOnBothRelations.map { case EqualTo(leftColumn, rightColumn) =>
    // check columns on both sides of join condition, 
    // and take the one which refers to the required join side
    Seq(leftColumn, rightColumn)
      .filter(canEvaluate(_, leftOrRight))
      .filter(_.nullable)
  }

is there a big difference between checking for nullability one side of EqualTo() predicate vs magically extracting the equivalent attribute from left/right LogicalPlans'?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so is catalyst.planning.patterns.ExtractEquiJoinKeys is the right way to go (?)

@SparkQA
Copy link

SparkQA commented Nov 7, 2015

Test build #45291 has finished for PR 9451 at commit 70d1fad.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in Inner | Semi join case, the null filter could be added to joinCondition (instead of left/right relations), assuming that I'll be pushed down by subsequent optimizer rules.
which do you prefer?

@SparkQA
Copy link

SparkQA commented Nov 8, 2015

Test build #45301 has finished for PR 9451 at commit 0fa27c4.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

marmbrus commented Nov 9, 2015

Hey, thanks for working on this! I probably won't have time to look at this in depth until after the Spark 1.6 release (early december).

@vidma vidma force-pushed the feature/fight-skew-in-inner-join branch from 0fa27c4 to d05a63d Compare January 4, 2016 18:09
@vidma
Copy link
Author

vidma commented Jan 4, 2016

@marmbrus ping ;)

@SparkQA
Copy link

SparkQA commented Jan 4, 2016

Test build #48673 has finished for PR 9451 at commit d05a63d.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vidma vidma force-pushed the feature/fight-skew-in-inner-join branch from d05a63d to 1bcf9aa Compare January 4, 2016 22:41
@SparkQA
Copy link

SparkQA commented Jan 5, 2016

Test build #48694 has finished for PR 9451 at commit 1bcf9aa.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

vidmantas zemleris added 5 commits January 5, 2016 10:39
i.e. should not rewrite <=> or comparison, where null semantics are more subtle
it will be pushed down by other rules, such as PushPredicateThroughJoin
@vidma vidma force-pushed the feature/fight-skew-in-inner-join branch from 1bcf9aa to cd8ca34 Compare January 5, 2016 08:39
@SparkQA
Copy link

SparkQA commented Jan 5, 2016

Test build #48754 has finished for PR 9451 at commit cd8ca34.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vidma
Copy link
Author

vidma commented Apr 1, 2016

@marmbrus pinging you to tag Catalyst plan rewriting guru.

For current inner join PR, there's a flanky test in python, couldn't track it down yet.

For more generic case (next PR), It doesn't seem to be easy not to loose table aliases, and add a randomized spraying column as extra left join key.

P.S. I'm in SF bay until Fri 8 Apr (better before Thu 9), so I could come over to chat to you guys live.
Cheers.

@marmbrus
Copy link
Contributor

marmbrus commented Apr 1, 2016

@sameeragarwal

@davies
Copy link
Contributor

davies commented May 3, 2016

@vidma I think this is already fixed in master (having constraints for join and turn constraints into predicate, push down the predicates), do you mind to close this PR?

@asfgit asfgit closed this in 5bb62b8 May 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants