[SPARK-9066][SQL] Improve cartesian performance #7417

Sephiroth-Lin · 2015-07-15T09:19:55Z

see jira https://issues.apache.org/jira/browse/SPARK-9066

use tpc-ds to test, for below SQL clause:

with single_value as (
  select 1 tpcds_val from date_dim
)
select sum(ss_quantity * ss_sales_price) ssales, tpcds_val
from store_sales, single_value
group by tpcds_val

use this patch run1h55min, without this patch run half tasks use 16.7h

@scwf

SparkQA · 2015-07-15T09:27:44Z

Test build #37347 has finished for PR 7417 at commit 0a62098.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-15T11:02:58Z

Test build #37350 has finished for PR 7417 at commit 61d1a7e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LDAModel(JavaModelWrapper):
- class LDA(object):
- trait ImplicitCastInputTypes extends ExpectsInputTypes
- abstract class BinaryOperator extends BinaryExpression with ExpectsInputTypes
- case class UnaryMinus(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class UnaryPositive(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class Abs(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class BitwiseNot(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class Factorial(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Hex(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Unhex(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Round(child: Expression, scale: Expression)
- case class Md5(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Sha1(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Not(child: Expression)
- case class And(left: Expression, right: Expression) extends BinaryOperator with Predicate
- case class Or(left: Expression, right: Expression) extends BinaryOperator with Predicate
- trait StringRegexExpression extends ImplicitCastInputTypes
- trait String2StringExpression extends ImplicitCastInputTypes
- trait StringComparison extends ImplicitCastInputTypes
- case class StringSpace(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class StringLength(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Ascii(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Base64(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class UnBase64(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Exchange(newPartitioning: Partitioning, child: SparkPlan) extends UnaryNode

Sephiroth-Lin · 2015-07-15T11:40:21Z

Jenkins, retest this please.

SparkQA · 2015-07-15T13:39:20Z

Test build #23 has finished for PR 7417 at commit 61d1a7e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LDAModel(JavaModelWrapper):
- class LDA(object):
- trait ImplicitCastInputTypes extends ExpectsInputTypes
- abstract class BinaryOperator extends BinaryExpression with ExpectsInputTypes
- case class UnaryMinus(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class UnaryPositive(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class Abs(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class BitwiseNot(child: Expression) extends UnaryExpression with ExpectsInputTypes
- case class Factorial(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Hex(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Unhex(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Round(child: Expression, scale: Expression)
- case class Md5(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Sha1(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Not(child: Expression)
- case class And(left: Expression, right: Expression) extends BinaryOperator with Predicate
- case class Or(left: Expression, right: Expression) extends BinaryOperator with Predicate
- trait StringRegexExpression extends ImplicitCastInputTypes
- trait String2StringExpression extends ImplicitCastInputTypes
- trait StringComparison extends ImplicitCastInputTypes
- case class StringSpace(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class StringLength(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Ascii(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class Base64(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class UnBase64(child: Expression) extends UnaryExpression with ImplicitCastInputTypes

SparkQA · 2015-07-15T13:41:27Z

Test build #37354 has finished for PR 7417 at commit 61d1a7e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2015-07-15T14:13:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/CartesianProduct.scala

Quick question. Why not use sizeInBytes? I assume we want to move as little data as possible? Using sizeInBytes would be a bit more involved, since this would involve the planner, and (probably) adding a BuildSide parameter to CartesianProduct...

yes, use partition size here is not accurate, see a rdd with 100 partitions, and each partition has one record and a rdd with 10 partition and each partition has 100 million records, use the method above will cause more scan from hdfs

@hvanhovell Yes, use sizeInBytes is better, but also have a problem, if leftResults only have 1 record and this record size are big, and rightResults have many records and these records total size are small, then at this scenario will cause worse performance. The best way is we check the total records for the partition, but now we can not get it.

SparkQA · 2015-07-16T07:12:54Z

Test build #37467 has finished for PR 7417 at commit 23deb4b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CartesianProduct(

SparkQA · 2015-07-16T07:47:17Z

Test build #37471 has finished for PR 7417 at commit eb9d155.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CartesianProduct(

SparkQA · 2015-07-16T13:33:13Z

Test build #37495 has finished for PR 7417 at commit 8198648.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CartesianProduct(

SparkQA · 2015-07-17T02:20:58Z

Test build #37557 has finished for PR 7417 at commit 1006d46.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CartesianProduct(

SparkQA · 2015-07-17T03:36:57Z

Test build #37566 has finished for PR 7417 at commit 547242e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class BroadcastCartesianProduct(
- case class CartesianProduct(

SparkQA · 2015-07-20T07:47:27Z

Test build #37807 has finished for PR 7417 at commit a168900.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class BroadcastCartesianProduct(
- case class CartesianProduct(

hvanhovell · 2015-07-21T12:54:56Z

Do you have any benchmarking results for this? Would be great to see how much this improves the current situation.

hvanhovell · 2015-07-21T13:01:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastCartesianProduct.scala

How is this different from a BroadcastNestedLoopJoin?

BroadcastNestedLoopJoin just used for out join right? But this is used for cartesian.

The inner join variant with (degenerate) condition 1 = 1 would do the same.

All I am saying is that this also a way to get a broadcasting cartesian join going, and it saves some lines of code.

SparkQA · 2015-07-22T06:10:35Z

Test build #38024 has finished for PR 7417 at commit 99bcde7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class BroadcastCartesianProduct(
- case class CartesianProduct(

Sephiroth-Lin · 2015-07-23T01:30:28Z

@hvanhovell I use tpc-ds to test, for below SQL clause:

with single_value as (
  select 1 tpcds_val from date_dim
)
select sum(ss_quantity * ss_sales_price) ssales, tpcds_val
from store_sales, single_value
group by tpcds_val

use this patch run1h55min, without this patch run half tasks use 16.7h

hvanhovell · 2015-07-24T15:24:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

This code is almost the same as the code above. I would put it in a method i.e. createCartesianProduct, and wrap the result in a Filter operator.

scwf · 2015-09-08T12:29:42Z

@Sephiroth-Lin can you rebase this?

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/CartesianProduct.scala

Sephiroth-Lin · 2015-09-08T13:07:51Z

@scwf done. @zsxwing code updated.

SparkQA · 2015-09-08T13:37:26Z

Test build #42133 has finished for PR 7417 at commit 8a8658c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CartesianProduct(

chenghao-intel · 2015-09-08T13:42:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

I think BroadcastNestedLoopJoin can support the condition, and push down the filter into the operator can also reduce the memory overhead, as BroadcastNestedLoopJoin will put all of the valid tuple into a compact buffer.

SparkQA · 2015-09-09T07:00:48Z

Test build #42193 has finished for PR 7417 at commit e01c8f0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CartesianProduct(

chenghao-intel · 2015-09-09T11:49:45Z

BTW, can you add some unit test like what I did at #8652

SparkQA · 2015-09-29T17:47:01Z

Test build #43092 has finished for PR 7417 at commit ce6ad25.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-10-03T01:05:00Z

@zsxwing the rdds order do matters for RDD.cartesian, because of the inefficient way we compute CartesianRDD:

for (x <- rdd1.iterator(currSplit.s1, context);
     y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)

For a.cartesian(b), when we compute partition in result RDD, logically we only need to compute the referenced partition of a and b once, but actually we comput partition of b many times according to the number of elements in partition of a.

Anyway this PR LGTM

cloud-fan · 2015-10-03T01:09:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

instead of passing a BuildSide to CartesianProduct, why not just change the parameters order according to the data size? like

if (left < right) { CartesianProduct(left, right) } else { CartesianProduct(right, left) }

Actually I am a little concern about the side switch based on the statistic, as I commented previously. And also as @cloud-fan comment out:

for (x <- rdd1.iterator(currSplit.s1, context); y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)

What we actually cared is the average amount of records in each partition in both sides, and, I don't think we can say, the one take the bigger file size in statistics will also with more average amount of records in its partition(most likely the average amount of records in each partition should be same).

Probably we'd better add more statistic info says partition number logical plan or average file size of each partition, and in order not to make confusing for the further improvement, I think we'd better remove this optimization rule for cartesian join. And that's why I didn't do that at #8652

What do you think?

Good point! This optimization should depend on record numbers, not data size.

cloud-fan · 2015-10-21T06:00:53Z

Hi @Sephiroth-Lin , according to the previous discussion, I think we should NOT do optimization according to data size, do you mind closing this PR and help us review #8652? It contains part of your optimization which is still valid.

Sephiroth-Lin · 2015-10-22T03:32:47Z

@cloud-fan OK.

Optimize cartesian order

0a62098

Fix code sytle

61d1a7e

hvanhovell reviewed Jul 15, 2015
View reviewed changes

Update

23deb4b

Fix code style

eb9d155

Fix unit test failed

8198648

Sephiroth-Lin added 4 commits July 17, 2015 08:53

Fix code style

1006d46

Update

f0ce447

Update code style

2bc0991

code style

547242e

Fix unit test failed

bca7a07

Fix NullPointerException

a168900

hvanhovell reviewed Jul 21, 2015
View reviewed changes

Update thread pool name

99bcde7

hvanhovell reviewed Jul 24, 2015
View reviewed changes

zsxwing mentioned this pull request Sep 8, 2015

[SPARK-10484][SQL] Optimize the cartesian join with broadcast join for some cases #8652

Closed

Sephiroth-Lin added 2 commits September 8, 2015 21:01

Merge branch 'master' into SPARK-9066

f1cebae

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/CartesianProduct.scala

Add Inner for do cartesian broadcast, SPARK-10484 point out this

8a8658c

chenghao-intel reviewed Sep 8, 2015
View reviewed changes

Sephiroth-Lin added 2 commits September 9, 2015 14:32

Update

60f2102

fix error

e01c8f0

Sephiroth-Lin added 5 commits September 12, 2015 17:09

Merge branch 'master' of https://github.com/apache/spark into SPARK-9066

dd77444

Merge branch 'master' into SPARK-9066

d9aef91

Add some unit test which PR#8652 have done, and fix unit test error

a66f475

Fix unit test error

9812242

Delete unused unit test

ce6ad25

cloud-fan reviewed Oct 3, 2015
View reviewed changes

Sephiroth-Lin closed this Oct 22, 2015

davies mentioned this pull request Nov 26, 2015

[SPARK-11982] [SQL] improve performance of cartesian product #9969

Closed

Sephiroth-Lin deleted the SPARK-9066 branch May 15, 2016 10:10

[SPARK-9066][SQL] Improve cartesian performance #7417

[SPARK-9066][SQL] Improve cartesian performance #7417

Uh oh!

Conversation

Sephiroth-Lin commented Jul 15, 2015

Uh oh!

SparkQA commented Jul 15, 2015

Uh oh!

SparkQA commented Jul 15, 2015

Uh oh!

Sephiroth-Lin commented Jul 15, 2015

Uh oh!

SparkQA commented Jul 15, 2015

Uh oh!

SparkQA commented Jul 15, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 16, 2015

Uh oh!

SparkQA commented Jul 16, 2015

Uh oh!

SparkQA commented Jul 16, 2015

Uh oh!

SparkQA commented Jul 17, 2015

Uh oh!

SparkQA commented Jul 17, 2015

Uh oh!

SparkQA commented Jul 20, 2015

Uh oh!

hvanhovell commented Jul 21, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 22, 2015

Uh oh!

Sephiroth-Lin commented Jul 23, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scwf commented Sep 8, 2015

Uh oh!

Sephiroth-Lin commented Sep 8, 2015

Uh oh!

SparkQA commented Sep 8, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 9, 2015

Uh oh!

chenghao-intel commented Sep 9, 2015

Uh oh!

SparkQA commented Sep 29, 2015

Uh oh!

cloud-fan commented Oct 3, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 21, 2015

Uh oh!

Sephiroth-Lin commented Oct 22, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development