[SPARK-11982] [SQL] improve performance of cartesian product #9969

davies · 2015-11-25T09:03:22Z

This PR improve the performance of CartesianProduct by caching the result of right plan.

After this patch, the query time of TPC-DS Q65 go down to 4 seconds from 28 minutes (420X faster).

cc @nongli

SparkQA · 2015-11-25T09:18:08Z

Test build #46683 has finished for PR 9969 at commit 162268c.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class ChainedIterator extends UnsafeSorterIterator\n * class UnsafeCartesianRDD(rdd1 : RDD[UnsafeRow], rdd2 : RDD[UnsafeRow])\n

rxin · 2015-11-25T19:44:21Z

core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java

Can you document what the difference is between this iterator and the sorted iterator? Is it simply that one is sorted and the other is not?

@davies are you trying to save a in-memory sort here?

SparkQA · 2015-11-25T22:07:50Z

Test build #46702 has finished for PR 9969 at commit a94204b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class ChainedIterator extends UnsafeSorterIterator\n * class UnsafeCartesianRDD(rdd1 : RDD[UnsafeRow], rdd2 : RDD[UnsafeRow])\n

cloud-fan · 2015-11-26T01:49:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/CartesianProduct.scala

does the UnsafeExternalSorter preserve records order if it spills?

and we may also need to update CartesianProduct strategy to put smaller child at right side.

As we discussed it in #7417, right now it's not clear that which metric could be used as the size of table, that could be another story.

Even the right table is larger than left, this approach is still much better than current one (building the partition is usually much expensive than loading them from memory or disk), it also fix another problem that the right table could be nondeterministic.

@cloud-fan For the first question, yes.

SparkQA · 2015-11-26T10:33:21Z

Test build #2117 has finished for PR 9969 at commit d3edd4f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-26T11:31:54Z

Test build #46755 has finished for PR 9969 at commit 074f2a7.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class ChainedIterator extends UnsafeSorterIterator\n * class UnsafeCartesianRDD(left : RDD[UnsafeRow], right : RDD[UnsafeRow], numFieldsOfRight: Int)\n

nongli · 2015-11-30T18:01:12Z

core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java

This doesnt work if iterators contain empty iterators. Fix or assert that can't be.

It checked that the iterators is not empty

That's not what i mean.

If iterators contains an empty one. So iterators is:
(1, 2) : empty : (3, 4)

When you move to the second iterator (current is empty) you will stop and not iterate over the iterator containing (3,4)

Oh, I see, thanks, will fix it.

For UnsafeExternalSorter, it's not possible to have an empty iterator in the middle, they are spilled files. It's still good to be defensive for that.

Yea. I figured it would not be empty but I agree about being defensive. If the implementation of UnsafeExternalSorter changes, we don't want to debug this.

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala

nongli · 2015-11-30T19:34:46Z

LGTM

SparkQA · 2015-11-30T21:18:28Z

Test build #46898 has finished for PR 9969 at commit 91c7824.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class ChainedIterator extends UnsafeSorterIterator\n * case class Count(children: Seq[Expression]) extends DeclarativeAggregate\n * class UnsafeCartesianRDD(left : RDD[UnsafeRow], right : RDD[UnsafeRow], numFieldsOfRight: Int)\n

SparkQA · 2015-11-30T21:29:55Z

Test build #2131 has finished for PR 9969 at commit d3edd4f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dujunling · 2016-05-05T06:06:04Z

After this patch, the query time of TPC-DS Q65 go down to 4 seconds from 28 minutes (420X faster).
@davies ,How many data did you used?

davies · 2016-05-05T06:40:24Z

Scale factor 1 and 10 (1G and 10G).

Davies Liu added 2 commits November 24, 2015 21:43

push filter through aggregation with alias and literals

2fb7a1c

improve performance of cartesian product

162268c

Davies Liu added 4 commits November 25, 2015 10:46

address comments

0f5d7ba

fix tests

951fe7a

Merge branch 'master' of github.com:apache/spark into improve_cartesian

3a66c89

fix build

a94204b

rxin reviewed Nov 25, 2015
View reviewed changes

cloud-fan reviewed Nov 26, 2015
View reviewed changes

Davies Liu added 2 commits November 25, 2015 22:18

address comments

37b3088

add comments

074f2a7

davies force-pushed the improve_cartesian branch from a3b3957 to 074f2a7 Compare November 26, 2015 07:08

Davies Liu added 3 commits November 25, 2015 23:25

fix test

99bb8ef

Merge branch 'master' of github.com:apache/spark into improve_cartesian

d3edd4f

Merge branch 'push_filter2' into improve_cartesian

d88fa69

nongli reviewed Nov 30, 2015
View reviewed changes

Davies Liu added 2 commits November 30, 2015 11:07

defend empty iterator

fbd7dfd

Merge branch 'master' of github.com:apache/spark into improve_cartesian

91c7824

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala

asfgit closed this in 8df584b Nov 30, 2015

ConeyLiu mentioned this pull request May 19, 2017

[SPARK-20638][Core]Optimize the CartesianRDD to reduce repeatedly data fetching #17936

Closed

[SPARK-11982] [SQL] improve performance of cartesian product #9969

[SPARK-11982] [SQL] improve performance of cartesian product #9969

Uh oh!

Conversation

davies commented Nov 25, 2015

Uh oh!

SparkQA commented Nov 25, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 25, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 26, 2015

Uh oh!

SparkQA commented Nov 26, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nongli commented Nov 30, 2015

Uh oh!

SparkQA commented Nov 30, 2015

Uh oh!

SparkQA commented Nov 30, 2015

Uh oh!

dujunling commented May 5, 2016

Uh oh!

davies commented May 5, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants