[SPARK-20638][Core]Optimize the CartesianRDD to reduce repeatedly data fetching #17898

jtengyp · 2017-05-08T07:50:41Z

In compute, group each iterator to multiple groups, reducing repeatedly data fetching.

What changes were proposed in this pull request?

In compute, group each iterator to multiple groups. Thus in the second iteration, the data with be fetched (num of data)/groupSize times, rather than (num of data) times.

How was this patch tested?

The existing UT.

In compute, group each iterator to multiple groups, reducing repeatedly data fetching.

srowen · 2017-05-08T07:56:20Z

Please read http://spark.apache.org/contributing.html
For example, "Update X" is never sufficient as a title.
How does this avoid fetching? how much difference does it make?

viirya · 2017-05-08T08:41:00Z

core/src/main/scala/org/apache/spark/rdd/CartesianRDD.scala

-         y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
+    val groupSize = 500;
+    for (x <- rdd1.iterator(currSplit.s1, context).grouped(groupSize);
+         y <- rdd2.iterator(currSplit.s2, context).grouped(groupSize);


One disadvantage I can think now is, longer waiting time for first element.

This is indeed a disadvantage.

Pardon, doesn't this change the type of the result? you're iterating over groupings not elements, and emitting pairs of groups. As in below, but maybe I'm missing something.

scala> val foo = List(1,2,3) foo: List[Int] = List(1, 2, 3) scala> val bar = List(4,5,6) bar: List[Int] = List(4, 5, 6) scala> for (x <- foo; y <- bar) yield (x, y) res0: List[(Int, Int)] = List((1,4), (1,5), (1,6), (2,4), (2,5), (2,6), (3,4), (3,5), (3,6)) scala> (for (x <- foo.grouped(2); y <- bar.grouped(2)) yield (x, y)).foreach(println) (List(1, 2),List(4, 5)) (List(1, 2),List(6)) (List(3),List(4, 5)) (List(3),List(6))

The actual yield is on (i, j) and not (x, y) - the next line adds the iteration over the groupings :-)

I agree with @viirya - there is also an implicit assumption of size here : the batch will get deserialized into memory.
By default, we have kept the iterator model going in spark without needing to buffer (iirc).

I working on this too. But the optimize method maybe similar to the pr which @viirya opened before, cache the second iterator into local. The code is ready, maybe open a pr in recently. In this patch, I worry about whether we can accurately control the size of the buffer. If we should cache it by BlockManager or MemoryConsumer?

Oh haha right. Hm, but isn't this better solved 'upstream' by buffering an iterator somewhere? or just buffering the iterator right here?

viirya · 2017-05-08T08:43:44Z

Maybe create a JIRA and update title as Spark PR convention. Since this should be a performance improvement, the difference is expected to show.

jtengyp · 2017-05-08T09:28:41Z

Here is my test:
Environment : 3 workers, each has 10 cores, 30G memory, 1 executor
Test data : users : 480,189, each is a 10-dim vector, and items : 17770, each is a 10-dim vector.
With default CartesianRDD, cartesian time is 2420.7s.
With this proposal, cartesian time is 45.3s
50x faster than the original method.

val groupSize = 1000
for (x <- rdd1.iterator(currSplit.s1, context).grouped(groupSize);
y <- rdd2.iterator(currSplit.s2, context);
i <- x) yield (i,y)
I found the above change gains better performance, just buffering the first RDD.
With this PR, the cartesian time was 41.05s, and with this change it drops to 27.35s, 88x faster than the original method.

SparkQA · 2017-05-08T18:32:33Z

Test build #3697 has finished for PR 17898 at commit d2cbcdd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-05-10T09:30:09Z

@jtengyp I think we won't proceed with this version, so this can be closed, but see the discussion at #17936

Update CartesianRDD.scala

d2cbcdd

In compute, group each iterator to multiple groups, reducing repeatedly data fetching.

jtengyp changed the title ~~Update CartesianRDD.scala~~ Optimize the CartesianRDD May 8, 2017

viirya reviewed May 8, 2017

View reviewed changes

jtengyp changed the title ~~Optimize the CartesianRDD~~ Optimize the CartesianRDD to reduce repeatedly data fetching May 8, 2017

jtengyp changed the title ~~Optimize the CartesianRDD to reduce repeatedly data fetching~~ [SPARK-20638][Core]Optimize the CartesianRDD to reduce repeatedly data fetching May 9, 2017

jerryshao mentioned this pull request May 10, 2017

[SPARK-20638][Core]Optimize the CartesianRDD to reduce repeatedly data fetching #17936

Closed

jtengyp closed this May 15, 2017

[SPARK-20638][Core]Optimize the CartesianRDD to reduce repeatedly data fetching #17898

[SPARK-20638][Core]Optimize the CartesianRDD to reduce repeatedly data fetching #17898

Uh oh!

Conversation

jtengyp commented May 8, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srowen commented May 8, 2017

Uh oh!

viirya May 8, 2017

Choose a reason for hiding this comment

Uh oh!

jtengyp May 8, 2017

Choose a reason for hiding this comment

Uh oh!

srowen May 8, 2017

Choose a reason for hiding this comment

Uh oh!

mridulm May 8, 2017

Choose a reason for hiding this comment

Uh oh!

mridulm May 8, 2017

Choose a reason for hiding this comment

Uh oh!

ConeyLiu May 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen May 8, 2017

Choose a reason for hiding this comment

Uh oh!

viirya commented May 8, 2017

Uh oh!

jtengyp commented May 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented May 8, 2017

Uh oh!

srowen commented May 10, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ConeyLiu May 8, 2017 •

edited

Loading

jtengyp commented May 8, 2017 •

edited

Loading