SPARK-1868: Users should be allowed to cogroup at least 4 RDDs #813

douglaz · 2014-05-18T03:08:28Z

Adds cogroup for 4 RDDs.

AmplabJenkins · 2014-05-18T03:12:58Z

Can one of the admins verify this patch?

rxin · 2014-05-19T04:34:20Z

Thanks for submitting this. Instead of allowing 4 (and maybe 5), users can certainly use the cogroup RDD's constructor to construct cogroups of arbitrary RDDs. If that is inconvenient, perhaps we should think about a cogroup interface that either takes varargs, or just a sequence/list of RDDs?

douglaz · 2014-05-20T02:15:32Z

Yes, the user can instantiate the RDD and yes this is inconvenient. An interface to do this would be no less inconvenient if it has the same drawbacks (that you need to explicitly convert back the resulting sequences to the original type).

Limiting the user to 3 cogroups is pretty much like limiting tuples to 3 elements. You may have technical reasons for that limit, but it isn't reasonable for practical purposes. You can't just say: if you need a tuple with more than 3 elements, use lists instead.

For tuples the current limit is 22, which is "enough for everyone". For cogroups the limit should be lower, but certainly above 3.

pwendell · 2014-05-20T02:45:35Z

Isn't it possible to just to new CoGroupedRDD(Seq(rdd1, rdd2, rdd3... rddn))? That seems like the same number of lines of code as rdd1.cogroup(rdd2, rdd3...rddn)).

We have many functions like this, including union - I'm not sure we want to create many definitions of each of these functions.

douglaz · 2014-05-20T17:44:02Z

It isn't just about lines of code, it is about pollution of code using asInstanceOf and runtime errors because of this and wrong pattern matching on Sequences.

Compare this almost-real-code using cogroup:

val userHistories = parsedViews.cogroup(parsedBuyOrders, parsedShoppingCarts, parsedSentMails, partitioner=context.partitioner)
  .map(values => {
    val (key, events) = values
    val (groupedViews, groupedBuyOrders, groupedShoppingCarts, groupedSentMails) = events

    val sentMailsProducts = groupedSentMails.flatMap(_.products)

    val validViews = groupedViews.filter(v => !sentMailsProducts.contains(v.productId))

    key -> UserHistory(validViews, groupedBuyOrders, groupedShoppingCarts, groupedSentMails)
  })

With this using CoGroupedRDD:

// Perhaps there is some mistake here, a RDD may be missing
val userHistories = new CoGroupedRDD(Seq(parsedViews, parsedBuyOrders, parsedShoppingCarts, parsedSentMails), part=context.partitioner)
  .map(values => {
    val (key, events) = values

    // Or the match is wrong here
    val Seq(_groupedViews, _groupedBuyOrders, _groupedShoppingCarts, _groupedSentMails) = events

    // Or here we are casting with the wrong type. We'll find out at runtime
    val groupedViews = _groupedViews.asInstanceOf[Seq[UHView]]
    val groupedBuyOrders = _groupedBuyOrders.asInstanceOf[Seq[UHBuyOrder]]
    val groupedShoppingCarts = _groupedShoppingCarts.asInstanceOf[Seq[UHShoppingCartLog]]
    val groupedSentMails = _groupedSentMails.asInstanceOf[Seq[UHSentMail]]

    val sentMailsProducts = groupedSentMails.flatMap(_.products)

    val validViews = groupedViews.filter(v => !sentMailsProducts.contains(v.productId))

    key -> UserHistory(validViews, groupedBuyOrders, groupedShoppingCarts, groupedSentMails)
  })

The second example is clearly more verbose and error-prone.

Comparing cogroup with union misses the point:

cogroup may be called using different types and it keeps them thanks to the tuple signature. With union there is just one RDD type.
rdd1.union(rdd2).union(rdd3) works very well and is transparent to the user of the resulting RDD, while rdd1.cogroup(rdd2).cogroup(rdd3) will be very different from rdd1.cogroup(rdd2, rdd3). The composition works fine for union but for cogroup we start to get Seq[Seq[ and of course we may have performance implications.

A more fair comparison would be with join because it also keeps different types and the composition will create tuple of tuples. But in this case I find it very easy and safe to unpack such tuples. It isn't ideal but better than cogroup in the same situation. Of course I wouldn't oppose to create a interface for joins with more elements.

But I agree that we should really discuss this. If such operations won't get in main Spark, then external libraries (using implicits) will be created to handle such cases. I think it would be better if Spark could handle such cases without letting the user deal with boilerplate or resorting to external libraries.

markhamstra · 2014-05-20T18:14:45Z

To throw another wrench into the Union analogy, there is also the little-used SparkContext#union, which has signatures for both Seq[RDD[T]] and varags RDD[T].

pwendell · 2014-05-28T06:17:22Z

Hey @douglaz thanks for giving the explanation. This makes a lot of sense... the issue is about compile time type checking because the varargs drops the value type (didn't realize). This will need to exist somewhere, I think it could be something to merge into Spark core or maybe could exist in user libraries. Let me ask around the committers a bit and try to get a consensus.

mateiz · 2014-05-29T00:10:22Z

I'd be okay adding this, but it can be a bit of a slippery slope because people may then want it for joins, etc as well. But maybe we can just limit it to cogroup right now.

Regarding the pull request though, we should add this API to Python as well. Can you look into what that will take?

douglaz · 2014-05-29T16:47:26Z

I'll take a look at the python interface soon.

douglaz · 2014-06-05T03:33:02Z

Jenkins, test this please.

douglaz · 2014-06-05T03:41:52Z

I'm having no luck running the python tests on my machine. I'll try again later.

pwendell · 2014-06-05T05:57:28Z

Jenkins, test this please.

AmplabJenkins · 2014-06-05T05:58:00Z

Merged build triggered.

AmplabJenkins · 2014-06-05T05:58:06Z

Merged build started.

AmplabJenkins · 2014-06-05T06:57:20Z

Merged build finished.

AmplabJenkins · 2014-06-05T06:57:20Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15473/

pwendell · 2014-06-10T22:42:03Z

Jenkins, retest this please.

AmplabJenkins · 2014-06-10T22:47:07Z

Merged build triggered.

AmplabJenkins · 2014-06-10T22:47:13Z

Merged build started.

AmplabJenkins · 2014-06-10T23:52:37Z

Merged build finished.

AmplabJenkins · 2014-06-10T23:52:38Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15655/

douglaz · 2014-06-15T20:24:32Z

The tests should pass now.

pwendell · 2014-06-16T06:22:21Z

Jenkins, retest this please.

AmplabJenkins · 2014-06-16T06:24:45Z

Build triggered.

AmplabJenkins · 2014-06-16T06:24:52Z

Build started.

AmplabJenkins · 2014-06-16T07:38:39Z

Build finished.

AmplabJenkins · 2014-06-16T07:38:40Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15811/

pwendell · 2014-06-17T18:03:21Z

@douglaz if you up-merge this with master I think the tests should pass fine (currently it's not merging cleanly). I'd like to get this merged soon if possible, so let me know! Thanks

douglaz · 2014-06-18T01:23:52Z

@pwendell, merged with latest master.

pwendell · 2014-06-18T06:35:38Z

Jenkins, test this please.

AmplabJenkins · 2014-06-18T06:39:52Z

Merged build triggered.

AmplabJenkins · 2014-06-18T06:39:58Z

Merged build started.

pwendell · 2014-06-18T06:41:25Z

python/pyspark/join.py

Will this break compatibility for users who were building against the previous API? It seems like this is a public API, so we might need to make a second version rather than replace the current one.

Okay I looked yet again, this entire file is not exposed in e.g. the docs, so I guess this isn't public.

mateiz · 2014-06-18T06:49:33Z

Hey @douglaz, thanks for updating this. One thing missing here is tests in each of the languages -- please add them so that this code will be tested later.

AmplabJenkins · 2014-06-18T07:19:29Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-18T07:19:29Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15866/

douglaz · 2014-06-19T20:54:42Z

@pwendell, @mateiz, check if everything is fine now.

pwendell · 2014-06-20T08:07:53Z

Jenkins, test this please.

AmplabJenkins · 2014-06-20T08:09:58Z

Merged build triggered.

AmplabJenkins · 2014-06-20T08:10:06Z

Merged build started.

AmplabJenkins · 2014-06-20T09:25:55Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-20T09:25:55Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15953/

pwendell · 2014-06-20T18:01:06Z

Looks good - thanks for this. I'm going to merge it.

Adds cogroup for 4 RDDs. Author: Allan Douglas R. de Oliveira <[email protected]> Closes apache#813 from douglaz/more_cogroups and squashes the following commits: f8d6273 [Allan Douglas R. de Oliveira] Test python groupWith for one more case 0e9009c [Allan Douglas R. de Oliveira] Added scala tests c3ffcdd [Allan Douglas R. de Oliveira] Added java tests 517a67f [Allan Douglas R. de Oliveira] Added tests for python groupWith 2f402d5 [Allan Douglas R. de Oliveira] Removed TODO 17474f4 [Allan Douglas R. de Oliveira] Use new cogroup function 7877a2a [Allan Douglas R. de Oliveira] Fixed code ba02414 [Allan Douglas R. de Oliveira] Added varargs cogroup to pyspark c4a8a51 [Allan Douglas R. de Oliveira] Added java cogroup 4 e94963c [Allan Douglas R. de Oliveira] Fixed spacing f1ee57b [Allan Douglas R. de Oliveira] Fixed scala style issues d7196f1 [Allan Douglas R. de Oliveira] Allow the cogroup of 4 RDDs

…e log pattern (apache#813) 1. update log4j2 pattern to show abbreviation for package names <img width="1031" alt="image" src="https://github.corp.ebay.com/carmel/ebay-spark/assets/27990/9526784e-d9a1-4a3b-9d51-c7fb60d90144"> 2. CELEBORN-2083: Suppress noisy worker status changed from application heartbeat response 3. upgrade celeborn client version

douglaz added 2 commits June 17, 2014 21:04

Allow the cogroup of 4 RDDs

d7196f1

Fixed scala style issues

f1ee57b

Use new cogroup function

17474f4

pwendell reviewed Jun 18, 2014
View reviewed changes

douglaz added 5 commits June 19, 2014 15:26

Removed TODO

2f402d5

Added tests for python groupWith

517a67f

Added java tests

c3ffcdd

Added scala tests

0e9009c

Test python groupWith for one more case

f8d6273

asfgit closed this in 6a224c3 Jun 20, 2014

SPARK-1868: Users should be allowed to cogroup at least 4 RDDs #813

SPARK-1868: Users should be allowed to cogroup at least 4 RDDs #813

Uh oh!

Conversation

douglaz commented May 18, 2014

Uh oh!

AmplabJenkins commented May 18, 2014

Uh oh!

rxin commented May 19, 2014

Uh oh!

douglaz commented May 20, 2014

Uh oh!

pwendell commented May 20, 2014

Uh oh!

douglaz commented May 20, 2014

Uh oh!

markhamstra commented May 20, 2014

Uh oh!

pwendell commented May 28, 2014

Uh oh!

mateiz commented May 29, 2014

Uh oh!

douglaz commented May 29, 2014

Uh oh!

douglaz commented Jun 5, 2014

Uh oh!

douglaz commented Jun 5, 2014

Uh oh!

pwendell commented Jun 5, 2014

Uh oh!

AmplabJenkins commented Jun 5, 2014

Uh oh!

AmplabJenkins commented Jun 5, 2014

Uh oh!

AmplabJenkins commented Jun 5, 2014

Uh oh!

AmplabJenkins commented Jun 5, 2014

Uh oh!

pwendell commented Jun 10, 2014

Uh oh!

AmplabJenkins commented Jun 10, 2014

Uh oh!

AmplabJenkins commented Jun 10, 2014

Uh oh!

AmplabJenkins commented Jun 10, 2014

Uh oh!

AmplabJenkins commented Jun 10, 2014

Uh oh!

douglaz commented Jun 15, 2014

Uh oh!

pwendell commented Jun 16, 2014

Uh oh!

AmplabJenkins commented Jun 16, 2014

Uh oh!

AmplabJenkins commented Jun 16, 2014

Uh oh!

AmplabJenkins commented Jun 16, 2014

Uh oh!

AmplabJenkins commented Jun 16, 2014

Uh oh!

pwendell commented Jun 17, 2014

Uh oh!

douglaz commented Jun 18, 2014

Uh oh!

pwendell commented Jun 18, 2014

Uh oh!

AmplabJenkins commented Jun 18, 2014

Uh oh!

AmplabJenkins commented Jun 18, 2014

Uh oh!

pwendell Jun 18, 2014

Choose a reason for hiding this comment

Uh oh!

pwendell Jun 18, 2014

Choose a reason for hiding this comment

Uh oh!

mateiz commented Jun 18, 2014

Uh oh!

AmplabJenkins commented Jun 18, 2014

Uh oh!