[SPARK-1061] assumePartitioned #4449

squito · 2015-02-07T05:59:45Z

https://issues.apache.org/jira/browse/SPARK-1061

If you partition an RDD, save to hdfs, then reload it in a separate SparkContext, you've lost the info that the RDD was partitioned. This prevents you from getting the savings of a narrow dependency you could get. This is especially painful if you've got some big dataset on hdfs, and you periodically get small updates that need to be joined against it.

assumePartitionedBy lets you simply assign a partitioner to an RDD, so you can get your narrow dependencies back. Its up to the application to know what the partitioner should be, but it will at least verify the assignment is OK.

SparkQA · 2015-02-07T06:03:48Z

Test build #26991 has finished for PR 4449 at commit e041155.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-07T07:07:36Z

Test build #26992 has finished for PR 4449 at commit 943984f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-07T22:12:22Z

Test build #27014 has finished for PR 4449 at commit 0e98abe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-08T00:22:25Z

Test build #27016 has finished for PR 4449 at commit b828f01.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…me a partitioner

SparkQA · 2015-02-08T02:09:07Z

Test build #27021 has finished for PR 4449 at commit ed154ce.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-02-08T02:09:46Z

Jenkins, retest this please.

JoshRosen · 2015-02-08T02:09:59Z

(This failure was due to me changing a Jenkins settings; I've reverted the change)

SparkQA · 2015-02-08T02:13:56Z

Test build #27022 has finished for PR 4449 at commit ed154ce.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-08T03:34:28Z

Test build #27023 has finished for PR 4449 at commit ea016db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-08T04:39:42Z

Test build #27026 has finished for PR 4449 at commit f6c13a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

pwendell · 2015-02-08T05:28:12Z

This seems like a slightly awkward API to expose since to use it you need to basically write a customized InputFormat. If someone is writing customized InputFormat's, then why can't the just write a custom RDD as well. Is the idea that someone would write an input format that only returns a single split for each file?

squito · 2015-02-09T02:35:14Z

@pwendell its a good question, I was wondering the same thing a little bit as I was writing those unit tests and was going to comment on the jira about this a little. It is definitely annoying to have to write a custom input format -- but I only need to do that to turn off splits. Every once in a while this comes up on the user list too -- should we just add another version of sc.hadoopFile, sc.textFile, and sc.sequenceFile to turn off splits? Unfortunately I don't think it makes any sense to directly pass an assumedPartitioner as an argument to those functions, since you really need to put in a map step in the middle to extract the key.

Really this gets to a more general question: when do we add these "convenience" methods to RDD?? Given that this requires application logic to track the partitioner to use, I doubt this will ever be used by other code within spark itself. But I would still make the case for its inclusion, since (a) it leads to a big optimization that is not obvious to most users. By promoting it to a function within spark itself, users are more likely to be aware of it. (b) its a little tricky to get right -- I think the verify step is really important to make sure this doesn't lead to completely wrong results in the user app. And (c) I think its a common use case. Not so common it would make it into spark tutorial's, or even into daily use of an experienced spark-user -- but I imagine it has a place in every "batch" use of spark, where there is some big dataset that lives on hdfs between spark contexts.

OTOH, we could just put this in some general location with spark-examples, and leave it out of spark itself. I guess we only need to make the change to HadoopRDD to sort the partitions.

squito · 2015-03-16T15:27:19Z

ping

If I haven't made a convincing argument this is a useful addition to the core api, than I'll change the PR to only add the sorting to HadoopRDD's partitions, as that is the only change to what is already there, and I can move this to an external package

IgorBerman · 2015-05-30T11:27:12Z

@squito Imran, any progress on this issue? We have same problem with narrowing dependencies(exactly the case you are describing with big dataset that lives on disk with small additions to it using join in different spark context eash time)
Do you have some examples? Maybe some blog post ;)?

squito · 2015-06-02T05:06:51Z

@rapen sorry no updates ... I think this is more or less ready, but seems there isn't much interest in getting this in core, unfortunately. It would be nice to put this elsewhere into a standalone package for Spark. I don't have time to do that at the moment -- feel free to take a stab at if you like.

You can also just use what's here in your project. The only problem is you need to make a copy of HadoopRdd so you can make the modifications here (in particular, to get consistent ordering of the partitions)

IgorBerman · 2015-06-03T20:16:30Z

@squito thanks! your work is very helpful. I'm testing now solution based on your code. I've subclassed NewHadoopRDD with same changes you've made in HaddopRDD and then created new method that creates this custom rdd(kind of copy-paste from newHadoopApi from SparkContext) + defined NonSplittable InputFormat(subclassed from avro formats). Thus I don't need to change HadoopRDD and recompile spark with it...it's just sort of extension(might be this what can be part of standalone library...I'm not pro in scala programming, so not sure to show this code to someone :)
Anyway, all shuffles disappeared ! 👍

danielhaviv · 2015-08-03T07:42:14Z

Hi,
Is there a chance someone could share some code that could shed some light on how to use this feature?

Thank you.
Daniel

IgorBerman · 2015-09-04T10:21:49Z

@danielhaviv , see tests in PR

koertkuipers · 2015-09-08T22:01:47Z

i would like to have something like this in core

On Fri, Sep 4, 2015 at 6:22 AM, rapen [email protected] wrote:

@danielhaviv https://github.com/danielhaviv , see tests in PR

—
Reply to this email directly or view it on GitHub
#4449 (comment).

JoshRosen · 2015-10-17T23:25:10Z

What's the final verdict on this? Can we do the standalone package approach for now, then close this out?

rxin · 2015-12-31T02:42:30Z

I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks!

assumePartitioned

e041155

fix long lines

943984f

squito changed the title ~~assumePartitioned~~ [SPARK-1061] assumePartitioned Feb 7, 2015

be sure to turn off input splits in test

0e98abe

squito added 2 commits February 7, 2015 17:24

debugging

5876f7e

fixup

b828f01

Hadoop RDD needs to sort the input partitions if we are going to assu…

ed154ce

…me a partitioner

squito added 2 commits February 7, 2015 20:20

fix newline

ea016db

remove debugging code

f6c13a1

asfgit closed this in 7b4452b Dec 31, 2015

[SPARK-1061] assumePartitioned #4449

[SPARK-1061] assumePartitioned #4449

Uh oh!

Conversation

squito commented Feb 7, 2015

Uh oh!

SparkQA commented Feb 7, 2015

Uh oh!

SparkQA commented Feb 7, 2015

Uh oh!

SparkQA commented Feb 7, 2015

Uh oh!

SparkQA commented Feb 8, 2015

Uh oh!

SparkQA commented Feb 8, 2015

Uh oh!

JoshRosen commented Feb 8, 2015

Uh oh!

JoshRosen commented Feb 8, 2015

Uh oh!

SparkQA commented Feb 8, 2015

Uh oh!

SparkQA commented Feb 8, 2015

Uh oh!

SparkQA commented Feb 8, 2015

Uh oh!

pwendell commented Feb 8, 2015

Uh oh!

squito commented Feb 9, 2015

Uh oh!

squito commented Mar 16, 2015

Uh oh!

IgorBerman commented May 30, 2015

Uh oh!

squito commented Jun 2, 2015

Uh oh!

IgorBerman commented Jun 3, 2015

Uh oh!

danielhaviv commented Aug 3, 2015

Uh oh!

IgorBerman commented Sep 4, 2015

Uh oh!

koertkuipers commented Sep 8, 2015

Uh oh!

JoshRosen commented Oct 17, 2015

Uh oh!

rxin commented Dec 31, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants