-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-1061] assumePartitioned #4449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #26991 has finished for PR 4449 at commit
|
|
Test build #26992 has finished for PR 4449 at commit
|
|
Test build #27014 has finished for PR 4449 at commit
|
|
Test build #27016 has finished for PR 4449 at commit
|
|
Test build #27021 has finished for PR 4449 at commit
|
|
Jenkins, retest this please. |
|
(This failure was due to me changing a Jenkins settings; I've reverted the change) |
|
Test build #27022 has finished for PR 4449 at commit
|
|
Test build #27023 has finished for PR 4449 at commit
|
|
Test build #27026 has finished for PR 4449 at commit
|
|
This seems like a slightly awkward API to expose since to use it you need to basically write a customized InputFormat. If someone is writing customized InputFormat's, then why can't the just write a custom RDD as well. Is the idea that someone would write an input format that only returns a single split for each file? |
|
@pwendell its a good question, I was wondering the same thing a little bit as I was writing those unit tests and was going to comment on the jira about this a little. It is definitely annoying to have to write a custom input format -- but I only need to do that to turn off splits. Every once in a while this comes up on the user list too -- should we just add another version of Really this gets to a more general question: when do we add these "convenience" methods to RDD?? Given that this requires application logic to track the partitioner to use, I doubt this will ever be used by other code within spark itself. But I would still make the case for its inclusion, since (a) it leads to a big optimization that is not obvious to most users. By promoting it to a function within spark itself, users are more likely to be aware of it. (b) its a little tricky to get right -- I think the OTOH, we could just put this in some general location with spark-examples, and leave it out of spark itself. I guess we only need to make the change to |
|
ping If I haven't made a convincing argument this is a useful addition to the core api, than I'll change the PR to only add the sorting to HadoopRDD's partitions, as that is the only change to what is already there, and I can move this to an external package |
|
@squito Imran, any progress on this issue? We have same problem with narrowing dependencies(exactly the case you are describing with big dataset that lives on disk with small additions to it using join in different spark context eash time) |
|
@rapen sorry no updates ... I think this is more or less ready, but seems there isn't much interest in getting this in core, unfortunately. It would be nice to put this elsewhere into a standalone package for Spark. I don't have time to do that at the moment -- feel free to take a stab at if you like. You can also just use what's here in your project. The only problem is you need to make a copy of |
|
@squito thanks! your work is very helpful. I'm testing now solution based on your code. I've subclassed NewHadoopRDD with same changes you've made in HaddopRDD and then created new method that creates this custom rdd(kind of copy-paste from newHadoopApi from SparkContext) + defined NonSplittable InputFormat(subclassed from avro formats). Thus I don't need to change HadoopRDD and recompile spark with it...it's just sort of extension(might be this what can be part of standalone library...I'm not pro in scala programming, so not sure to show this code to someone :) |
|
Hi, Thank you. |
|
@danielhaviv , see tests in PR |
|
i would like to have something like this in core On Fri, Sep 4, 2015 at 6:22 AM, rapen [email protected] wrote:
|
|
What's the final verdict on this? Can we do the standalone package approach for now, then close this out? |
|
I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks! |
https://issues.apache.org/jira/browse/SPARK-1061
If you partition an RDD, save to hdfs, then reload it in a separate SparkContext, you've lost the info that the RDD was partitioned. This prevents you from getting the savings of a narrow dependency you could get. This is especially painful if you've got some big dataset on hdfs, and you periodically get small updates that need to be joined against it.
assumePartitionedBylets you simply assign a partitioner to an RDD, so you can get your narrow dependencies back. Its up to the application to know what the partitioner should be, but it will at least verify the assignment is OK.