Skip to content

Conversation

@erikerlandson
Copy link
Contributor

drop, dropRight and dropWhile methods for RDDs that return a new RDD as the result.

// example: load in some text and skip header lines
val txt = sc.textFile("data_with_header.txt")
val data = txt.drop(3)

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@rxin
Copy link
Contributor

rxin commented Jun 28, 2014

Thanks - I can see why this might be useful, but it is a pretty high bar now to add new APIs to the RDD interface, and we need to be very careful about APIs that might have very bad performance behaviors (dropping a large number can be very slow, in particular if it crosses many partitions).

For this reason, it might make more sense for this to be an example program or a blog post that's easily indexable so people can find.

@rxin
Copy link
Contributor

rxin commented Jun 28, 2014

BTW it is just my personal opinion. Feel free to debate or find support :)

@erikerlandson
Copy link
Contributor Author

My reasoning is that most use cases (or at least the ones I had in mind) are something like rdd.drop(n), where n is much smaller than rdd.count(), generally 1 or some other small number. FWIW, I implemented it via an implicit object, so it's not directly on the RDD class per se. Another way to look at it, these functions aren't worse than rdd.take(), as they use similar logic.

However, it's true that if (n) is a large fraction of the size of the RDD, then it will invoke computation of a large fraction of the partitions.

@rxin
Copy link
Contributor

rxin commented Jun 28, 2014

The thing is we must scan data twice to make sure this actually works (because we need to verify the number of partitions we checked is sufficient). Usually users' specific use case can be solved with a very simple workaround despite the lack of RDD.drop (e.g. for csv files with header that you want to drop, you can just drop it at the first partition using an drop within a mapPartitions).

@erikerlandson
Copy link
Contributor Author

It will scan one partition twice: the one containing the "boundary" between things dropped and not-dropped. Any partitions prior to that boundary are ignored by the resulting RDD (so they are scanned once), and any partitions after the boundary are not examined unless/until the result RDD is evaluated.

@erikerlandson
Copy link
Contributor Author

Tangentially, one thing I noticed is that currently all the "XxxRDDFunctions" implicits are automatically defined in SparkContext, and so I held to that pattern in this PR. However, another option might be to not automatically define it, and a user would import DropRDDFunctions for themselves if they wanted to use drop methods.

In fact, that seems like a good pattern generally for reducing unneeded imports; one might say the same thing for OrderedRDDFunctions, etc: import XxxRDDFunctions if you need it.

@erikerlandson
Copy link
Contributor Author

Note, in a typical case where one is invoking something like rdd.drop(1), or other small number, only one partition gets evaluated by drop - the first one.

@erikerlandson
Copy link
Contributor Author

I also envision typical use cases as being either pre- or post-processing. That is, not something that would often appear inside a tight loop.

@jayunit100
Copy link

Adding the Drop function to a contrib library of functions (which requires manual import) , as erik suggests, seems like a really good option. I could see such a contrib library also being useful for other isoteric but nevertheless important tasks, like dealing with binary data formats, etc

take RDD as input and return new RDD with elements dropped.

These methods are now implemented as lazy RDD transforms.
@erikerlandson
Copy link
Contributor Author

I updated this PR so that drop(), dropRight() and dropWhile() are now lazy transforms. A description of what I did is here:
http://erikerlandson.github.io/blog/2014/07/29/deferring-spark-actions-to-lazy-transforms-with-the-promise-rdd/

@erikerlandson
Copy link
Contributor Author

should Jenkins run an automatic build on PR update?

@rxin
Copy link
Contributor

rxin commented Jul 30, 2014

Jenkins, test this please.

@erikerlandson
Copy link
Contributor Author

O Jenkins Where Art Thou?

@erikerlandson
Copy link
Contributor Author

jenkins appears to be awol

@JoshRosen
Copy link
Contributor

Let me give it a try:

Jenkins, this is ok to test.

Jenkins, retest this please.

@erikerlandson
Copy link
Contributor Author

Starting to worry I confused it by pushing the PR branch using '+'

@erikerlandson
Copy link
Contributor Author

Should I consider creating a fresh PR, or is there some better way to get Jenkins to test?

@rxin
Copy link
Contributor

rxin commented Aug 5, 2014

I'm not sure what's happening. Maybe Jenkins is lazy today. We can retry tomorrow, and if it doesn't work, create a new PR.

@erikerlandson
Copy link
Contributor Author

I'm going to try closing this PR and rebooting with a fresh one

wangyum added a commit that referenced this pull request May 26, 2023
* Update tables.scala

* Update tables.scala

* Update tables.scala

* Update TemporaryTableSuite.scala
mapr-devops pushed a commit to mapr/spark that referenced this pull request May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants