[SPARK-2315] Implement drop, dropRight and dropWhile for RDDs #1254

erikerlandson · 2014-06-28T05:03:41Z

drop, dropRight and dropWhile methods for RDDs that return a new RDD as the result.

// example: load in some text and skip header lines
val txt = sc.textFile("data_with_header.txt")
val data = txt.drop(3)

AmplabJenkins · 2014-06-28T05:05:29Z

Can one of the admins verify this patch?

rxin · 2014-06-28T06:35:49Z

Thanks - I can see why this might be useful, but it is a pretty high bar now to add new APIs to the RDD interface, and we need to be very careful about APIs that might have very bad performance behaviors (dropping a large number can be very slow, in particular if it crosses many partitions).

For this reason, it might make more sense for this to be an example program or a blog post that's easily indexable so people can find.

rxin · 2014-06-28T06:39:46Z

BTW it is just my personal opinion. Feel free to debate or find support :)

erikerlandson · 2014-06-28T06:59:35Z

My reasoning is that most use cases (or at least the ones I had in mind) are something like rdd.drop(n), where n is much smaller than rdd.count(), generally 1 or some other small number. FWIW, I implemented it via an implicit object, so it's not directly on the RDD class per se. Another way to look at it, these functions aren't worse than rdd.take(), as they use similar logic.

However, it's true that if (n) is a large fraction of the size of the RDD, then it will invoke computation of a large fraction of the partitions.

rxin · 2014-06-28T07:08:27Z

The thing is we must scan data twice to make sure this actually works (because we need to verify the number of partitions we checked is sufficient). Usually users' specific use case can be solved with a very simple workaround despite the lack of RDD.drop (e.g. for csv files with header that you want to drop, you can just drop it at the first partition using an drop within a mapPartitions).

erikerlandson · 2014-06-28T12:52:19Z

It will scan one partition twice: the one containing the "boundary" between things dropped and not-dropped. Any partitions prior to that boundary are ignored by the resulting RDD (so they are scanned once), and any partitions after the boundary are not examined unless/until the result RDD is evaluated.

erikerlandson · 2014-06-28T13:01:39Z

Tangentially, one thing I noticed is that currently all the "XxxRDDFunctions" implicits are automatically defined in SparkContext, and so I held to that pattern in this PR. However, another option might be to not automatically define it, and a user would import DropRDDFunctions for themselves if they wanted to use drop methods.

In fact, that seems like a good pattern generally for reducing unneeded imports; one might say the same thing for OrderedRDDFunctions, etc: import XxxRDDFunctions if you need it.

erikerlandson · 2014-06-28T14:26:38Z

Note, in a typical case where one is invoking something like rdd.drop(1), or other small number, only one partition gets evaluated by drop - the first one.

erikerlandson · 2014-06-28T15:04:58Z

I also envision typical use cases as being either pre- or post-processing. That is, not something that would often appear inside a tight loop.

jayunit100 · 2014-07-21T13:06:41Z

Adding the Drop function to a contrib library of functions (which requires manual import) , as erik suggests, seems like a really good option. I could see such a contrib library also being useful for other isoteric but nevertheless important tasks, like dealing with binary data formats, etc

take RDD as input and return new RDD with elements dropped. These methods are now implemented as lazy RDD transforms.

erikerlandson · 2014-07-29T23:44:53Z

I updated this PR so that drop(), dropRight() and dropWhile() are now lazy transforms. A description of what I did is here:
http://erikerlandson.github.io/blog/2014/07/29/deferring-spark-actions-to-lazy-transforms-with-the-promise-rdd/

erikerlandson · 2014-07-30T17:12:47Z

should Jenkins run an automatic build on PR update?

rxin · 2014-07-30T20:19:15Z

Jenkins, test this please.

erikerlandson · 2014-07-31T13:41:38Z

O Jenkins Where Art Thou?

erikerlandson · 2014-08-01T16:35:16Z

jenkins appears to be awol

JoshRosen · 2014-08-01T16:39:17Z

Let me give it a try:

Jenkins, this is ok to test.

Jenkins, retest this please.

erikerlandson · 2014-08-02T15:55:52Z

Starting to worry I confused it by pushing the PR branch using '+'

erikerlandson · 2014-08-05T02:24:08Z

Should I consider creating a fresh PR, or is there some better way to get Jenkins to test?

rxin · 2014-08-05T02:28:29Z

I'm not sure what's happening. Maybe Jenkins is lazy today. We can retry tomorrow, and if it doesn't work, create a new PR.

erikerlandson · 2014-08-07T16:25:50Z

I'm going to try closing this PR and rebooting with a fresh one

* Update tables.scala * Update tables.scala * Update tables.scala * Update TemporaryTableSuite.scala

[SPARK-2315] Implement drop, dropRight and dropWhile for RDDs, which

7953b9b

take RDD as input and return new RDD with elements dropped. These methods are now implemented as lazy RDD transforms.

erikerlandson closed this Aug 7, 2014

erikerlandson mentioned this pull request Aug 7, 2014

[SPARK-2315] Implement drop, dropRight and dropWhile for RDDs, which #1839

Closed

wangyum added a commit that referenced this pull request May 26, 2023

[CARMEL-6511] Disable rename temp table (#1254)

56c9bad

* Update tables.scala * Update tables.scala * Update tables.scala * Update TemporaryTableSuite.scala

mapr-devops pushed a commit to mapr/spark that referenced this pull request May 8, 2025

EZAF-8952: spark.mapr.user.secret cannot be empty (apache#1254)

9bc1469

[SPARK-2315] Implement drop, dropRight and dropWhile for RDDs #1254

[SPARK-2315] Implement drop, dropRight and dropWhile for RDDs #1254

Uh oh!

Conversation

erikerlandson commented Jun 28, 2014

Uh oh!

AmplabJenkins commented Jun 28, 2014

Uh oh!

rxin commented Jun 28, 2014

Uh oh!

rxin commented Jun 28, 2014

Uh oh!

erikerlandson commented Jun 28, 2014

Uh oh!

rxin commented Jun 28, 2014

Uh oh!

erikerlandson commented Jun 28, 2014

Uh oh!

erikerlandson commented Jun 28, 2014

Uh oh!

erikerlandson commented Jun 28, 2014

Uh oh!

erikerlandson commented Jun 28, 2014

Uh oh!

jayunit100 commented Jul 21, 2014

Uh oh!

erikerlandson commented Jul 29, 2014

Uh oh!

erikerlandson commented Jul 30, 2014

Uh oh!

rxin commented Jul 30, 2014

Uh oh!

erikerlandson commented Jul 31, 2014

Uh oh!

erikerlandson commented Aug 1, 2014

Uh oh!

JoshRosen commented Aug 1, 2014

Uh oh!

erikerlandson commented Aug 2, 2014

Uh oh!

erikerlandson commented Aug 5, 2014

Uh oh!

rxin commented Aug 5, 2014

Uh oh!

erikerlandson commented Aug 7, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants