-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-5201][CORE] deal with int overflow in the ParallelCollectionRDD.slice method #4002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
e66e60a
196f8a8
651c959
7d39b9e
b3f5577
e143d7a
96265a1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -111,7 +111,8 @@ private object ParallelCollectionRDD { | |
| /** | ||
| * Slice a collection into numSlices sub-collections. One extra thing we do here is to treat Range | ||
| * collections specially, encoding the slices as other Ranges to minimize memory cost. This makes | ||
| * it efficient to run Spark over RDDs representing large sets of numbers. | ||
| * it efficient to run Spark over RDDs representing large sets of numbers. And if the collection | ||
| * is an inclusive Range, we use inclusive range for the last slice. | ||
| */ | ||
| def slice[T: ClassTag](seq: Seq[T], numSlices: Int): Seq[Seq[T]] = { | ||
| if (numSlices < 1) { | ||
|
|
@@ -127,19 +128,15 @@ private object ParallelCollectionRDD { | |
| }) | ||
| } | ||
| seq match { | ||
| case r: Range.Inclusive => { | ||
| val sign = if (r.step < 0) { | ||
| -1 | ||
| } else { | ||
| 1 | ||
| } | ||
| slice(new Range( | ||
| r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices) | ||
| } | ||
| case r: Range => { | ||
| positions(r.length, numSlices).map({ | ||
| case (start, end) => | ||
| positions(r.length, numSlices).zipWithIndex.map({ case ((start, end), index) => | ||
| // If the range is inclusive, use inclusive range for the last slice | ||
| if (r.isInclusive && index == numSlices - 1) { | ||
| new Range.Inclusive(r.start + start * r.step, r.end, r.step) | ||
| } | ||
| else { | ||
| new Range(r.start + start * r.step, r.start + end * r.step, r.step) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No need to match multiple cases here if we just ignore the index for the non-inclusive case. I think it's sufficient to do
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No doubt that your version is more straightforward than mine. When I wrote my code, I didn't consider splitting normal inclusive range using inclusive range. However the benefit of my implementation is that the splitting result will be same as in the master for normal inclusive ranges. I wonder there may be some spark code rely on the exclusive range output. And of course, I think we should update the corresponding document for this kind of change. I will covert the pattern matching to one case and update the implementation when we decided which one fits better.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think as long as we don't change the behavior it's preferrable to rewrite it in a readable manner. Here it's pretty clear to me that if the range is inclusive we should include the last element in the last slice, regardless of whether the range ends in a special value like |
||
| } | ||
| }).toSeq.asInstanceOf[Seq[Seq[T]]] | ||
| } | ||
| case nr: NumericRange[_] => { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary for callers to know this? It seems like purely an implementation detail that could be documented inside the function later. Same for the scaladoc later.
I would say "the sub
Ranges in each slice are exclusive"Otherwise at a glance this looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment. I will update the docs if we decide to add them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I would rather leave this out. I'll delete this when I merge.