-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-5201][CORE] deal with int overflow in the ParallelCollectionRDD.slice method #4002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… inclusive and the end of the range is (Int.MaxValue or Int.MinValue), we should use inclusive range instead of exclusive
|
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing it but why is IntMinValue a special case here? Also the ({ on the next line is redundant. Just one is needed. sign isn't terribly descriptive here either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to convert this inclusive range
-2 to Int.MinValue by -1
to exclusive range will be
-2L until -1L + Int.MinValue by -1
-1 + Int.MinValue will overflow.
As for sign, which name would you recommend? How about inclusiveRangeWithIntBoundary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, the range can go backwards. Yeah, something like needsInclusiveRange or exceptionalBoundary or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. Will change the name.
As for redundant ({, there is a infix operator toSeq, so I prefer the redundant one. And the previous code used ({
|
ok to test |
|
Test build #25419 has started for PR 4002 at commit
|
|
Test build #25419 has finished for PR 4002 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this need not be specific to Int.MaxValue and Int.MinValue. We can just do a special case for inclusive range in general and I think the code will be significantly more straightforward to read. I'll explain below.
…usive range, hence eliminate the need to check Int.MaxValue or Int.MinValue.
|
Test build #25499 has started for PR 4002 at commit
|
|
@andrewor14 would you mind to take a look again? I update some docs. There may be some wrongly chosen words as I am not a native speaker. Please correct me if you see any. |
|
Test build #25499 has finished for PR 4002 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary for callers to know this? It seems like purely an implementation detail that could be documented inside the function later. Same for the scaladoc later.
I would say "the sub Ranges in each slice are exclusive"
Otherwise at a glance this looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment. I will update the docs if we decide to add them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I would rather leave this out. I'll delete this when I merge.
|
ping @andrewor14 |
|
Thanks @advancedxy I'm merging this into master and 1.2 |
…D.slice method There is an int overflow in the ParallelCollectionRDD.slice method. That's originally reported by SaintBacchus. ``` sc.makeRDD(1 to (Int.MaxValue)).count // result = 0 sc.makeRDD(1 to (Int.MaxValue - 1)).count // result = 2147483646 = Int.MaxValue - 1 sc.makeRDD(1 until (Int.MaxValue)).count // result = 2147483646 = Int.MaxValue - 1 ``` see #2874 for more details. This pr try to fix the overflow. However, There's another issue I don't address. ``` val largeRange = Int.MinValue to Int.MaxValue largeRange.length // throws java.lang.IllegalArgumentException: -2147483648 to 2147483647 by 1: seqs cannot contain more than Int.MaxValue elements. ``` So, the range we feed to sc.makeRDD cannot contain more than Int.MaxValue elements. This is the limitation of Scala. However I think we may want to support that kind of range. But the fix is beyond this pr. srowen andrewor14 would you mind take a look at this pr? Author: Ye Xianjin <[email protected]> Closes #4002 from advancedxy/SPARk-5201 and squashes the following commits: 96265a1 [Ye Xianjin] Update slice method comment and some responding docs. e143d7a [Ye Xianjin] Update inclusive range check for splitting inclusive range. b3f5577 [Ye Xianjin] We can include the last element in the last slice in general for inclusive range, hence eliminate the need to check Int.MaxValue or Int.MinValue. 7d39b9e [Ye Xianjin] Convert the two cases pattern matching to one case. 651c959 [Ye Xianjin] rename sign to needsInclusiveRange. add some comments 196f8a8 [Ye Xianjin] Add test cases for ranges end with Int.MaxValue or Int.MinValue e66e60a [Ye Xianjin] Deal with inclusive and exclusive ranges in one case. If the range is inclusive and the end of the range is (Int.MaxValue or Int.MinValue), we should use inclusive range instead of exclusive
There is an int overflow in the ParallelCollectionRDD.slice method. That's originally reported by SaintBacchus.
see #2874 for more details.
This pr try to fix the overflow. However, There's another issue I don't address.
So, the range we feed to sc.makeRDD cannot contain more than Int.MaxValue elements. This is the limitation of Scala. However I think we may want to support that kind of range. But the fix is beyond this pr.
@srowen @andrewor14 would you mind take a look at this pr?