[SPARK-4033][Examples]Input of the SparkPi too big causes the emption exception #2874

SaintBacchus · 2014-10-21T09:27:30Z

If input of the SparkPi args is larger than the 25000, the integer 'n' inside the code will be overflow, and may be a negative number.
And it causes the (0 until n) Seq as an empty seq, then doing the action 'reduce' will throw the UnsupportedOperationException("empty collection").

The max size of the input of sc.parallelize is Int.MaxValue - 1, not the Int.MaxValue.

if the Input of the SparkPi is too big, it will throw new UnsupportedOperationException("empty collection"), because the n may be a negative int number.

srowen · 2014-10-21T09:46:26Z

examples/src/main/scala/org/apache/spark/examples/SparkPi.scala

How about something like math.min(100000L * slices, Int.MaxValue).toInt? Or even just cap slices with val slices = math.min(..., 21474), with a comment. I know it's a demo but maybe it just makes sense to cap the value of slices to something much lower anyway.

@srowen when the value n is larger than the Int.MaxValue, it's better to set just a small value for slices.But user inputting a big slices may just wish that the SparkPi can run for more time.So I just limit the size of Seqs and do not modify the parallelism of the app.

JoshRosen · 2014-10-21T17:43:33Z

Could we fix this by reading in the argument as a long instead of an integer?

srowen · 2014-10-21T23:21:08Z

Yes, that would also fix the overflow, or at least, push it much farther away. Does it ever make sense to have tens of billions of elements in the RDD for this computation though? Allowing quardillions doesn't probably work anyway. I'd just cap the number of slices to something sane.

SaintBacchus · 2014-10-22T01:19:45Z

@JoshRosen the n inside "0 until n " must be an integer so we can't make input as a long

SaintBacchus · 2014-10-22T01:35:43Z

@srowen I had considered that just judging the number of the input to be a small int will be a easy way.
But I assume user just want his spark app running, not to ask why the input must be a small number or what's going wrong if inputting a big number

advancedxy · 2014-10-22T06:42:02Z

examples/src/main/scala/org/apache/spark/examples/SparkPi.scala

Why Int.MaxValue - 1? I believe (1 to Int.MaxValue) will just work fine.

yes, this can just be Int.MaxValue

maybe also add a quick comment at the end to explain why we're doing this:

val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow

srowen · 2014-10-22T06:44:31Z

@SaintBacchus sure, maybe this is used as a load test in some situations. What if n map tasks computed 100000 iterations each (or some smaller quantum)? rather than 100000n map tasks computing 1? You might still check for negative n to be tidy but int overflow is no longer an issue then.

advancedxy · 2014-10-22T06:46:25Z

examples/src/main/scala/org/apache/spark/examples/SparkPi.scala

I think you need to supply a new slices number if slices is too large.
something like math.min(slices, n / 100000).

So, a better solution may be :

val realSlices = math.min(slices, Int.MaxValue/100000) val n = 100000 * realSlices

srowen · 2014-12-10T10:00:18Z

@SaintBacchus why did you close this? seems like it still needs a fix and you had an improvement going here.

SaintBacchus · 2014-12-11T03:10:30Z

Hi @srowen , I considered it was a very small improvement and counld be found easily if the spark developer check this code so I closed it.
If it's neccesary to improve it, I reopen it and hava a modify.

SparkQA · 2014-12-11T04:42:23Z

Test build #24349 has finished for PR 2874 at commit 4cdc388.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-01-08T20:25:06Z

@SaintBacchus Any update on this PR? As @JoshRosen mentioned I think the correct fix here is to use Long instead of Int. The current solution here probably works but adds unnecessary complexity to the supposably very easy to understand example. This is intended to be many users' first Spark application so I think keeping it simple is a very important consideration.

SaintBacchus · 2015-01-09T03:35:59Z

@andrewor14 I had explained why it can not use Long instead of Int. Not only the Range but also the Partition only can be appropriate with Int, and can't converse to a Long.
Exception: seqs cannot contain more than Int.MaxValue elements.
Can we restrict the input and log an error to exit from the process?

srowen · 2015-01-09T11:58:47Z

How about just capping the argument to the largest slices that won't make 100000 * slices overflow and print a message about it? Doesn't seem like there is a goal to support this as a super-long-running load test, and for these purposes, just proceeding with the largest possible value seems more friendly than failing.

andrewor14 · 2015-01-09T19:05:47Z

I see. I think it's simplest to just exit and tell the user to provide a smaller input if the provided input would've cause an overflow. For instance: https://github.com/andrewor14/spark/compare/spark-pi-validate-args. I don't feel strongly for or against the alternative that @srowen suggested, but I personally find the fail-fast solution easiest to understand.

andrewor14 · 2015-01-09T19:47:18Z

Actually, on second thought I think the existing solution in this PR is probably sufficient. The issue with the validation solution is that it adds a lot of extra code to this supposedly small and isolated example. For this reason I'm also OK with not printing a warning when overflow happens, since it adds extra logic.

LGTM once you remove the -1 as I pointed out in the comments.

SaintBacchus · 2015-01-11T14:25:25Z

Sorry for forgetting explain why there must be Int.MaxValue -1 instead of Int.MaxValue .
As the SparkPi using the parallelize to generator the RDD so it has a limit of input seq, and go through the implement of this:

    def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] = {
      (0 until numSlices).iterator.map(i => {
        val start = ((i * length) / numSlices).toInt
        val end = (((i + 1) * length) / numSlices).toInt
        (start, end)
      })
    }

in line 122 at ParallelCollectionRDD.scala
The max size of the input seq was Int.MaxValue - 1, otherwise the end = (((i + 1) * length) / numSlices).toInt will overflow and lead an uncorrect reslut.
I tested this:

  sc.makeRDD(1 to (Int.MaxValue)).count       // result = 0
  sc.makeRDD(1 to (Int.MaxValue - 1)).count   // result = 2147483646 = Int.MaxValue - 1
  sc.makeRDD(1 until (Int.MaxValue)).count    // result = 2147483646 = Int.MaxValue - 1

Add a commit to tell the user why we had a judgement here. And use `until` instead of `to` for the purpose of avoid the limit size of the input seq and make the code more readable.

SparkQA · 2015-01-11T15:40:09Z

Test build #25376 has finished for PR 2874 at commit 62d7cd7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

advancedxy · 2015-01-11T16:30:22Z

Hi @SaintBacchus, I get the idea that there is a limit in the seq size. But I don't think the buggy code is from implementation you listed. The actually buggy code is the first case in the pattern matching.

case r: Range.Inclusive => {
        val sign = if (r.step < 0) {
          -1
        } else {
          1
        }
        slice(new Range(
          r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices) // r.end + sign is overflow
      }

We should consider fix that bug. @srowen and @andrewor14, do you think we need another pr?
@SaintBacchus do you think you have time for that? I can contribute my work if you are busy.

srowen · 2015-01-11T17:43:12Z

@advancedxy I do think it's a separate issue. The case of integer overflow could be handled better in both places.

andrewor14 · 2015-01-12T00:32:09Z

I see, @advancedxy would you mind filing a JIRA for this? If it's convenient for you to do so, feel free to also create a PR for it.

As for this PR, the latest changes LGTM. The difference between to and until is trivial enough in this particular case that I think it's OK for this to go in. The existing alternative would have failed the application anyway, so I believe this piece of code is strictly better than before. Merging into master.

advancedxy · 2015-01-12T01:15:41Z

@andrewor14, of course, I will file a jira and send a pr when I get some spare time later today.

SaintBacchus · 2015-01-12T01:26:43Z

@advancedxy I had showed the wrong bad code:sparkles: Thanks for point it out

…D.slice method There is an int overflow in the ParallelCollectionRDD.slice method. That's originally reported by SaintBacchus. ``` sc.makeRDD(1 to (Int.MaxValue)).count // result = 0 sc.makeRDD(1 to (Int.MaxValue - 1)).count // result = 2147483646 = Int.MaxValue - 1 sc.makeRDD(1 until (Int.MaxValue)).count // result = 2147483646 = Int.MaxValue - 1 ``` see #2874 for more details. This pr try to fix the overflow. However, There's another issue I don't address. ``` val largeRange = Int.MinValue to Int.MaxValue largeRange.length // throws java.lang.IllegalArgumentException: -2147483648 to 2147483647 by 1: seqs cannot contain more than Int.MaxValue elements. ``` So, the range we feed to sc.makeRDD cannot contain more than Int.MaxValue elements. This is the limitation of Scala. However I think we may want to support that kind of range. But the fix is beyond this pr. srowen andrewor14 would you mind take a look at this pr? Author: Ye Xianjin <[email protected]> Closes #4002 from advancedxy/SPARk-5201 and squashes the following commits: 96265a1 [Ye Xianjin] Update slice method comment and some responding docs. e143d7a [Ye Xianjin] Update inclusive range check for splitting inclusive range. b3f5577 [Ye Xianjin] We can include the last element in the last slice in general for inclusive range, hence eliminate the need to check Int.MaxValue or Int.MinValue. 7d39b9e [Ye Xianjin] Convert the two cases pattern matching to one case. 651c959 [Ye Xianjin] rename sign to needsInclusiveRange. add some comments 196f8a8 [Ye Xianjin] Add test cases for ranges end with Int.MaxValue or Int.MinValue e66e60a [Ye Xianjin] Deal with inclusive and exclusive ranges in one case. If the range is inclusive and the end of the range is (Int.MaxValue or Int.MinValue), we should use inclusive range instead of exclusive

… exception If input of the SparkPi args is larger than the 25000, the integer 'n' inside the code will be overflow, and may be a negative number. And it causes the (0 until n) Seq as an empty seq, then doing the action 'reduce' will throw the UnsupportedOperationException("empty collection"). The max size of the input of sc.parallelize is Int.MaxValue - 1, not the Int.MaxValue. Author: huangzhaowei <[email protected]> Closes #2874 from SaintBacchus/SparkPi and squashes the following commits: 62d7cd7 [huangzhaowei] Add a commit to explain the modify 4cdc388 [huangzhaowei] Update SparkPi.scala 9a2fb7b [huangzhaowei] Input of the SparkPi is too big

…D.slice method There is an int overflow in the ParallelCollectionRDD.slice method. That's originally reported by SaintBacchus. ``` sc.makeRDD(1 to (Int.MaxValue)).count // result = 0 sc.makeRDD(1 to (Int.MaxValue - 1)).count // result = 2147483646 = Int.MaxValue - 1 sc.makeRDD(1 until (Int.MaxValue)).count // result = 2147483646 = Int.MaxValue - 1 ``` see #2874 for more details. This pr try to fix the overflow. However, There's another issue I don't address. ``` val largeRange = Int.MinValue to Int.MaxValue largeRange.length // throws java.lang.IllegalArgumentException: -2147483648 to 2147483647 by 1: seqs cannot contain more than Int.MaxValue elements. ``` So, the range we feed to sc.makeRDD cannot contain more than Int.MaxValue elements. This is the limitation of Scala. However I think we may want to support that kind of range. But the fix is beyond this pr. srowen andrewor14 would you mind take a look at this pr? Author: Ye Xianjin <[email protected]> Closes #4002 from advancedxy/SPARk-5201 and squashes the following commits: 96265a1 [Ye Xianjin] Update slice method comment and some responding docs. e143d7a [Ye Xianjin] Update inclusive range check for splitting inclusive range. b3f5577 [Ye Xianjin] We can include the last element in the last slice in general for inclusive range, hence eliminate the need to check Int.MaxValue or Int.MinValue. 7d39b9e [Ye Xianjin] Convert the two cases pattern matching to one case. 651c959 [Ye Xianjin] rename sign to needsInclusiveRange. add some comments 196f8a8 [Ye Xianjin] Add test cases for ranges end with Int.MaxValue or Int.MinValue e66e60a [Ye Xianjin] Deal with inclusive and exclusive ranges in one case. If the range is inclusive and the end of the range is (Int.MaxValue or Int.MinValue), we should use inclusive range instead of exclusive

Input of the SparkPi is too big

9a2fb7b

if the Input of the SparkPi is too big, it will throw new UnsupportedOperationException("empty collection"), because the n may be a negative int number.

srowen reviewed Oct 21, 2014
View reviewed changes

Update SparkPi.scala

4cdc388

advancedxy reviewed Oct 22, 2014
View reviewed changes

SaintBacchus closed this Nov 9, 2014

SaintBacchus deleted the SparkPi branch November 9, 2014 13:32

SaintBacchus restored the SparkPi branch December 11, 2014 03:09

SaintBacchus reopened this Dec 11, 2014

Add a commit to explain the modify

62d7cd7

Add a commit to tell the user why we had a judgement here. And use `until` instead of `to` for the purpose of avoid the limit size of the input seq and make the code more readable.

asfgit closed this in f38ef65 Jan 12, 2015

advancedxy mentioned this pull request Jan 12, 2015

[SPARK-5201][CORE] deal with int overflow in the ParallelCollectionRDD.slice method #4002

Closed

SaintBacchus deleted the SparkPi branch December 26, 2015 06:39

yanghaogn mentioned this pull request Jun 26, 2016

[SPARK-16214] [Examples] fix the denominator of SparkPi #13910

Closed

[SPARK-4033][Examples]Input of the SparkPi too big causes the emption exception #2874

[SPARK-4033][Examples]Input of the SparkPi too big causes the emption exception #2874

Uh oh!

Conversation

SaintBacchus commented Oct 21, 2014

Uh oh!

srowen Oct 21, 2014

Choose a reason for hiding this comment

Uh oh!

SaintBacchus Oct 21, 2014

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Oct 21, 2014

Uh oh!

srowen commented Oct 21, 2014

Uh oh!

SaintBacchus commented Oct 22, 2014

Uh oh!

SaintBacchus commented Oct 22, 2014

Uh oh!

advancedxy Oct 22, 2014

Choose a reason for hiding this comment

Uh oh!

andrewor14 Jan 9, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 Jan 9, 2015

Choose a reason for hiding this comment

Uh oh!

srowen commented Oct 22, 2014

Uh oh!

advancedxy Oct 22, 2014

Choose a reason for hiding this comment

Uh oh!

srowen commented Dec 10, 2014

Uh oh!

SaintBacchus commented Dec 11, 2014

Uh oh!

SparkQA commented Dec 11, 2014

Uh oh!

andrewor14 commented Jan 8, 2015

Uh oh!

SaintBacchus commented Jan 9, 2015

Uh oh!

srowen commented Jan 9, 2015

Uh oh!

andrewor14 commented Jan 9, 2015

Uh oh!

andrewor14 commented Jan 9, 2015

Uh oh!

SaintBacchus commented Jan 11, 2015

Uh oh!

SparkQA commented Jan 11, 2015

Uh oh!

advancedxy commented Jan 11, 2015

Uh oh!

srowen commented Jan 11, 2015

Uh oh!

andrewor14 commented Jan 12, 2015

Uh oh!

advancedxy commented Jan 12, 2015

Uh oh!

SaintBacchus commented Jan 12, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants