Commit 0b9718a
[SPARK-3190] Avoid overflow in VertexRDD.count()
VertexRDDs with more than 4 billion elements are counted incorrectly due to integer overflow when summing partition sizes. This PR fixes the issue by converting partition sizes to Longs before summing them.
The following code previously returned -10000000. After applying this PR, it returns the correct answer of 5000000000 (5 billion).
```scala
val pairs = sc.parallelize(0L until 500L).map(_ * 10000000)
.flatMap(start => start until (start + 10000000)).map(x => (x, x))
VertexRDD(pairs).count()
```
Author: Ankur Dave <[email protected]>
Closes #2106 from ankurdave/SPARK-3190 and squashes the following commits:
641f468 [Ankur Dave] Avoid overflow in VertexRDD.count()
(cherry picked from commit 96df929)
Signed-off-by: Josh Rosen <[email protected]>1 parent 069ecfe commit 0b9718a
File tree
1 file changed
+1
-1
lines changed- graphx/src/main/scala/org/apache/spark/graphx
1 file changed
+1
-1
lines changedLines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
108 | 108 | | |
109 | 109 | | |
110 | 110 | | |
111 | | - | |
| 111 | + | |
112 | 112 | | |
113 | 113 | | |
114 | 114 | | |
| |||
0 commit comments