-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-5860][CORE] JdbcRDD: overflow on large range with high number of partitions #4701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this would be more robust if we just increment by a fixed width every iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BigDecimal is overkill and feels funny to be introducing floating point. i and length are int already so using longs here is sufficient. I really think this is solve by simply changing i to i.toLong and removing the superfluous .toLong at the end then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen No that won't work, the problem here is this
length is long I think since it's diff of 2 longs, but (i * length) can overflow long. In my test case for example
567279357766147899L * 20 = -7101156918386593636
So we need a type that is can represent larger numbers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin I actually favor the current partition algo, it's pretty neat in a way, for example
lowerBound = 1
upperBound = 100
numPartition = 8
With fix length increase, you get
[1,13],[14,26],[27,39],[40,52],[53,65],[66,78],[79,91],[92,100]
In which you always end up with one small partition at the end
With the current algo you get
[1,12],[13,25],[26,37],[38,50],[51,62],[63,75],[76,87],[88,100]
You get more evenly distributed partitions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right I see that since the bounds are used for ID-like values, it's not unrealistic to think that length is near 2^63, and it's a long. Hm, does this overflow anyway in the final partition? if upperBound is near 2^63 then this can compute an end that doesn't fit back into a long even if the intermediate result does. PS why not BigInteger?
While we're fixing it though, I tend to think that @rxin is right and that this can just map to fixed intervals rather than perform this computation.
BTW since the range is inclusive, I feel like the last partition should subtract 1 from its end? and this should never be larger than upperBound? there might be a few reasons to revise this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does although the difference is +/-1 for all partitions but the last. Yea I get your point about the final partition but isn't it possible that its end is >= upperBound? I haven't thought it through, I admit. A version with BigInteger here is definitely an improvement so I'd support that. I suppose I'm still concerned this fails in exactly the corner case you're trying to fix -- very large length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen Sorry I don't understand how last partition can get >= upperBound
With this codes
val length = 1 + upperBound - lowerBound
(0 until numPartitions).map(i => {
val start = lowerBound + ((BigInt(i) * length) / numPartitions).toLong
val end = lowerBound + ((BigInt(i + 1) * length) / numPartitions).toLong - 1
new JdbcPartition(i, start, end)
})
the last iteration on this:
i = numPartitions - 1
So end = lowerBound + (numPartitions - 1 + 1) * (1 + upperBound - lowerBound) / numPartitions - 1
= lowerBound + numPartitions/ numPartitions * (1 + upperBound - lowerBound) - 1
= lowerBound 1 + upperBound - lowerBound - 1
= upperBound
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that's sound, good explanation. This is also why it can't overflow when toLong is called. I think this is the right fix with BigInt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a problem here where
length = 1 + upperBound - lowerBound
This can over flow as well, if lowerBound = 0 and upperBound = Long.MAX_VALUE
I should fix this, actually make the change smaller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were these changes accidental?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to add a new test for this bug, is this not the right place to put it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but here it looks like you just indented existing code or something. The new test below looks right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh.. that's because I need to create a new table, and I put that and the old table creation in one single
try {
...
} finally {
conn.close()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except can't we keep this catch block in common in the outer try block? I realize it would mean changing some of the variable names in the second block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem when I tried to to that. The original writer uses the inner catch block to prevent re-creating the table.
catch {
case e: SQLException if e.getSQLState == "X0Y32" =>
// table exists
}
Which means it has to exist for each table to be created. I was simply following that pattern.
An alternative would be is to drop and re-create the table each time, which produce cleaner codes, but may slow down the test suite a little bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right I get it, never mind. LGTM. I can merge in a bit if there are no further comments.
Fix a overflow bug in JdbcRDD when calculating partitions for large BIGINT ids