-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10439] [sql] Add bound checks to DateTimeUtils. #8606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There were a couple of places where Spark SQL would silently truncate data if certain timestamps were provided. In a couple of other places, the way to calculate Julian day-based timestamps was changed a little so that Spark writes data that is friendlier to Hive; mostly, Hive does not like very much when the data has negative values for either the days or nanos part, so avoid those. The values that trigger these use cases are very uncommon (very large values in either end of the spectrum), so this shouldn't really affect any existing applications.
|
Test build #42019 timed out for PR 8606 at commit |
|
retest this please |
Conflicts: sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala
|
@yhuai I resolved a conflict cause by #8597. In that PR you added bounds to the timestamps generated by I took those at face value, so now all timestamps are checked against those bounds, even though they are more restrictive than the ones I had previously (e.g., now you can't represent any BC timestamps). It would be nice to have a better explanation for why those bounds are there, though. |
|
@vanzin I chose those values just to make the random data generator generates valid data for tests. Feel free to change them to more reasonable values. |
|
Test build #42147 has finished for PR 8606 at commit
|
|
ok, thanks, I'll revert to the bounds I had before. I was assuming you picked those values because of some problem you ran while testing that change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave it a try and I got
scala> new java.sql.Timestamp(-4708372992000L)
res3: java.sql.Timestamp = 1820-10-18 14:36:48.0
scala> java.sql.Timestamp.valueOf("0317-02-14 06:13:20.0")
res7: java.sql.Timestamp = 0317-02-14 06:13:20.0
scala> res5.getTime
res8: Long = -52159715200000
scala> new java.sql.Timestamp(-72135740800000L)
res1: java.sql.Timestamp = 0317-02-14 06:13:20.0
So, is it the right lower bound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the right value if you consider the math. But I've seen really weird behavior in how the Java classes print very large (positive or negative) timestamps, and I didn't find anything in the javadocs about limits, so I'm not sure what's the best way to proceed. We can choose an arbitrary minimum and maximum, but what would those be based on?
For example, java.sql.Timestamp will print the same formatted date string for both -793495812000 and -123456789012000L.
|
also cc @davies since he is a better person to review this one. |
|
Test build #42157 has finished for PR 8606 at commit
|
|
Test build #42158 has finished for PR 8606 at commit
|
Note that since Timestamp.valueOf() parses things in the host's time zone, these limits might be a little bit off for someone living ahead of UTC. Not sure what the best solution is here (aside from avoiding Timestamp.valueOf).
|
@yhuai I had to revert to the limits you had before because of the test failures; at least that helped me provide a better comment for the limits, instead of just magic numbers. :-p |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd assume the millisUtc is a valid timestamp, then this check is not needed.
|
I'm not a super fan to add bound checking to those heavily used internal functions. If we really want to limit the range that we support, we should do in inbound face (for data sources or in CatalystConverter) and after calculation. |
|
@davies sorry, not sure I follow.
The check in |
New limits make these unnecessary. Tweak tests accordingly.
|
If you think it would be better I could move the checks to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct, it should be:
val julian_us = us + JULIAN_DAY_OF_EPOCH * MICROSECONDS_PER_DAY
val day = julian_us / MICROSECONDS_PER_DAY
val micros = julian_us % MICROSECONDS_PER_DAY
(day.toInt, micros * 1000L)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch; my code can return nanos > 999999999 which java.sql.Timestamp does not like...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created https://issues.apache.org/jira/browse/SPARK-10522 for this, then we could back port the fix for 1.5.
|
Test build #42218 has finished for PR 8606 at commit
|
|
Test build #42219 has finished for PR 8606 at commit
|
|
Test build #42223 has finished for PR 8606 at commit
|
|
Test build #42224 has finished for PR 8606 at commit
|
There were a couple of places where Spark SQL would silently truncate
data if certain timestamps were provided.
In a couple of other places, the way to calculate Julian day-based
timestamps was changed a little so that Spark writes data that is
friendlier to Hive; mostly, Hive does not like very much when
the data has negative values for either the days or nanos part,
so avoid those.
The values that trigger these use cases are very uncommon (very large
values in either end of the spectrum), so this shouldn't really affect
any existing applications.