[SPARK-10439] [sql] Add bound checks to DateTimeUtils. #8606

vanzin · 2015-09-04T20:50:11Z

There were a couple of places where Spark SQL would silently truncate
data if certain timestamps were provided.

In a couple of other places, the way to calculate Julian day-based
timestamps was changed a little so that Spark writes data that is
friendlier to Hive; mostly, Hive does not like very much when
the data has negative values for either the days or nanos part,
so avoid those.

The values that trigger these use cases are very uncommon (very large
values in either end of the spectrum), so this shouldn't really affect
any existing applications.

There were a couple of places where Spark SQL would silently truncate data if certain timestamps were provided. In a couple of other places, the way to calculate Julian day-based timestamps was changed a little so that Spark writes data that is friendlier to Hive; mostly, Hive does not like very much when the data has negative values for either the days or nanos part, so avoid those. The values that trigger these use cases are very uncommon (very large values in either end of the spectrum), so this shouldn't really affect any existing applications.

SparkQA · 2015-09-05T01:08:38Z

Test build #42019 timed out for PR 8606 at commit ffe39e4 after a configured wait of 250m.

vanzin · 2015-09-08T19:57:48Z

retest this please

Conflicts: sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala

vanzin · 2015-09-08T22:19:10Z

@yhuai I resolved a conflict cause by #8597. In that PR you added bounds to the timestamps generated by RandomDataGenerator, but there's no real explanation of why you chose those values.

I took those at face value, so now all timestamps are checked against those bounds, even though they are more restrictive than the ones I had previously (e.g., now you can't represent any BC timestamps). It would be nice to have a better explanation for why those bounds are there, though.

yhuai · 2015-09-08T22:23:17Z

@vanzin I chose those values just to make the random data generator generates valid data for tests. Feel free to change them to more reasonable values.

SparkQA · 2015-09-08T22:24:04Z

Test build #42147 has finished for PR 8606 at commit ffe39e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-09-08T22:24:49Z

ok, thanks, I'll revert to the bounds I had before. I was assuming you picked those values because of some problem you ran while testing that change.

yhuai · 2015-09-08T23:21:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

I gave it a try and I got

scala> new java.sql.Timestamp(-4708372992000L) res3: java.sql.Timestamp = 1820-10-18 14:36:48.0 scala> java.sql.Timestamp.valueOf("0317-02-14 06:13:20.0") res7: java.sql.Timestamp = 0317-02-14 06:13:20.0 scala> res5.getTime res8: Long = -52159715200000 scala> new java.sql.Timestamp(-72135740800000L) res1: java.sql.Timestamp = 0317-02-14 06:13:20.0

So, is it the right lower bound?

It's the right value if you consider the math. But I've seen really weird behavior in how the Java classes print very large (positive or negative) timestamps, and I didn't find anything in the javadocs about limits, so I'm not sure what's the best way to proceed. We can choose an arbitrary minimum and maximum, but what would those be based on?

For example, java.sql.Timestamp will print the same formatted date string for both -793495812000 and -123456789012000L.

yhuai · 2015-09-08T23:24:52Z

also cc @davies since he is a better person to review this one.

SparkQA · 2015-09-08T23:54:14Z

Test build #42157 has finished for PR 8606 at commit 196ba9e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-09T00:14:52Z

Test build #42158 has finished for PR 8606 at commit 63d0c39.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Note that since Timestamp.valueOf() parses things in the host's time zone, these limits might be a little bit off for someone living ahead of UTC. Not sure what the best solution is here (aside from avoiding Timestamp.valueOf).

vanzin · 2015-09-09T21:21:35Z

@yhuai I had to revert to the limits you had before because of the test failures; at least that helped me provide a better comment for the limits, instead of just magic numbers. :-p

davies · 2015-09-09T21:39:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

I'd assume the millisUtc is a valid timestamp, then this check is not needed.

davies · 2015-09-09T21:45:05Z

I'm not a super fan to add bound checking to those heavily used internal functions. If we really want to limit the range that we support, we should do in inbound face (for data sources or in CatalystConverter) and after calculation.

vanzin · 2015-09-09T21:53:37Z

@davies sorry, not sure I follow.

millisToDays seems to not need the check, so I'll remove it. I didn't find any code path that started from user code. But fromJavaTimestamp is called from user code - when you convert an RDD[Row] to a DataFrame. I think it's better to fail early in that case.

The check in toJulianDay should be rarely invoked, and might be even unnecessary given the current limits.

New limits make these unnecessary. Tweak tests accordingly.

vanzin · 2015-09-09T22:07:49Z

If you think it would be better I could move the checks to CatalystTypeConverters (DateConverter - which is missing from my patch! - and TimestampConverter).

davies · 2015-09-09T22:18:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

This is not correct, it should be:

val julian_us = us + JULIAN_DAY_OF_EPOCH * MICROSECONDS_PER_DAY val day = julian_us / MICROSECONDS_PER_DAY val micros = julian_us % MICROSECONDS_PER_DAY (day.toInt, micros * 1000L)

Good catch; my code can return nanos > 999999999 which java.sql.Timestamp does not like...

Created https://issues.apache.org/jira/browse/SPARK-10522 for this, then we could back port the fix for 1.5.

SparkQA · 2015-09-09T23:32:37Z

Test build #42218 has finished for PR 8606 at commit edeb06e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SparkHadoopWriter(jobConf: JobConf)
- class BlockRDD[T: ClassTag](sc: SparkContext, @transient val blockIds: Array[BlockId])
- class ZippedWithIndexRDD[T: ClassTag](prev: RDD[T]) extends RDD[(T, Long)](prev)
- case class Instance(w: Double, a: Vector, b: Double)
- class DefaultSource extends RelationProvider with DataSourceRegister
- class WeibullGenerator(
- class IndexToString(JavaTransformer, HasInputCol, HasOutputCol):
- class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)
- abstract class InputDStream[T: ClassTag] (ssc_ : StreamingContext)
- abstract class ReceiverInputDStream[T: ClassTag](ssc_ : StreamingContext)

SparkQA · 2015-09-09T23:56:13Z

Test build #42219 has finished for PR 8606 at commit 922c09e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-10T00:39:02Z

Test build #42223 has finished for PR 8606 at commit 7a4cf9e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-10T01:36:45Z

Test build #42224 has finished for PR 8606 at commit fa8d6f6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SparkHadoopWriter(jobConf: JobConf)
- class BlockRDD[T: ClassTag](sc: SparkContext, @transient val blockIds: Array[BlockId])
- class ZippedWithIndexRDD[T: ClassTag](prev: RDD[T]) extends RDD[(T, Long)](prev)
- case class Instance(w: Double, a: Vector, b: Double)
- class DefaultSource extends RelationProvider with DataSourceRegister
- class WeibullGenerator(
- class IndexToString(JavaTransformer, HasInputCol, HasOutputCol):
- class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)
- abstract class InputDStream[T: ClassTag] (ssc_ : StreamingContext)
- abstract class ReceiverInputDStream[T: ClassTag](ssc_ : StreamingContext)

Merge branch 'master' into SPARK-10439

196ba9e

Conflicts: sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala

Revert to previous bounds for timestamp values.

63d0c39

yhuai reviewed Sep 8, 2015
View reviewed changes

Marcelo Vanzin added 2 commits September 9, 2015 13:08

Merge branch 'master' into SPARK-10439

00757d2

Revert to previous limits, and explain why they're needed.

edeb06e

Note that since Timestamp.valueOf() parses things in the host's time zone, these limits might be a little bit off for someone living ahead of UTC. Not sure what the best solution is here (aside from avoiding Timestamp.valueOf).

davies reviewed Sep 9, 2015
View reviewed changes

Remove unneded checks.

922c09e

New limits make these unnecessary. Tweak tests accordingly.

davies reviewed Sep 9, 2015
View reviewed changes

Marcelo Vanzin added 3 commits September 9, 2015 15:42

Fix negative timestamp to julian day ts conversion.

3869816

Fix timestamp range check, add check to fromJavaDate().

6064ef2

And a test.

7a4cf9e

vanzin mentioned this pull request Sep 9, 2015

[SPARK-10522] [SQL] Nanoseconds of Timestamp in Parquet should be positive #8674

Closed

Simplify toJulianDay().

fa8d6f6

vanzin closed this Sep 9, 2015

vanzin deleted the SPARK-10439 branch September 9, 2015 23:07

[SPARK-10439] [sql] Add bound checks to DateTimeUtils. #8606

[SPARK-10439] [sql] Add bound checks to DateTimeUtils. #8606

Uh oh!

Conversation

vanzin commented Sep 4, 2015

Uh oh!

SparkQA commented Sep 5, 2015

Uh oh!

vanzin commented Sep 8, 2015

Uh oh!

vanzin commented Sep 8, 2015

Uh oh!

yhuai commented Sep 8, 2015

Uh oh!

SparkQA commented Sep 8, 2015

Uh oh!

vanzin commented Sep 8, 2015

Uh oh!

yhuai Sep 8, 2015

Choose a reason for hiding this comment

Uh oh!

vanzin Sep 8, 2015

Choose a reason for hiding this comment

Uh oh!

yhuai commented Sep 8, 2015

Uh oh!

SparkQA commented Sep 8, 2015

Uh oh!

SparkQA commented Sep 9, 2015

Uh oh!

vanzin commented Sep 9, 2015

Uh oh!

davies Sep 9, 2015

Choose a reason for hiding this comment

Uh oh!

davies commented Sep 9, 2015

Uh oh!

vanzin commented Sep 9, 2015

Uh oh!

vanzin commented Sep 9, 2015

Uh oh!

davies Sep 9, 2015

Choose a reason for hiding this comment

Uh oh!

vanzin Sep 9, 2015

Choose a reason for hiding this comment

Uh oh!

davies Sep 9, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 9, 2015

Uh oh!

SparkQA commented Sep 9, 2015

Uh oh!

SparkQA commented Sep 10, 2015

Uh oh!

SparkQA commented Sep 10, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants