Skip to content

Conversation

@vanzin
Copy link
Contributor

@vanzin vanzin commented Sep 4, 2015

There were a couple of places where Spark SQL would silently truncate
data if certain timestamps were provided.

In a couple of other places, the way to calculate Julian day-based
timestamps was changed a little so that Spark writes data that is
friendlier to Hive; mostly, Hive does not like very much when
the data has negative values for either the days or nanos part,
so avoid those.

The values that trigger these use cases are very uncommon (very large
values in either end of the spectrum), so this shouldn't really affect
any existing applications.

There were a couple of places where Spark SQL would silently truncate
data if certain timestamps were provided.

In a couple of other places, the way to calculate Julian day-based
timestamps was changed a little so that Spark writes data that is
friendlier to Hive; mostly, Hive does not like very much when
the data has negative values for either the days or nanos part,
so avoid those.

The values that trigger these use cases are very uncommon (very large
values in either end of the spectrum), so this shouldn't really affect
any existing applications.
@SparkQA
Copy link

SparkQA commented Sep 5, 2015

Test build #42019 timed out for PR 8606 at commit ffe39e4 after a configured wait of 250m.

@vanzin
Copy link
Contributor Author

vanzin commented Sep 8, 2015

retest this please

Conflicts:
	sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala
@vanzin
Copy link
Contributor Author

vanzin commented Sep 8, 2015

@yhuai I resolved a conflict cause by #8597. In that PR you added bounds to the timestamps generated by RandomDataGenerator, but there's no real explanation of why you chose those values.

I took those at face value, so now all timestamps are checked against those bounds, even though they are more restrictive than the ones I had previously (e.g., now you can't represent any BC timestamps). It would be nice to have a better explanation for why those bounds are there, though.

@yhuai
Copy link
Contributor

yhuai commented Sep 8, 2015

@vanzin I chose those values just to make the random data generator generates valid data for tests. Feel free to change them to more reasonable values.

@SparkQA
Copy link

SparkQA commented Sep 8, 2015

Test build #42147 has finished for PR 8606 at commit ffe39e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Sep 8, 2015

ok, thanks, I'll revert to the bounds I had before. I was assuming you picked those values because of some problem you ran while testing that change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it a try and I got

scala> new java.sql.Timestamp(-4708372992000L)
res3: java.sql.Timestamp = 1820-10-18 14:36:48.0

scala> java.sql.Timestamp.valueOf("0317-02-14 06:13:20.0")
res7: java.sql.Timestamp = 0317-02-14 06:13:20.0

scala> res5.getTime
res8: Long = -52159715200000

scala> new java.sql.Timestamp(-72135740800000L)
res1: java.sql.Timestamp = 0317-02-14 06:13:20.0

So, is it the right lower bound?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the right value if you consider the math. But I've seen really weird behavior in how the Java classes print very large (positive or negative) timestamps, and I didn't find anything in the javadocs about limits, so I'm not sure what's the best way to proceed. We can choose an arbitrary minimum and maximum, but what would those be based on?

For example, java.sql.Timestamp will print the same formatted date string for both -793495812000 and -123456789012000L.

@yhuai
Copy link
Contributor

yhuai commented Sep 8, 2015

also cc @davies since he is a better person to review this one.

@SparkQA
Copy link

SparkQA commented Sep 8, 2015

Test build #42157 has finished for PR 8606 at commit 196ba9e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 9, 2015

Test build #42158 has finished for PR 8606 at commit 63d0c39.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Marcelo Vanzin added 2 commits September 9, 2015 13:08
Note that since Timestamp.valueOf() parses things in the host's time zone,
these limits might be a little bit off for someone living ahead of UTC. Not
sure what the best solution is here (aside from avoiding Timestamp.valueOf).
@vanzin
Copy link
Contributor Author

vanzin commented Sep 9, 2015

@yhuai I had to revert to the limits you had before because of the test failures; at least that helped me provide a better comment for the limits, instead of just magic numbers. :-p

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd assume the millisUtc is a valid timestamp, then this check is not needed.

@davies
Copy link
Contributor

davies commented Sep 9, 2015

I'm not a super fan to add bound checking to those heavily used internal functions. If we really want to limit the range that we support, we should do in inbound face (for data sources or in CatalystConverter) and after calculation.

@vanzin
Copy link
Contributor Author

vanzin commented Sep 9, 2015

@davies sorry, not sure I follow.

millisToDays seems to not need the check, so I'll remove it. I didn't find any code path that started from user code. But fromJavaTimestamp is called from user code - when you convert an RDD[Row] to a DataFrame. I think it's better to fail early in that case.

The check in toJulianDay should be rarely invoked, and might be even unnecessary given the current limits.

New limits make these unnecessary. Tweak tests accordingly.
@vanzin
Copy link
Contributor Author

vanzin commented Sep 9, 2015

If you think it would be better I could move the checks to CatalystTypeConverters (DateConverter - which is missing from my patch! - and TimestampConverter).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct, it should be:


val julian_us = us + JULIAN_DAY_OF_EPOCH * MICROSECONDS_PER_DAY
val day = julian_us / MICROSECONDS_PER_DAY
val micros = julian_us % MICROSECONDS_PER_DAY
(day.toInt, micros * 1000L)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch; my code can return nanos > 999999999 which java.sql.Timestamp does not like...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created https://issues.apache.org/jira/browse/SPARK-10522 for this, then we could back port the fix for 1.5.

@vanzin vanzin closed this Sep 9, 2015
@vanzin vanzin deleted the SPARK-10439 branch September 9, 2015 23:07
@SparkQA
Copy link

SparkQA commented Sep 9, 2015

Test build #42218 has finished for PR 8606 at commit edeb06e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SparkHadoopWriter(jobConf: JobConf)
    • class BlockRDD[T: ClassTag](sc: SparkContext, @transient val blockIds: Array[BlockId])
    • class ZippedWithIndexRDD[T: ClassTag](prev: RDD[T]) extends RDD[(T, Long)](prev)
    • case class Instance(w: Double, a: Vector, b: Double)
    • class DefaultSource extends RelationProvider with DataSourceRegister
    • class WeibullGenerator(
    • class IndexToString(JavaTransformer, HasInputCol, HasOutputCol):
    • class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)
    • abstract class InputDStream[T: ClassTag] (ssc_ : StreamingContext)
    • abstract class ReceiverInputDStream[T: ClassTag](ssc_ : StreamingContext)

@SparkQA
Copy link

SparkQA commented Sep 9, 2015

Test build #42219 has finished for PR 8606 at commit 922c09e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 10, 2015

Test build #42223 has finished for PR 8606 at commit 7a4cf9e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 10, 2015

Test build #42224 has finished for PR 8606 at commit fa8d6f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SparkHadoopWriter(jobConf: JobConf)
    • class BlockRDD[T: ClassTag](sc: SparkContext, @transient val blockIds: Array[BlockId])
    • class ZippedWithIndexRDD[T: ClassTag](prev: RDD[T]) extends RDD[(T, Long)](prev)
    • case class Instance(w: Double, a: Vector, b: Double)
    • class DefaultSource extends RelationProvider with DataSourceRegister
    • class WeibullGenerator(
    • class IndexToString(JavaTransformer, HasInputCol, HasOutputCol):
    • class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)
    • abstract class InputDStream[T: ClassTag] (ssc_ : StreamingContext)
    • abstract class ReceiverInputDStream[T: ClassTag](ssc_ : StreamingContext)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants