Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Aug 24, 2016

What changes were proposed in this pull request?

This PR adds the support for returning TimestampType for date_add, date_sub, trunc, last_day, next_day and add_months functions.

The output type of this function follows the input data type. (e.g. input:TimestampType, output : TimestampType)

How was this patch tested?

Unit test in DateExpressionsSuite.

@HyukjinKwon
Copy link
Member Author

Hi @cloud-fan, could you check this if this is sensible? I just took a scan and it seems there are similar instances as below:

DateAdd
DateSub
LastDay
NextDay
AddMonths
TruncDate
DateDiff
WeekOfYear
DayOfMonth
Month
Quarter
Year
DayOfYear

I can do this here or as a follow-up if you think this approach looks okay.

@HyukjinKwon HyukjinKwon changed the title [SPARK-17174][SQL] Add type support for TimestampType for add_months [SPARK-17174][SQL] Add the support for TimestampType for add_months as output type Aug 24, 2016

override def nullSafeEval(start: Any, months: Any): Any = {
DateTimeUtils.dateAddMonths(start.asInstanceOf[Int], months.asInstanceOf[Int])
override def nullSafeEval(start: Any, months: Any): Any = startDate.dataType match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also add the parameters start and months to the pattern match. That saves some casting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks!

@hvanhovell
Copy link
Contributor

hvanhovell commented Aug 24, 2016

@HyukjinKwon this looks pretty good.

I think we should also do this for DateAdd, DateSub, LastDay, NextDay & Truncate (the naming is a bit unfortunate for the first two though). We might need to add a few options to Truncate (day, minute, second, ...).

The other functions all have non-date return types. I think we can leave them alone.

@HyukjinKwon
Copy link
Member Author

Oh, yes. right. I will address your comments tomorrow. Thank you @hvanhovell

usage = "_FUNC_(start_date, num_months) - Returns the date/timestamp that is num_months after start_date.",
extended = "> SELECT _FUNC_('2016-08-31', 1);\n '2016-09-30'")
// scalastyle:on line.size.limit
case class AddMonths(startDate: Expression, numMonths: Expression)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as it's not always date type now, should we use a new name instead of startDate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instant?

@SparkQA
Copy link

SparkQA commented Aug 24, 2016

Test build #64347 has finished for PR 14788 at commit 777300e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon changed the title [SPARK-17174][SQL] Add the support for TimestampType for add_months as output type [SPARK-17174][SQL][WIP] Add the support for TimestampType for add_months as output type Aug 24, 2016
@HyukjinKwon
Copy link
Member Author

Sorry, it takes a bit longer than I thought I will finish up this within this week.

@hvanhovell
Copy link
Contributor

NP :)

@HyukjinKwon HyukjinKwon changed the title [SPARK-17174][SQL][WIP] Add the support for TimestampType for add_months as output type [SPARK-17174][SQL] Add the support for TimestampType for add_months as output type Aug 27, 2016
case "DAY" | "DD" => TRUNC_TO_DAY
case "HOUR" | "HH" => TRUNC_TO_HOUR
case "MI" => TRUNC_TO_MINUTE
case "SEC" | "SS" => TRUNC_TO_SECOND
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* Add timestamp and days interval.
* Returns a timestamp value, expressed in microseconds since 1.1.1970 00:00:00.
*/
def timestampAddDays(start: SQLTimestamp, days: Int): SQLTimestamp = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it just start + days * MICROS_PER_DAY? timestampAddInterval has some complex logic to handle month, which is not unnecessary here.

@SparkQA
Copy link

SparkQA commented Aug 27, 2016

Test build #64524 has finished for PR 14788 at commit eee40b8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Aug 27, 2016

I ran some SQLs in MySQL and PostgreSQL and took a look for Oracle and IBM's Informix documentation.

It seems they are not really consistent for NEXT_DAY, LAST_DAY and TRUNC but it seems NEXT_DAY and LAST_DAY return DateType and TRUNC supports both types or TimestampType only.

In more details, I skimmed through the function list here for MySQL http://www.tutorialspoint.com/mysql/mysql-date-time-functions.htm and for PostgreSQL https://www.postgresql.org/docs/9.1/static/functions-datetime.html, and then tried to apply some equivalent as below:

Please let me know if you think we need more references such as Hive.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Aug 27, 2016

It seems it is up to how we define the behaviour. I will follow your decision @cloud-fan.

My personal opinion is, support both types for TRUNC(or TimestampType only) and DateType only for LAST_DAY and NEXT_DAY.

@HyukjinKwon HyukjinKwon changed the title [SPARK-17174][SQL] Add the support for TimestampType for add_months as output type [SPARK-17174][SQL] Add the support for TimestampType for some as output type Aug 27, 2016
@HyukjinKwon HyukjinKwon changed the title [SPARK-17174][SQL] Add the support for TimestampType for some as output type [SPARK-17174][SQL] Add the support for TimestampType for some functions as output type Aug 27, 2016
@cloud-fan
Copy link
Contributor

For date_add, date_sub, add_month, I think we should support both DateType and TimestampType, and the return type should depend on the input type.

For last_day, first_day, we should support both DateType and TimestampType, but the return type should always be DateType

For date_trunc, we should support both DateType and TimestampType, but the return type should always be TimestampType

cc @rxin

@HyukjinKwon
Copy link
Member Author

@cloud-fan @hvanhovell Would there be other things I should double check and take care of?

@hvanhovell
Copy link
Contributor

retest this please

@hvanhovell
Copy link
Contributor

LGTM - I'll merge as soon as tests complete successfully

@rxin
Copy link
Contributor

rxin commented Oct 10, 2016

Actually can we avoid renaming these expressions? I don't see the point to rename DateSub to SubDays. It just makes it more annoying to link the user facing API with the internal expressions.

* @since 1.5.0
*/
def date_add(start: Column, days: Int): Column = withExpr { DateAdd(start.expr, Literal(days)) }
def date_add(start: Column, days: Int): Column = withExpr { AddDays(start.expr, Literal(days)) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change the name of these expressions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am willing to revert it back if there is no specific reason. This is about #14788 (comment) - @hvanhovell

Copy link
Contributor

@cloud-fan cloud-fan Oct 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's confusing to users that date_add add days to the given date and add_months add months to the given date. I think add_days and add_months are more consistent.

Other databases(e.g. MySQL, Postgres) only have date_add which adds interval to the given date, so that they don't need add_days and add_months respectively.

The function name is already realsed and maybe hard to change, but changing the expression name to match the real logic seems good.

@rxin any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get what you were suggesting here. Wouldn't it make more sense to make DateAdd expression support both adding Interval type and adding IntegralType (for days)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we have date_add to add days or interval to the given date, and add_months to add months to the given date, seems a little weird...


/**
* Returns date truncated to the unit specified by the format.
* Returns timestamp truncated to the unit specified by the format.
Copy link
Contributor

@rxin rxin Oct 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this actually change the data type returned and can break code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap. So, I initially wanted to match input type to output type (DateType input - DateType - output, TimestampType input - TimestampType output) but just decided to follow the suggestion because It seems it depends on how we define the type maybe as it seems the implementation is different in each DBMS.

I would like to to hear the thoughts on #14788 (comment) in more details - cc @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reynold has a very valid point. My bad for not thinking this is a problem. This will break as soon as you call a java UDF or call df.rdd.map(...). I think we need to have both a truncate date and a timestamp expression.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thank you for your kind explanation. Will definitely try to avoid such a mistake in the futute.

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #66626 has finished for PR 14788 at commit 8c50b2c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class AddDaysBase(instant: Expression, days: Expression)
    • case class AddDays(instant: Expression, days: Expression) extends AddDaysBase(instant, days)
    • case class SubDays(instant: Expression, days: Expression) extends AddDaysBase(instant, days)

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #3313 has finished for PR 14788 at commit 8c50b2c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class AddDaysBase(instant: Expression, days: Expression)
    • case class AddDays(instant: Expression, days: Expression) extends AddDaysBase(instant, days)
    • case class SubDays(instant: Expression, days: Expression) extends AddDaysBase(instant, days)

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #3316 has finished for PR 14788 at commit 8c50b2c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class AddDaysBase(instant: Expression, days: Expression)
    • case class AddDays(instant: Expression, days: Expression) extends AddDaysBase(instant, days)
    • case class SubDays(instant: Expression, days: Expression) extends AddDaysBase(instant, days)

@SparkQA
Copy link

SparkQA commented Oct 11, 2016

Test build #66712 has finished for PR 14788 at commit 537fe88.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class DateSub(instant: Expression, days: Expression) extends AddDaysBase(instant, days)
    • case class TruncInstant(instant: Expression, format: Expression)

@SparkQA
Copy link

SparkQA commented Oct 11, 2016

Test build #66713 has finished for PR 14788 at commit ef67829.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 11, 2016

Test build #66728 has finished for PR 14788 at commit cd78330.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

extended = "> SELECT _FUNC_('2009-02-12', 'MM')\n '2009-02-01 00:00:00'\n> SELECT _FUNC_('2015-10-27', 'YEAR');\n '2015-01-01 00:00:00'")
// scalastyle:on line.size.limit
case class TruncDate(date: Expression, format: Expression)
case class TruncInstant(instant: Expression, format: Expression)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This auto casts. I think we are still breaking API here if a user passes a Timestamp. In the old situation the user would always get a Date, and now he gets a Date or Timestamp based on the input type. So I think we need to split this into two expressions.

Copy link
Member Author

@HyukjinKwon HyukjinKwon Oct 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood the first comment.. Will make two expressions. Thanks!

Copy link
Member Author

@HyukjinKwon HyukjinKwon Oct 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell ur.. actually, should I split other functions I corrected here as well here? DateAdd, DateSub and etc. also seem having the same problems.

@HyukjinKwon
Copy link
Member Author

@rxin @hvanhovell @cloud-fan Could I please ask the behaviour we want here?

  1. Separate expressions for both DateType and TimestampType
  2. Returns DateType for DateType and TimestampType for TimestampType
  3. Returns TimestampType
  4. Do not change this and close this PR

@hvanhovell
Copy link
Contributor

I prefer option 1.

@rxin
Copy link
Contributor

rxin commented Oct 22, 2016

Does option 1 really work? Wouldn't the expressions have the same user facing function names?

@hvanhovell
Copy link
Contributor

hvanhovell commented Oct 22, 2016

It would work with different names, i.e.:

  • timestamp_trunc
  • timestamp_add
  • timestamp_sub

@HyukjinKwon
Copy link
Member Author

Could I then go for option 1.?

@rxin
Copy link
Contributor

rxin commented Oct 28, 2016

What do other databases do? Does date_add in other databases support timestamps?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 28, 2016

#14788 (comment) here is my observation. It seems usually option 2 or option 3. I can take a look deeper if we want to follow other databases or be very sure on this.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 28, 2016

I will be back after testing/looking into other databases tomorrow.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 29, 2016

I tried to find the functions equivalent with add/sub/trunc with date/timestamp and tested them. It seems generally option 2 or option 3.

In more details,

DB2 (TRUNC)

  • input: TimestampType, output: TimestampType
  • input: DateType, output: DateType

Note: DB2 has TRUNC_TIMESTAMP function too which always return TimestampType.

db2 => SELECT TRUNC(DATE('1990-01-01'), 'DAY') FROM SYSIBM.SYSDUMMY1

1
----------
12/31/1989

  1 record(s) selected.

db2 => SELECT TRUNC(TIMESTAMP('1990-01-01'), 'DAY') FROM SYSIBM.SYSDUMMY1

1
--------------------------
1989-12-31-00.00.00.000000

  1 record(s) selected.

(TRUNC_TIMESTAMP)

db2 => SELECT TRUNC_TIMESTAMP(DATE('1990-01-01'), 'DAY') FROM SYSIBM.SYSDUMMY1

1
-------------------
1989-12-31-00.00.00

  1 record(s) selected.

Oracle (TRUNC)

  • input: TimestampType/DateType, output: DateType
SQL> SELECT TRUNC(TO_DATE('1999-12-01 11:00:00', 'YYYY-MM-DD HH:MI:SS'), 'HH') FROM DUAL;

TRUNC(TO_DATE('1999-12-0111:00:00','YYYY-MM-DDHH:MI:SS'),'HH')
--------------------------------------------------------------
01-DEC-99

SQL> SELECT TRUNC(TO_TIMESTAMP('1999-12-01 11:00:00', 'YYYY-MM-DD HH:MI:SS'), 'HH') FROM DUAL;

TRUNC(TO_TIMESTAMP('1999-12-0111:00:00','YYYY-MM-DDHH:MI:SS'),'HH')
-------------------------------------------------------------------
01-DEC-99

Postgres (TRUNC)

  • input: TimestampType/DateType, output: TimestampType
  • input: CalanderIntervalType, output CalanderIntervalType
postgres=# SELECT DATE_TRUNC('day', CAST('2015-10-10' AS DATE));
       date_trunc
------------------------
 2015-10-10 00:00:00+00
(1 row)

postgres=# SELECT DATE_TRUNC('day', CAST('2015-10-10' AS TIMESTAMP));
     date_trunc
---------------------
 2015-10-10 00:00:00
(1 row)

postgres=# select DATE_TRUNC('hour', interval '2 days 3 hours 40 minutes');
   date_trunc
-----------------
 2 days 03:00:00
(1 row)

MySQL (DATE_SUB)

  • input: TimestampType, output: TimestampType
  • input: DateType, output: DateType

Note: When the unit is less than DAY in CalanderIntervalType, it seems converting this into TimestampType.

mysql> SELECT DATE_SUB('1998-01-02', INTERVAL 31 DAY);
+-----------------------------------------+
| DATE_SUB('1998-01-02', INTERVAL 31 DAY) |
+-----------------------------------------+
| 1997-12-02                              |
+-----------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT DATE_SUB('1998-01-02 00:00:00', INTERVAL 31 DAY);
+--------------------------------------------------+
| DATE_SUB('1998-01-02 00:00:00', INTERVAL 31 DAY) |
+--------------------------------------------------+
| 1997-12-02 00:00:00                              |
+--------------------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT DATE_SUB('1998-01-02', INTERVAL 0 second);
+-------------------------------------------+
| DATE_SUB('1998-01-02', INTERVAL 0 second) |
+-------------------------------------------+
| 1998-01-02 00:00:00                       |
+-------------------------------------------+
1 row in set (0.00 sec)

@HyukjinKwon
Copy link
Member Author

Hi @rxin, do you mind if I ask what you do think about this?

@HyukjinKwon
Copy link
Member Author

gentle ping...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants