-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17174][SQL] Add the support for TimestampType for some functions as output type #14788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @cloud-fan, could you check this if this is sensible? I just took a scan and it seems there are similar instances as below: I can do this here or as a follow-up if you think this approach looks okay. |
|
|
||
| override def nullSafeEval(start: Any, months: Any): Any = { | ||
| DateTimeUtils.dateAddMonths(start.asInstanceOf[Int], months.asInstanceOf[Int]) | ||
| override def nullSafeEval(start: Any, months: Any): Any = startDate.dataType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also add the parameters start and months to the pattern match. That saves some casting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks!
|
@HyukjinKwon this looks pretty good. I think we should also do this for The other functions all have non-date return types. I think we can leave them alone. |
|
Oh, yes. right. I will address your comments tomorrow. Thank you @hvanhovell |
| usage = "_FUNC_(start_date, num_months) - Returns the date/timestamp that is num_months after start_date.", | ||
| extended = "> SELECT _FUNC_('2016-08-31', 1);\n '2016-09-30'") | ||
| // scalastyle:on line.size.limit | ||
| case class AddMonths(startDate: Expression, numMonths: Expression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as it's not always date type now, should we use a new name instead of startDate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instant?
|
Test build #64347 has finished for PR 14788 at commit
|
|
Sorry, it takes a bit longer than I thought I will finish up this within this week. |
|
NP :) |
…day, next_day and trunc functions
| case "DAY" | "DD" => TRUNC_TO_DAY | ||
| case "HOUR" | "HH" => TRUNC_TO_HOUR | ||
| case "MI" => TRUNC_TO_MINUTE | ||
| case "SEC" | "SS" => TRUNC_TO_SECOND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Add timestamp and days interval. | ||
| * Returns a timestamp value, expressed in microseconds since 1.1.1970 00:00:00. | ||
| */ | ||
| def timestampAddDays(start: SQLTimestamp, days: Int): SQLTimestamp = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it just start + days * MICROS_PER_DAY? timestampAddInterval has some complex logic to handle month, which is not unnecessary here.
|
Test build #64524 has finished for PR 14788 at commit
|
|
I ran some SQLs in MySQL and PostgreSQL and took a look for Oracle and IBM's Informix documentation. It seems they are not really consistent for In more details, I skimmed through the function list here for MySQL http://www.tutorialspoint.com/mysql/mysql-date-time-functions.htm and for PostgreSQL https://www.postgresql.org/docs/9.1/static/functions-datetime.html, and then tried to apply some equivalent as below:
Please let me know if you think we need more references such as Hive. |
|
It seems it is up to how we define the behaviour. I will follow your decision @cloud-fan. My personal opinion is, support both types for |
|
For For For cc @rxin |
|
@cloud-fan @hvanhovell Would there be other things I should double check and take care of? |
|
retest this please |
|
LGTM - I'll merge as soon as tests complete successfully |
|
Actually can we avoid renaming these expressions? I don't see the point to rename DateSub to SubDays. It just makes it more annoying to link the user facing API with the internal expressions. |
| * @since 1.5.0 | ||
| */ | ||
| def date_add(start: Column, days: Int): Column = withExpr { DateAdd(start.expr, Literal(days)) } | ||
| def date_add(start: Column, days: Int): Column = withExpr { AddDays(start.expr, Literal(days)) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change the name of these expressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am willing to revert it back if there is no specific reason. This is about #14788 (comment) - @hvanhovell
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's confusing to users that date_add add days to the given date and add_months add months to the given date. I think add_days and add_months are more consistent.
Other databases(e.g. MySQL, Postgres) only have date_add which adds interval to the given date, so that they don't need add_days and add_months respectively.
The function name is already realsed and maybe hard to change, but changing the expression name to match the real logic seems good.
@rxin any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get what you were suggesting here. Wouldn't it make more sense to make DateAdd expression support both adding Interval type and adding IntegralType (for days)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then we have date_add to add days or interval to the given date, and add_months to add months to the given date, seems a little weird...
|
|
||
| /** | ||
| * Returns date truncated to the unit specified by the format. | ||
| * Returns timestamp truncated to the unit specified by the format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't this actually change the data type returned and can break code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap. So, I initially wanted to match input type to output type (DateType input - DateType - output, TimestampType input - TimestampType output) but just decided to follow the suggestion because It seems it depends on how we define the type maybe as it seems the implementation is different in each DBMS.
I would like to to hear the thoughts on #14788 (comment) in more details - cc @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reynold has a very valid point. My bad for not thinking this is a problem. This will break as soon as you call a java UDF or call df.rdd.map(...). I think we need to have both a truncate date and a timestamp expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thank you for your kind explanation. Will definitely try to avoid such a mistake in the futute.
|
Test build #66626 has finished for PR 14788 at commit
|
|
Test build #3313 has finished for PR 14788 at commit
|
|
Test build #3316 has finished for PR 14788 at commit
|
…pe (date/timestamp)
|
Test build #66712 has finished for PR 14788 at commit
|
|
Test build #66713 has finished for PR 14788 at commit
|
|
Test build #66728 has finished for PR 14788 at commit
|
| extended = "> SELECT _FUNC_('2009-02-12', 'MM')\n '2009-02-01 00:00:00'\n> SELECT _FUNC_('2015-10-27', 'YEAR');\n '2015-01-01 00:00:00'") | ||
| // scalastyle:on line.size.limit | ||
| case class TruncDate(date: Expression, format: Expression) | ||
| case class TruncInstant(instant: Expression, format: Expression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This auto casts. I think we are still breaking API here if a user passes a Timestamp. In the old situation the user would always get a Date, and now he gets a Date or Timestamp based on the input type. So I think we need to split this into two expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misunderstood the first comment.. Will make two expressions. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell ur.. actually, should I split other functions I corrected here as well here? DateAdd, DateSub and etc. also seem having the same problems.
|
@rxin @hvanhovell @cloud-fan Could I please ask the behaviour we want here?
|
|
I prefer option 1. |
|
Does option 1 really work? Wouldn't the expressions have the same user facing function names? |
|
It would work with different names, i.e.:
|
|
Could I then go for option 1.? |
|
What do other databases do? Does date_add in other databases support timestamps? |
|
#14788 (comment) here is my observation. It seems usually option 2 or option 3. I can take a look deeper if we want to follow other databases or be very sure on this. |
|
I will be back after testing/looking into other databases tomorrow. |
|
I tried to find the functions equivalent with In more details, DB2 (
Note: DB2 has ( Oracle (
Postgres (
MySQL (
Note: When the |
|
Hi @rxin, do you mind if I ask what you do think about this? |
|
gentle ping... |
What changes were proposed in this pull request?
This PR adds the support for returning
TimestampTypefordate_add,date_sub,trunc,last_day,next_dayandadd_monthsfunctions.The output type of this function follows the input data type. (e.g. input:
TimestampType, output :TimestampType)How was this patch tested?
Unit test in
DateExpressionsSuite.