-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-29774][SQL] Date and Timestamp type +/- null should be null as Postgres #26412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #113314 has finished for PR 26412 at commit
|
|
Ah, I see. The change looks reasonable to me. Just in case, can you check behaivours in the other systems? |
|
also check with presto presto> select date('1900-01-01') - null;
_col0
-------
NULL
(1 row)
Query 20191127_065501_00001_9md27, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s] |
|
Test build #114511 has finished for PR 26412 at commit
|
|
retest this please |
| case Add(l @ NullType(), r @ DateType()) => DateAdd(r, Cast(l, IntegerType)) | ||
| case Subtract(l @ DateType(), r @ IntegerType()) => DateSub(l, r) | ||
| case Subtract(l @ DateType(), r @ NullType()) => DateSub(l, Cast(r, IntegerType)) | ||
| case Subtract(l @ DateType(), r @ DateType()) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge the multiple rule above into one like this?
case b @ BinaryOperator(l @ DateType(), r @ NullType()) =>
b.withNewChildren(Seq(l, Cast(r, IntegerType)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, we might leave a trivial bug here if we set spark.sql.optimizer.maxIterations=1, it will not be transformed to DateAdd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm..., I personally think that behaivour looks a little weired to me. Probably, the root cause is that Subtract(l @ DateType(), r @ NullType()).checkInputDataTypes.isSuccess returns true. To fix this issue, we might need to modify that check code to return false. cc: @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, add with numeric type and null type is also handled in TypeCoercion too
| SubtractTimestamps(l, r) | ||
| case Subtract(l @ TimestampType(), r @ DateType()) => | ||
| SubtractTimestamps(l, Cast(r, TimestampType)) | ||
| case Subtract(l @ TimestampType(), r @ NullType()) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about null - timestamp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we need them too. checked with pg
|
Test build #114528 has finished for PR 26412 at commit
|
|
Test build #114534 has finished for PR 26412 at commit
|
gatorsmile
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the BinaryArithmetic operators are NullIntolerant. Why this is only against Date/Timestamp types?
|
IIUC, the |
|
I think it's all because we hack the How about we create |
This is better, I will follow this sugguestion, thanks. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Show resolved
Hide resolved
| Cast(TimeSub(l, r), l.dataType) | ||
| case (CalendarIntervalType, TimestampType | DateType | StringType) => | ||
| Cast(TimeSub(r, l), r.dataType) | ||
| case (DateType | NullType, DateType) => if (conf.usePostgreSQLDialect) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to handle NullType here? The Subtract should work for null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, the result same but do not semantic equal, is that OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually not, subtract(null, date) will not pass type checking
| } else { | ||
| SubtractDates(l, r) | ||
| } | ||
| case (TimestampType, TimestampType | DateType | NullType) => SubtractTimestamps(l, r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| SubtractDates(l, r) | ||
| } | ||
| case (TimestampType, TimestampType | DateType | NullType) => SubtractTimestamps(l, r) | ||
| case (DateType | NullType, TimestampType) => SubtractTimestamps(Cast(l, TimestampType), r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| case (_, _) => Subtract(l, r) | ||
| } | ||
| case UnresolvedMultiply(l, r) => (l.dataType, r.dataType) match { | ||
| case (CalendarIntervalType, _: NumericType | NullType) => MultiplyInterval(l, r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| } | ||
| case UnresolvedSubtract(l, r) => (l.dataType, r.dataType) match { | ||
| case (TimestampType | DateType | StringType, CalendarIntervalType) => | ||
| Cast(TimeSub(l, r), l.dataType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that TimeSub is replaceable by TimeAdd(l, UnaryMinus(r)), which make it useless
|
Test build #114718 has finished for PR 26412 at commit
|
|
Test build #114886 has finished for PR 26412 at commit
|
Simply replace the |
|
Test build #114893 has finished for PR 26412 at commit
|
|
retest this please |
| import CatalystSqlParser._ | ||
| import org.apache.spark.sql.catalyst.dsl.expressions._ | ||
| import org.apache.spark.sql.catalyst.dsl.plans._ | ||
| import org.apache.spark.sql.catalyst.parser.CatalystSqlParser._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary change
| import CatalystSqlParser._ | ||
| import org.apache.spark.sql.catalyst.dsl.expressions._ | ||
| import org.apache.spark.sql.catalyst.dsl.plans._ | ||
| import org.apache.spark.sql.catalyst.parser.CatalystSqlParser._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| select timestamp'2011-11-11 11:11:11' + interval '2' day | ||
| -- !query 13 schema | ||
| struct<date_sub(DATE '2001-10-01', 7):date> | ||
| struct<CAST(TIMESTAMP '2011-11-11 11:11:11' + INTERVAL '2 days' AS TIMESTAMP):timestamp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid adding cast if not necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK it's the existing behavior too, we can revisit it later.
| select '2011-11-11' - interval '2' day | ||
| -- !query 17 schema | ||
| struct<DATE '2019-01-01':date> | ||
| struct<CAST(CAST(2011-11-11 AS TIMESTAMP) - INTERVAL '2 days' AS STRING):string> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's super weird that this returns string. What was the behavior before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK it's the existing behavior. We can revisit it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, but follows the behavior before https://github.com/apache/spark/pull/26412/files/c84d46ea6d384dcb1f442ca54abad48e59c92bb3#diff-383a8cdd0a9c58cae68e0a79295520a3L846
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Allowed operations:
// IntervalYearMonth - IntervalYearMonth = IntervalYearMonth
// Date - IntervalYearMonth = Date (operands not reversible)
// Timestamp - IntervalYearMonth = Timestamp (operands not reversible)
// IntervalDayTime - IntervalDayTime = IntervalDayTime
// Date - IntervalYearMonth = Timestamp (operands not reversible)
// Timestamp - IntervalYearMonth = Timestamp (operands not reversible)
// Timestamp - Timestamp = IntervalDayTime
// Date - Date = IntervalDayTime
// Timestamp - Date = IntervalDayTime (operands reversible)
// Date - Int = DateHive's behavior is more convictive, we can check this later.
|
looks pretty good, let's see how tests go this time. |
|
Test build #114896 has finished for PR 26412 at commit
|
|
Test build #114897 has finished for PR 26412 at commit
|
|
thanks, merging to master! |
… Postgres # What changes were proposed in this pull request? Add an analyzer rule to convert unresolved `Add`, `Subtract`, etc. to `TimeAdd`, `DateAdd`, etc. according to the following policy: ```scala /** * For [[Add]]: * 1. if both side are interval, stays the same; * 2. else if one side is interval, turns it to [[TimeAdd]]; * 3. else if one side is date, turns it to [[DateAdd]] ; * 4. else stays the same. * * For [[Subtract]]: * 1. if both side are interval, stays the same; * 2. else if the right side is an interval, turns it to [[TimeSub]]; * 3. else if one side is timestamp, turns it to [[SubtractTimestamps]]; * 4. else if the right side is date, turns it to [[DateDiff]]/[[SubtractDates]]; * 5. else if the left side is date, turns it to [[DateSub]]; * 6. else turns it to stays the same. * * For [[Multiply]]: * 1. If one side is interval, turns it to [[MultiplyInterval]]; * 2. otherwise, stays the same. * * For [[Divide]]: * 1. If the left side is interval, turns it to [[DivideInterval]]; * 2. otherwise, stays the same. */ ``` Besides, we change datetime functions from implicit cast types to strict ones, all available type coercions happen in `DateTimeOperations` coercion rule. ### Why are the changes needed? Feature Parity between PostgreSQL and Spark, and make the null semantic consistent with Spark. ### Does this PR introduce any user-facing change? 1. date_add/date_sub functions only accept int/tinynit/smallint as the second arg, double/string etc, are forbidden like hive, which produce weird results. ### How was this patch tested? add ut Closes apache#26412 from yaooqinn/SPARK-29774. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
|
Sorry, @cloud-fan, I just checked the cc.
I don't think there's any differences for the column names being generated in PySpark specifically. |
|
@yaooqinn Thanks for the work, but I don't know the behavior before this PR from the PR description and discussions. I would suggest adding that in the PR description as well. I have to check with Spark 2.4.4 to find the previous behavior: |
|
Hi @gengliangwang, thanks for your suggestion, I have updated the description. Can you check whether it is clear enough. |
Do you mean time_add/time_sub? BTW we should have a migration guide for it. |
|
It is date_add and date_sub, we have made them |
…ate_sub ### What changes were proposed in this pull request? add a migration guide for date_add and date_sub to indicates their behavior change. It a followup for #26412 ### Why are the changes needed? add a migration guide ### Does this PR introduce any user-facing change? yes, doc change ### How was this patch tested? no Closes #26932 from yaooqinn/SPARK-29774-f. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ate_add/date_sub functions ### What changes were proposed in this pull request? #26412 introduced a behavior change that `date_add`/`date_sub` functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime. However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding. ### Why are the changes needed? avoid breaking changes ### Does this PR introduce any user-facing change? Yes, now 3.0 can run `date_add('2011-11-11', '1')` like 2.4 ### How was this patch tested? new tests. Closes #27965 from cloud-fan/string. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ate_add/date_sub functions ### What changes were proposed in this pull request? #26412 introduced a behavior change that `date_add`/`date_sub` functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime. However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding. ### Why are the changes needed? avoid breaking changes ### Does this PR introduce any user-facing change? Yes, now 3.0 can run `date_add('2011-11-11', '1')` like 2.4 ### How was this patch tested? new tests. Closes #27965 from cloud-fan/string. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 1d0f549) Signed-off-by: Wenchen Fan <[email protected]>
…ate_add/date_sub functions ### What changes were proposed in this pull request? apache#26412 introduced a behavior change that `date_add`/`date_sub` functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime. However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding. ### Why are the changes needed? avoid breaking changes ### Does this PR introduce any user-facing change? Yes, now 3.0 can run `date_add('2011-11-11', '1')` like 2.4 ### How was this patch tested? new tests. Closes apache#27965 from cloud-fan/string. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…date_sub for Spark 2 compatibility (apache#286) * [HADP-43405] Implicitly cast second argument of date_add/date_sub for Spark 2 compatibility (apache#60) This PR adds a new mixin `LegacyCastInputTypes` for analyzer to perform implicit type casting when `spark.sql.legacy.implicitCastInputTypes` is true for Spark 2 compatibility. apache#26412 broke Spark 2 compatibility by not implicitly casting second argument of `date_add/date_sub` functions. `DateAdd/DateSub` now extends `LegacyCastInputTypes` with `spark.sql.legacy.implicitCastInputTypes=false` by default so the default behavior is not changed. We will enable the config in Panda. No. Add UT. Co-authored-by: tianlzhang <[email protected]> Co-authored-by: Wang, Fei <[email protected]>
What changes were proposed in this pull request?
Add an analyzer rule to convert unresolved
Add,Subtract, etc. toTimeAdd,DateAdd, etc. according to the following policy:Besides, we change datetime functions from implicit cast types to strict ones, all available type coercions happen in
DateTimeOperationscoercion rule.Why are the changes needed?
Feature Parity between PostgreSQL and Spark, and make the null semantic consistent with Spark.
Does this PR introduce any user-facing change?
NullIntolerant, e.g.select timestamp'1999-12-31 00:00:00' - nullis valid nowHow was this patch tested?
add ut