-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-28885][SQL] Follow ANSI store assignment rules in table insertion by default #26107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #112004 has finished for PR 26107 at commit
|
|
let's also add a migration guide for DS v1. |
|
Test build #112021 has finished for PR 26107 at commit
|
|
Test build #112023 has finished for PR 26107 at commit
|
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
Show resolved
Hide resolved
|
Test build #112024 has finished for PR 26107 at commit
|
|
Test build #112027 has finished for PR 26107 at commit
|
|
Test build #112030 has finished for PR 26107 at commit
|
|
Test build #112033 has finished for PR 26107 at commit
|
|
Test build #112034 has finished for PR 26107 at commit
|
|
Test build #112052 has finished for PR 26107 at commit
|
|
retest this please. |
|
Test build #112055 has finished for PR 26107 at commit
|
49e3fba to
b602083
Compare
|
Test build #112059 has finished for PR 26107 at commit
|
|
Test build #112068 has finished for PR 26107 at commit
|
|
Test build #112075 has finished for PR 26107 at commit
|
|
Test build #112090 has finished for PR 26107 at commit
|
| true | ||
| } | ||
|
|
||
| case (_: NullType, _) if storeAssignmentPolicy == ANSI => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can write null to any nullable column even if it's strict policy. We can have a followup PR to discuss it further.
|
LGTM |
|
Test build #112107 has finished for PR 26107 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. This is a big change according to the vote.
Thank you, @gengliangwang and @cloud-fan .
I'll merge this to master to unblock the other PRs. This opened up several follow-up issues. Also, we can adjust the rule on strict policy later.
| "postgreSQL/float8.sql", | ||
| // SPARK-28885 String value is not allowed to be stored as date/timestamp type with | ||
| // ANSI store assignment policy. | ||
| "postgreSQL/date.sql", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang Sorry, I just realized recently that my changed are not tested by date.sql and timestamp.sql any more when I run: build/sbt "sql/test-only *SQLQueryTestSuite -- -z date.sql". Did you disable them forever or are they executed in some way by jenkins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just setting storeAssignmentPolicy=LEGACY for the PgSQL tests? They have a lots of INSERT queries having implicit casts from string literals to numeric values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu I am not sure about that.
For PgSQL, it disable inserting string values to numeric columns except for literals. Setting storeAssignmentPolicy=LEGACY for all the PgSQL tests seems inaccurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, before this pr's been merged, we tested the PgSQL tests in the LEGACY mode? Is my understanding wrong? Personally, I think we need to explicitly file issues in jira (Or, comment out them?) if we have inaccurate tests in pgSQL/.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about changing the timestamp/date values from string literal to timestamp/date literal in those sql files, just as https://github.com/apache/spark/pull/26107/files#diff-431a4d1f056a06e853da8a60c46e9ca0R68
I am not sure about whether there is a guideline in porting the PgSQL test files. Such modifications are allowed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about whether there is a guideline in porting the PgSQL test files. Such modifications are allowed, right?
Yea, I think that's ok. cc: @wangyum @dongjoon-hyun @HyukjinKwon
If you have no time to do, I'll check next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm +1 for @gengliangwang 's suggestion enabling those tests with that modification.
…ests of SQLQueryTestSuite ### What changes were proposed in this pull request? SPARK-28885(#26107) has supported the ANSI store assignment rules and stopped running some ported PgSQL regression tests that violate the rules. To re-activate these tests, this pr is to modify them for passing tests with the rules. ### Why are the changes needed? To make the test coverage better. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #26492 from maropu/SPARK-28885-FOLLOWUP. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
When inserting a value into a column with the different data type, Spark performs type coercion. Currently, we support 3 policies for the store assignment rules: ANSI, legacy and strict, which can be set via the option "spark.sql.storeAssignmentPolicy":
stringtointanddoubletoboolean. It will throw a runtime exception if the value is out-of-range(overflow).Cast, which is very loose. E.g., converting eitherstringtointordoubletobooleanis allowed. It is the current behavior in Spark 2.x for compatibility with Hive. When inserting an out-of-range value to a integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted to a field of Byte type, the result is 1.doubletointordecimaltodoubleis allowed. The rules are originally for Dataset encoder. As far as I know, no mainstream DBMS is using this policy by default.Currently, the V1 data source uses "Legacy" policy by default, while V2 uses "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 in Spark 3.0.
Why are the changes needed?
Following the ANSI SQL standard is most reasonable among the 3 policies.
Does this PR introduce any user-facing change?
Yes.
The default store assignment policy is ANSI for both V1 and V2 data sources.
How was this patch tested?
Unit test