[SPARK-28885][SQL] Follow ANSI store assignment rules in table insertion by default #26107

gengliangwang · 2019-10-14T05:34:04Z

What changes were proposed in this pull request?

When inserting a value into a column with the different data type, Spark performs type coercion. Currently, we support 3 policies for the store assignment rules: ANSI, legacy and strict, which can be set via the option "spark.sql.storeAssignmentPolicy":

ANSI: Spark performs the type coercion as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting string to int and double to boolean. It will throw a runtime exception if the value is out-of-range(overflow).
Legacy: Spark allows the type coercion as long as it is a valid Cast, which is very loose. E.g., converting either string to int or double to boolean is allowed. It is the current behavior in Spark 2.x for compatibility with Hive. When inserting an out-of-range value to a integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted to a field of Byte type, the result is 1.
Strict: Spark doesn't allow any possible precision loss or data truncation in store assignment, e.g., converting either double to int or decimal to double is allowed. The rules are originally for Dataset encoder. As far as I know, no mainstream DBMS is using this policy by default.

Currently, the V1 data source uses "Legacy" policy by default, while V2 uses "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 in Spark 3.0.

Why are the changes needed?

Following the ANSI SQL standard is most reasonable among the 3 policies.

Does this PR introduce any user-facing change?

Yes.
The default store assignment policy is ANSI for both V1 and V2 data sources.

How was this patch tested?

Unit test

SparkQA · 2019-10-14T05:59:19Z

Test build #112004 has finished for PR 26107 at commit 83d87bd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-10-14T07:47:25Z

let's also add a migration guide for DS v1.

SparkQA · 2019-10-14T11:57:49Z

Test build #112021 has finished for PR 26107 at commit 3fe9e46.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-14T12:10:34Z

Test build #112023 has finished for PR 26107 at commit aebd191.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala

SparkQA · 2019-10-14T14:05:56Z

Test build #112024 has finished for PR 26107 at commit 4fd423b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-14T14:52:07Z

Test build #112027 has finished for PR 26107 at commit fc47811.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-14T15:27:58Z

Test build #112030 has finished for PR 26107 at commit 497f9a3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-14T16:00:01Z

Test build #112033 has finished for PR 26107 at commit b26f39c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-14T16:23:03Z

Test build #112034 has finished for PR 26107 at commit aefc6be.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-14T18:33:57Z

Test build #112052 has finished for PR 26107 at commit 49e3fba.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-10-14T19:25:24Z

retest this please.

SparkQA · 2019-10-14T19:54:07Z

Test build #112055 has finished for PR 26107 at commit 49e3fba.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

This reverts commit b602083.

This reverts commit cb70ddc.

SparkQA · 2019-10-14T23:03:12Z

Test build #112059 has finished for PR 26107 at commit b602083.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-14T23:40:02Z

Test build #112068 has finished for PR 26107 at commit cb70ddc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-15T00:58:11Z

Test build #112075 has finished for PR 26107 at commit e7ccbac.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-15T10:58:31Z

Test build #112090 has finished for PR 26107 at commit 4b77736.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-10-15T11:32:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala

          true
        }

+      case (_: NullType, _) if storeAssignmentPolicy == ANSI => true


I think we can write null to any nullable column even if it's strict policy. We can have a followup PR to discuss it further.

cloud-fan · 2019-10-15T13:25:12Z

LGTM

SparkQA · 2019-10-15T17:28:26Z

Test build #112107 has finished for PR 26107 at commit b9abb67.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. This is a big change according to the vote.
Thank you, @gengliangwang and @cloud-fan .
I'll merge this to master to unblock the other PRs. This opened up several follow-up issues. Also, we can adjust the rule on strict policy later.

MaxGekk · 2019-11-02T10:02:20Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala

+    "postgreSQL/float8.sql",
+    // SPARK-28885 String value is not allowed to be stored as date/timestamp type with
+    // ANSI store assignment policy.
+    "postgreSQL/date.sql",


@gengliangwang Sorry, I just realized recently that my changed are not tested by date.sql and timestamp.sql any more when I run: build/sbt "sql/test-only *SQLQueryTestSuite -- -z date.sql". Did you disable them forever or are they executed in some way by jenkins?

How about just setting storeAssignmentPolicy=LEGACY for the PgSQL tests? They have a lots of INSERT queries having implicit casts from string literals to numeric values.

@maropu I am not sure about that.
For PgSQL, it disable inserting string values to numeric columns except for literals. Setting storeAssignmentPolicy=LEGACY for all the PgSQL tests seems inaccurate.

But, before this pr's been merged, we tested the PgSQL tests in the LEGACY mode? Is my understanding wrong? Personally, I think we need to explicitly file issues in jira (Or, comment out them?) if we have inaccurate tests in pgSQL/.

How about changing the timestamp/date values from string literal to timestamp/date literal in those sql files, just as https://github.com/apache/spark/pull/26107/files#diff-431a4d1f056a06e853da8a60c46e9ca0R68
I am not sure about whether there is a guideline in porting the PgSQL test files. Such modifications are allowed, right?

I am not sure about whether there is a guideline in porting the PgSQL test files. Such modifications are allowed, right?

Yea, I think that's ok. cc: @wangyum @dongjoon-hyun @HyukjinKwon
If you have no time to do, I'll check next week.

I'm +1 for @gengliangwang 's suggestion enabling those tests with that modification.

…ests of SQLQueryTestSuite ### What changes were proposed in this pull request? SPARK-28885(#26107) has supported the ANSI store assignment rules and stopped running some ported PgSQL regression tests that violate the rules. To re-activate these tests, this pr is to modify them for passing tests with the rules. ### Why are the changes needed? To make the test coverage better. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #26492 from maropu/SPARK-28885-FOLLOWUP. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

cloud-fan reviewed Oct 14, 2019

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala Show resolved Hide resolved

dongjoon-hyun added the SQL label Oct 14, 2019

gengliangwang added 10 commits October 14, 2019 22:25

making ANSI store assignment policy as default

ae40a71

fix test failures

6f9cfa1

fix one more test failure

ad8f578

fix failure

758ac1f

fix more failures

28c49e7

more fixes

3d80200

add migration guide

4bd989e

fix more test failures

944a426

fix

24056b9

fix

b602083

gengliangwang force-pushed the ansiPolicyAsDefault branch from 49e3fba to b602083 Compare October 14, 2019 20:28

fix more failure

cb70ddc

gengliangwang mentioned this pull request Oct 14, 2019

[SPARK-29107][SQL][TESTS][FollowUp] Move window_part1.sql to directory 'postgreSQL' #26116

Closed

gengliangwang added 2 commits October 15, 2019 00:39

Revert "fix"

e24e35d

This reverts commit b602083.

Revert "fix more failure"

e7ccbac

This reverts commit cb70ddc.

fix one more failure

4b77736

cloud-fan reviewed Oct 15, 2019

View reviewed changes

recover one test case

b9abb67

This was referenced Oct 15, 2019

[SPARK-29107][SQL][TESTS] Port window.sql (Part 1) #26119

Closed

[SPARK-29108][SQL][TESTS] Port window.sql (Part 2) #26121

Closed

dongjoon-hyun approved these changes Oct 15, 2019

View reviewed changes

dongjoon-hyun closed this in 322ec0b Oct 15, 2019

gengliangwang mentioned this pull request Oct 22, 2019

[WIP][SPARK-29540][SQL]Support cast StringType to DateType follow ansi #26208

Closed

gengliangwang mentioned this pull request Oct 30, 2019

[SPARK-29462] The data type of "array()" should be array<null> #26324

Closed

MaxGekk reviewed Nov 2, 2019

View reviewed changes

maropu mentioned this pull request Nov 13, 2019

[SPARK-28885][SQL][FOLLOW-UP] Re-enable the ported PgSQL regression tests of SQLQueryTestSuite #26492

Closed

[SPARK-28885][SQL] Follow ANSI store assignment rules in table insertion by default #26107

[SPARK-28885][SQL] Follow ANSI store assignment rules in table insertion by default #26107

Uh oh!

Conversation

gengliangwang commented Oct 14, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

cloud-fan commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

gengliangwang commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 15, 2019

Uh oh!

SparkQA commented Oct 15, 2019

Uh oh!

cloud-fan Oct 15, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 15, 2019

Uh oh!

SparkQA commented Oct 15, 2019

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Nov 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu Nov 3, 2019

Choose a reason for hiding this comment

Uh oh!

gengliangwang Nov 3, 2019

Choose a reason for hiding this comment

Uh oh!

maropu Nov 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang Nov 3, 2019

Choose a reason for hiding this comment

Uh oh!

maropu Nov 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 3, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

dongjoon-hyun left a comment •

edited

Loading

MaxGekk Nov 2, 2019 •

edited

Loading

maropu Nov 3, 2019 •

edited

Loading

maropu Nov 3, 2019 •

edited

Loading