Skip to content

Conversation

@gengliangwang
Copy link
Member

What changes were proposed in this pull request?

When inserting a value into a column with the different data type, Spark performs type coercion. Currently, we support 3 policies for the store assignment rules: ANSI, legacy and strict, which can be set via the option "spark.sql.storeAssignmentPolicy":

  1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting string to int and double to boolean. It will throw a runtime exception if the value is out-of-range(overflow).
  2. Legacy: Spark allows the type coercion as long as it is a valid Cast, which is very loose. E.g., converting either string to int or double to boolean is allowed. It is the current behavior in Spark 2.x for compatibility with Hive. When inserting an out-of-range value to a integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted to a field of Byte type, the result is 1.
  3. Strict: Spark doesn't allow any possible precision loss or data truncation in store assignment, e.g., converting either double to int or decimal to double is allowed. The rules are originally for Dataset encoder. As far as I know, no mainstream DBMS is using this policy by default.

Currently, the V1 data source uses "Legacy" policy by default, while V2 uses "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 in Spark 3.0.

Why are the changes needed?

Following the ANSI SQL standard is most reasonable among the 3 policies.

Does this PR introduce any user-facing change?

Yes.
The default store assignment policy is ANSI for both V1 and V2 data sources.

How was this patch tested?

Unit test

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112004 has finished for PR 26107 at commit 83d87bd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

let's also add a migration guide for DS v1.

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112021 has finished for PR 26107 at commit 3fe9e46.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112023 has finished for PR 26107 at commit aebd191.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112024 has finished for PR 26107 at commit 4fd423b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112027 has finished for PR 26107 at commit fc47811.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112030 has finished for PR 26107 at commit 497f9a3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112033 has finished for PR 26107 at commit b26f39c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112034 has finished for PR 26107 at commit aefc6be.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112052 has finished for PR 26107 at commit 49e3fba.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112055 has finished for PR 26107 at commit 49e3fba.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

This reverts commit b602083.
This reverts commit cb70ddc.
@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112059 has finished for PR 26107 at commit b602083.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112068 has finished for PR 26107 at commit cb70ddc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 15, 2019

Test build #112075 has finished for PR 26107 at commit e7ccbac.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 15, 2019

Test build #112090 has finished for PR 26107 at commit 4b77736.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

true
}

case (_: NullType, _) if storeAssignmentPolicy == ANSI => true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can write null to any nullable column even if it's strict policy. We can have a followup PR to discuss it further.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Oct 15, 2019

Test build #112107 has finished for PR 26107 at commit b9abb67.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. This is a big change according to the vote.
Thank you, @gengliangwang and @cloud-fan .
I'll merge this to master to unblock the other PRs. This opened up several follow-up issues. Also, we can adjust the rule on strict policy later.

"postgreSQL/float8.sql",
// SPARK-28885 String value is not allowed to be stored as date/timestamp type with
// ANSI store assignment policy.
"postgreSQL/date.sql",
Copy link
Member

@MaxGekk MaxGekk Nov 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang Sorry, I just realized recently that my changed are not tested by date.sql and timestamp.sql any more when I run: build/sbt "sql/test-only *SQLQueryTestSuite -- -z date.sql". Did you disable them forever or are they executed in some way by jenkins?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just setting storeAssignmentPolicy=LEGACY for the PgSQL tests? They have a lots of INSERT queries having implicit casts from string literals to numeric values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu I am not sure about that.
For PgSQL, it disable inserting string values to numeric columns except for literals. Setting storeAssignmentPolicy=LEGACY for all the PgSQL tests seems inaccurate.

Copy link
Member

@maropu maropu Nov 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, before this pr's been merged, we tested the PgSQL tests in the LEGACY mode? Is my understanding wrong? Personally, I think we need to explicitly file issues in jira (Or, comment out them?) if we have inaccurate tests in pgSQL/.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about changing the timestamp/date values from string literal to timestamp/date literal in those sql files, just as https://github.com/apache/spark/pull/26107/files#diff-431a4d1f056a06e853da8a60c46e9ca0R68
I am not sure about whether there is a guideline in porting the PgSQL test files. Such modifications are allowed, right?

Copy link
Member

@maropu maropu Nov 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about whether there is a guideline in porting the PgSQL test files. Such modifications are allowed, right?

Yea, I think that's ok. cc: @wangyum @dongjoon-hyun @HyukjinKwon
If you have no time to do, I'll check next week.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 for @gengliangwang 's suggestion enabling those tests with that modification.

dongjoon-hyun pushed a commit that referenced this pull request Nov 20, 2019
…ests of SQLQueryTestSuite

### What changes were proposed in this pull request?

SPARK-28885(#26107) has supported the ANSI store assignment rules and stopped running some ported PgSQL regression tests that violate the rules. To re-activate these tests, this pr is to modify them for passing tests with the rules.

### Why are the changes needed?

To make the test coverage better.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #26492 from maropu/SPARK-28885-FOLLOWUP.

Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants