[SPARK-27931][SQL] Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. #25458

younggyuchun · 2019-08-15T04:01:46Z

What changes were proposed in this pull request?

This PR aims to add "true", "yes", "1", "false", "no", "0", and unique prefixes as input for the boolean data type and ignore input whitespace. Please see the following what string representations are using for the boolean type in other databases.

https://www.postgresql.org/docs/devel/datatype-boolean.html
https://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html

How was this patch tested?

Added new tests to CastSuite.

…a boolean data type.

wangyum · 2019-08-15T04:34:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala

+    Set("t", "true", "y", "yes", "1", "on").map(UTF8String.fromString)
+
+  private[this] val falseStrings =
+    Set("f", "false", "n", "no", "0", "off").map(UTF8String.fromString)


It seems only PostgreSQL accepts on and off?

Yes I guess so. Do you know other common string representattion used in other databases?

But PostgreSQL also acceptsof, tru, fals, ...:

postgres=# select cast('of' as boolean), cast('tru' as boolean), cast('fals' as boolean); bool | bool | bool ------+------+------ f | t | f (1 row)

postgres/postgres@9729c93

Ah okay. Let me add that too. Thank you

SparkQA · 2019-08-15T06:40:25Z

Test build #109139 has finished for PR 25458 at commit 7d61642.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…he boolean data type.

younggyuchun · 2019-08-16T00:13:05Z

Unique prefixes of strings, for example, "true", "tre", "tr" and "t" also accepted.

younggyuchun · 2019-08-16T00:17:37Z

@HyukjinKwon, @srowen Could you please review this PR?

HyukjinKwon · 2019-08-16T03:18:13Z

cc @dongjoon-hyun, @cloud-fan and @gatorsmile as well.

SparkQA · 2019-08-16T06:01:22Z

Test build #109158 has finished for PR 25458 at commit 933fe86.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

younggyuchun · 2019-08-16T13:21:11Z

sql/core/src/test/resources/sql-tests/inputs/pgSQL/boolean.sql


 -- [SPARK-27931] Trim the string when cast string type to boolean type
-SELECT boolean('   f           ') AS `false`;
+SELECT boolean('   f           ') AS `true`;


@wangyum I think it should be 'true'. Is it correct?

It's false:

Let me dig into this.

younggyuchun · 2019-08-16T13:22:36Z

retest this please

SparkQA · 2019-08-16T18:58:47Z

Test build #109222 has finished for PR 25458 at commit a5aec9f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-16T22:33:01Z

Test build #109234 has finished for PR 25458 at commit 9e9aac3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

younggyuchun · 2019-08-19T13:02:57Z

Could you please review this PR? @HyukjinKwon @dongjoon-hyun @cloud-fan and @gatorsmile

HyukjinKwon · 2019-08-23T02:40:52Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala

+    checkCast("n", false)
    checkCast("0", false)
+    checkCast("off", false)
+    checkCast("of", false)


@younggyuchun, just for doubly sure, did you double check the behaviours against PostgreSQL?

@HyukjinKwon Here it is:

This is not documented: https://www.postgresql.org/docs/devel/datatype-boolean.html

Postgres may support of for history reasons, I don't think we have to follow it.

@cloud-fan @dongjoon-hyun @HyukjinKwon
This build accepts several unique prefixes for the boolean data type. For example, tru, tr, ye, fals, fal, fa and of, which are not documented. Do we want to not to accept these prefixes?

@cloud-fan . It's a documented feature in that document. We had better support it.

Unique prefixes of these strings are also accepted, for example t or n. Leading or trailing whitespace is ignored, and case does not matter.

cc @gatorsmile

BTW, @younggyuchun . Please add a negative test case.
o is not supported because it's not unique. It's a common prefix for on and off.
It should be null.

Seems okay to me

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala

dongjoon-hyun

Hi, @younggyuchun . Thank you for this contribution. Could you add a comment to the non-trivial code part? Also, please update the PR title and description accordingly since this PR seems to be almost ready for merge. I'll review this PR again after updating.

…fixes of these strings are accepted.

younggyuchun · 2019-08-27T19:47:26Z

Hi @dongjoon-hyun,
Sorry for the late reply. I was on vacation :). PR title and description have been changed accordingly.

SparkQA · 2019-08-27T23:23:02Z

Test build #109830 has finished for PR 25458 at commit dacd46b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-08-30T00:54:15Z

Retest this please

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala

…pace.

younggyuchun · 2019-08-30T02:19:50Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala

+    checkCast("off", false)
+    checkCast("of", false)

+    checkEvaluation(cast("o", BooleanType), null)


@dongjoon-hyun
Add a negative test for "o"

dongjoon-hyun

+1, LGTM. (Pending Jenkins).
Thank you, @younggyuchun .

maropu

The code itself looks ok to me.

SparkQA · 2019-08-30T04:40:56Z

Test build #109927 has finished for PR 25458 at commit dacd46b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-30T05:33:44Z

Test build #109931 has finished for PR 25458 at commit abe9a84.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-30T06:10:40Z

Test build #109933 has finished for PR 25458 at commit 2ea551c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…an type" comment.

dongjoon-hyun · 2019-08-30T21:17:30Z

The last commit removes 4 comments from boolean.sql and it passed already in the last Jenkins. For the other tests, we verified them in the previous Jenkins runs.

 - pgSQL/boolean.sql (18 seconds, 745 milliseconds)

Merged to master. Thank you all!

younggyuchun · 2019-08-30T21:41:30Z

Thank you all.

SparkQA · 2019-08-30T22:16:34Z

Test build #109957 has finished for PR 25458 at commit b787483.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2019-08-30T23:14:42Z

@dongjoon-hyun @maropu @cloud-fan @HyukjinKwon This is a behavior change. We need to document it if it is on by default. I think it should be guarded by a global dialect flag, for postgreSQL compatibility instead of doing it for all the cases. WDYT?

gatorsmile · 2019-08-30T23:28:03Z

I believe postgreSQL compatibility is a very important feature. We will add more and more such behavior changes in the near future. Let us move these changes behind a conf like spark.sql.compatiblity.mode=pgSQL?

dongjoon-hyun · 2019-08-30T23:29:17Z

Yep. I agree. Sounds good to me. And, what is the default value of the conf?

gatorsmile · 2019-08-30T23:30:03Z

I think spark at first? When pgSQL mode is stable, we can turn it on? I believe it might take a couple of releases to make it stable.

dongjoon-hyun · 2019-08-30T23:34:07Z

Got it. So, Apache Spark 3.0.0 starts with spark.sql.parser.ansi.enabled=false and spark.sql.compatiblity.mode=spark.

gatorsmile · 2019-08-30T23:37:25Z

Yes. I think that might be a good choice.

dongjoon-hyun · 2019-08-30T23:38:16Z

https://issues.apache.org/jira/browse/SPARK-28934 is filed. After we create the configuration, we can wrap the existing work one-by-one safely.

maropu · 2019-08-30T23:46:18Z

Sounds good to me, too.

### What changes were proposed in this pull request? After #25158 and #25458, SQL features of PostgreSQL are introduced into Spark. AFAIK, both features are implementation-defined behaviors, which are not specified in ANSI SQL. In such a case, this proposal is to add a configuration `spark.sql.dialect` for choosing a database dialect. After this PR, Spark supports two database dialects, `Spark` and `PostgreSQL`. With `PostgreSQL` dialect, Spark will: 1. perform integral division with the / operator if both sides are integral types; 2. accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. ### Why are the changes needed? Unify the external database dialect with one configuration, instead of small flags. ### Does this PR introduce any user-facing change? A new configuration `spark.sql.dialect` for choosing a database dialect. ### How was this patch tested? Existing tests. Closes #25697 from gengliangwang/dialect. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? Reprocess all PostgreSQL dialect related PRs, listing in order: - #25158: PostgreSQL integral division support [revert] - #25170: UT changes for the integral division support [revert] - #25458: Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. [revert] - #25697: Combine below 2 feature tags into "spark.sql.dialect" [revert] - #26112: Date substraction support [keep the ANSI-compliant part] - #26444: Rename config "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled" [revert] - #26463: Cast to boolean support for PostgreSQL dialect [revert] - #26584: Make the behavior of Postgre dialect independent of ansi mode config [keep the ANSI-compliant part] ### Why are the changes needed? As the discussion in http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-PostgreSQL-dialect-td28417.html, we need to remove PostgreSQL dialect form code base for several reasons: 1. The current approach makes the codebase complicated and hard to maintain. 2. Fully migrating PostgreSQL workloads to Spark SQL is not our focus for now. ### Does this PR introduce any user-facing change? Yes, the config `spark.sql.dialect` will be removed. ### How was this patch tested? Existing UT. Closes #26763 from xuanyuanking/SPARK-30125. Lead-authored-by: Yuanjian Li <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

[SPARK-27931][SQL] Accept 'on' and 'off' as input and trim input for …

7d61642

…a boolean data type.

wangyum reviewed Aug 15, 2019

View reviewed changes

dongjoon-hyun added the SQL label Aug 15, 2019

[SPARK-27931][SQL] Unique prefixes of strings are also accepted for t…

933fe86

…he boolean data type.

Trim string whitespace so it should be true

a5aec9f

younggyuchun commented Aug 16, 2019

View reviewed changes

Generated spark golden files.

9e9aac3

HyukjinKwon reviewed Aug 23, 2019

View reviewed changes

dongjoon-hyun reviewed Aug 24, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala Show resolved Hide resolved

dongjoon-hyun requested changes Aug 24, 2019

View reviewed changes

younggyuchun changed the title ~~[SPARK-27931][SQL] Accept 'on' and 'off' as input and trim input for the boolean data type.~~ [SPARK-27931][SQL] Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. Aug 27, 2019

add a comment: "true", "yes", "1", "false", "no", "0", and unique pre…

dacd46b

…fixes of these strings are accepted.

dongjoon-hyun reviewed Aug 30, 2019

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala Show resolved Hide resolved

dongjoon-hyun reviewed Aug 30, 2019

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala Outdated Show resolved Hide resolved

younggyu chun added 2 commits August 29, 2019 21:49

[SPARK-27931][SQL] add a "FAlse" test case and remove the redundant s…

abe9a84

…pace.

[SPARK-27931][SQL] add a negative test case for "o"

2ea551c

younggyuchun commented Aug 30, 2019

View reviewed changes

dongjoon-hyun approved these changes Aug 30, 2019

View reviewed changes

maropu approved these changes Aug 30, 2019

View reviewed changes

HyukjinKwon approved these changes Aug 30, 2019

View reviewed changes

removed "[SPARK-27931] Trim the string when cast string type to boole…

b787483

…an type" comment.

dongjoon-hyun closed this in 3b07a4e Aug 30, 2019

gengliangwang mentioned this pull request Sep 5, 2019

[SPARK-28997][SQL] Add spark.sql.dialect #25697

Closed

xuanyuanking mentioned this pull request Dec 4, 2019

[SPARK-30125][SQL] Remove PostgreSQL dialect #26763

Closed

[SPARK-27931][SQL] Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. #25458

[SPARK-27931][SQL] Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. #25458

Uh oh!

Conversation

younggyuchun commented Aug 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 15, 2019

Uh oh!

younggyuchun commented Aug 16, 2019

Uh oh!

younggyuchun commented Aug 16, 2019

Uh oh!

HyukjinKwon commented Aug 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Aug 16, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

younggyuchun commented Aug 16, 2019

Uh oh!

SparkQA commented Aug 16, 2019

Uh oh!

SparkQA commented Aug 16, 2019

Uh oh!

younggyuchun commented Aug 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Aug 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

younggyuchun commented Aug 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Aug 27, 2019

Uh oh!

dongjoon-hyun commented Aug 30, 2019

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

younggyuchun commented Aug 15, 2019 •

edited

Loading

HyukjinKwon commented Aug 16, 2019 •

edited

Loading

younggyuchun commented Aug 19, 2019 •

edited

Loading

dongjoon-hyun Aug 30, 2019 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

younggyuchun commented Aug 27, 2019 •

edited

Loading

younggyuchun commented Aug 30, 2019 •

edited

Loading

dongjoon-hyun commented Aug 30, 2019 •

edited

Loading