Skip to content

Conversation

@younggyuchun
Copy link

@younggyuchun younggyuchun commented Aug 15, 2019

What changes were proposed in this pull request?

This PR aims to add "true", "yes", "1", "false", "no", "0", and unique prefixes as input for the boolean data type and ignore input whitespace. Please see the following what string representations are using for the boolean type in other databases.

https://www.postgresql.org/docs/devel/datatype-boolean.html
https://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html

How was this patch tested?

Added new tests to CastSuite.

Set("t", "true", "y", "yes", "1", "on").map(UTF8String.fromString)

private[this] val falseStrings =
Set("f", "false", "n", "no", "0", "off").map(UTF8String.fromString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems only PostgreSQL accepts on and off?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I guess so. Do you know other common string representattion used in other databases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But PostgreSQL also acceptsof, tru, fals, ...:

postgres=# select cast('of' as boolean), cast('tru' as boolean), cast('fals' as boolean);
 bool | bool | bool
------+------+------
 f    | t    | f
(1 row)

postgres/postgres@9729c93

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay. Let me add that too. Thank you

@SparkQA
Copy link

SparkQA commented Aug 15, 2019

Test build #109139 has finished for PR 25458 at commit 7d61642.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@younggyuchun
Copy link
Author

Unique prefixes of strings, for example, "true", "tre", "tr" and "t" also accepted.

@younggyuchun
Copy link
Author

@HyukjinKwon, @srowen Could you please review this PR?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Aug 16, 2019

cc @dongjoon-hyun, @cloud-fan and @gatorsmile as well.

@SparkQA
Copy link

SparkQA commented Aug 16, 2019

Test build #109158 has finished for PR 25458 at commit 933fe86.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


-- [SPARK-27931] Trim the string when cast string type to boolean type
SELECT boolean(' f ') AS `false`;
SELECT boolean(' f ') AS `true`;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangyum I think it should be 'true'. Is it correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's false:
image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me dig into this.

@younggyuchun
Copy link
Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 16, 2019

Test build #109222 has finished for PR 25458 at commit a5aec9f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 16, 2019

Test build #109234 has finished for PR 25458 at commit 9e9aac3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@younggyuchun
Copy link
Author

younggyuchun commented Aug 19, 2019

Could you please review this PR? @HyukjinKwon @dongjoon-hyun @cloud-fan and @gatorsmile

checkCast("n", false)
checkCast("0", false)
checkCast("off", false)
checkCast("of", false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@younggyuchun, just for doubly sure, did you double check the behaviours against PostgreSQL?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Here it is:
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not documented: https://www.postgresql.org/docs/devel/datatype-boolean.html

Postgres may support of for history reasons, I don't think we have to follow it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan @dongjoon-hyun @HyukjinKwon
This build accepts several unique prefixes for the boolean data type. For example, tru, tr, ye, fals, fal, fa and of, which are not documented. Do we want to not to accept these prefixes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan . It's a documented feature in that document. We had better support it.

Unique prefixes of these strings are also accepted, for example t or n. Leading or trailing whitespace is ignored, and case does not matter.

cc @gatorsmile

Copy link
Member

@dongjoon-hyun dongjoon-hyun Aug 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, @younggyuchun . Please add a negative test case.
o is not supported because it's not unique. It's a common prefix for on and off.
It should be null.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems okay to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @younggyuchun . Thank you for this contribution. Could you add a comment to the non-trivial code part? Also, please update the PR title and description accordingly since this PR seems to be almost ready for merge. I'll review this PR again after updating.

@younggyuchun younggyuchun changed the title [SPARK-27931][SQL] Accept 'on' and 'off' as input and trim input for the boolean data type. [SPARK-27931][SQL] Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. Aug 27, 2019
@younggyuchun
Copy link
Author

younggyuchun commented Aug 27, 2019

Hi @dongjoon-hyun,
Sorry for the late reply. I was on vacation :). PR title and description have been changed accordingly.

@SparkQA
Copy link

SparkQA commented Aug 27, 2019

Test build #109830 has finished for PR 25458 at commit dacd46b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please

checkCast("off", false)
checkCast("of", false)

checkEvaluation(cast("o", BooleanType), null)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun
Add a negative test for "o"

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. (Pending Jenkins).
Thank you, @younggyuchun .

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code itself looks ok to me.

@SparkQA
Copy link

SparkQA commented Aug 30, 2019

Test build #109927 has finished for PR 25458 at commit dacd46b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 30, 2019

Test build #109931 has finished for PR 25458 at commit abe9a84.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 30, 2019

Test build #109933 has finished for PR 25458 at commit 2ea551c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

The last commit removes 4 comments from boolean.sql and it passed already in the last Jenkins. For the other tests, we verified them in the previous Jenkins runs.

 - pgSQL/boolean.sql (18 seconds, 745 milliseconds)

Merged to master. Thank you all!

@younggyuchun
Copy link
Author

younggyuchun commented Aug 30, 2019

Thank you all.

@SparkQA
Copy link

SparkQA commented Aug 30, 2019

Test build #109957 has finished for PR 25458 at commit b787483.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@dongjoon-hyun @maropu @cloud-fan @HyukjinKwon This is a behavior change. We need to document it if it is on by default. I think it should be guarded by a global dialect flag, for postgreSQL compatibility instead of doing it for all the cases. WDYT?

@gatorsmile
Copy link
Member

I believe postgreSQL compatibility is a very important feature. We will add more and more such behavior changes in the near future. Let us move these changes behind a conf like spark.sql.compatiblity.mode=pgSQL?

@dongjoon-hyun
Copy link
Member

Yep. I agree. Sounds good to me. And, what is the default value of the conf?

@gatorsmile
Copy link
Member

I think spark at first? When pgSQL mode is stable, we can turn it on? I believe it might take a couple of releases to make it stable.

@dongjoon-hyun
Copy link
Member

Got it. So, Apache Spark 3.0.0 starts with spark.sql.parser.ansi.enabled=false and spark.sql.compatiblity.mode=spark.

@gatorsmile
Copy link
Member

Yes. I think that might be a good choice.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Aug 30, 2019

https://issues.apache.org/jira/browse/SPARK-28934 is filed. After we create the configuration, we can wrap the existing work one-by-one safely.

@maropu
Copy link
Member

maropu commented Aug 30, 2019

Sounds good to me, too.

cloud-fan pushed a commit that referenced this pull request Sep 26, 2019
### What changes were proposed in this pull request?

After #25158 and #25458, SQL features of PostgreSQL are introduced into Spark. AFAIK, both features are implementation-defined behaviors, which are not specified in ANSI SQL.
In such a case, this proposal is to add a configuration `spark.sql.dialect` for choosing a database dialect.
After this PR, Spark supports two database dialects, `Spark` and `PostgreSQL`. With `PostgreSQL` dialect, Spark will:
1. perform integral division with the / operator if both sides are integral types;
2. accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type.

### Why are the changes needed?

Unify the external database dialect with one configuration, instead of small flags.

### Does this PR introduce any user-facing change?

A new configuration `spark.sql.dialect` for choosing a database dialect.

### How was this patch tested?

Existing tests.

Closes #25697 from gengliangwang/dialect.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Dec 10, 2019
### What changes were proposed in this pull request?
Reprocess all PostgreSQL dialect related PRs, listing in order:

- #25158: PostgreSQL integral division support [revert]
- #25170: UT changes for the integral division support [revert]
- #25458: Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. [revert]
- #25697: Combine below 2 feature tags into "spark.sql.dialect" [revert]
- #26112: Date substraction support [keep the ANSI-compliant part]
- #26444: Rename config "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled" [revert]
- #26463: Cast to boolean support for PostgreSQL dialect [revert]
- #26584: Make the behavior of Postgre dialect independent of ansi mode config [keep the ANSI-compliant part]

### Why are the changes needed?
As the discussion in http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-PostgreSQL-dialect-td28417.html, we need to remove PostgreSQL dialect form code base for several reasons:
1. The current approach makes the codebase complicated and hard to maintain.
2. Fully migrating PostgreSQL workloads to Spark SQL is not our focus for now.

### Does this PR introduce any user-facing change?
Yes, the config `spark.sql.dialect` will be removed.

### How was this patch tested?
Existing UT.

Closes #26763 from xuanyuanking/SPARK-30125.

Lead-authored-by: Yuanjian Li <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants