Skip to content

Conversation

@cloud-fan
Copy link
Contributor

When we cast string to boolean in hive, it returns true if the length of string is > 0, and spark SQL follows this behavior.

However, this behavior is very different from other SQL systems:

  1. presto will return true for 't' 'true' '1', false for 'f' 'false' '0', throw exception for others.
  2. redshift will return true for 't' 'true' 'y' 'yes' '1', false for 'f' 'false' 'n' 'no' '0', null for others.
  3. postgresql will return true for 't' 'true' 'y' 'yes' 'on' '1', false for 'f' 'false' 'n' 'no' 'off' '0', throw exception for others.
  4. vertica will return true for 't' 'true' 'y' 'yes' '1', false for 'f' 'false' 'n' 'no' '0', null for others.
  5. impala throw exception when try to cast string to boolean.
  6. mysql, oracle, sqlserver don't have boolean type

Whether we should change the cast behavior according to other SQL system or not is not decided yet, this PR is a test to see if we changed, how many compatibility tests will fail.

@SparkQA
Copy link

SparkQA commented Sep 10, 2015

Test build #42267 has finished for PR 8698 at commit e7b50f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

cc @yhuai, looks no hive compatibility tests is broken :)

@liancheng
Copy link
Contributor

Would be nice to have a test for a persisted table partitioned by a boolean column.

sqlContext.range(2).selectExpr("(id % 2 = 0) as b", "id").write.partitionBy("b").saveAsTable("t")
sqlContext.table("t").show()

Currently this snippet produces wrong answer (all boolean values are true).

@liancheng
Copy link
Contributor

Although this change doesn't break any existing Hive compatibility tests, it's still a breaking change. We might want to have a separate SQL option to let users be able to fallback to the old behavior. The partitioned table case should be fixed in a separate PR (don't use Cast for boolean columns).

@marmbrus
Copy link
Contributor

A compatibility option would be reasonable. My vote would be for the vertica option. @rxin ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't this be a static variable somewhere?

@rxin
Copy link
Contributor

rxin commented Sep 10, 2015

I think having a config flag is reasonable. If we want that, it should be a single flag that dictates whether we should follow Hive, or our own standards.

However, in this case it seems it is too much work to bring a flag, and the benefit isn't huge yet. So I would just follow Vertica's approach.

@marmbrus
Copy link
Contributor

+1 to @rxin 's suggestions

@cloud-fan cloud-fan changed the title [SPARK-10442][SQL][WIP] fix string to boolean cast [SPARK-10442][SQL] fix string to boolean cast Sep 11, 2015
@SparkQA
Copy link

SparkQA commented Sep 11, 2015

Test build #42311 has finished for PR 8698 at commit 8706165.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

LGTM

@yhuai
Copy link
Contributor

yhuai commented Sep 11, 2015

LGTM. I am merging it to master.

@asfgit asfgit closed this in d5d6473 Sep 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants